From 4b3d50062ce06530adc27224a1e10f087d0c9caf Mon Sep 17 00:00:00 2001 From: Linus Walleij Date: Thu, 23 May 2019 10:17:36 +0200 Subject: gpio: Fix minor grammar errors in documentation This fixes up some of my own mistakes when I stressed to refresh the documentation. Signed-off-by: Linus Walleij --- Documentation/driver-api/gpio/driver.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 1ce7fcd0f989..58036c2d84d2 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -235,7 +235,7 @@ means that a pull up or pull-down resistor is available on the output of the GPIO line, and this resistor is software controlled. In discrete designs, a pull-up or pull-down resistor is simply soldered on -the circuit board. This is not something we deal or model in software. The +the circuit board. This is not something we deal with or model in software. The most you will think about these lines is that they will very likely be configured as open drain or open source (see the section above). @@ -292,18 +292,18 @@ We can divide GPIO irqchips in two broad categories: - HIERARCHICAL INTERRUPT CHIPS: this means that each GPIO line has a dedicated irq line to a parent interrupt controller one level up. There is no need - to inquire the GPIO hardware to figure out which line has figured, but it - may still be necessary to acknowledge the interrupt and set up the - configuration such as edge sensitivity. + to inquire the GPIO hardware to figure out which line has fired, but it + may still be necessary to acknowledge the interrupt and set up configuration + such as edge sensitivity. Realtime considerations: a realtime compliant GPIO driver should not use spinlock_t or any sleepable APIs (like PM runtime) as part of its irqchip implementation. -- spinlock_t should be replaced with raw_spinlock_t [1]. +- spinlock_t should be replaced with raw_spinlock_t.[1] - If sleepable APIs have to be used, these can be done from the .irq_bus_lock() and .irq_bus_unlock() callbacks, as these are the only slowpath callbacks - on an irqchip. Create the callbacks if needed [2]. + on an irqchip. Create the callbacks if needed.[2] Cascaded GPIO irqchips @@ -361,7 +361,7 @@ Cascaded GPIO irqchips usually fall in one of three categories: Realtime considerations: this kind of handlers will be forced threaded on -RT, and as result the IRQ core will complain that generic_handle_irq() is called - with IRQ enabled and the same work around as for "CHAINED GPIO irqchips" can + with IRQ enabled and the same work-around as for "CHAINED GPIO irqchips" can be applied. - NESTED THREADED GPIO IRQCHIPS: these are off-chip GPIO expanders and any -- cgit v1.2.3 From 919c46c89bff432f45994259d28d774212cf6e8d Mon Sep 17 00:00:00 2001 From: Luca Ceresoli Date: Fri, 10 May 2019 11:03:39 +0200 Subject: Documentation: gpio: remove duplicated lines The 'default (active high)' lines are repeated twice. Avoid people stare at their screens looking for differences. Signed-off-by: Luca Ceresoli Reviewed-by: Bartosz Golaszewski Signed-off-by: Linus Walleij --- Documentation/driver-api/gpio/consumer.rst | 2 -- 1 file changed, 2 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/gpio/consumer.rst b/Documentation/driver-api/gpio/consumer.rst index 5e4d8aa68913..23d68c321c5c 100644 --- a/Documentation/driver-api/gpio/consumer.rst +++ b/Documentation/driver-api/gpio/consumer.rst @@ -283,8 +283,6 @@ To summarize:: gpiod_set_value(desc, 1); default (active high) high gpiod_set_value(desc, 0); active low high gpiod_set_value(desc, 1); active low low - gpiod_set_value(desc, 0); default (active high) low - gpiod_set_value(desc, 1); default (active high) high gpiod_set_value(desc, 0); open drain low gpiod_set_value(desc, 1); open drain high impedance gpiod_set_value(desc, 0); open source high impedance -- cgit v1.2.3 From be1038846b8063c448baf5ddcdc2387241c4133e Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 4 Jun 2019 11:17:47 -0300 Subject: docs: soundwire: locking: fix tags for a code-block There's an ascii artwork at Example 1 whose code-block is not properly idented, causing those warnings. Documentation/driver-api/soundwire/locking.rst:50: WARNING: Inconsistent literal block quoting. Documentation/driver-api/soundwire/locking.rst:51: WARNING: Line block ends without a blank line. Documentation/driver-api/soundwire/locking.rst:55: WARNING: Inline substitution_reference start-string without end-string. Documentation/driver-api/soundwire/locking.rst:56: WARNING: Line block ends without a blank line. Signed-off-by: Mauro Carvalho Chehab Acked-by: Pierre-Louis Bossart Signed-off-by: Vinod Koul --- Documentation/driver-api/soundwire/locking.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/soundwire/locking.rst b/Documentation/driver-api/soundwire/locking.rst index 253f73555255..3a7ffb3d87f3 100644 --- a/Documentation/driver-api/soundwire/locking.rst +++ b/Documentation/driver-api/soundwire/locking.rst @@ -44,7 +44,9 @@ Message transfer. b. Transfer message (Read/Write) to Slave1 or broadcast message on Bus in case of bank switch. - c. Release Message lock :: + c. Release Message lock + + :: +----------+ +---------+ | | | | -- cgit v1.2.3 From 7e527e11d672e90f1a3dc8de84e0bfaccda15bba Mon Sep 17 00:00:00 2001 From: Tomas Winkler Date: Mon, 3 Jun 2019 12:14:00 +0300 Subject: mei: docs: move documentation under driver-api Move mei driver documentation under Documentation/driver-api/ Perform some minimal formating changes to produce correct sphinx rendering and add index.rst Signed-off-by: Tomas Winkler Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/mei/index.rst | 22 ++ Documentation/driver-api/mei/mei-client-bus.rst | 152 +++++++++++++ Documentation/driver-api/mei/mei.rst | 250 ++++++++++++++++++++ Documentation/misc-devices/mei/mei-client-bus.txt | 141 ------------ Documentation/misc-devices/mei/mei.txt | 266 ---------------------- MAINTAINERS | 2 +- 7 files changed, 426 insertions(+), 408 deletions(-) create mode 100644 Documentation/driver-api/mei/index.rst create mode 100644 Documentation/driver-api/mei/mei-client-bus.rst create mode 100644 Documentation/driver-api/mei/mei.rst delete mode 100644 Documentation/misc-devices/mei/mei-client-bus.txt delete mode 100644 Documentation/misc-devices/mei/mei.txt (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index d26308af6036..0dbaa987aa11 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -42,6 +42,7 @@ available subsections can be seen below. target mtdnand miscellaneous + mei/index w1 rapidio s390-drivers diff --git a/Documentation/driver-api/mei/index.rst b/Documentation/driver-api/mei/index.rst new file mode 100644 index 000000000000..35c1117d8366 --- /dev/null +++ b/Documentation/driver-api/mei/index.rst @@ -0,0 +1,22 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: + +=================================================== +Intel(R) Management Engine Interface (Intel(R) MEI) +=================================================== + +**Copyright** |copy| 2019 Intel Corporation + + +.. only:: html + + .. class:: toc-title + + Table of Contents + +.. toctree:: + :maxdepth: 2 + + mei + mei-client-bus diff --git a/Documentation/driver-api/mei/mei-client-bus.rst b/Documentation/driver-api/mei/mei-client-bus.rst new file mode 100644 index 000000000000..a26a85453bdf --- /dev/null +++ b/Documentation/driver-api/mei/mei-client-bus.rst @@ -0,0 +1,152 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================== +Intel(R) Management Engine (ME) Client bus API +============================================== + + +Rationale +========= + +MEI misc character device is useful for dedicated applications to send and receive +data to the many FW appliance found in Intel's ME from the user space. +However for some of the ME functionalities it make sense to leverage existing software +stack and expose them through existing kernel subsystems. + +In order to plug seamlessly into the kernel device driver model we add kernel virtual +bus abstraction on top of the MEI driver. This allows implementing linux kernel drivers +for the various MEI features as a stand alone entities found in their respective subsystem. +Existing device drivers can even potentially be re-used by adding an MEI CL bus layer to +the existing code. + + +MEI CL bus API +============== + +A driver implementation for an MEI Client is very similar to existing bus +based device drivers. The driver registers itself as an MEI CL bus driver through +the ``struct mei_cl_driver`` structure: + +.. code-block:: C + + struct mei_cl_driver { + struct device_driver driver; + const char *name; + + const struct mei_cl_device_id *id_table; + + int (*probe)(struct mei_cl_device *dev, const struct mei_cl_id *id); + int (*remove)(struct mei_cl_device *dev); + }; + + struct mei_cl_id { + char name[MEI_NAME_SIZE]; + kernel_ulong_t driver_info; + }; + +The mei_cl_id structure allows the driver to bind itself against a device name. + +To actually register a driver on the ME Client bus one must call the mei_cl_add_driver() +API. This is typically called at module init time. + +Once registered on the ME Client bus, a driver will typically try to do some I/O on +this bus and this should be done through the mei_cl_send() and mei_cl_recv() +routines. The latter is synchronous (blocks and sleeps until data shows up). +In order for drivers to be notified of pending events waiting for them (e.g. +an Rx event) they can register an event handler through the +mei_cl_register_event_cb() routine. Currently only the MEI_EVENT_RX event +will trigger an event handler call and the driver implementation is supposed +to call mei_recv() from the event handler in order to fetch the pending +received buffers. + + +Example +======= + +As a theoretical example let's pretend the ME comes with a "contact" NFC IP. +The driver init and exit routines for this device would look like: + +.. code-block:: C + + #define CONTACT_DRIVER_NAME "contact" + + static struct mei_cl_device_id contact_mei_cl_tbl[] = { + { CONTACT_DRIVER_NAME, }, + + /* required last entry */ + { } + }; + MODULE_DEVICE_TABLE(mei_cl, contact_mei_cl_tbl); + + static struct mei_cl_driver contact_driver = { + .id_table = contact_mei_tbl, + .name = CONTACT_DRIVER_NAME, + + .probe = contact_probe, + .remove = contact_remove, + }; + + static int contact_init(void) + { + int r; + + r = mei_cl_driver_register(&contact_driver); + if (r) { + pr_err(CONTACT_DRIVER_NAME ": driver registration failed\n"); + return r; + } + + return 0; + } + + static void __exit contact_exit(void) + { + mei_cl_driver_unregister(&contact_driver); + } + + module_init(contact_init); + module_exit(contact_exit); + +And the driver's simplified probe routine would look like that: + +.. code-block:: C + + int contact_probe(struct mei_cl_device *dev, struct mei_cl_device_id *id) + { + struct contact_driver *contact; + + [...] + mei_cl_enable_device(dev); + + mei_cl_register_event_cb(dev, contact_event_cb, contact); + + return 0; + } + +In the probe routine the driver first enable the MEI device and then registers +an ME bus event handler which is as close as it can get to registering a +threaded IRQ handler. +The handler implementation will typically call some I/O routine depending on +the pending events: + +#define MAX_NFC_PAYLOAD 128 + +.. code-block:: C + + static void contact_event_cb(struct mei_cl_device *dev, u32 events, + void *context) + { + struct contact_driver *contact = context; + + if (events & BIT(MEI_EVENT_RX)) { + u8 payload[MAX_NFC_PAYLOAD]; + int payload_size; + + payload_size = mei_recv(dev, payload, MAX_NFC_PAYLOAD); + if (payload_size <= 0) + return; + + /* Hook to the NFC subsystem */ + nfc_hci_recv_frame(contact->hdev, payload, payload_size); + } + } diff --git a/Documentation/driver-api/mei/mei.rst b/Documentation/driver-api/mei/mei.rst new file mode 100644 index 000000000000..5aa3a5e6496a --- /dev/null +++ b/Documentation/driver-api/mei/mei.rst @@ -0,0 +1,250 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Introduction +============ + +The Intel Management Engine (Intel ME) is an isolated and protected computing +resource (Co-processor) residing inside certain Intel chipsets. The Intel ME +provides support for computer/IT management features. The feature set +depends on the Intel chipset SKU. + +The Intel Management Engine Interface (Intel MEI, previously known as HECI) +is the interface between the Host and Intel ME. This interface is exposed +to the host as a PCI device. The Intel MEI Driver is in charge of the +communication channel between a host application and the Intel ME feature. + +Each Intel ME feature (Intel ME Client) is addressed by a GUID/UUID and +each client has its own protocol. The protocol is message-based with a +header and payload up to 512 bytes. + +Prominent usage of the Intel ME Interface is to communicate with Intel(R) +Active Management Technology (Intel AMT) implemented in firmware running on +the Intel ME. + +Intel AMT provides the ability to manage a host remotely out-of-band (OOB) +even when the operating system running on the host processor has crashed or +is in a sleep state. + +Some examples of Intel AMT usage are: + - Monitoring hardware state and platform components + - Remote power off/on (useful for green computing or overnight IT + maintenance) + - OS updates + - Storage of useful platform information such as software assets + - Built-in hardware KVM + - Selective network isolation of Ethernet and IP protocol flows based + on policies set by a remote management console + - IDE device redirection from remote management console + +Intel AMT (OOB) communication is based on SOAP (deprecated +starting with Release 6.0) over HTTP/S or WS-Management protocol over +HTTP/S that are received from a remote management console application. + +For more information about Intel AMT: +http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide + + +Intel MEI Driver +================ + +The driver exposes a misc device called /dev/mei. + +An application maintains communication with an Intel ME feature while +/dev/mei is open. The binding to a specific feature is performed by calling +MEI_CONNECT_CLIENT_IOCTL, which passes the desired UUID. +The number of instances of an Intel ME feature that can be opened +at the same time depends on the Intel ME feature, but most of the +features allow only a single instance. + +The Intel AMT Host Interface (Intel AMTHI) feature supports multiple +simultaneous user connected applications. The Intel MEI driver +handles this internally by maintaining request queues for the applications. + +The driver is transparent to data that are passed between firmware feature +and host application. + +Because some of the Intel ME features can change the system +configuration, the driver by default allows only a privileged +user to access it. + +A code snippet for an application communicating with Intel AMTHI client: + +.. code-block:: C + + struct mei_connect_client_data data; + fd = open(MEI_DEVICE); + + data.d.in_client_uuid = AMTHI_UUID; + + ioctl(fd, IOCTL_MEI_CONNECT_CLIENT, &data); + + printf("Ver=%d, MaxLen=%ld\n", + data.d.in_client_uuid.protocol_version, + data.d.in_client_uuid.max_msg_length); + + [...] + + write(fd, amthi_req_data, amthi_req_data_len); + + [...] + + read(fd, &amthi_res_data, amthi_res_data_len); + + [...] + close(fd); + + +IOCTLs +====== + +The Intel MEI Driver supports the following IOCTL commands: + IOCTL_MEI_CONNECT_CLIENT Connect to firmware Feature (client). + + usage: + struct mei_connect_client_data clientData; + ioctl(fd, IOCTL_MEI_CONNECT_CLIENT, &clientData); + + inputs: + mei_connect_client_data struct contain the following + input field: + + in_client_uuid - UUID of the FW Feature that needs + to connect to. + outputs: + out_client_properties - Client Properties: MTU and Protocol Version. + + error returns: + EINVAL Wrong IOCTL Number + ENODEV Device or Connection is not initialized or ready. (e.g. Wrong UUID) + ENOMEM Unable to allocate memory to client internal data. + EFAULT Fatal Error (e.g. Unable to access user input data) + EBUSY Connection Already Open + + Notes: + max_msg_length (MTU) in client properties describes the maximum + data that can be sent or received. (e.g. if MTU=2K, can send + requests up to bytes 2k and received responses up to 2k bytes). + + IOCTL_MEI_NOTIFY_SET: enable or disable event notifications + + Usage: + uint32_t enable; + ioctl(fd, IOCTL_MEI_NOTIFY_SET, &enable); + + Inputs: + uint32_t enable = 1; + or + uint32_t enable[disable] = 0; + + Error returns: + EINVAL Wrong IOCTL Number + ENODEV Device is not initialized or the client not connected + ENOMEM Unable to allocate memory to client internal data. + EFAULT Fatal Error (e.g. Unable to access user input data) + EOPNOTSUPP if the device doesn't support the feature + + Notes: + The client must be connected in order to enable notification events + + + IOCTL_MEI_NOTIFY_GET : retrieve event + + Usage: + uint32_t event; + ioctl(fd, IOCTL_MEI_NOTIFY_GET, &event); + + Outputs: + 1 - if an event is pending + 0 - if there is no even pending + + Error returns: + EINVAL Wrong IOCTL Number + ENODEV Device is not initialized or the client not connected + ENOMEM Unable to allocate memory to client internal data. + EFAULT Fatal Error (e.g. Unable to access user input data) + EOPNOTSUPP if the device doesn't support the feature + + Notes: + The client must be connected and event notification has to be enabled + in order to receive an event + + +Intel ME Applications +===================== + + 1) Intel Local Management Service (Intel LMS) + + Applications running locally on the platform communicate with Intel AMT Release + 2.0 and later releases in the same way that network applications do via SOAP + over HTTP (deprecated starting with Release 6.0) or with WS-Management over + SOAP over HTTP. This means that some Intel AMT features can be accessed from a + local application using the same network interface as a remote application + communicating with Intel AMT over the network. + + When a local application sends a message addressed to the local Intel AMT host + name, the Intel LMS, which listens for traffic directed to the host name, + intercepts the message and routes it to the Intel MEI. + For more information: + http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide + Under "About Intel AMT" => "Local Access" + + For downloading Intel LMS: + http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ + + The Intel LMS opens a connection using the Intel MEI driver to the Intel LMS + firmware feature using a defined UUID and then communicates with the feature + using a protocol called Intel AMT Port Forwarding Protocol (Intel APF protocol). + The protocol is used to maintain multiple sessions with Intel AMT from a + single application. + + See the protocol specification in the Intel AMT Software Development Kit (SDK) + http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide + Under "SDK Resources" => "Intel(R) vPro(TM) Gateway (MPS)" + => "Information for Intel(R) vPro(TM) Gateway Developers" + => "Description of the Intel AMT Port Forwarding (APF) Protocol" + + 2) Intel AMT Remote configuration using a Local Agent + + A Local Agent enables IT personnel to configure Intel AMT out-of-the-box + without requiring installing additional data to enable setup. The remote + configuration process may involve an ISV-developed remote configuration + agent that runs on the host. + For more information: + http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide + Under "Setup and Configuration of Intel AMT" => + "SDK Tools Supporting Setup and Configuration" => + "Using the Local Agent Sample" + + An open source Intel AMT configuration utility, implementing a local agent + that accesses the Intel MEI driver, can be found here: + http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ + + +Intel AMT OS Health Watchdog +============================ + +The Intel AMT Watchdog is an OS Health (Hang/Crash) watchdog. +Whenever the OS hangs or crashes, Intel AMT will send an event +to any subscriber to this event. This mechanism means that +IT knows when a platform crashes even when there is a hard failure on the host. + +The Intel AMT Watchdog is composed of two parts: + 1) Firmware feature - receives the heartbeats + and sends an event when the heartbeats stop. + 2) Intel MEI iAMT watchdog driver - connects to the watchdog feature, + configures the watchdog and sends the heartbeats. + +The Intel iAMT watchdog MEI driver uses the kernel watchdog API to configure +the Intel AMT Watchdog and to send heartbeats to it. The default timeout of the +watchdog is 120 seconds. + +If the Intel AMT is not enabled in the firmware then the watchdog client won't enumerate +on the me client bus and watchdog devices won't be exposed. + +Supported Chipsets +================== +82X38/X48 Express and newer + + +--- +linux-mei@linux.intel.com diff --git a/Documentation/misc-devices/mei/mei-client-bus.txt b/Documentation/misc-devices/mei/mei-client-bus.txt deleted file mode 100644 index 743be4ec8989..000000000000 --- a/Documentation/misc-devices/mei/mei-client-bus.txt +++ /dev/null @@ -1,141 +0,0 @@ -Intel(R) Management Engine (ME) Client bus API -============================================== - - -Rationale -========= - -MEI misc character device is useful for dedicated applications to send and receive -data to the many FW appliance found in Intel's ME from the user space. -However for some of the ME functionalities it make sense to leverage existing software -stack and expose them through existing kernel subsystems. - -In order to plug seamlessly into the kernel device driver model we add kernel virtual -bus abstraction on top of the MEI driver. This allows implementing linux kernel drivers -for the various MEI features as a stand alone entities found in their respective subsystem. -Existing device drivers can even potentially be re-used by adding an MEI CL bus layer to -the existing code. - - -MEI CL bus API -============== - -A driver implementation for an MEI Client is very similar to existing bus -based device drivers. The driver registers itself as an MEI CL bus driver through -the mei_cl_driver structure: - -struct mei_cl_driver { - struct device_driver driver; - const char *name; - - const struct mei_cl_device_id *id_table; - - int (*probe)(struct mei_cl_device *dev, const struct mei_cl_id *id); - int (*remove)(struct mei_cl_device *dev); -}; - -struct mei_cl_id { - char name[MEI_NAME_SIZE]; - kernel_ulong_t driver_info; -}; - -The mei_cl_id structure allows the driver to bind itself against a device name. - -To actually register a driver on the ME Client bus one must call the mei_cl_add_driver() -API. This is typically called at module init time. - -Once registered on the ME Client bus, a driver will typically try to do some I/O on -this bus and this should be done through the mei_cl_send() and mei_cl_recv() -routines. The latter is synchronous (blocks and sleeps until data shows up). -In order for drivers to be notified of pending events waiting for them (e.g. -an Rx event) they can register an event handler through the -mei_cl_register_event_cb() routine. Currently only the MEI_EVENT_RX event -will trigger an event handler call and the driver implementation is supposed -to call mei_recv() from the event handler in order to fetch the pending -received buffers. - - -Example -======= - -As a theoretical example let's pretend the ME comes with a "contact" NFC IP. -The driver init and exit routines for this device would look like: - -#define CONTACT_DRIVER_NAME "contact" - -static struct mei_cl_device_id contact_mei_cl_tbl[] = { - { CONTACT_DRIVER_NAME, }, - - /* required last entry */ - { } -}; -MODULE_DEVICE_TABLE(mei_cl, contact_mei_cl_tbl); - -static struct mei_cl_driver contact_driver = { - .id_table = contact_mei_tbl, - .name = CONTACT_DRIVER_NAME, - - .probe = contact_probe, - .remove = contact_remove, -}; - -static int contact_init(void) -{ - int r; - - r = mei_cl_driver_register(&contact_driver); - if (r) { - pr_err(CONTACT_DRIVER_NAME ": driver registration failed\n"); - return r; - } - - return 0; -} - -static void __exit contact_exit(void) -{ - mei_cl_driver_unregister(&contact_driver); -} - -module_init(contact_init); -module_exit(contact_exit); - -And the driver's simplified probe routine would look like that: - -int contact_probe(struct mei_cl_device *dev, struct mei_cl_device_id *id) -{ - struct contact_driver *contact; - - [...] - mei_cl_enable_device(dev); - - mei_cl_register_event_cb(dev, contact_event_cb, contact); - - return 0; -} - -In the probe routine the driver first enable the MEI device and then registers -an ME bus event handler which is as close as it can get to registering a -threaded IRQ handler. -The handler implementation will typically call some I/O routine depending on -the pending events: - -#define MAX_NFC_PAYLOAD 128 - -static void contact_event_cb(struct mei_cl_device *dev, u32 events, - void *context) -{ - struct contact_driver *contact = context; - - if (events & BIT(MEI_EVENT_RX)) { - u8 payload[MAX_NFC_PAYLOAD]; - int payload_size; - - payload_size = mei_recv(dev, payload, MAX_NFC_PAYLOAD); - if (payload_size <= 0) - return; - - /* Hook to the NFC subsystem */ - nfc_hci_recv_frame(contact->hdev, payload, payload_size); - } -} diff --git a/Documentation/misc-devices/mei/mei.txt b/Documentation/misc-devices/mei/mei.txt deleted file mode 100644 index 2b80a0cd621f..000000000000 --- a/Documentation/misc-devices/mei/mei.txt +++ /dev/null @@ -1,266 +0,0 @@ -Intel(R) Management Engine Interface (Intel(R) MEI) -=================================================== - -Introduction -============ - -The Intel Management Engine (Intel ME) is an isolated and protected computing -resource (Co-processor) residing inside certain Intel chipsets. The Intel ME -provides support for computer/IT management features. The feature set -depends on the Intel chipset SKU. - -The Intel Management Engine Interface (Intel MEI, previously known as HECI) -is the interface between the Host and Intel ME. This interface is exposed -to the host as a PCI device. The Intel MEI Driver is in charge of the -communication channel between a host application and the Intel ME feature. - -Each Intel ME feature (Intel ME Client) is addressed by a GUID/UUID and -each client has its own protocol. The protocol is message-based with a -header and payload up to 512 bytes. - -Prominent usage of the Intel ME Interface is to communicate with Intel(R) -Active Management Technology (Intel AMT) implemented in firmware running on -the Intel ME. - -Intel AMT provides the ability to manage a host remotely out-of-band (OOB) -even when the operating system running on the host processor has crashed or -is in a sleep state. - -Some examples of Intel AMT usage are: - - Monitoring hardware state and platform components - - Remote power off/on (useful for green computing or overnight IT - maintenance) - - OS updates - - Storage of useful platform information such as software assets - - Built-in hardware KVM - - Selective network isolation of Ethernet and IP protocol flows based - on policies set by a remote management console - - IDE device redirection from remote management console - -Intel AMT (OOB) communication is based on SOAP (deprecated -starting with Release 6.0) over HTTP/S or WS-Management protocol over -HTTP/S that are received from a remote management console application. - -For more information about Intel AMT: -http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - - -Intel MEI Driver -================ - -The driver exposes a misc device called /dev/mei. - -An application maintains communication with an Intel ME feature while -/dev/mei is open. The binding to a specific feature is performed by calling -MEI_CONNECT_CLIENT_IOCTL, which passes the desired UUID. -The number of instances of an Intel ME feature that can be opened -at the same time depends on the Intel ME feature, but most of the -features allow only a single instance. - -The Intel AMT Host Interface (Intel AMTHI) feature supports multiple -simultaneous user connected applications. The Intel MEI driver -handles this internally by maintaining request queues for the applications. - -The driver is transparent to data that are passed between firmware feature -and host application. - -Because some of the Intel ME features can change the system -configuration, the driver by default allows only a privileged -user to access it. - -A code snippet for an application communicating with Intel AMTHI client: - - struct mei_connect_client_data data; - fd = open(MEI_DEVICE); - - data.d.in_client_uuid = AMTHI_UUID; - - ioctl(fd, IOCTL_MEI_CONNECT_CLIENT, &data); - - printf("Ver=%d, MaxLen=%ld\n", - data.d.in_client_uuid.protocol_version, - data.d.in_client_uuid.max_msg_length); - - [...] - - write(fd, amthi_req_data, amthi_req_data_len); - - [...] - - read(fd, &amthi_res_data, amthi_res_data_len); - - [...] - close(fd); - - -IOCTL -===== - -The Intel MEI Driver supports the following IOCTL commands: - IOCTL_MEI_CONNECT_CLIENT Connect to firmware Feature (client). - - usage: - struct mei_connect_client_data clientData; - ioctl(fd, IOCTL_MEI_CONNECT_CLIENT, &clientData); - - inputs: - mei_connect_client_data struct contain the following - input field: - - in_client_uuid - UUID of the FW Feature that needs - to connect to. - outputs: - out_client_properties - Client Properties: MTU and Protocol Version. - - error returns: - EINVAL Wrong IOCTL Number - ENODEV Device or Connection is not initialized or ready. - (e.g. Wrong UUID) - ENOMEM Unable to allocate memory to client internal data. - EFAULT Fatal Error (e.g. Unable to access user input data) - EBUSY Connection Already Open - - Notes: - max_msg_length (MTU) in client properties describes the maximum - data that can be sent or received. (e.g. if MTU=2K, can send - requests up to bytes 2k and received responses up to 2k bytes). - - IOCTL_MEI_NOTIFY_SET: enable or disable event notifications - - Usage: - uint32_t enable; - ioctl(fd, IOCTL_MEI_NOTIFY_SET, &enable); - - Inputs: - uint32_t enable = 1; - or - uint32_t enable[disable] = 0; - - Error returns: - EINVAL Wrong IOCTL Number - ENODEV Device is not initialized or the client not connected - ENOMEM Unable to allocate memory to client internal data. - EFAULT Fatal Error (e.g. Unable to access user input data) - EOPNOTSUPP if the device doesn't support the feature - - Notes: - The client must be connected in order to enable notification events - - - IOCTL_MEI_NOTIFY_GET : retrieve event - - Usage: - uint32_t event; - ioctl(fd, IOCTL_MEI_NOTIFY_GET, &event); - - Outputs: - 1 - if an event is pending - 0 - if there is no even pending - - Error returns: - EINVAL Wrong IOCTL Number - ENODEV Device is not initialized or the client not connected - ENOMEM Unable to allocate memory to client internal data. - EFAULT Fatal Error (e.g. Unable to access user input data) - EOPNOTSUPP if the device doesn't support the feature - - Notes: - The client must be connected and event notification has to be enabled - in order to receive an event - - -Intel ME Applications -===================== - - 1) Intel Local Management Service (Intel LMS) - - Applications running locally on the platform communicate with Intel AMT Release - 2.0 and later releases in the same way that network applications do via SOAP - over HTTP (deprecated starting with Release 6.0) or with WS-Management over - SOAP over HTTP. This means that some Intel AMT features can be accessed from a - local application using the same network interface as a remote application - communicating with Intel AMT over the network. - - When a local application sends a message addressed to the local Intel AMT host - name, the Intel LMS, which listens for traffic directed to the host name, - intercepts the message and routes it to the Intel MEI. - For more information: - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "About Intel AMT" => "Local Access" - - For downloading Intel LMS: - http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ - - The Intel LMS opens a connection using the Intel MEI driver to the Intel LMS - firmware feature using a defined UUID and then communicates with the feature - using a protocol called Intel AMT Port Forwarding Protocol (Intel APF protocol). - The protocol is used to maintain multiple sessions with Intel AMT from a - single application. - - See the protocol specification in the Intel AMT Software Development Kit (SDK) - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "SDK Resources" => "Intel(R) vPro(TM) Gateway (MPS)" - => "Information for Intel(R) vPro(TM) Gateway Developers" - => "Description of the Intel AMT Port Forwarding (APF) Protocol" - - 2) Intel AMT Remote configuration using a Local Agent - - A Local Agent enables IT personnel to configure Intel AMT out-of-the-box - without requiring installing additional data to enable setup. The remote - configuration process may involve an ISV-developed remote configuration - agent that runs on the host. - For more information: - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "Setup and Configuration of Intel AMT" => - "SDK Tools Supporting Setup and Configuration" => - "Using the Local Agent Sample" - - An open source Intel AMT configuration utility, implementing a local agent - that accesses the Intel MEI driver, can be found here: - http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ - - -Intel AMT OS Health Watchdog -============================ - -The Intel AMT Watchdog is an OS Health (Hang/Crash) watchdog. -Whenever the OS hangs or crashes, Intel AMT will send an event -to any subscriber to this event. This mechanism means that -IT knows when a platform crashes even when there is a hard failure on the host. - -The Intel AMT Watchdog is composed of two parts: - 1) Firmware feature - receives the heartbeats - and sends an event when the heartbeats stop. - 2) Intel MEI iAMT watchdog driver - connects to the watchdog feature, - configures the watchdog and sends the heartbeats. - -The Intel iAMT watchdog MEI driver uses the kernel watchdog API to configure -the Intel AMT Watchdog and to send heartbeats to it. The default timeout of the -watchdog is 120 seconds. - -If the Intel AMT is not enabled in the firmware then the watchdog client won't enumerate -on the me client bus and watchdog devices won't be exposed. - - -Supported Chipsets -================== - -7 Series Chipset Family -6 Series Chipset Family -5 Series Chipset Family -4 Series Chipset Family -Mobile 4 Series Chipset Family -ICH9 -82946GZ/GL -82G35 Express -82Q963/Q965 -82P965/G965 -Mobile PM965/GM965 -Mobile GME965/GLE960 -82Q35 Express -82G33/G31/P35/P31 Express -82Q33 Express -82X38/X48 Express - ---- -linux-mei@linux.intel.com diff --git a/MAINTAINERS b/MAINTAINERS index 5cfbea4ce575..bfe48cbea84c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8021,7 +8021,7 @@ F: include/uapi/linux/mei.h F: include/linux/mei_cl_bus.h F: drivers/misc/mei/* F: drivers/watchdog/mei_wdt.c -F: Documentation/misc-devices/mei/* +F: Documentation/driver-api/mei/* F: samples/mei/* INTEL MENLOW THERMAL DRIVER -- cgit v1.2.3 From 815d0f26c104873eb829e24510383d4d098417dd Mon Sep 17 00:00:00 2001 From: Tomas Winkler Date: Mon, 3 Jun 2019 12:14:01 +0300 Subject: mei: docs: move iamt docs to a iamt.rst file Move intel amt documentation to a seprate file. Signed-off-by: Tomas Winkler Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/mei/iamt.rst | 106 +++++++++++++++++++++++++++++++++ Documentation/driver-api/mei/index.rst | 1 + Documentation/driver-api/mei/mei.rst | 100 ------------------------------- 3 files changed, 107 insertions(+), 100 deletions(-) create mode 100644 Documentation/driver-api/mei/iamt.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/mei/iamt.rst b/Documentation/driver-api/mei/iamt.rst new file mode 100644 index 000000000000..6dcf5b16e958 --- /dev/null +++ b/Documentation/driver-api/mei/iamt.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Intel(R) Active Management Technology (Intel AMT) +================================================= + +Prominent usage of the Intel ME Interface is to communicate with Intel(R) +Active Management Technology (Intel AMT) implemented in firmware running on +the Intel ME. + +Intel AMT provides the ability to manage a host remotely out-of-band (OOB) +even when the operating system running on the host processor has crashed or +is in a sleep state. + +Some examples of Intel AMT usage are: + - Monitoring hardware state and platform components + - Remote power off/on (useful for green computing or overnight IT + maintenance) + - OS updates + - Storage of useful platform information such as software assets + - Built-in hardware KVM + - Selective network isolation of Ethernet and IP protocol flows based + on policies set by a remote management console + - IDE device redirection from remote management console + +Intel AMT (OOB) communication is based on SOAP (deprecated +starting with Release 6.0) over HTTP/S or WS-Management protocol over +HTTP/S that are received from a remote management console application. + +For more information about Intel AMT: +http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide + + +Intel AMT Applications +====================== + + 1) Intel Local Management Service (Intel LMS) + + Applications running locally on the platform communicate with Intel AMT Release + 2.0 and later releases in the same way that network applications do via SOAP + over HTTP (deprecated starting with Release 6.0) or with WS-Management over + SOAP over HTTP. This means that some Intel AMT features can be accessed from a + local application using the same network interface as a remote application + communicating with Intel AMT over the network. + + When a local application sends a message addressed to the local Intel AMT host + name, the Intel LMS, which listens for traffic directed to the host name, + intercepts the message and routes it to the Intel MEI. + For more information: + http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide + Under "About Intel AMT" => "Local Access" + + For downloading Intel LMS: + http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ + + The Intel LMS opens a connection using the Intel MEI driver to the Intel LMS + firmware feature using a defined UUID and then communicates with the feature + using a protocol called Intel AMT Port Forwarding Protocol (Intel APF protocol). + The protocol is used to maintain multiple sessions with Intel AMT from a + single application. + + See the protocol specification in the Intel AMT Software Development Kit (SDK) + http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide + Under "SDK Resources" => "Intel(R) vPro(TM) Gateway (MPS)" + => "Information for Intel(R) vPro(TM) Gateway Developers" + => "Description of the Intel AMT Port Forwarding (APF) Protocol" + + 2) Intel AMT Remote configuration using a Local Agent + + A Local Agent enables IT personnel to configure Intel AMT out-of-the-box + without requiring installing additional data to enable setup. The remote + configuration process may involve an ISV-developed remote configuration + agent that runs on the host. + For more information: + http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide + Under "Setup and Configuration of Intel AMT" => + "SDK Tools Supporting Setup and Configuration" => + "Using the Local Agent Sample" + + An open source Intel AMT configuration utility, implementing a local agent + that accesses the Intel MEI driver, can be found here: + http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ + + +Intel AMT OS Health Watchdog +============================ + +The Intel AMT Watchdog is an OS Health (Hang/Crash) watchdog. +Whenever the OS hangs or crashes, Intel AMT will send an event +to any subscriber to this event. This mechanism means that +IT knows when a platform crashes even when there is a hard failure on the host. + +The Intel AMT Watchdog is composed of two parts: + 1) Firmware feature - receives the heartbeats + and sends an event when the heartbeats stop. + 2) Intel MEI iAMT watchdog driver - connects to the watchdog feature, + configures the watchdog and sends the heartbeats. + +The Intel iAMT watchdog MEI driver uses the kernel watchdog API to configure +the Intel AMT Watchdog and to send heartbeats to it. The default timeout of the +watchdog is 120 seconds. + +If the Intel AMT is not enabled in the firmware then the watchdog client won't enumerate +on the me client bus and watchdog devices won't be exposed. + +--- +linux-mei@linux.intel.com diff --git a/Documentation/driver-api/mei/index.rst b/Documentation/driver-api/mei/index.rst index 35c1117d8366..d261afac6852 100644 --- a/Documentation/driver-api/mei/index.rst +++ b/Documentation/driver-api/mei/index.rst @@ -20,3 +20,4 @@ Intel(R) Management Engine Interface (Intel(R) MEI) mei mei-client-bus + iamt diff --git a/Documentation/driver-api/mei/mei.rst b/Documentation/driver-api/mei/mei.rst index 5aa3a5e6496a..c7f10a4b46ff 100644 --- a/Documentation/driver-api/mei/mei.rst +++ b/Documentation/driver-api/mei/mei.rst @@ -17,33 +17,6 @@ Each Intel ME feature (Intel ME Client) is addressed by a GUID/UUID and each client has its own protocol. The protocol is message-based with a header and payload up to 512 bytes. -Prominent usage of the Intel ME Interface is to communicate with Intel(R) -Active Management Technology (Intel AMT) implemented in firmware running on -the Intel ME. - -Intel AMT provides the ability to manage a host remotely out-of-band (OOB) -even when the operating system running on the host processor has crashed or -is in a sleep state. - -Some examples of Intel AMT usage are: - - Monitoring hardware state and platform components - - Remote power off/on (useful for green computing or overnight IT - maintenance) - - OS updates - - Storage of useful platform information such as software assets - - Built-in hardware KVM - - Selective network isolation of Ethernet and IP protocol flows based - on policies set by a remote management console - - IDE device redirection from remote management console - -Intel AMT (OOB) communication is based on SOAP (deprecated -starting with Release 6.0) over HTTP/S or WS-Management protocol over -HTTP/S that are received from a remote management console application. - -For more information about Intel AMT: -http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - - Intel MEI Driver ================ @@ -169,82 +142,9 @@ The Intel MEI Driver supports the following IOCTL commands: in order to receive an event -Intel ME Applications -===================== - - 1) Intel Local Management Service (Intel LMS) - - Applications running locally on the platform communicate with Intel AMT Release - 2.0 and later releases in the same way that network applications do via SOAP - over HTTP (deprecated starting with Release 6.0) or with WS-Management over - SOAP over HTTP. This means that some Intel AMT features can be accessed from a - local application using the same network interface as a remote application - communicating with Intel AMT over the network. - - When a local application sends a message addressed to the local Intel AMT host - name, the Intel LMS, which listens for traffic directed to the host name, - intercepts the message and routes it to the Intel MEI. - For more information: - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "About Intel AMT" => "Local Access" - - For downloading Intel LMS: - http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ - - The Intel LMS opens a connection using the Intel MEI driver to the Intel LMS - firmware feature using a defined UUID and then communicates with the feature - using a protocol called Intel AMT Port Forwarding Protocol (Intel APF protocol). - The protocol is used to maintain multiple sessions with Intel AMT from a - single application. - - See the protocol specification in the Intel AMT Software Development Kit (SDK) - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "SDK Resources" => "Intel(R) vPro(TM) Gateway (MPS)" - => "Information for Intel(R) vPro(TM) Gateway Developers" - => "Description of the Intel AMT Port Forwarding (APF) Protocol" - - 2) Intel AMT Remote configuration using a Local Agent - - A Local Agent enables IT personnel to configure Intel AMT out-of-the-box - without requiring installing additional data to enable setup. The remote - configuration process may involve an ISV-developed remote configuration - agent that runs on the host. - For more information: - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "Setup and Configuration of Intel AMT" => - "SDK Tools Supporting Setup and Configuration" => - "Using the Local Agent Sample" - - An open source Intel AMT configuration utility, implementing a local agent - that accesses the Intel MEI driver, can be found here: - http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ - - -Intel AMT OS Health Watchdog -============================ - -The Intel AMT Watchdog is an OS Health (Hang/Crash) watchdog. -Whenever the OS hangs or crashes, Intel AMT will send an event -to any subscriber to this event. This mechanism means that -IT knows when a platform crashes even when there is a hard failure on the host. - -The Intel AMT Watchdog is composed of two parts: - 1) Firmware feature - receives the heartbeats - and sends an event when the heartbeats stop. - 2) Intel MEI iAMT watchdog driver - connects to the watchdog feature, - configures the watchdog and sends the heartbeats. - -The Intel iAMT watchdog MEI driver uses the kernel watchdog API to configure -the Intel AMT Watchdog and to send heartbeats to it. The default timeout of the -watchdog is 120 seconds. - -If the Intel AMT is not enabled in the firmware then the watchdog client won't enumerate -on the me client bus and watchdog devices won't be exposed. Supported Chipsets ================== 82X38/X48 Express and newer - ---- linux-mei@linux.intel.com -- cgit v1.2.3 From 6080e0cff2bf7108d3f2855a7177b1f7f1830035 Mon Sep 17 00:00:00 2001 From: Tomas Winkler Date: Mon, 3 Jun 2019 12:14:03 +0300 Subject: mei: docs: update mei client bus documentation. The mei client bus API has changed significantly from time it was documented, and had required update. Signed-off-by: Tomas Winkler Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/mei/mei-client-bus.rst | 162 +++++++++++++----------- 1 file changed, 85 insertions(+), 77 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/mei/mei-client-bus.rst b/Documentation/driver-api/mei/mei-client-bus.rst index a26a85453bdf..7310dd45c484 100644 --- a/Documentation/driver-api/mei/mei-client-bus.rst +++ b/Documentation/driver-api/mei/mei-client-bus.rst @@ -8,13 +8,13 @@ Intel(R) Management Engine (ME) Client bus API Rationale ========= -MEI misc character device is useful for dedicated applications to send and receive +The MEI character device is useful for dedicated applications to send and receive data to the many FW appliance found in Intel's ME from the user space. -However for some of the ME functionalities it make sense to leverage existing software +However, for some of the ME functionalities it makes sense to leverage existing software stack and expose them through existing kernel subsystems. In order to plug seamlessly into the kernel device driver model we add kernel virtual -bus abstraction on top of the MEI driver. This allows implementing linux kernel drivers +bus abstraction on top of the MEI driver. This allows implementing Linux kernel drivers for the various MEI features as a stand alone entities found in their respective subsystem. Existing device drivers can even potentially be re-used by adding an MEI CL bus layer to the existing code. @@ -23,9 +23,9 @@ the existing code. MEI CL bus API ============== -A driver implementation for an MEI Client is very similar to existing bus +A driver implementation for an MEI Client is very similar to any other existing bus based device drivers. The driver registers itself as an MEI CL bus driver through -the ``struct mei_cl_driver`` structure: +the ``struct mei_cl_driver`` structure defined in :file:`include/linux/mei_cl_bus.c` .. code-block:: C @@ -39,25 +39,38 @@ the ``struct mei_cl_driver`` structure: int (*remove)(struct mei_cl_device *dev); }; - struct mei_cl_id { - char name[MEI_NAME_SIZE]; + + +The mei_cl_device_id structure defined in :file:`include/linux/mod_devicetable.h` allows a +driver to bind itself against a device name. + +.. code-block:: C + + struct mei_cl_device_id { + char name[MEI_CL_NAME_SIZE]; + uuid_le uuid; + __u8 version; kernel_ulong_t driver_info; }; -The mei_cl_id structure allows the driver to bind itself against a device name. +To actually register a driver on the ME Client bus one must call the :c:func:`mei_cl_add_driver` +API. This is typically called at module initialization time. + +Once the driver is registered and bound to the device, a driver will typically +try to do some I/O on this bus and this should be done through the :c:func:`mei_cl_send` +and :c:func:`mei_cl_recv` functions. More detailed information is in :ref:`api` section. + +In order for a driver to be notified about pending traffic or event, the driver +should register a callback via :c:func:`mei_cl_devev_register_rx_cb` and +:c:func:`mei_cldev_register_notify_cb` function respectively. -To actually register a driver on the ME Client bus one must call the mei_cl_add_driver() -API. This is typically called at module init time. +.. _api: + +API: +---- +.. kernel-doc:: drivers/misc/mei/bus.c + :export: drivers/misc/mei/bus.c -Once registered on the ME Client bus, a driver will typically try to do some I/O on -this bus and this should be done through the mei_cl_send() and mei_cl_recv() -routines. The latter is synchronous (blocks and sleeps until data shows up). -In order for drivers to be notified of pending events waiting for them (e.g. -an Rx event) they can register an event handler through the -mei_cl_register_event_cb() routine. Currently only the MEI_EVENT_RX event -will trigger an event handler call and the driver implementation is supposed -to call mei_recv() from the event handler in order to fetch the pending -received buffers. Example @@ -68,85 +81,80 @@ The driver init and exit routines for this device would look like: .. code-block:: C - #define CONTACT_DRIVER_NAME "contact" + #define CONTACT_DRIVER_NAME "contact" - static struct mei_cl_device_id contact_mei_cl_tbl[] = { - { CONTACT_DRIVER_NAME, }, + static struct mei_cl_device_id contact_mei_cl_tbl[] = { + { CONTACT_DRIVER_NAME, }, - /* required last entry */ - { } - }; - MODULE_DEVICE_TABLE(mei_cl, contact_mei_cl_tbl); + /* required last entry */ + { } + }; + MODULE_DEVICE_TABLE(mei_cl, contact_mei_cl_tbl); - static struct mei_cl_driver contact_driver = { - .id_table = contact_mei_tbl, - .name = CONTACT_DRIVER_NAME, + static struct mei_cl_driver contact_driver = { + .id_table = contact_mei_tbl, + .name = CONTACT_DRIVER_NAME, - .probe = contact_probe, - .remove = contact_remove, - }; + .probe = contact_probe, + .remove = contact_remove, + }; - static int contact_init(void) - { - int r; + static int contact_init(void) + { + int r; - r = mei_cl_driver_register(&contact_driver); - if (r) { - pr_err(CONTACT_DRIVER_NAME ": driver registration failed\n"); - return r; - } + r = mei_cl_driver_register(&contact_driver); + if (r) { + pr_err(CONTACT_DRIVER_NAME ": driver registration failed\n"); + return r; + } - return 0; - } + return 0; + } - static void __exit contact_exit(void) - { - mei_cl_driver_unregister(&contact_driver); - } + static void __exit contact_exit(void) + { + mei_cl_driver_unregister(&contact_driver); + } - module_init(contact_init); - module_exit(contact_exit); + module_init(contact_init); + module_exit(contact_exit); And the driver's simplified probe routine would look like that: .. code-block:: C - int contact_probe(struct mei_cl_device *dev, struct mei_cl_device_id *id) - { - struct contact_driver *contact; + int contact_probe(struct mei_cl_device *dev, struct mei_cl_device_id *id) + { + [...] + mei_cldev_enable(dev); - [...] - mei_cl_enable_device(dev); + mei_cldev_register_rx_cb(dev, contact_rx_cb); - mei_cl_register_event_cb(dev, contact_event_cb, contact); - - return 0; - } + return 0; + } In the probe routine the driver first enable the MEI device and then registers -an ME bus event handler which is as close as it can get to registering a -threaded IRQ handler. -The handler implementation will typically call some I/O routine depending on -the pending events: - -#define MAX_NFC_PAYLOAD 128 +an rx handler which is as close as it can get to registering a threaded IRQ handler. +The handler implementation will typically call :c:func:`mei_cldev_recv` and then +process received data. .. code-block:: C - static void contact_event_cb(struct mei_cl_device *dev, u32 events, - void *context) - { - struct contact_driver *contact = context; + #define MAX_PAYLOAD 128 + #define HDR_SIZE 4 + static void conntact_rx_cb(struct mei_cl_device *cldev) + { + struct contact *c = mei_cldev_get_drvdata(cldev); + unsigned char payload[MAX_PAYLOAD]; + ssize_t payload_sz; + + payload_sz = mei_cldev_recv(cldev, payload, MAX_PAYLOAD) + if (reply_size < HDR_SIZE) { + return; + } - if (events & BIT(MEI_EVENT_RX)) { - u8 payload[MAX_NFC_PAYLOAD]; - int payload_size; + c->process_rx(payload); - payload_size = mei_recv(dev, payload, MAX_NFC_PAYLOAD); - if (payload_size <= 0) - return; + } - /* Hook to the NFC subsystem */ - nfc_hci_recv_frame(contact->hdev, payload, payload_size); - } - } -- cgit v1.2.3 From 4e3d3b784ae7cd86ace2776c01be99ddfd378801 Mon Sep 17 00:00:00 2001 From: Tomas Winkler Date: Mon, 3 Jun 2019 12:14:04 +0300 Subject: mei: docs: add a short description for nfc behind mei Signed-off-by: Tomas Winkler Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/mei/index.rst | 2 +- Documentation/driver-api/mei/mei-client-bus.rst | 7 +++++++ Documentation/driver-api/mei/nfc.rst | 28 +++++++++++++++++++++++++ 3 files changed, 36 insertions(+), 1 deletion(-) create mode 100644 Documentation/driver-api/mei/nfc.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/mei/index.rst b/Documentation/driver-api/mei/index.rst index d261afac6852..3a22b522ee78 100644 --- a/Documentation/driver-api/mei/index.rst +++ b/Documentation/driver-api/mei/index.rst @@ -16,7 +16,7 @@ Intel(R) Management Engine Interface (Intel(R) MEI) Table of Contents .. toctree:: - :maxdepth: 2 + :maxdepth: 3 mei mei-client-bus diff --git a/Documentation/driver-api/mei/mei-client-bus.rst b/Documentation/driver-api/mei/mei-client-bus.rst index 7310dd45c484..bfe28ebc3ca8 100644 --- a/Documentation/driver-api/mei/mei-client-bus.rst +++ b/Documentation/driver-api/mei/mei-client-bus.rst @@ -158,3 +158,10 @@ process received data. } +MEI Client Bus Drivers +====================== + +.. toctree:: + :maxdepth: 2 + + nfc diff --git a/Documentation/driver-api/mei/nfc.rst b/Documentation/driver-api/mei/nfc.rst new file mode 100644 index 000000000000..b5b6fc96f85e --- /dev/null +++ b/Documentation/driver-api/mei/nfc.rst @@ -0,0 +1,28 @@ +.. SPDX-License-Identifier: GPL-2.0 + +MEI NFC +------- + +Some Intel 8 and 9 Serieses chipsets supports NFC devices connected behind +the Intel Management Engine controller. +MEI client bus exposes the NFC chips as NFC phy devices and enables +binding with Microread and NXP PN544 NFC device driver from the Linux NFC +subsystem. + +.. kernel-render:: DOT + :alt: MEI NFC digraph + :caption: **MEI NFC** Stack + + digraph NFC { + cl_nfc -> me_cl_nfc; + "drivers/nfc/mei_phy" -> cl_nfc [lhead=bus]; + "drivers/nfc/microread/mei" -> cl_nfc; + "drivers/nfc/microread/mei" -> "drivers/nfc/mei_phy"; + "drivers/nfc/pn544/mei" -> cl_nfc; + "drivers/nfc/pn544/mei" -> "drivers/nfc/mei_phy"; + "net/nfc" -> "drivers/nfc/microread/mei"; + "net/nfc" -> "drivers/nfc/pn544/mei"; + "neard" -> "net/nfc"; + cl_nfc [label="mei/bus(nfc)"]; + me_cl_nfc [label="me fw (nfc)"]; + } -- cgit v1.2.3 From 0475afd2a5dee99defdb7b030c09ba202ea3c64a Mon Sep 17 00:00:00 2001 From: Tomas Winkler Date: Mon, 3 Jun 2019 12:14:05 +0300 Subject: mei: docs: add hdcp documentation 1. Add a short ducumentation for MEI HDCP driver, and fix DOC comments in drivers/misc/mei/hdcp/mei_hdcp.c Signed-off-by: Tomas Winkler Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/mei/hdcp.rst | 32 +++++++++++++++++++++++++ Documentation/driver-api/mei/mei-client-bus.rst | 1 + drivers/misc/mei/hdcp/mei_hdcp.c | 11 ++++----- 3 files changed, 37 insertions(+), 7 deletions(-) create mode 100644 Documentation/driver-api/mei/hdcp.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/mei/hdcp.rst b/Documentation/driver-api/mei/hdcp.rst new file mode 100644 index 000000000000..e85a065b1cdc --- /dev/null +++ b/Documentation/driver-api/mei/hdcp.rst @@ -0,0 +1,32 @@ +.. SPDX-License-Identifier: GPL-2.0 + +HDCP: +===== + +ME FW as a security engine provides the capability for setting up +HDCP2.2 protocol negotiation between the Intel graphics device and +an HDC2.2 sink. + +ME FW prepares HDCP2.2 negotiation parameters, signs and encrypts them +according the HDCP 2.2 spec. The Intel graphics sends the created blob +to the HDCP2.2 sink. + +Similarly, the HDCP2.2 sink's response is transferred to ME FW +for decryption and verification. + +Once all the steps of HDCP2.2 negotiation are completed, +upon request ME FW will configure the port as authenticated and supply +the HDCP encryption keys to Intel graphics hardware. + + +mei_hdcp driver +--------------- +.. kernel-doc:: drivers/misc/mei/hdcp/mei_hdcp.c + :doc: MEI_HDCP Client Driver + +mei_hdcp api +------------ + +.. kernel-doc:: drivers/misc/mei/hdcp/mei_hdcp.c + :functions: + diff --git a/Documentation/driver-api/mei/mei-client-bus.rst b/Documentation/driver-api/mei/mei-client-bus.rst index bfe28ebc3ca8..f242b3f8d6aa 100644 --- a/Documentation/driver-api/mei/mei-client-bus.rst +++ b/Documentation/driver-api/mei/mei-client-bus.rst @@ -164,4 +164,5 @@ MEI Client Bus Drivers .. toctree:: :maxdepth: 2 + hdcp nfc diff --git a/drivers/misc/mei/hdcp/mei_hdcp.c b/drivers/misc/mei/hdcp/mei_hdcp.c index b07000202d4a..ed816939fb32 100644 --- a/drivers/misc/mei/hdcp/mei_hdcp.c +++ b/drivers/misc/mei/hdcp/mei_hdcp.c @@ -2,7 +2,7 @@ /* * Copyright © 2019 Intel Corporation * - * Mei_hdcp.c: HDCP client driver for mei bus + * mei_hdcp.c: HDCP client driver for mei bus * * Author: * Ramalingam C @@ -11,12 +11,9 @@ /** * DOC: MEI_HDCP Client Driver * - * This is a client driver to the mei_bus to make the HDCP2.2 services of - * ME FW available for the interested consumers like I915. - * - * This module will act as a translation layer between HDCP protocol - * implementor(I915) and ME FW by translating HDCP2.2 authentication - * messages to ME FW command payloads and vice versa. + * The mei_hdcp driver acts as a translation layer between HDCP 2.2 + * protocol implementer (I915) and ME FW by translating HDCP2.2 + * negotiation messages to ME FW command payloads and vice versa. */ #include -- cgit v1.2.3 From 7e706da35a458f4b0d4c565c7b71023d8bfe279b Mon Sep 17 00:00:00 2001 From: Tomas Winkler Date: Mon, 3 Jun 2019 12:14:06 +0300 Subject: mei: docs: fix broken links in iamt documentation. The iAMT documentation moved from http:// https://, and LMS is moved to github.com Signed-off-by: Tomas Winkler Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/mei/iamt.rst | 105 ++++++++++++++++------------------ 1 file changed, 50 insertions(+), 55 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/mei/iamt.rst b/Documentation/driver-api/mei/iamt.rst index 6dcf5b16e958..6ef3e613684b 100644 --- a/Documentation/driver-api/mei/iamt.rst +++ b/Documentation/driver-api/mei/iamt.rst @@ -27,62 +27,57 @@ starting with Release 6.0) over HTTP/S or WS-Management protocol over HTTP/S that are received from a remote management console application. For more information about Intel AMT: -http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide +https://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide/default.htm Intel AMT Applications -====================== - - 1) Intel Local Management Service (Intel LMS) - - Applications running locally on the platform communicate with Intel AMT Release - 2.0 and later releases in the same way that network applications do via SOAP - over HTTP (deprecated starting with Release 6.0) or with WS-Management over - SOAP over HTTP. This means that some Intel AMT features can be accessed from a - local application using the same network interface as a remote application - communicating with Intel AMT over the network. - - When a local application sends a message addressed to the local Intel AMT host - name, the Intel LMS, which listens for traffic directed to the host name, - intercepts the message and routes it to the Intel MEI. - For more information: - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "About Intel AMT" => "Local Access" - - For downloading Intel LMS: - http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ - - The Intel LMS opens a connection using the Intel MEI driver to the Intel LMS - firmware feature using a defined UUID and then communicates with the feature - using a protocol called Intel AMT Port Forwarding Protocol (Intel APF protocol). - The protocol is used to maintain multiple sessions with Intel AMT from a - single application. - - See the protocol specification in the Intel AMT Software Development Kit (SDK) - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "SDK Resources" => "Intel(R) vPro(TM) Gateway (MPS)" - => "Information for Intel(R) vPro(TM) Gateway Developers" - => "Description of the Intel AMT Port Forwarding (APF) Protocol" - - 2) Intel AMT Remote configuration using a Local Agent - - A Local Agent enables IT personnel to configure Intel AMT out-of-the-box - without requiring installing additional data to enable setup. The remote - configuration process may involve an ISV-developed remote configuration - agent that runs on the host. - For more information: - http://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide - Under "Setup and Configuration of Intel AMT" => - "SDK Tools Supporting Setup and Configuration" => - "Using the Local Agent Sample" - - An open source Intel AMT configuration utility, implementing a local agent - that accesses the Intel MEI driver, can be found here: - http://software.intel.com/en-us/articles/download-the-latest-intel-amt-open-source-drivers/ - +---------------------- + + 1) Intel Local Management Service (Intel LMS) + + Applications running locally on the platform communicate with Intel AMT Release + 2.0 and later releases in the same way that network applications do via SOAP + over HTTP (deprecated starting with Release 6.0) or with WS-Management over + SOAP over HTTP. This means that some Intel AMT features can be accessed from a + local application using the same network interface as a remote application + communicating with Intel AMT over the network. + + When a local application sends a message addressed to the local Intel AMT host + name, the Intel LMS, which listens for traffic directed to the host name, + intercepts the message and routes it to the Intel MEI. + For more information: + https://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide/default.htm + Under "About Intel AMT" => "Local Access" + + For downloading Intel LMS: + https://github.com/intel/lms + + The Intel LMS opens a connection using the Intel MEI driver to the Intel LMS + firmware feature using a defined GUID and then communicates with the feature + using a protocol called Intel AMT Port Forwarding Protocol (Intel APF protocol). + The protocol is used to maintain multiple sessions with Intel AMT from a + single application. + + See the protocol specification in the Intel AMT Software Development Kit (SDK) + https://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide/default.htm + Under "SDK Resources" => "Intel(R) vPro(TM) Gateway (MPS)" + => "Information for Intel(R) vPro(TM) Gateway Developers" + => "Description of the Intel AMT Port Forwarding (APF) Protocol" + + 2) Intel AMT Remote configuration using a Local Agent + + A Local Agent enables IT personnel to configure Intel AMT out-of-the-box + without requiring installing additional data to enable setup. The remote + configuration process may involve an ISV-developed remote configuration + agent that runs on the host. + For more information: + https://software.intel.com/sites/manageability/AMT_Implementation_and_Reference_Guide/default.htm + Under "Setup and Configuration of Intel AMT" => + "SDK Tools Supporting Setup and Configuration" => + "Using the Local Agent Sample" Intel AMT OS Health Watchdog -============================ +---------------------------- The Intel AMT Watchdog is an OS Health (Hang/Crash) watchdog. Whenever the OS hangs or crashes, Intel AMT will send an event @@ -90,10 +85,10 @@ to any subscriber to this event. This mechanism means that IT knows when a platform crashes even when there is a hard failure on the host. The Intel AMT Watchdog is composed of two parts: - 1) Firmware feature - receives the heartbeats - and sends an event when the heartbeats stop. - 2) Intel MEI iAMT watchdog driver - connects to the watchdog feature, - configures the watchdog and sends the heartbeats. + 1) Firmware feature - receives the heartbeats + and sends an event when the heartbeats stop. + 2) Intel MEI iAMT watchdog driver - connects to the watchdog feature, + configures the watchdog and sends the heartbeats. The Intel iAMT watchdog MEI driver uses the kernel watchdog API to configure the Intel AMT Watchdog and to send heartbeats to it. The default timeout of the -- cgit v1.2.3 From d0a178095c5fbbd25454c20e49bc3a7d70ecb769 Mon Sep 17 00:00:00 2001 From: Tomas Winkler Date: Thu, 6 Jun 2019 16:31:08 +0300 Subject: mei: docs: update mei documentation The mei driver went via multiple changes, update the documentation and fix formatting. Signed-off-by: Tomas Winkler Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/mei/mei.rst | 96 +++++++++++++++++++++++------------- 1 file changed, 61 insertions(+), 35 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/mei/mei.rst b/Documentation/driver-api/mei/mei.rst index c7f10a4b46ff..c800d8e5f422 100644 --- a/Documentation/driver-api/mei/mei.rst +++ b/Documentation/driver-api/mei/mei.rst @@ -5,34 +5,32 @@ Introduction The Intel Management Engine (Intel ME) is an isolated and protected computing resource (Co-processor) residing inside certain Intel chipsets. The Intel ME -provides support for computer/IT management features. The feature set -depends on the Intel chipset SKU. +provides support for computer/IT management and security features. +The actual feature set depends on the Intel chipset SKU. The Intel Management Engine Interface (Intel MEI, previously known as HECI) is the interface between the Host and Intel ME. This interface is exposed -to the host as a PCI device. The Intel MEI Driver is in charge of the -communication channel between a host application and the Intel ME feature. +to the host as a PCI device, actually multiple PCI devices might be exposed. +The Intel MEI Driver is in charge of the communication channel between +a host application and the Intel ME features. -Each Intel ME feature (Intel ME Client) is addressed by a GUID/UUID and +Each Intel ME feature, or Intel ME Client is addressed by a unique GUID and each client has its own protocol. The protocol is message-based with a -header and payload up to 512 bytes. +header and payload up to maximal number of bytes advertised by the client, +upon connection. Intel MEI Driver ================ -The driver exposes a misc device called /dev/mei. +The driver exposes a character device with device nodes /dev/meiX. An application maintains communication with an Intel ME feature while -/dev/mei is open. The binding to a specific feature is performed by calling -MEI_CONNECT_CLIENT_IOCTL, which passes the desired UUID. +/dev/meiX is open. The binding to a specific feature is performed by calling +:c:macro:`MEI_CONNECT_CLIENT_IOCTL`, which passes the desired GUID. The number of instances of an Intel ME feature that can be opened at the same time depends on the Intel ME feature, but most of the features allow only a single instance. -The Intel AMT Host Interface (Intel AMTHI) feature supports multiple -simultaneous user connected applications. The Intel MEI driver -handles this internally by maintaining request queues for the applications. - The driver is transparent to data that are passed between firmware feature and host application. @@ -40,6 +38,8 @@ Because some of the Intel ME features can change the system configuration, the driver by default allows only a privileged user to access it. +The session is terminated calling :c:func:`close(int fd)`. + A code snippet for an application communicating with Intel AMTHI client: .. code-block:: C @@ -47,13 +47,13 @@ A code snippet for an application communicating with Intel AMTHI client: struct mei_connect_client_data data; fd = open(MEI_DEVICE); - data.d.in_client_uuid = AMTHI_UUID; + data.d.in_client_uuid = AMTHI_GUID; ioctl(fd, IOCTL_MEI_CONNECT_CLIENT, &data); printf("Ver=%d, MaxLen=%ld\n", - data.d.in_client_uuid.protocol_version, - data.d.in_client_uuid.max_msg_length); + data.d.in_client_uuid.protocol_version, + data.d.in_client_uuid.max_msg_length); [...] @@ -67,60 +67,86 @@ A code snippet for an application communicating with Intel AMTHI client: close(fd); -IOCTLs -====== +User space API + +IOCTLs: +======= The Intel MEI Driver supports the following IOCTL commands: - IOCTL_MEI_CONNECT_CLIENT Connect to firmware Feature (client). - usage: - struct mei_connect_client_data clientData; - ioctl(fd, IOCTL_MEI_CONNECT_CLIENT, &clientData); +IOCTL_MEI_CONNECT_CLIENT +------------------------- +Connect to firmware Feature/Client. + +.. code-block:: none + + Usage: - inputs: - mei_connect_client_data struct contain the following - input field: + struct mei_connect_client_data client_data; - in_client_uuid - UUID of the FW Feature that needs + ioctl(fd, IOCTL_MEI_CONNECT_CLIENT, &client_data); + + Inputs: + + struct mei_connect_client_data - contain the following + Input field: + + in_client_uuid - GUID of the FW Feature that needs to connect to. - outputs: + Outputs: out_client_properties - Client Properties: MTU and Protocol Version. - error returns: + Error returns: + + ENOTTY No such client (i.e. wrong GUID) or connection is not allowed. EINVAL Wrong IOCTL Number - ENODEV Device or Connection is not initialized or ready. (e.g. Wrong UUID) + ENODEV Device or Connection is not initialized or ready. ENOMEM Unable to allocate memory to client internal data. EFAULT Fatal Error (e.g. Unable to access user input data) EBUSY Connection Already Open - Notes: +:Note: max_msg_length (MTU) in client properties describes the maximum data that can be sent or received. (e.g. if MTU=2K, can send requests up to bytes 2k and received responses up to 2k bytes). - IOCTL_MEI_NOTIFY_SET: enable or disable event notifications + +IOCTL_MEI_NOTIFY_SET +--------------------- +Enable or disable event notifications. + + +.. code-block:: none Usage: + uint32_t enable; + ioctl(fd, IOCTL_MEI_NOTIFY_SET, &enable); - Inputs: + uint32_t enable = 1; or uint32_t enable[disable] = 0; Error returns: + + EINVAL Wrong IOCTL Number ENODEV Device is not initialized or the client not connected ENOMEM Unable to allocate memory to client internal data. EFAULT Fatal Error (e.g. Unable to access user input data) EOPNOTSUPP if the device doesn't support the feature - Notes: +:Note: The client must be connected in order to enable notification events - IOCTL_MEI_NOTIFY_GET : retrieve event +IOCTL_MEI_NOTIFY_GET +-------------------- +Retrieve event + +.. code-block:: none Usage: uint32_t event; @@ -137,7 +163,7 @@ The Intel MEI Driver supports the following IOCTL commands: EFAULT Fatal Error (e.g. Unable to access user input data) EOPNOTSUPP if the device doesn't support the feature - Notes: +:Note: The client must be connected and event notification has to be enabled in order to receive an event -- cgit v1.2.3 From 0aa3ebffc43cb8974f2fca92d07b9ebeba0f67c1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Wed, 29 May 2019 20:23:47 -0300 Subject: docs: gpio: driver.rst: fix a bad tag With ReST, [foo]_ means a reference to foo, causing this warning: Documentation/driver-api/gpio/driver.rst:419: WARNING: Unknown target name: "devm". Fix it by using a literal for the name. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Linus Walleij --- Documentation/driver-api/gpio/driver.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 58036c2d84d2..4af9aae724f0 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -418,7 +418,7 @@ symbol: If there is a need to exclude certain GPIO lines from the IRQ domain handled by these helpers, we can set .irq.need_valid_mask of the gpiochip before -[devm_]gpiochip_add_data() is called. This allocates an .irq.valid_mask with as +``[devm_]gpiochip_add_data()`` is called. This allocates an .irq.valid_mask with as many bits set as there are GPIO lines in the chip, each bit representing line 0..n-1. Drivers can exclude GPIO lines by clearing bits from this mask. The mask must be filled in before gpiochip_irqchip_add() or gpiochip_irqchip_add_nested() -- cgit v1.2.3 From 8b4a503d659b32cae8266aeb306f7fd6717e6a53 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Sat, 8 Jun 2019 23:27:16 -0300 Subject: docs: s390: convert docs to ReST and rename to *.rst Convert all text files with s390 documentation to ReST format. Tried to preserve as much as possible the original document format. Still, some of the files required some work in order for it to be visible on both plain text and after converted to html. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Heiko Carstens --- Documentation/admin-guide/kernel-parameters.txt | 4 +- Documentation/driver-api/s390-drivers.rst | 4 +- Documentation/s390/3270.rst | 298 +++ Documentation/s390/3270.txt | 271 --- Documentation/s390/CommonIO | 125 -- Documentation/s390/DASD | 73 - Documentation/s390/Debugging390.txt | 2172 ------------------- Documentation/s390/cds.rst | 530 +++++ Documentation/s390/cds.txt | 472 ---- Documentation/s390/common_io.rst | 140 ++ Documentation/s390/dasd.rst | 84 + Documentation/s390/debugging390.rst | 2613 +++++++++++++++++++++++ Documentation/s390/driver-model.rst | 328 +++ Documentation/s390/driver-model.txt | 287 --- Documentation/s390/index.rst | 30 + Documentation/s390/monreader.rst | 212 ++ Documentation/s390/monreader.txt | 197 -- Documentation/s390/qeth.rst | 64 + Documentation/s390/qeth.txt | 50 - Documentation/s390/s390dbf.rst | 803 +++++++ Documentation/s390/s390dbf.txt | 667 ------ Documentation/s390/text_files.rst | 11 + Documentation/s390/vfio-ap.rst | 866 ++++++++ Documentation/s390/vfio-ap.txt | 837 -------- Documentation/s390/vfio-ccw.rst | 326 +++ Documentation/s390/vfio-ccw.txt | 300 --- Documentation/s390/zfcpdump.rst | 50 + Documentation/s390/zfcpdump.txt | 48 - MAINTAINERS | 4 +- arch/s390/Kconfig | 4 +- arch/s390/include/asm/debug.h | 4 +- drivers/s390/char/zcore.c | 2 +- 32 files changed, 6366 insertions(+), 5510 deletions(-) create mode 100644 Documentation/s390/3270.rst delete mode 100644 Documentation/s390/3270.txt delete mode 100644 Documentation/s390/CommonIO delete mode 100644 Documentation/s390/DASD delete mode 100644 Documentation/s390/Debugging390.txt create mode 100644 Documentation/s390/cds.rst delete mode 100644 Documentation/s390/cds.txt create mode 100644 Documentation/s390/common_io.rst create mode 100644 Documentation/s390/dasd.rst create mode 100644 Documentation/s390/debugging390.rst create mode 100644 Documentation/s390/driver-model.rst delete mode 100644 Documentation/s390/driver-model.txt create mode 100644 Documentation/s390/index.rst create mode 100644 Documentation/s390/monreader.rst delete mode 100644 Documentation/s390/monreader.txt create mode 100644 Documentation/s390/qeth.rst delete mode 100644 Documentation/s390/qeth.txt create mode 100644 Documentation/s390/s390dbf.rst delete mode 100644 Documentation/s390/s390dbf.txt create mode 100644 Documentation/s390/text_files.rst create mode 100644 Documentation/s390/vfio-ap.rst delete mode 100644 Documentation/s390/vfio-ap.txt create mode 100644 Documentation/s390/vfio-ccw.rst delete mode 100644 Documentation/s390/vfio-ccw.txt create mode 100644 Documentation/s390/zfcpdump.rst delete mode 100644 Documentation/s390/zfcpdump.txt (limited to 'Documentation/driver-api') diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 138f6664b2e2..b9b0623be925 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -478,7 +478,7 @@ others). ccw_timeout_log [S390] - See Documentation/s390/CommonIO for details. + See Documentation/s390/common_io.rst for details. cgroup_disable= [KNL] Disable a particular controller Format: {name of the controller(s) to disable} @@ -516,7 +516,7 @@ /selinux/checkreqprot. cio_ignore= [S390] - See Documentation/s390/CommonIO for details. + See Documentation/s390/common_io.rst for details. clk_ignore_unused [CLK] Prevents the clock framework from automatically gating diff --git a/Documentation/driver-api/s390-drivers.rst b/Documentation/driver-api/s390-drivers.rst index 30e6aa7e160b..5158577bc29b 100644 --- a/Documentation/driver-api/s390-drivers.rst +++ b/Documentation/driver-api/s390-drivers.rst @@ -27,7 +27,7 @@ not strictly considered I/O devices. They are considered here as well, although they are not the focus of this document. Some additional information can also be found in the kernel source under -Documentation/s390/driver-model.txt. +Documentation/s390/driver-model.rst. The css bus =========== @@ -38,7 +38,7 @@ into several categories: * Standard I/O subchannels, for use by the system. They have a child device on the ccw bus and are described below. * I/O subchannels bound to the vfio-ccw driver. See - Documentation/s390/vfio-ccw.txt. + Documentation/s390/vfio-ccw.rst. * Message subchannels. No Linux driver currently exists. * CHSC subchannels (at most one). The chsc subchannel driver can be used to send asynchronous chsc commands. diff --git a/Documentation/s390/3270.rst b/Documentation/s390/3270.rst new file mode 100644 index 000000000000..e09e77954238 --- /dev/null +++ b/Documentation/s390/3270.rst @@ -0,0 +1,298 @@ +=============================== +IBM 3270 Display System support +=============================== + +This file describes the driver that supports local channel attachment +of IBM 3270 devices. It consists of three sections: + + * Introduction + * Installation + * Operation + + +Introduction +============ + +This paper describes installing and operating 3270 devices under +Linux/390. A 3270 device is a block-mode rows-and-columns terminal of +which I'm sure hundreds of millions were sold by IBM and clonemakers +twenty and thirty years ago. + +You may have 3270s in-house and not know it. If you're using the +VM-ESA operating system, define a 3270 to your virtual machine by using +the command "DEF GRAF " This paper presumes you will be +defining four 3270s with the CP/CMS commands: + + - DEF GRAF 620 + - DEF GRAF 621 + - DEF GRAF 622 + - DEF GRAF 623 + +Your network connection from VM-ESA allows you to use x3270, tn3270, or +another 3270 emulator, started from an xterm window on your PC or +workstation. With the DEF GRAF command, an application such as xterm, +and this Linux-390 3270 driver, you have another way of talking to your +Linux box. + +This paper covers installation of the driver and operation of a +dialed-in x3270. + + +Installation +============ + +You install the driver by installing a patch, doing a kernel build, and +running the configuration script (config3270.sh, in this directory). + +WARNING: If you are using 3270 console support, you must rerun the +configuration script every time you change the console's address (perhaps +by using the condev= parameter in silo's /boot/parmfile). More precisely, +you should rerun the configuration script every time your set of 3270s, +including the console 3270, changes subchannel identifier relative to +one another. ReIPL as soon as possible after running the configuration +script and the resulting /tmp/mkdev3270. + +If you have chosen to make tub3270 a module, you add a line to a +configuration file under /etc/modprobe.d/. If you are working on a VM +virtual machine, you can use DEF GRAF to define virtual 3270 devices. + +You may generate both 3270 and 3215 console support, or one or the +other, or neither. If you generate both, the console type under VM is +not changed. Use #CP Q TERM to see what the current console type is. +Use #CP TERM CONMODE 3270 to change it to 3270. If you generate only +3270 console support, then the driver automatically converts your console +at boot time to a 3270 if it is a 3215. + +In brief, these are the steps: + + 1. Install the tub3270 patch + 2. (If a module) add a line to a file in `/etc/modprobe.d/*.conf` + 3. (If VM) define devices with DEF GRAF + 4. Reboot + 5. Configure + +To test that everything works, assuming VM and x3270, + + 1. Bring up an x3270 window. + 2. Use the DIAL command in that window. + 3. You should immediately see a Linux login screen. + +Here are the installation steps in detail: + + 1. The 3270 driver is a part of the official Linux kernel + source. Build a tree with the kernel source and any necessary + patches. Then do:: + + make oldconfig + (If you wish to disable 3215 console support, edit + .config; change CONFIG_TN3215's value to "n"; + and rerun "make oldconfig".) + make image + make modules + make modules_install + + 2. (Perform this step only if you have configured tub3270 as a + module.) Add a line to a file `/etc/modprobe.d/*.conf` to automatically + load the driver when it's needed. With this line added, you will see + login prompts appear on your 3270s as soon as boot is complete (or + with emulated 3270s, as soon as you dial into your vm guest using the + command "DIAL "). Since the line-mode major number is + 227, the line to add should be:: + + alias char-major-227 tub3270 + + 3. Define graphic devices to your vm guest machine, if you + haven't already. Define them before you reboot (reipl): + + - DEFINE GRAF 620 + - DEFINE GRAF 621 + - DEFINE GRAF 622 + - DEFINE GRAF 623 + + 4. Reboot. The reboot process scans hardware devices, including + 3270s, and this enables the tub3270 driver once loaded to respond + correctly to the configuration requests of the next step. If + you have chosen 3270 console support, your console now behaves + as a 3270, not a 3215. + + 5. Run the 3270 configuration script config3270. It is + distributed in this same directory, Documentation/s390, as + config3270.sh. Inspect the output script it produces, + /tmp/mkdev3270, and then run that script. This will create the + necessary character special device files and make the necessary + changes to /etc/inittab. + + Then notify /sbin/init that /etc/inittab has changed, by issuing + the telinit command with the q operand:: + + cd Documentation/s390 + sh config3270.sh + sh /tmp/mkdev3270 + telinit q + + This should be sufficient for your first time. If your 3270 + configuration has changed and you're reusing config3270, you + should follow these steps:: + + Change 3270 configuration + Reboot + Run config3270 and /tmp/mkdev3270 + Reboot + +Here are the testing steps in detail: + + 1. Bring up an x3270 window, or use an actual hardware 3278 or + 3279, or use the 3270 emulator of your choice. You would be + running the emulator on your PC or workstation. You would use + the command, for example:: + + x3270 vm-esa-domain-name & + + if you wanted a 3278 Model 4 with 43 rows of 80 columns, the + default model number. The driver does not take advantage of + extended attributes. + + The screen you should now see contains a VM logo with input + lines near the bottom. Use TAB to move to the bottom line, + probably labeled "COMMAND ===>". + + 2. Use the DIAL command instead of the LOGIN command to connect + to one of the virtual 3270s you defined with the DEF GRAF + commands:: + + dial my-vm-guest-name + + 3. You should immediately see a login prompt from your + Linux-390 operating system. If that does not happen, you would + see instead the line "DIALED TO my-vm-guest-name 0620". + + To troubleshoot: do these things. + + A. Is the driver loaded? Use the lsmod command (no operands) + to find out. Probably it isn't. Try loading it manually, with + the command "insmod tub3270". Does that command give error + messages? Ha! There's your problem. + + B. Is the /etc/inittab file modified as in installation step 3 + above? Use the grep command to find out; for instance, issue + "grep 3270 /etc/inittab". Nothing found? There's your + problem! + + C. Are the device special files created, as in installation + step 2 above? Use the ls -l command to find out; for instance, + issue "ls -l /dev/3270/tty620". The output should start with the + letter "c" meaning character device and should contain "227, 1" + just to the left of the device name. No such file? no "c"? + Wrong major number? Wrong minor number? There's your + problem! + + D. Do you get the message:: + + "HCPDIA047E my-vm-guest-name 0620 does not exist"? + + If so, you must issue the command "DEF GRAF 620" from your VM + 3215 console and then reboot the system. + + + +OPERATION. +========== + +The driver defines three areas on the 3270 screen: the log area, the +input area, and the status area. + +The log area takes up all but the bottom two lines of the screen. The +driver writes terminal output to it, starting at the top line and going +down. When it fills, the status area changes from "Linux Running" to +"Linux More...". After a scrolling timeout of (default) 5 sec, the +screen clears and more output is written, from the top down. + +The input area extends from the beginning of the second-to-last screen +line to the start of the status area. You type commands in this area +and hit ENTER to execute them. + +The status area initializes to "Linux Running" to give you a warm +fuzzy feeling. When the log area fills up and output awaits, it +changes to "Linux More...". At this time you can do several things or +nothing. If you do nothing, the screen will clear in (default) 5 sec +and more output will appear. You may hit ENTER with nothing typed in +the input area to toggle between "Linux More..." and "Linux Holding", +which indicates no scrolling will occur. (If you hit ENTER with "Linux +Running" and nothing typed, the application receives a newline.) + +You may change the scrolling timeout value. For example, the following +command line:: + + echo scrolltime=60 > /proc/tty/driver/tty3270 + +changes the scrolling timeout value to 60 sec. Set scrolltime to 0 if +you wish to prevent scrolling entirely. + +Other things you may do when the log area fills up are: hit PA2 to +clear the log area and write more output to it, or hit CLEAR to clear +the log area and the input area and write more output to the log area. + +Some of the Program Function (PF) and Program Attention (PA) keys are +preassigned special functions. The ones that are not yield an alarm +when pressed. + +PA1 causes a SIGINT to the currently running application. You may do +the same thing from the input area, by typing "^C" and hitting ENTER. + +PA2 causes the log area to be cleared. If output awaits, it is then +written to the log area. + +PF3 causes an EOF to be received as input by the application. You may +cause an EOF also by typing "^D" and hitting ENTER. + +No PF key is preassigned to cause a job suspension, but you may cause a +job suspension by typing "^Z" and hitting ENTER. You may wish to +assign this function to a PF key. To make PF7 cause job suspension, +execute the command:: + + echo pf7=^z > /proc/tty/driver/tty3270 + +If the input you type does not end with the two characters "^n", the +driver appends a newline character and sends it to the tty driver; +otherwise the driver strips the "^n" and does not append a newline. +The IBM 3215 driver behaves similarly. + +Pf10 causes the most recent command to be retrieved from the tube's +command stack (default depth 20) and displayed in the input area. You +may hit PF10 again for the next-most-recent command, and so on. A +command is entered into the stack only when the input area is not made +invisible (such as for password entry) and it is not identical to the +current top entry. PF10 rotates backward through the command stack; +PF11 rotates forward. You may assign the backward function to any PF +key (or PA key, for that matter), say, PA3, with the command:: + + echo -e pa3=\\033k > /proc/tty/driver/tty3270 + +This assigns the string ESC-k to PA3. Similarly, the string ESC-j +performs the forward function. (Rationale: In bash with vi-mode line +editing, ESC-k and ESC-j retrieve backward and forward history. +Suggestions welcome.) + +Is a stack size of twenty commands not to your liking? Change it on +the fly. To change to saving the last 100 commands, execute the +command:: + + echo recallsize=100 > /proc/tty/driver/tty3270 + +Have a command you issue frequently? Assign it to a PF or PA key! Use +the command:: + + echo pf24="mkdir foobar; cd foobar" > /proc/tty/driver/tty3270 + +to execute the commands mkdir foobar and cd foobar immediately when you +hit PF24. Want to see the command line first, before you execute it? +Use the -n option of the echo command:: + + echo -n pf24="mkdir foo; cd foo" > /proc/tty/driver/tty3270 + + + +Happy testing! I welcome any and all comments about this document, the +driver, etc etc. + +Dick Hitt diff --git a/Documentation/s390/3270.txt b/Documentation/s390/3270.txt deleted file mode 100644 index 7c715de99774..000000000000 --- a/Documentation/s390/3270.txt +++ /dev/null @@ -1,271 +0,0 @@ -IBM 3270 Display System support - -This file describes the driver that supports local channel attachment -of IBM 3270 devices. It consists of three sections: - * Introduction - * Installation - * Operation - - -INTRODUCTION. - -This paper describes installing and operating 3270 devices under -Linux/390. A 3270 device is a block-mode rows-and-columns terminal of -which I'm sure hundreds of millions were sold by IBM and clonemakers -twenty and thirty years ago. - -You may have 3270s in-house and not know it. If you're using the -VM-ESA operating system, define a 3270 to your virtual machine by using -the command "DEF GRAF " This paper presumes you will be -defining four 3270s with the CP/CMS commands - - DEF GRAF 620 - DEF GRAF 621 - DEF GRAF 622 - DEF GRAF 623 - -Your network connection from VM-ESA allows you to use x3270, tn3270, or -another 3270 emulator, started from an xterm window on your PC or -workstation. With the DEF GRAF command, an application such as xterm, -and this Linux-390 3270 driver, you have another way of talking to your -Linux box. - -This paper covers installation of the driver and operation of a -dialed-in x3270. - - -INSTALLATION. - -You install the driver by installing a patch, doing a kernel build, and -running the configuration script (config3270.sh, in this directory). - -WARNING: If you are using 3270 console support, you must rerun the -configuration script every time you change the console's address (perhaps -by using the condev= parameter in silo's /boot/parmfile). More precisely, -you should rerun the configuration script every time your set of 3270s, -including the console 3270, changes subchannel identifier relative to -one another. ReIPL as soon as possible after running the configuration -script and the resulting /tmp/mkdev3270. - -If you have chosen to make tub3270 a module, you add a line to a -configuration file under /etc/modprobe.d/. If you are working on a VM -virtual machine, you can use DEF GRAF to define virtual 3270 devices. - -You may generate both 3270 and 3215 console support, or one or the -other, or neither. If you generate both, the console type under VM is -not changed. Use #CP Q TERM to see what the current console type is. -Use #CP TERM CONMODE 3270 to change it to 3270. If you generate only -3270 console support, then the driver automatically converts your console -at boot time to a 3270 if it is a 3215. - -In brief, these are the steps: - 1. Install the tub3270 patch - 2. (If a module) add a line to a file in /etc/modprobe.d/*.conf - 3. (If VM) define devices with DEF GRAF - 4. Reboot - 5. Configure - -To test that everything works, assuming VM and x3270, - 1. Bring up an x3270 window. - 2. Use the DIAL command in that window. - 3. You should immediately see a Linux login screen. - -Here are the installation steps in detail: - - 1. The 3270 driver is a part of the official Linux kernel - source. Build a tree with the kernel source and any necessary - patches. Then do - make oldconfig - (If you wish to disable 3215 console support, edit - .config; change CONFIG_TN3215's value to "n"; - and rerun "make oldconfig".) - make image - make modules - make modules_install - - 2. (Perform this step only if you have configured tub3270 as a - module.) Add a line to a file /etc/modprobe.d/*.conf to automatically - load the driver when it's needed. With this line added, you will see - login prompts appear on your 3270s as soon as boot is complete (or - with emulated 3270s, as soon as you dial into your vm guest using the - command "DIAL "). Since the line-mode major number is - 227, the line to add should be: - alias char-major-227 tub3270 - - 3. Define graphic devices to your vm guest machine, if you - haven't already. Define them before you reboot (reipl): - DEFINE GRAF 620 - DEFINE GRAF 621 - DEFINE GRAF 622 - DEFINE GRAF 623 - - 4. Reboot. The reboot process scans hardware devices, including - 3270s, and this enables the tub3270 driver once loaded to respond - correctly to the configuration requests of the next step. If - you have chosen 3270 console support, your console now behaves - as a 3270, not a 3215. - - 5. Run the 3270 configuration script config3270. It is - distributed in this same directory, Documentation/s390, as - config3270.sh. Inspect the output script it produces, - /tmp/mkdev3270, and then run that script. This will create the - necessary character special device files and make the necessary - changes to /etc/inittab. - - Then notify /sbin/init that /etc/inittab has changed, by issuing - the telinit command with the q operand: - cd Documentation/s390 - sh config3270.sh - sh /tmp/mkdev3270 - telinit q - - This should be sufficient for your first time. If your 3270 - configuration has changed and you're reusing config3270, you - should follow these steps: - Change 3270 configuration - Reboot - Run config3270 and /tmp/mkdev3270 - Reboot - -Here are the testing steps in detail: - - 1. Bring up an x3270 window, or use an actual hardware 3278 or - 3279, or use the 3270 emulator of your choice. You would be - running the emulator on your PC or workstation. You would use - the command, for example, - x3270 vm-esa-domain-name & - if you wanted a 3278 Model 4 with 43 rows of 80 columns, the - default model number. The driver does not take advantage of - extended attributes. - - The screen you should now see contains a VM logo with input - lines near the bottom. Use TAB to move to the bottom line, - probably labeled "COMMAND ===>". - - 2. Use the DIAL command instead of the LOGIN command to connect - to one of the virtual 3270s you defined with the DEF GRAF - commands: - dial my-vm-guest-name - - 3. You should immediately see a login prompt from your - Linux-390 operating system. If that does not happen, you would - see instead the line "DIALED TO my-vm-guest-name 0620". - - To troubleshoot: do these things. - - A. Is the driver loaded? Use the lsmod command (no operands) - to find out. Probably it isn't. Try loading it manually, with - the command "insmod tub3270". Does that command give error - messages? Ha! There's your problem. - - B. Is the /etc/inittab file modified as in installation step 3 - above? Use the grep command to find out; for instance, issue - "grep 3270 /etc/inittab". Nothing found? There's your - problem! - - C. Are the device special files created, as in installation - step 2 above? Use the ls -l command to find out; for instance, - issue "ls -l /dev/3270/tty620". The output should start with the - letter "c" meaning character device and should contain "227, 1" - just to the left of the device name. No such file? no "c"? - Wrong major number? Wrong minor number? There's your - problem! - - D. Do you get the message - "HCPDIA047E my-vm-guest-name 0620 does not exist"? - If so, you must issue the command "DEF GRAF 620" from your VM - 3215 console and then reboot the system. - - - -OPERATION. - -The driver defines three areas on the 3270 screen: the log area, the -input area, and the status area. - -The log area takes up all but the bottom two lines of the screen. The -driver writes terminal output to it, starting at the top line and going -down. When it fills, the status area changes from "Linux Running" to -"Linux More...". After a scrolling timeout of (default) 5 sec, the -screen clears and more output is written, from the top down. - -The input area extends from the beginning of the second-to-last screen -line to the start of the status area. You type commands in this area -and hit ENTER to execute them. - -The status area initializes to "Linux Running" to give you a warm -fuzzy feeling. When the log area fills up and output awaits, it -changes to "Linux More...". At this time you can do several things or -nothing. If you do nothing, the screen will clear in (default) 5 sec -and more output will appear. You may hit ENTER with nothing typed in -the input area to toggle between "Linux More..." and "Linux Holding", -which indicates no scrolling will occur. (If you hit ENTER with "Linux -Running" and nothing typed, the application receives a newline.) - -You may change the scrolling timeout value. For example, the following -command line: - echo scrolltime=60 > /proc/tty/driver/tty3270 -changes the scrolling timeout value to 60 sec. Set scrolltime to 0 if -you wish to prevent scrolling entirely. - -Other things you may do when the log area fills up are: hit PA2 to -clear the log area and write more output to it, or hit CLEAR to clear -the log area and the input area and write more output to the log area. - -Some of the Program Function (PF) and Program Attention (PA) keys are -preassigned special functions. The ones that are not yield an alarm -when pressed. - -PA1 causes a SIGINT to the currently running application. You may do -the same thing from the input area, by typing "^C" and hitting ENTER. - -PA2 causes the log area to be cleared. If output awaits, it is then -written to the log area. - -PF3 causes an EOF to be received as input by the application. You may -cause an EOF also by typing "^D" and hitting ENTER. - -No PF key is preassigned to cause a job suspension, but you may cause a -job suspension by typing "^Z" and hitting ENTER. You may wish to -assign this function to a PF key. To make PF7 cause job suspension, -execute the command: - echo pf7=^z > /proc/tty/driver/tty3270 - -If the input you type does not end with the two characters "^n", the -driver appends a newline character and sends it to the tty driver; -otherwise the driver strips the "^n" and does not append a newline. -The IBM 3215 driver behaves similarly. - -Pf10 causes the most recent command to be retrieved from the tube's -command stack (default depth 20) and displayed in the input area. You -may hit PF10 again for the next-most-recent command, and so on. A -command is entered into the stack only when the input area is not made -invisible (such as for password entry) and it is not identical to the -current top entry. PF10 rotates backward through the command stack; -PF11 rotates forward. You may assign the backward function to any PF -key (or PA key, for that matter), say, PA3, with the command: - echo -e pa3=\\033k > /proc/tty/driver/tty3270 -This assigns the string ESC-k to PA3. Similarly, the string ESC-j -performs the forward function. (Rationale: In bash with vi-mode line -editing, ESC-k and ESC-j retrieve backward and forward history. -Suggestions welcome.) - -Is a stack size of twenty commands not to your liking? Change it on -the fly. To change to saving the last 100 commands, execute the -command: - echo recallsize=100 > /proc/tty/driver/tty3270 - -Have a command you issue frequently? Assign it to a PF or PA key! Use -the command - echo pf24="mkdir foobar; cd foobar" > /proc/tty/driver/tty3270 -to execute the commands mkdir foobar and cd foobar immediately when you -hit PF24. Want to see the command line first, before you execute it? -Use the -n option of the echo command: - echo -n pf24="mkdir foo; cd foo" > /proc/tty/driver/tty3270 - - - -Happy testing! I welcome any and all comments about this document, the -driver, etc etc. - -Dick Hitt diff --git a/Documentation/s390/CommonIO b/Documentation/s390/CommonIO deleted file mode 100644 index 6e0f63f343b4..000000000000 --- a/Documentation/s390/CommonIO +++ /dev/null @@ -1,125 +0,0 @@ -S/390 common I/O-Layer - command line parameters, procfs and debugfs entries -============================================================================ - -Command line parameters ------------------------ - -* ccw_timeout_log - - Enable logging of debug information in case of ccw device timeouts. - -* cio_ignore = device[,device[,..]] - - device := {all | [!]ipldev | [!]condev | [!] | [!]-} - - The given devices will be ignored by the common I/O-layer; no detection - and device sensing will be done on any of those devices. The subchannel to - which the device in question is attached will be treated as if no device was - attached. - - An ignored device can be un-ignored later; see the "/proc entries"-section for - details. - - The devices must be given either as bus ids (0.x.abcd) or as hexadecimal - device numbers (0xabcd or abcd, for 2.4 backward compatibility). If you - give a device number 0xabcd, it will be interpreted as 0.0.abcd. - - You can use the 'all' keyword to ignore all devices. The 'ipldev' and 'condev' - keywords can be used to refer to the CCW based boot device and CCW console - device respectively (these are probably useful only when combined with the '!' - operator). The '!' operator will cause the I/O-layer to _not_ ignore a device. - The command line is parsed from left to right. - - For example, - cio_ignore=0.0.0023-0.0.0042,0.0.4711 - will ignore all devices ranging from 0.0.0023 to 0.0.0042 and the device - 0.0.4711, if detected. - As another example, - cio_ignore=all,!0.0.4711,!0.0.fd00-0.0.fd02 - will ignore all devices but 0.0.4711, 0.0.fd00, 0.0.fd01, 0.0.fd02. - - By default, no devices are ignored. - - -/proc entries -------------- - -* /proc/cio_ignore - - Lists the ranges of devices (by bus id) which are ignored by common I/O. - - You can un-ignore certain or all devices by piping to /proc/cio_ignore. - "free all" will un-ignore all ignored devices, - "free , , ..." will un-ignore the specified - devices. - - For example, if devices 0.0.0023 to 0.0.0042 and 0.0.4711 are ignored, - - echo free 0.0.0030-0.0.0032 > /proc/cio_ignore - will un-ignore devices 0.0.0030 to 0.0.0032 and will leave devices 0.0.0023 - to 0.0.002f, 0.0.0033 to 0.0.0042 and 0.0.4711 ignored; - - echo free 0.0.0041 > /proc/cio_ignore will furthermore un-ignore device - 0.0.0041; - - echo free all > /proc/cio_ignore will un-ignore all remaining ignored - devices. - - When a device is un-ignored, device recognition and sensing is performed and - the device driver will be notified if possible, so the device will become - available to the system. Note that un-ignoring is performed asynchronously. - - You can also add ranges of devices to be ignored by piping to - /proc/cio_ignore; "add , , ..." will ignore the - specified devices. - - Note: While already known devices can be added to the list of devices to be - ignored, there will be no effect on then. However, if such a device - disappears and then reappears, it will then be ignored. To make - known devices go away, you need the "purge" command (see below). - - For example, - "echo add 0.0.a000-0.0.accc, 0.0.af00-0.0.afff > /proc/cio_ignore" - will add 0.0.a000-0.0.accc and 0.0.af00-0.0.afff to the list of ignored - devices. - - You can remove already known but now ignored devices via - "echo purge > /proc/cio_ignore" - All devices ignored but still registered and not online (= not in use) - will be deregistered and thus removed from the system. - - The devices can be specified either by bus id (0.x.abcd) or, for 2.4 backward - compatibility, by the device number in hexadecimal (0xabcd or abcd). Device - numbers given as 0xabcd will be interpreted as 0.0.abcd. - -* /proc/cio_settle - - A write request to this file is blocked until all queued cio actions are - handled. This will allow userspace to wait for pending work affecting - device availability after changing cio_ignore or the hardware configuration. - -* For some of the information present in the /proc filesystem in 2.4 (namely, - /proc/subchannels and /proc/chpids), see driver-model.txt. - Information formerly in /proc/irq_count is now in /proc/interrupts. - - -debugfs entries ---------------- - -* /sys/kernel/debug/s390dbf/cio_*/ (S/390 debug feature) - - Some views generated by the debug feature to hold various debug outputs. - - - /sys/kernel/debug/s390dbf/cio_crw/sprintf - Messages from the processing of pending channel report words (machine check - handling). - - - /sys/kernel/debug/s390dbf/cio_msg/sprintf - Various debug messages from the common I/O-layer. - - - /sys/kernel/debug/s390dbf/cio_trace/hex_ascii - Logs the calling of functions in the common I/O-layer and, if applicable, - which subchannel they were called for, as well as dumps of some data - structures (like irb in an error case). - - The level of logging can be changed to be more or less verbose by piping to - /sys/kernel/debug/s390dbf/cio_*/level a number between 0 and 6; see the - documentation on the S/390 debug feature (Documentation/s390/s390dbf.txt) - for details. diff --git a/Documentation/s390/DASD b/Documentation/s390/DASD deleted file mode 100644 index 9963f1e9c98a..000000000000 --- a/Documentation/s390/DASD +++ /dev/null @@ -1,73 +0,0 @@ -DASD device driver - -S/390's disk devices (DASDs) are managed by Linux via the DASD device -driver. It is valid for all types of DASDs and represents them to -Linux as block devices, namely "dd". Currently the DASD driver uses a -single major number (254) and 4 minor numbers per volume (1 for the -physical volume and 3 for partitions). With respect to partitions see -below. Thus you may have up to 64 DASD devices in your system. - -The kernel parameter 'dasd=from-to,...' may be issued arbitrary times -in the kernel's parameter line or not at all. The 'from' and 'to' -parameters are to be given in hexadecimal notation without a leading -0x. -If you supply kernel parameters the different instances are processed -in order of appearance and a minor number is reserved for any device -covered by the supplied range up to 64 volumes. Additional DASDs are -ignored. If you do not supply the 'dasd=' kernel parameter at all, the -DASD driver registers all supported DASDs of your system to a minor -number in ascending order of the subchannel number. - -The driver currently supports ECKD-devices and there are stubs for -support of the FBA and CKD architectures. For the FBA architecture -only some smart data structures are missing to make the support -complete. -We performed our testing on 3380 and 3390 type disks of different -sizes, under VM and on the bare hardware (LPAR), using internal disks -of the multiprise as well as a RAMAC virtual array. Disks exported by -an Enterprise Storage Server (Seascape) should work fine as well. - -We currently implement one partition per volume, which is the whole -volume, skipping the first blocks up to the volume label. These are -reserved for IPL records and IBM's volume label to assure -accessibility of the DASD from other OSs. In a later stage we will -provide support of partitions, maybe VTOC oriented or using a kind of -partition table in the label record. - -USAGE - --Low-level format (?CKD only) -For using an ECKD-DASD as a Linux harddisk you have to low-level -format the tracks by issuing the BLKDASDFORMAT-ioctl on that -device. This will erase any data on that volume including IBM volume -labels, VTOCs etc. The ioctl may take a 'struct format_data *' or -'NULL' as an argument. -typedef struct { - int start_unit; - int stop_unit; - int blksize; -} format_data_t; -When a NULL argument is passed to the BLKDASDFORMAT ioctl the whole -disk is formatted to a blocksize of 1024 bytes. Otherwise start_unit -and stop_unit are the first and last track to be formatted. If -stop_unit is -1 it implies that the DASD is formatted from start_unit -up to the last track. blksize can be any power of two between 512 and -4096. We recommend no blksize lower than 1024 because the ext2fs uses -1kB blocks anyway and you gain approx. 50% of capacity increasing your -blksize from 512 byte to 1kB. - --Make a filesystem -Then you can mk??fs the filesystem of your choice on that volume or -partition. For reasons of sanity you should build your filesystem on -the partition /dev/dd?1 instead of the whole volume. You only lose 3kB -but may be sure that you can reuse your data after introduction of a -real partition table. - -BUGS: -- Performance sometimes is rather low because we don't fully exploit clustering - -TODO-List: -- Add IBM'S Disk layout to genhd -- Enhance driver to use more than one major number -- Enable usage as a module -- Support Cache fast write and DASD fast write (ECKD) diff --git a/Documentation/s390/Debugging390.txt b/Documentation/s390/Debugging390.txt deleted file mode 100644 index c35804c238ad..000000000000 --- a/Documentation/s390/Debugging390.txt +++ /dev/null @@ -1,2172 +0,0 @@ - - Debugging on Linux for s/390 & z/Architecture - by - Denis Joseph Barrow (djbarrow@de.ibm.com,barrow_dj@yahoo.com) - Copyright (C) 2000-2001 IBM Deutschland Entwicklung GmbH, IBM Corporation - Best viewed with fixed width fonts - -Overview of Document: -===================== -This document is intended to give a good overview of how to debug Linux for -s/390 and z/Architecture. It is not intended as a complete reference and not a -tutorial on the fundamentals of C & assembly. It doesn't go into -390 IO in any detail. It is intended to complement the documents in the -reference section below & any other worthwhile references you get. - -It is intended like the Enterprise Systems Architecture/390 Reference Summary -to be printed out & used as a quick cheat sheet self help style reference when -problems occur. - -Contents -======== -Register Set -Address Spaces on Intel Linux -Address Spaces on Linux for s/390 & z/Architecture -The Linux for s/390 & z/Architecture Kernel Task Structure -Register Usage & Stackframes on Linux for s/390 & z/Architecture -A sample program with comments -Compiling programs for debugging on Linux for s/390 & z/Architecture -Debugging under VM -s/390 & z/Architecture IO Overview -Debugging IO on s/390 & z/Architecture under VM -GDB on s/390 & z/Architecture -Stack chaining in gdb by hand -Examining core dumps -ldd -Debugging modules -The proc file system -SysRq -References -Special Thanks - -Register Set -============ -The current architectures have the following registers. - -16 General propose registers, 32 bit on s/390 and 64 bit on z/Architecture, -r0-r15 (or gpr0-gpr15), used for arithmetic and addressing. - -16 Control registers, 32 bit on s/390 and 64 bit on z/Architecture, cr0-cr15, -kernel usage only, used for memory management, interrupt control, debugging -control etc. - -16 Access registers (ar0-ar15), 32 bit on both s/390 and z/Architecture, -normally not used by normal programs but potentially could be used as -temporary storage. These registers have a 1:1 association with general -purpose registers and are designed to be used in the so-called access -register mode to select different address spaces. -Access register 0 (and access register 1 on z/Architecture, which needs a -64 bit pointer) is currently used by the pthread library as a pointer to -the current running threads private area. - -16 64 bit floating point registers (fp0-fp15 ) IEEE & HFP floating -point format compliant on G5 upwards & a Floating point control reg (FPC) -4 64 bit registers (fp0,fp2,fp4 & fp6) HFP only on older machines. -Note: -Linux (currently) always uses IEEE & emulates G5 IEEE format on older machines, -( provided the kernel is configured for this ). - - -The PSW is the most important register on the machine it -is 64 bit on s/390 & 128 bit on z/Architecture & serves the roles of -a program counter (pc), condition code register,memory space designator. -In IBM standard notation I am counting bit 0 as the MSB. -It has several advantages over a normal program counter -in that you can change address translation & program counter -in a single instruction. To change address translation, -e.g. switching address translation off requires that you -have a logical=physical mapping for the address you are -currently running at. - -+-------------------------+-------------------------------------------------+ -| Bit | | -+--------+----------------+ Value | -| s/390 | z/Architecture | | -+========+================+=================================================+ -| 0 | 0 | Reserved (must be 0) otherwise specification | -| | | exception occurs. | -+--------+----------------+-------------------------------------------------+ -| 1 | 1 | Program Event Recording 1 PER enabled, | -| | | PER is used to facilitate debugging e.g. | -| | | single stepping. | -+--------+----------------+-------------------------------------------------+ -| 2-4 | 2-4 | Reserved (must be 0). | -+--------+----------------+-------------------------------------------------+ -| 5 | 5 | Dynamic address translation 1=DAT on. | -+--------+----------------+-------------------------------------------------+ -| 6 | 6 | Input/Output interrupt Mask | -+--------+----------------+-------------------------------------------------+ -| 7 | 7 | External interrupt Mask used primarily for | -| | | interprocessor signalling and clock interrupts. | -+--------+----------------+-------------------------------------------------+ -| 8-11 | 8-11 | PSW Key used for complex memory protection | -| | | mechanism (not used under linux) | -+--------+----------------+-------------------------------------------------+ -| 12 | 12 | 1 on s/390 0 on z/Architecture | -+--------+----------------+-------------------------------------------------+ -| 13 | 13 | Machine Check Mask 1=enable machine check | -| | | interrupts | -+--------+----------------+-------------------------------------------------+ -| 14 | 14 | Wait State. Set this to 1 to stop the processor | -| | | except for interrupts and give time to other | -| | | LPARS. Used in CPU idle in the kernel to | -| | | increase overall usage of processor resources. | -+--------+----------------+-------------------------------------------------+ -| 15 | 15 | Problem state (if set to 1 certain instructions | -| | | are disabled). All linux user programs run with | -| | | this bit 1 (useful info for debugging under VM).| -+--------+----------------+-------------------------------------------------+ -| 16-17 | 16-17 | Address Space Control | -| | | | -| | | 00 Primary Space Mode: | -| | | | -| | | The register CR1 contains the primary | -| | | address-space control element (PASCE), which | -| | | points to the primary space region/segment | -| | | table origin. | -| | | | -| | | 01 Access register mode | -| | | | -| | | 10 Secondary Space Mode: | -| | | | -| | | The register CR7 contains the secondary | -| | | address-space control element (SASCE), which | -| | | points to the secondary space region or | -| | | segment table origin. | -| | | | -| | | 11 Home Space Mode: | -| | | | -| | | The register CR13 contains the home space | -| | | address-space control element (HASCE), which | -| | | points to the home space region/segment | -| | | table origin. | -| | | | -| | | See "Address Spaces on Linux for s/390 & | -| | | z/Architecture" below for more information | -| | | about address space usage in Linux. | -+--------+----------------+-------------------------------------------------+ -| 18-19 | 18-19 | Condition codes (CC) | -+--------+----------------+-------------------------------------------------+ -| 20 | 20 | Fixed point overflow mask if 1=FPU exceptions | -| | | for this event occur (normally 0) | -+--------+----------------+-------------------------------------------------+ -| 21 | 21 | Decimal overflow mask if 1=FPU exceptions for | -| | | this event occur (normally 0) | -+--------+----------------+-------------------------------------------------+ -| 22 | 22 | Exponent underflow mask if 1=FPU exceptions | -| | | for this event occur (normally 0) | -+--------+----------------+-------------------------------------------------+ -| 23 | 23 | Significance Mask if 1=FPU exceptions for this | -| | | event occur (normally 0) | -+--------+----------------+-------------------------------------------------+ -| 24-31 | 24-30 | Reserved Must be 0. | -| +----------------+-------------------------------------------------+ -| | 31 | Extended Addressing Mode | -| +----------------+-------------------------------------------------+ -| | 32 | Basic Addressing Mode | -| | | | -| | | Used to set addressing mode | -| | | | -| | | +---------+----------+----------+ | -| | | | PSW 31 | PSW 32 | | | -| | | +---------+----------+----------+ | -| | | | 0 | 0 | 24 bit | | -| | | +---------+----------+----------+ | -| | | | 0 | 1 | 31 bit | | -| | | +---------+----------+----------+ | -| | | | 1 | 1 | 64 bit | | -| | | +---------+----------+----------+ | -+--------+----------------+-------------------------------------------------+ -| 32 | | 1=31 bit addressing mode 0=24 bit addressing | -| | | mode (for backward compatibility), linux | -| | | always runs with this bit set to 1 | -+--------+----------------+-------------------------------------------------+ -| 33-64 | | Instruction address. | -| +----------------+-------------------------------------------------+ -| | 33-63 | Reserved must be 0 | -| +----------------+-------------------------------------------------+ -| | 64-127 | Address | -| | | | -| | | - In 24 bits mode bits 64-103=0 bits 104-127 | -| | | Address | -| | | - In 31 bits mode bits 64-96=0 bits 97-127 | -| | | Address | -| | | | -| | | Note: | -| | | unlike 31 bit mode on s/390 bit 96 must be | -| | | zero when loading the address with LPSWE | -| | | otherwise a specification exception occurs, | -| | | LPSW is fully backward compatible. | -+--------+----------------+-------------------------------------------------+ - -Prefix Page(s) --------------- -This per cpu memory area is too intimately tied to the processor not to mention. -It exists between the real addresses 0-4096 on s/390 and between 0-8192 on -z/Architecture and is exchanged with one page on s/390 or two pages on -z/Architecture in absolute storage by the set prefix instruction during Linux -startup. -This page is mapped to a different prefix for each processor in an SMP -configuration (assuming the OS designer is sane of course). -Bytes 0-512 (200 hex) on s/390 and 0-512, 4096-4544, 4604-5119 currently on -z/Architecture are used by the processor itself for holding such information -as exception indications and entry points for exceptions. -Bytes after 0xc00 hex are used by linux for per processor globals on s/390 and -z/Architecture (there is a gap on z/Architecture currently between 0xc00 and -0x1000, too, which is used by Linux). -The closest thing to this on traditional architectures is the interrupt -vector table. This is a good thing & does simplify some of the kernel coding -however it means that we now cannot catch stray NULL pointers in the -kernel without hard coded checks. - - - -Address Spaces on Intel Linux -============================= - -The traditional Intel Linux is approximately mapped as follows forgive -the ascii art. -0xFFFFFFFF 4GB Himem ***************** - * * - * Kernel Space * - * * - ***************** **************** -User Space Himem * User Stack * * * -(typically 0xC0000000 3GB ) ***************** * * - * Shared Libs * * Next Process * - ***************** * to * - * * <== * Run * <== - * User Program * * * - * Data BSS * * * - * Text * * * - * Sections * * * -0x00000000 ***************** **************** - -Now it is easy to see that on Intel it is quite easy to recognise a kernel -address as being one greater than user space himem (in this case 0xC0000000), -and addresses of less than this are the ones in the current running program on -this processor (if an smp box). -If using the virtual machine ( VM ) as a debugger it is quite difficult to -know which user process is running as the address space you are looking at -could be from any process in the run queue. - -The limitation of Intels addressing technique is that the linux -kernel uses a very simple real address to virtual addressing technique -of Real Address=Virtual Address-User Space Himem. -This means that on Intel the kernel linux can typically only address -Himem=0xFFFFFFFF-0xC0000000=1GB & this is all the RAM these machines -can typically use. -They can lower User Himem to 2GB or lower & thus be -able to use 2GB of RAM however this shrinks the maximum size -of User Space from 3GB to 2GB they have a no win limit of 4GB unless -they go to 64 Bit. - - -On 390 our limitations & strengths make us slightly different. -For backward compatibility we are only allowed use 31 bits (2GB) -of our 32 bit addresses, however, we use entirely separate address -spaces for the user & kernel. - -This means we can support 2GB of non Extended RAM on s/390, & more -with the Extended memory management swap device & -currently 4TB of physical memory currently on z/Architecture. - - -Address Spaces on Linux for s/390 & z/Architecture -================================================== - -Our addressing scheme is basically as follows: - - Primary Space Home Space -Himem 0x7fffffff 2GB on s/390 ***************** **************** -currently 0x3ffffffffff (2^42)-1 * User Stack * * * -on z/Architecture. ***************** * * - * Shared Libs * * * - ***************** * * - * * * Kernel * - * User Program * * * - * Data BSS * * * - * Text * * * - * Sections * * * -0x00000000 ***************** **************** - -This also means that we need to look at the PSW problem state bit and the -addressing mode to decide whether we are looking at user or kernel space. - -User space runs in primary address mode (or access register mode within -the vdso code). - -The kernel usually also runs in home space mode, however when accessing -user space the kernel switches to primary or secondary address mode if -the mvcos instruction is not available or if a compare-and-swap (futex) -instruction on a user space address is performed. - -When also looking at the ASCE control registers, this means: - -User space: -- runs in primary or access register mode -- cr1 contains the user asce -- cr7 contains the user asce -- cr13 contains the kernel asce - -Kernel space: -- runs in home space mode -- cr1 contains the user or kernel asce - -> the kernel asce is loaded when a uaccess requires primary or - secondary address mode -- cr7 contains the user or kernel asce, (changed with set_fs()) -- cr13 contains the kernel asce - -In case of uaccess the kernel changes to: -- primary space mode in case of a uaccess (copy_to_user) and uses - e.g. the mvcp instruction to access user space. However the kernel - will stay in home space mode if the mvcos instruction is available -- secondary space mode in case of futex atomic operations, so that the - instructions come from primary address space and data from secondary - space - -In case of KVM, the kernel runs in home space mode, but cr1 gets switched -to contain the gmap asce before the SIE instruction gets executed. When -the SIE instruction is finished, cr1 will be switched back to contain the -user asce. - - -Virtual Addresses on s/390 & z/Architecture -=========================================== - -A virtual address on s/390 is made up of 3 parts -The SX (segment index, roughly corresponding to the PGD & PMD in Linux -terminology) being bits 1-11. -The PX (page index, corresponding to the page table entry (pte) in Linux -terminology) being bits 12-19. -The remaining bits BX (the byte index are the offset in the page ) -i.e. bits 20 to 31. - -On z/Architecture in linux we currently make up an address from 4 parts. -The region index bits (RX) 0-32 we currently use bits 22-32 -The segment index (SX) being bits 33-43 -The page index (PX) being bits 44-51 -The byte index (BX) being bits 52-63 - -Notes: -1) s/390 has no PMD so the PMD is really the PGD also. -A lot of this stuff is defined in pgtable.h. - -2) Also seeing as s/390's page indexes are only 1k in size -(bits 12-19 x 4 bytes per pte ) we use 1 ( page 4k ) -to make the best use of memory by updating 4 segment indices -entries each time we mess with a PMD & use offsets -0,1024,2048 & 3072 in this page as for our segment indexes. -On z/Architecture our page indexes are now 2k in size -( bits 12-19 x 8 bytes per pte ) we do a similar trick -but only mess with 2 segment indices each time we mess with -a PMD. - -3) As z/Architecture supports up to a massive 5-level page table lookup we -can only use 3 currently on Linux ( as this is all the generic kernel -currently supports ) however this may change in future -this allows us to access ( according to my sums ) -4TB of virtual storage per process i.e. -4096*512(PTES)*1024(PMDS)*2048(PGD) = 4398046511104 bytes, -enough for another 2 or 3 of years I think :-). -to do this we use a region-third-table designation type in -our address space control registers. - - -The Linux for s/390 & z/Architecture Kernel Task Structure -========================================================== -Each process/thread under Linux for S390 has its own kernel task_struct -defined in linux/include/linux/sched.h -The S390 on initialisation & resuming of a process on a cpu sets -the __LC_KERNEL_STACK variable in the spare prefix area for this cpu -(which we use for per-processor globals). - -The kernel stack pointer is intimately tied with the task structure for -each processor as follows. - - s/390 - ************************ - * 1 page kernel stack * - * ( 4K ) * - ************************ - * 1 page task_struct * - * ( 4K ) * -8K aligned ************************ - - z/Architecture - ************************ - * 2 page kernel stack * - * ( 8K ) * - ************************ - * 2 page task_struct * - * ( 8K ) * -16K aligned ************************ - -What this means is that we don't need to dedicate any register or global -variable to point to the current running process & can retrieve it with the -following very simple construct for s/390 & one very similar for z/Architecture. - -static inline struct task_struct * get_current(void) -{ - struct task_struct *current; - __asm__("lhi %0,-8192\n\t" - "nr %0,15" - : "=r" (current) ); - return current; -} - -i.e. just anding the current kernel stack pointer with the mask -8192. -Thankfully because Linux doesn't have support for nested IO interrupts -& our devices have large buffers can survive interrupts being shut for -short amounts of time we don't need a separate stack for interrupts. - - - - -Register Usage & Stackframes on Linux for s/390 & z/Architecture -================================================================= -Overview: ---------- -This is the code that gcc produces at the top & the bottom of -each function. It usually is fairly consistent & similar from -function to function & if you know its layout you can probably -make some headway in finding the ultimate cause of a problem -after a crash without a source level debugger. - -Note: To follow stackframes requires a knowledge of C or Pascal & -limited knowledge of one assembly language. - -It should be noted that there are some differences between the -s/390 and z/Architecture stack layouts as the z/Architecture stack layout -didn't have to maintain compatibility with older linkage formats. - -Glossary: ---------- -alloca: -This is a built in compiler function for runtime allocation -of extra space on the callers stack which is obviously freed -up on function exit ( e.g. the caller may choose to allocate nothing -of a buffer of 4k if required for temporary purposes ), it generates -very efficient code ( a few cycles ) when compared to alternatives -like malloc. - -automatics: These are local variables on the stack, -i.e they aren't in registers & they aren't static. - -back-chain: -This is a pointer to the stack pointer before entering a -framed functions ( see frameless function ) prologue got by -dereferencing the address of the current stack pointer, - i.e. got by accessing the 32 bit value at the stack pointers -current location. - -base-pointer: -This is a pointer to the back of the literal pool which -is an area just behind each procedure used to store constants -in each function. - -call-clobbered: The caller probably needs to save these registers if there -is something of value in them, on the stack or elsewhere before making a -call to another procedure so that it can restore it later. - -epilogue: -The code generated by the compiler to return to the caller. - -frameless-function -A frameless function in Linux for s390 & z/Architecture is one which doesn't -need more than the register save area (96 bytes on s/390, 160 on z/Architecture) -given to it by the caller. -A frameless function never: -1) Sets up a back chain. -2) Calls alloca. -3) Calls other normal functions -4) Has automatics. - -GOT-pointer: -This is a pointer to the global-offset-table in ELF -( Executable Linkable Format, Linux'es most common executable format ), -all globals & shared library objects are found using this pointer. - -lazy-binding -ELF shared libraries are typically only loaded when routines in the shared -library are actually first called at runtime. This is lazy binding. - -procedure-linkage-table -This is a table found from the GOT which contains pointers to routines -in other shared libraries which can't be called to by easier means. - -prologue: -The code generated by the compiler to set up the stack frame. - -outgoing-args: -This is extra area allocated on the stack of the calling function if the -parameters for the callee's cannot all be put in registers, the same -area can be reused by each function the caller calls. - -routine-descriptor: -A COFF executable format based concept of a procedure reference -actually being 8 bytes or more as opposed to a simple pointer to the routine. -This is typically defined as follows -Routine Descriptor offset 0=Pointer to Function -Routine Descriptor offset 4=Pointer to Table of Contents -The table of contents/TOC is roughly equivalent to a GOT pointer. -& it means that shared libraries etc. can be shared between several -environments each with their own TOC. - - -static-chain: This is used in nested functions a concept adopted from pascal -by gcc not used in ansi C or C++ ( although quite useful ), basically it -is a pointer used to reference local variables of enclosing functions. -You might come across this stuff once or twice in your lifetime. - -e.g. -The function below should return 11 though gcc may get upset & toss warnings -about unused variables. -int FunctionA(int a) -{ - int b; - FunctionC(int c) - { - b=c+1; - } - FunctionC(10); - return(b); -} - - -s/390 & z/Architecture Register usage -===================================== -r0 used by syscalls/assembly call-clobbered -r1 used by syscalls/assembly call-clobbered -r2 argument 0 / return value 0 call-clobbered -r3 argument 1 / return value 1 (if long long) call-clobbered -r4 argument 2 call-clobbered -r5 argument 3 call-clobbered -r6 argument 4 saved -r7 pointer-to arguments 5 to ... saved -r8 this & that saved -r9 this & that saved -r10 static-chain ( if nested function ) saved -r11 frame-pointer ( if function used alloca ) saved -r12 got-pointer saved -r13 base-pointer saved -r14 return-address saved -r15 stack-pointer saved - -f0 argument 0 / return value ( float/double ) call-clobbered -f2 argument 1 call-clobbered -f4 z/Architecture argument 2 saved -f6 z/Architecture argument 3 saved -The remaining floating points -f1,f3,f5 f7-f15 are call-clobbered. - -Notes: ------- -1) The only requirement is that registers which are used -by the callee are saved, e.g. the compiler is perfectly -capable of using r11 for purposes other than a frame a -frame pointer if a frame pointer is not needed. -2) In functions with variable arguments e.g. printf the calling procedure -is identical to one without variable arguments & the same number of -parameters. However, the prologue of this function is somewhat more -hairy owing to it having to move these parameters to the stack to -get va_start, va_arg & va_end to work. -3) Access registers are currently unused by gcc but are used in -the kernel. Possibilities exist to use them at the moment for -temporary storage but it isn't recommended. -4) Only 4 of the floating point registers are used for -parameter passing as older machines such as G3 only have only 4 -& it keeps the stack frame compatible with other compilers. -However with IEEE floating point emulation under linux on the -older machines you are free to use the other 12. -5) A long long or double parameter cannot be have the -first 4 bytes in a register & the second four bytes in the -outgoing args area. It must be purely in the outgoing args -area if crossing this boundary. -6) Floating point parameters are mixed with outgoing args -on the outgoing args area in the order the are passed in as parameters. -7) Floating point arguments 2 & 3 are saved in the outgoing args area for -z/Architecture - - -Stack Frame Layout ------------------- -s/390 z/Architecture -0 0 back chain ( a 0 here signifies end of back chain ) -4 8 eos ( end of stack, not used on Linux for S390 used in other linkage formats ) -8 16 glue used in other s/390 linkage formats for saved routine descriptors etc. -12 24 glue used in other s/390 linkage formats for saved routine descriptors etc. -16 32 scratch area -20 40 scratch area -24 48 saved r6 of caller function -28 56 saved r7 of caller function -32 64 saved r8 of caller function -36 72 saved r9 of caller function -40 80 saved r10 of caller function -44 88 saved r11 of caller function -48 96 saved r12 of caller function -52 104 saved r13 of caller function -56 112 saved r14 of caller function -60 120 saved r15 of caller function -64 128 saved f4 of caller function -72 132 saved f6 of caller function -80 undefined -96 160 outgoing args passed from caller to callee -96+x 160+x possible stack alignment ( 8 bytes desirable ) -96+x+y 160+x+y alloca space of caller ( if used ) -96+x+y+z 160+x+y+z automatics of caller ( if used ) -0 back-chain - -A sample program with comments. -=============================== - -Comments on the function test ------------------------------ -1) It didn't need to set up a pointer to the constant pool gpr13 as it is not -used ( :-( ). -2) This is a frameless function & no stack is bought. -3) The compiler was clever enough to recognise that it could return the -value in r2 as well as use it for the passed in parameter ( :-) ). -4) The basr ( branch relative & save ) trick works as follows the instruction -has a special case with r0,r0 with some instruction operands is understood as -the literal value 0, some risc architectures also do this ). So now -we are branching to the next address & the address new program counter is -in r13,so now we subtract the size of the function prologue we have executed -+ the size of the literal pool to get to the top of the literal pool -0040037c int test(int b) -{ # Function prologue below - 40037c: 90 de f0 34 stm %r13,%r14,52(%r15) # Save registers r13 & r14 - 400380: 0d d0 basr %r13,%r0 # Set up pointer to constant pool using - 400382: a7 da ff fa ahi %r13,-6 # basr trick - return(5+b); - # Huge main program - 400386: a7 2a 00 05 ahi %r2,5 # add 5 to r2 - - # Function epilogue below - 40038a: 98 de f0 34 lm %r13,%r14,52(%r15) # restore registers r13 & 14 - 40038e: 07 fe br %r14 # return -} - -Comments on the function main ------------------------------ -1) The compiler did this function optimally ( 8-) ) - -Literal pool for main. -400390: ff ff ff ec .long 0xffffffec -main(int argc,char *argv[]) -{ # Function prologue below - 400394: 90 bf f0 2c stm %r11,%r15,44(%r15) # Save necessary registers - 400398: 18 0f lr %r0,%r15 # copy stack pointer to r0 - 40039a: a7 fa ff a0 ahi %r15,-96 # Make area for callee saving - 40039e: 0d d0 basr %r13,%r0 # Set up r13 to point to - 4003a0: a7 da ff f0 ahi %r13,-16 # literal pool - 4003a4: 50 00 f0 00 st %r0,0(%r15) # Save backchain - - return(test(5)); # Main Program Below - 4003a8: 58 e0 d0 00 l %r14,0(%r13) # load relative address of test from - # literal pool - 4003ac: a7 28 00 05 lhi %r2,5 # Set first parameter to 5 - 4003b0: 4d ee d0 00 bas %r14,0(%r14,%r13) # jump to test setting r14 as return - # address using branch & save instruction. - - # Function Epilogue below - 4003b4: 98 bf f0 8c lm %r11,%r15,140(%r15)# Restore necessary registers. - 4003b8: 07 fe br %r14 # return to do program exit -} - - -Compiler updates ----------------- - -main(int argc,char *argv[]) -{ - 4004fc: 90 7f f0 1c stm %r7,%r15,28(%r15) - 400500: a7 d5 00 04 bras %r13,400508 - 400504: 00 40 04 f4 .long 0x004004f4 - # compiler now puts constant pool in code to so it saves an instruction - 400508: 18 0f lr %r0,%r15 - 40050a: a7 fa ff a0 ahi %r15,-96 - 40050e: 50 00 f0 00 st %r0,0(%r15) - return(test(5)); - 400512: 58 10 d0 00 l %r1,0(%r13) - 400516: a7 28 00 05 lhi %r2,5 - 40051a: 0d e1 basr %r14,%r1 - # compiler adds 1 extra instruction to epilogue this is done to - # avoid processor pipeline stalls owing to data dependencies on g5 & - # above as register 14 in the old code was needed directly after being loaded - # by the lm %r11,%r15,140(%r15) for the br %14. - 40051c: 58 40 f0 98 l %r4,152(%r15) - 400520: 98 7f f0 7c lm %r7,%r15,124(%r15) - 400524: 07 f4 br %r4 -} - - -Hartmut ( our compiler developer ) also has been threatening to take out the -stack backchain in optimised code as this also causes pipeline stalls, you -have been warned. - -64 bit z/Architecture code disassembly --------------------------------------- - -If you understand the stuff above you'll understand the stuff -below too so I'll avoid repeating myself & just say that -some of the instructions have g's on the end of them to indicate -they are 64 bit & the stack offsets are a bigger, -the only other difference you'll find between 32 & 64 bit is that -we now use f4 & f6 for floating point arguments on 64 bit. -00000000800005b0 : -int test(int b) -{ - return(5+b); - 800005b0: a7 2a 00 05 ahi %r2,5 - 800005b4: b9 14 00 22 lgfr %r2,%r2 # downcast to integer - 800005b8: 07 fe br %r14 - 800005ba: 07 07 bcr 0,%r7 - - -} - -00000000800005bc
: -main(int argc,char *argv[]) -{ - 800005bc: eb bf f0 58 00 24 stmg %r11,%r15,88(%r15) - 800005c2: b9 04 00 1f lgr %r1,%r15 - 800005c6: a7 fb ff 60 aghi %r15,-160 - 800005ca: e3 10 f0 00 00 24 stg %r1,0(%r15) - return(test(5)); - 800005d0: a7 29 00 05 lghi %r2,5 - # brasl allows jumps > 64k & is overkill here bras would do fune - 800005d4: c0 e5 ff ff ff ee brasl %r14,800005b0 - 800005da: e3 40 f1 10 00 04 lg %r4,272(%r15) - 800005e0: eb bf f0 f8 00 04 lmg %r11,%r15,248(%r15) - 800005e6: 07 f4 br %r4 -} - - - -Compiling programs for debugging on Linux for s/390 & z/Architecture -==================================================================== --gdwarf-2 now works it should be considered the default debugging -format for s/390 & z/Architecture as it is more reliable for debugging -shared libraries, normal -g debugging works much better now -Thanks to the IBM java compiler developers bug reports. - -This is typically done adding/appending the flags -g or -gdwarf-2 to the -CFLAGS & LDFLAGS variables Makefile of the program concerned. - -If using gdb & you would like accurate displays of registers & - stack traces compile without optimisation i.e make sure -that there is no -O2 or similar on the CFLAGS line of the Makefile & -the emitted gcc commands, obviously this will produce worse code -( not advisable for shipment ) but it is an aid to the debugging process. - -This aids debugging because the compiler will copy parameters passed in -in registers onto the stack so backtracing & looking at passed in -parameters will work, however some larger programs which use inline functions -will not compile without optimisation. - -Debugging with optimisation has since much improved after fixing -some bugs, please make sure you are using gdb-5.0 or later developed -after Nov'2000. - - - -Debugging under VM -================== - -Notes ------ -Addresses & values in the VM debugger are always hex never decimal -Address ranges are of the format - or -. -For example, the address range 0x2000 to 0x3000 can be described as 2000-3000 -or 2000.1000 - -The VM Debugger is case insensitive. - -VM's strengths are usually other debuggers weaknesses you can get at any -resource no matter how sensitive e.g. memory management resources, change -address translation in the PSW. For kernel hacking you will reap dividends if -you get good at it. - -The VM Debugger displays operators but not operands, and also the debugger -displays useful information on the same line as the author of the code probably -felt that it was a good idea not to go over the 80 columns on the screen. -This isn't as unintuitive as it may seem as the s/390 instructions are easy to -decode mentally and you can make a good guess at a lot of them as all the -operands are nibble (half byte aligned). -So if you have an objdump listing by hand, it is quite easy to follow, and if -you don't have an objdump listing keep a copy of the s/390 Reference Summary -or alternatively the s/390 principles of operation next to you. -e.g. even I can guess that -0001AFF8' LR 180F CC 0 -is a ( load register ) lr r0,r15 - -Also it is very easy to tell the length of a 390 instruction from the 2 most -significant bits in the instruction (not that this info is really useful except -if you are trying to make sense of a hexdump of code). -Here is a table -Bits Instruction Length ------------------------------------------- -00 2 Bytes -01 4 Bytes -10 4 Bytes -11 6 Bytes - -The debugger also displays other useful info on the same line such as the -addresses being operated on destination addresses of branches & condition codes. -e.g. -00019736' AHI A7DAFF0E CC 1 -000198BA' BRC A7840004 -> 000198C2' CC 0 -000198CE' STM 900EF068 >> 0FA95E78 CC 2 - - - -Useful VM debugger commands ---------------------------- - -I suppose I'd better mention this before I start -to list the current active traces do -Q TR -there can be a maximum of 255 of these per set -( more about trace sets later ). -To stop traces issue a -TR END. -To delete a particular breakpoint issue -TR DEL - -The PA1 key drops to CP mode so you can issue debugger commands, -Doing alt c (on my 3270 console at least ) clears the screen. -hitting b comes back to the running operating system -from cp mode ( in our case linux ). -It is typically useful to add shortcuts to your profile.exec file -if you have one ( this is roughly equivalent to autoexec.bat in DOS ). -file here are a few from mine. -/* this gives me command history on issuing f12 */ -set pf12 retrieve -/* this continues */ -set pf8 imm b -/* goes to trace set a */ -set pf1 imm tr goto a -/* goes to trace set b */ -set pf2 imm tr goto b -/* goes to trace set c */ -set pf3 imm tr goto c - - - -Instruction Tracing -------------------- -Setting a simple breakpoint -TR I PSWA
-To debug a particular function try -TR I R -TR I on its own will single step. -TR I DATA will trace for particular mnemonics -e.g. -TR I DATA 4D R 0197BC.4000 -will trace for BAS'es ( opcode 4D ) in the range 0197BC.4000 -if you were inclined you could add traces for all branch instructions & -suffix them with the run prefix so you would have a backtrace on screen -when a program crashes. -TR BR will trace branches into or out of an address. -e.g. -TR BR INTO 0 is often quite useful if a program is getting awkward & deciding -to branch to 0 & crashing as this will stop at the address before in jumps to 0. -TR I R
RUN cmd d g -single steps a range of addresses but stays running & -displays the gprs on each step. - - - -Displaying & modifying Registers --------------------------------- -D G will display all the gprs -Adding a extra G to all the commands is necessary to access the full 64 bit -content in VM on z/Architecture. Obviously this isn't required for access -registers as these are still 32 bit. -e.g. DGG instead of DG -D X will display all the control registers -D AR will display all the access registers -D AR4-7 will display access registers 4 to 7 -CPU ALL D G will display the GRPS of all CPUS in the configuration -D PSW will display the current PSW -st PSW 2000 will put the value 2000 into the PSW & -cause crash your machine. -D PREFIX displays the prefix offset - - -Displaying Memory ------------------ -To display memory mapped using the current PSW's mapping try -D -To make VM display a message each time it hits a particular address and -continue try -D I will disassemble/display a range of instructions. -ST addr 32 bit word will store a 32 bit aligned address -D T will display the EBCDIC in an address (if you are that way inclined) -D R will display real addresses ( without DAT ) but with prefixing. -There are other complex options to display if you need to get at say home space -but are in primary space the easiest thing to do is to temporarily -modify the PSW to the other addressing mode, display the stuff & then -restore it. - - - -Hints ------ -If you want to issue a debugger command without halting your virtual machine -with the PA1 key try prefixing the command with #CP e.g. -#cp tr i pswa 2000 -also suffixing most debugger commands with RUN will cause them not -to stop just display the mnemonic at the current instruction on the console. -If you have several breakpoints you want to put into your program & -you get fed up of cross referencing with System.map -you can do the following trick for several symbols. -grep do_signal System.map -which emits the following among other things -0001f4e0 T do_signal -now you can do - -TR I PSWA 0001f4e0 cmd msg * do_signal -This sends a message to your own console each time do_signal is entered. -( As an aside I wrote a perl script once which automatically generated a REXX -script with breakpoints on every kernel procedure, this isn't a good idea -because there are thousands of these routines & VM can only set 255 breakpoints -at a time so you nearly had to spend as long pruning the file down as you would -entering the msgs by hand), however, the trick might be useful for a single -object file. In the 3270 terminal emulator x3270 there is a very useful option -in the file menu called "Save Screen In File" - this is very good for keeping a -copy of traces. - -From CMS help will give you online help on a particular command. -e.g. -HELP DISPLAY - -Also CP has a file called profile.exec which automatically gets called -on startup of CMS ( like autoexec.bat ), keeping on a DOS analogy session -CP has a feature similar to doskey, it may be useful for you to -use profile.exec to define some keystrokes. -e.g. -SET PF9 IMM B -This does a single step in VM on pressing F8. -SET PF10 ^ -This sets up the ^ key. -which can be used for ^c (ctrl-c),^z (ctrl-z) which can't be typed directly -into some 3270 consoles. -SET PF11 ^- -This types the starting keystrokes for a sysrq see SysRq below. -SET PF12 RETRIEVE -This retrieves command history on pressing F12. - - -Sometimes in VM the display is set up to scroll automatically this -can be very annoying if there are messages you wish to look at -to stop this do -TERM MORE 255 255 -This will nearly stop automatic screen updates, however it will -cause a denial of service if lots of messages go to the 3270 console, -so it would be foolish to use this as the default on a production machine. - - -Tracing particular processes ----------------------------- -The kernel's text segment is intentionally at an address in memory that it will -very seldom collide with text segments of user programs ( thanks Martin ), -this simplifies debugging the kernel. -However it is quite common for user processes to have addresses which collide -this can make debugging a particular process under VM painful under normal -circumstances as the process may change when doing a -TR I R
. -Thankfully after reading VM's online help I figured out how to debug -I particular process. - -Your first problem is to find the STD ( segment table designation ) -of the program you wish to debug. -There are several ways you can do this here are a few -1) objdump --syms | grep main -To get the address of main in the program. -tr i pswa
-Start the program, if VM drops to CP on what looks like the entry -point of the main function this is most likely the process you wish to debug. -Now do a D X13 or D XG13 on z/Architecture. -On 31 bit the STD is bits 1-19 ( the STO segment table origin ) -& 25-31 ( the STL segment table length ) of CR13. -now type -TR I R STD 0.7fffffff -e.g. -TR I R STD 8F32E1FF 0.7fffffff -Another very useful variation is -TR STORE INTO STD
-for finding out when a particular variable changes. - -An alternative way of finding the STD of a currently running process -is to do the following, ( this method is more complex but -could be quite convenient if you aren't updating the kernel much & -so your kernel structures will stay constant for a reasonable period of -time ). - -grep task /proc//status -from this you should see something like -task: 0f160000 ksp: 0f161de8 pt_regs: 0f161f68 -This now gives you a pointer to the task structure. -Now make CC:="s390-gcc -g" kernel/sched.s -To get the task_struct stabinfo. -( task_struct is defined in include/linux/sched.h ). -Now we want to look at -task->active_mm->pgd -on my machine the active_mm in the task structure stab is -active_mm:(4,12),672,32 -its offset is 672/8=84=0x54 -the pgd member in the mm_struct stab is -pgd:(4,6)=*(29,5),96,32 -so its offset is 96/8=12=0xc - -so we'll -hexdump -s 0xf160054 /dev/mem | more -i.e. task_struct+active_mm offset -to look at the active_mm member -f160054 0fee cc60 0019 e334 0000 0000 0000 0011 -hexdump -s 0x0feecc6c /dev/mem | more -i.e. active_mm+pgd offset -feecc6c 0f2c 0000 0000 0001 0000 0001 0000 0010 -we get something like -now do -TR I R STD 0.7fffffff -i.e. the 0x7f is added because the pgd only -gives the page table origin & we need to set the low bits -to the maximum possible segment table length. -TR I R STD 0f2c007f 0.7fffffff -on z/Architecture you'll probably need to do -TR I R STD 0.ffffffffffffffff -to set the TableType to 0x1 & the Table length to 3. - - - -Tracing Program Exceptions --------------------------- -If you get a crash which says something like -illegal operation or specification exception followed by a register dump -You can restart linux & trace these using the tr prog trace -option. - - -The most common ones you will normally be tracing for is -1=operation exception -2=privileged operation exception -4=protection exception -5=addressing exception -6=specification exception -10=segment translation exception -11=page translation exception - -The full list of these is on page 22 of the current s/390 Reference Summary. -e.g. -tr prog 10 will trace segment translation exceptions. -tr prog on its own will trace all program interruption codes. - -Trace Sets ----------- -On starting VM you are initially in the INITIAL trace set. -You can do a Q TR to verify this. -If you have a complex tracing situation where you wish to wait for instance -till a driver is open before you start tracing IO, but know in your -heart that you are going to have to make several runs through the code till you -have a clue whats going on. - -What you can do is -TR I PSWA -hit b to continue till breakpoint -reach the breakpoint -now do your -TR GOTO B -TR IO 7c08-7c09 inst int run -or whatever the IO channels you wish to trace are & hit b - -To got back to the initial trace set do -TR GOTO INITIAL -& the TR I PSWA will be the only active breakpoint again. - - -Tracing linux syscalls under VM -------------------------------- -Syscalls are implemented on Linux for S390 by the Supervisor call instruction -(SVC). There 256 possibilities of these as the instruction is made up of a 0xA -opcode and the second byte being the syscall number. They are traced using the -simple command: -TR SVC -the syscalls are defined in linux/arch/s390/include/asm/unistd.h -e.g. to trace all file opens just do -TR SVC 5 ( as this is the syscall number of open ) - - -SMP Specific commands ---------------------- -To find out how many cpus you have -Q CPUS displays all the CPU's available to your virtual machine -To find the cpu that the current cpu VM debugger commands are being directed at -do Q CPU to change the current cpu VM debugger commands are being directed at do -CPU - -On a SMP guest issue a command to all CPUs try prefixing the command with cpu -all. To issue a command to a particular cpu try cpu e.g. -CPU 01 TR I R 2000.3000 -If you are running on a guest with several cpus & you have a IO related problem -& cannot follow the flow of code but you know it isn't smp related. -from the bash prompt issue -shutdown -h now or halt. -do a Q CPUS to find out how many cpus you have -detach each one of them from cp except cpu 0 -by issuing a -DETACH CPU 01-(number of cpus in configuration) -& boot linux again. -TR SIGP will trace inter processor signal processor instructions. -DEFINE CPU 01-(number in configuration) -will get your guests cpus back. - - -Help for displaying ascii textstrings -------------------------------------- -On the very latest VM Nucleus'es VM can now display ascii -( thanks Neale for the hint ) by doing -D TX. -e.g. -D TX0.100 - -Alternatively -============= -Under older VM debuggers (I love EBDIC too) you can use following little -program which converts a command line of hex digits to ascii text. It can be -compiled under linux and you can copy the hex digits from your x3270 terminal -to your xterm if you are debugging from a linuxbox. - -This is quite useful when looking at a parameter passed in as a text string -under VM ( unless you are good at decoding ASCII in your head ). - -e.g. consider tracing an open syscall -TR SVC 5 -We have stopped at a breakpoint -000151B0' SVC 0A05 -> 0001909A' CC 0 - -D 20.8 to check the SVC old psw in the prefix area and see was it from userspace -(for the layout of the prefix area consult the "Fixed Storage Locations" -chapter of the s/390 Reference Summary if you have it available). -V00000020 070C2000 800151B2 -The problem state bit wasn't set & it's also too early in the boot sequence -for it to be a userspace SVC if it was we would have to temporarily switch the -psw to user space addressing so we could get at the first parameter of the open -in gpr2. -Next do a -D G2 -GPR 2 = 00014CB4 -Now display what gpr2 is pointing to -D 00014CB4.20 -V00014CB4 2F646576 2F636F6E 736F6C65 00001BF5 -V00014CC4 FC00014C B4001001 E0001000 B8070707 -Now copy the text till the first 00 hex ( which is the end of the string -to an xterm & do hex2ascii on it. -hex2ascii 2F646576 2F636F6E 736F6C65 00 -outputs -Decoded Hex:=/ d e v / c o n s o l e 0x00 -We were opening the console device, - -You can compile the code below yourself for practice :-), -/* - * hex2ascii.c - * a useful little tool for converting a hexadecimal command line to ascii - * - * Author(s): Denis Joseph Barrow (djbarrow@de.ibm.com,barrow_dj@yahoo.com) - * (C) 2000 IBM Deutschland Entwicklung GmbH, IBM Corporation. - */ -#include - -int main(int argc,char *argv[]) -{ - int cnt1,cnt2,len,toggle=0; - int startcnt=1; - unsigned char c,hex; - - if(argc>1&&(strcmp(argv[1],"-a")==0)) - startcnt=2; - printf("Decoded Hex:="); - for(cnt1=startcnt;cnt1='0'&&c<='9') - c=c-'0'; - if(c>='A'&&c<='F') - c=c-'A'+10; - if(c>='a'&&c<='f') - c=c-'a'+10; - switch(toggle) - { - case 0: - hex=c<<4; - toggle=1; - break; - case 1: - hex+=c; - if(hex<32||hex>127) - { - if(startcnt==1) - printf("0x%02X ",(int)hex); - else - printf("."); - } - else - { - printf("%c",hex); - if(startcnt==1) - printf(" "); - } - toggle=0; - break; - } - } - } - printf("\n"); -} - - - - -Stack tracing under VM ----------------------- -A basic backtrace ------------------ - -Here are the tricks I use 9 out of 10 times it works pretty well, - -When your backchain reaches a dead end --------------------------------------- -This can happen when an exception happens in the kernel and the kernel is -entered twice. If you reach the NULL pointer at the end of the back chain you -should be able to sniff further back if you follow the following tricks. -1) A kernel address should be easy to recognise since it is in -primary space & the problem state bit isn't set & also -The Hi bit of the address is set. -2) Another backchain should also be easy to recognise since it is an -address pointing to another address approximately 100 bytes or 0x70 hex -behind the current stackpointer. - - -Here is some practice. -boot the kernel & hit PA1 at some random time -d g to display the gprs, this should display something like -GPR 0 = 00000001 00156018 0014359C 00000000 -GPR 4 = 00000001 001B8888 000003E0 00000000 -GPR 8 = 00100080 00100084 00000000 000FE000 -GPR 12 = 00010400 8001B2DC 8001B36A 000FFED8 -Note that GPR14 is a return address but as we are real men we are going to -trace the stack. -display 0x40 bytes after the stack pointer. - -V000FFED8 000FFF38 8001B838 80014C8E 000FFF38 -V000FFEE8 00000000 00000000 000003E0 00000000 -V000FFEF8 00100080 00100084 00000000 000FE000 -V000FFF08 00010400 8001B2DC 8001B36A 000FFED8 - - -Ah now look at whats in sp+56 (sp+0x38) this is 8001B36A our saved r14 if -you look above at our stackframe & also agrees with GPR14. - -now backchain -d 000FFF38.40 -we now are taking the contents of SP to get our first backchain. - -V000FFF38 000FFFA0 00000000 00014995 00147094 -V000FFF48 00147090 001470A0 000003E0 00000000 -V000FFF58 00100080 00100084 00000000 001BF1D0 -V000FFF68 00010400 800149BA 80014CA6 000FFF38 - -This displays a 2nd return address of 80014CA6 - -now do d 000FFFA0.40 for our 3rd backchain - -V000FFFA0 04B52002 0001107F 00000000 00000000 -V000FFFB0 00000000 00000000 FF000000 0001107F -V000FFFC0 00000000 00000000 00000000 00000000 -V000FFFD0 00010400 80010802 8001085A 000FFFA0 - - -our 3rd return address is 8001085A - -as the 04B52002 looks suspiciously like rubbish it is fair to assume that the -kernel entry routines for the sake of optimisation don't set up a backchain. - -now look at System.map to see if the addresses make any sense. - -grep -i 0001b3 System.map -outputs among other things -0001b304 T cpu_idle -so 8001B36A -is cpu_idle+0x66 ( quiet the cpu is asleep, don't wake it ) - - -grep -i 00014 System.map -produces among other things -00014a78 T start_kernel -so 0014CA6 is start_kernel+some hex number I can't add in my head. - -grep -i 00108 System.map -this produces -00010800 T _stext -so 8001085A is _stext+0x5a - -Congrats you've done your first backchain. - - - -s/390 & z/Architecture IO Overview -================================== - -I am not going to give a course in 390 IO architecture as this would take me -quite a while and I'm no expert. Instead I'll give a 390 IO architecture -summary for Dummies. If you have the s/390 principles of operation available -read this instead. If nothing else you may find a few useful keywords in here -and be able to use them on a web search engine to find more useful information. - -Unlike other bus architectures modern 390 systems do their IO using mostly -fibre optics and devices such as tapes and disks can be shared between several -mainframes. Also S390 can support up to 65536 devices while a high end PC based -system might be choking with around 64. - -Here is some of the common IO terminology: - -Subchannel: -This is the logical number most IO commands use to talk to an IO device. There -can be up to 0x10000 (65536) of these in a configuration, typically there are a -few hundred. Under VM for simplicity they are allocated contiguously, however -on the native hardware they are not. They typically stay consistent between -boots provided no new hardware is inserted or removed. -Under Linux for s390 we use these as IRQ's and also when issuing an IO command -(CLEAR SUBCHANNEL, HALT SUBCHANNEL, MODIFY SUBCHANNEL, RESUME SUBCHANNEL, -START SUBCHANNEL, STORE SUBCHANNEL and TEST SUBCHANNEL). We use this as the ID -of the device we wish to talk to. The most important of these instructions are -START SUBCHANNEL (to start IO), TEST SUBCHANNEL (to check whether the IO -completed successfully) and HALT SUBCHANNEL (to kill IO). A subchannel can have -up to 8 channel paths to a device, this offers redundancy if one is not -available. - -Device Number: -This number remains static and is closely tied to the hardware. There are 65536 -of these, made up of a CHPID (Channel Path ID, the most significant 8 bits) and -another lsb 8 bits. These remain static even if more devices are inserted or -removed from the hardware. There is a 1 to 1 mapping between subchannels and -device numbers, provided devices aren't inserted or removed. - -Channel Control Words: -CCWs are linked lists of instructions initially pointed to by an operation -request block (ORB), which is initially given to Start Subchannel (SSCH) -command along with the subchannel number for the IO subsystem to process -while the CPU continues executing normal code. -CCWs come in two flavours, Format 0 (24 bit for backward compatibility) and -Format 1 (31 bit). These are typically used to issue read and write (and many -other) instructions. They consist of a length field and an absolute address -field. -Each IO typically gets 1 or 2 interrupts, one for channel end (primary status) -when the channel is idle, and the second for device end (secondary status). -Sometimes you get both concurrently. You check how the IO went on by issuing a -TEST SUBCHANNEL at each interrupt, from which you receive an Interruption -response block (IRB). If you get channel and device end status in the IRB -without channel checks etc. your IO probably went okay. If you didn't you -probably need to examine the IRB, extended status word etc. -If an error occurs, more sophisticated control units have a facility known as -concurrent sense. This means that if an error occurs Extended sense information -will be presented in the Extended status word in the IRB. If not you have to -issue a subsequent SENSE CCW command after the test subchannel. - - -TPI (Test pending interrupt) can also be used for polled IO, but in -multitasking multiprocessor systems it isn't recommended except for -checking special cases (i.e. non looping checks for pending IO etc.). - -Store Subchannel and Modify Subchannel can be used to examine and modify -operating characteristics of a subchannel (e.g. channel paths). - -Other IO related Terms: -Sysplex: S390's Clustering Technology -QDIO: S390's new high speed IO architecture to support devices such as gigabit -ethernet, this architecture is also designed to be forward compatible with -upcoming 64 bit machines. - - -General Concepts - -Input Output Processors (IOP's) are responsible for communicating between -the mainframe CPU's & the channel & relieve the mainframe CPU's from the -burden of communicating with IO devices directly, this allows the CPU's to -concentrate on data processing. - -IOP's can use one or more links ( known as channel paths ) to talk to each -IO device. It first checks for path availability & chooses an available one, -then starts ( & sometimes terminates IO ). -There are two types of channel path: ESCON & the Parallel IO interface. - -IO devices are attached to control units, control units provide the -logic to interface the channel paths & channel path IO protocols to -the IO devices, they can be integrated with the devices or housed separately -& often talk to several similar devices ( typical examples would be raid -controllers or a control unit which connects to 1000 3270 terminals ). - - - +---------------------------------------------------------------+ - | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ | - | | CPU | | CPU | | CPU | | CPU | | Main | | Expanded | | - | | | | | | | | | | Memory | | Storage | | - | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ | - |---------------------------------------------------------------+ - | IOP | IOP | IOP | - |--------------------------------------------------------------- - | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C | - ---------------------------------------------------------------- - || || - || Bus & Tag Channel Path || ESCON - || ====================== || Channel - || || || || Path - +----------+ +----------+ +----------+ - | | | | | | - | CU | | CU | | CU | - | | | | | | - +----------+ +----------+ +----------+ - | | | | | -+----------+ +----------+ +----------+ +----------+ +----------+ -|I/O Device| |I/O Device| |I/O Device| |I/O Device| |I/O Device| -+----------+ +----------+ +----------+ +----------+ +----------+ - CPU = Central Processing Unit - C = Channel - IOP = IP Processor - CU = Control Unit - -The 390 IO systems come in 2 flavours the current 390 machines support both - -The Older 360 & 370 Interface,sometimes called the Parallel I/O interface, -sometimes called Bus-and Tag & sometimes Original Equipment Manufacturers -Interface (OEMI). - -This byte wide Parallel channel path/bus has parity & data on the "Bus" cable -and control lines on the "Tag" cable. These can operate in byte multiplex mode -for sharing between several slow devices or burst mode and monopolize the -channel for the whole burst. Up to 256 devices can be addressed on one of these -cables. These cables are about one inch in diameter. The maximum unextended -length supported by these cables is 125 Meters but this can be extended up to -2km with a fibre optic channel extended such as a 3044. The maximum burst speed -supported is 4.5 megabytes per second. However, some really old processors -support only transfer rates of 3.0, 2.0 & 1.0 MB/sec. -One of these paths can be daisy chained to up to 8 control units. - - -ESCON if fibre optic it is also called FICON -Was introduced by IBM in 1990. Has 2 fibre optic cables and uses either leds or -lasers for communication at a signaling rate of up to 200 megabits/sec. As -10bits are transferred for every 8 bits info this drops to 160 megabits/sec -and to 18.6 Megabytes/sec once control info and CRC are added. ESCON only -operates in burst mode. - -ESCONs typical max cable length is 3km for the led version and 20km for the -laser version known as XDF (extended distance facility). This can be further -extended by using an ESCON director which triples the above mentioned ranges. -Unlike Bus & Tag as ESCON is serial it uses a packet switching architecture, -the standard Bus & Tag control protocol is however present within the packets. -Up to 256 devices can be attached to each control unit that uses one of these -interfaces. - -Common 390 Devices include: -Network adapters typically OSA2,3172's,2116's & OSA-E gigabit ethernet adapters, -Consoles 3270 & 3215 (a teletype emulated under linux for a line mode console). -DASD's direct access storage devices ( otherwise known as hard disks ). -Tape Drives. -CTC ( Channel to Channel Adapters ), -ESCON or Parallel Cables used as a very high speed serial link -between 2 machines. - - -Debugging IO on s/390 & z/Architecture under VM -=============================================== - -Now we are ready to go on with IO tracing commands under VM - -A few self explanatory queries: -Q OSA -Q CTC -Q DISK ( This command is CMS specific ) -Q DASD - - - - - - -Q OSA on my machine returns -OSA 7C08 ON OSA 7C08 SUBCHANNEL = 0000 -OSA 7C09 ON OSA 7C09 SUBCHANNEL = 0001 -OSA 7C14 ON OSA 7C14 SUBCHANNEL = 0002 -OSA 7C15 ON OSA 7C15 SUBCHANNEL = 0003 - -If you have a guest with certain privileges you may be able to see devices -which don't belong to you. To avoid this, add the option V. -e.g. -Q V OSA - -Now using the device numbers returned by this command we will -Trace the io starting up on the first device 7c08 & 7c09 -In our simplest case we can trace the -start subchannels -like TR SSCH 7C08-7C09 -or the halt subchannels -or TR HSCH 7C08-7C09 -MSCH's ,STSCH's I think you can guess the rest - -A good trick is tracing all the IO's and CCWS and spooling them into the reader -of another VM guest so he can ftp the logfile back to his own machine. I'll do -a small bit of this and give you a look at the output. - -1) Spool stdout to VM reader -SP PRT TO (another vm guest ) or * for the local vm guest -2) Fill the reader with the trace -TR IO 7c08-7c09 INST INT CCW PRT RUN -3) Start up linux -i 00c -4) Finish the trace -TR END -5) close the reader -C PRT -6) list reader contents -RDRLIST -7) copy it to linux4's minidisk -RECEIVE / LOG TXT A1 ( replace -8) -filel & press F11 to look at it -You should see something like: - -00020942' SSCH B2334000 0048813C CC 0 SCH 0000 DEV 7C08 - CPA 000FFDF0 PARM 00E2C9C4 KEY 0 FPI C0 LPM 80 - CCW 000FFDF0 E4200100 00487FE8 0000 E4240100 ........ - IDAL 43D8AFE8 - IDAL 0FB76000 -00020B0A' I/O DEV 7C08 -> 000197BC' SCH 0000 PARM 00E2C9C4 -00021628' TSCH B2354000 >> 00488164 CC 0 SCH 0000 DEV 7C08 - CCWA 000FFDF8 DEV STS 0C SCH STS 00 CNT 00EC - KEY 0 FPI C0 CC 0 CTLS 4007 -00022238' STSCH B2344000 >> 00488108 CC 0 SCH 0000 DEV 7C08 - -If you don't like messing up your readed ( because you possibly booted from it ) -you can alternatively spool it to another readers guest. - - -Other common VM device related commands ---------------------------------------------- -These commands are listed only because they have -been of use to me in the past & may be of use to -you too. For more complete info on each of the commands -use type HELP from CMS. -detaching devices -DET -ATT -attach a device to guest * for your own guest -READY cause VM to issue a fake interrupt. - -The VARY command is normally only available to VM administrators. -VARY ON PATH TO -VARY OFF PATH FROM -This is used to switch on or off channel paths to devices. - -Q CHPID -This displays state of devices using this channel path -D SCHIB -This displays the subchannel information SCHIB block for the device. -this I believe is also only available to administrators. -DEFINE CTC -defines a virtual CTC channel to channel connection -2 need to be defined on each guest for the CTC driver to use. -COUPLE devno userid remote devno -Joins a local virtual device to a remote virtual device -( commonly used for the CTC driver ). - -Building a VM ramdisk under CMS which linux can use -def vfb- -blocksize is commonly 4096 for linux. -Formatting it -format (blksize - -Sharing a disk between multiple guests -LINK userid devno1 devno2 mode password - - - -GDB on S390 -=========== -N.B. if compiling for debugging gdb works better without optimisation -( see Compiling programs for debugging ) - -invocation ----------- -gdb - -Online help ------------ -help: gives help on commands -e.g. -help -help display -Note gdb's online help is very good use it. - - -Assembly --------- -info registers: displays registers other than floating point. -info all-registers: displays floating points as well. -disassemble: disassembles -e.g. -disassemble without parameters will disassemble the current function -disassemble $pc $pc+10 - -Viewing & modifying variables ------------------------------ -print or p: displays variable or register -e.g. p/x $sp will display the stack pointer - -display: prints variable or register each time program stops -e.g. -display/x $pc will display the program counter -display argc - -undisplay : undo's display's - -info breakpoints: shows all current breakpoints - -info stack: shows stack back trace (if this doesn't work too well, I'll show -you the stacktrace by hand below). - -info locals: displays local variables. - -info args: display current procedure arguments. - -set args: will set argc & argv each time the victim program is invoked. - -set =value -set argc=100 -set $pc=0 - - - -Modifying execution -------------------- -step: steps n lines of sourcecode -step steps 1 line. -step 100 steps 100 lines of code. - -next: like step except this will not step into subroutines - -stepi: steps a single machine code instruction. -e.g. stepi 100 - -nexti: steps a single machine code instruction but will not step into -subroutines. - -finish: will run until exit of the current routine - -run: (re)starts a program - -cont: continues a program - -quit: exits gdb. - - -breakpoints ------------- - -break -sets a breakpoint -e.g. - -break main - -break *$pc - -break *0x400618 - -Here's a really useful one for large programs -rbr -Set a breakpoint for all functions matching REGEXP -e.g. -rbr 390 -will set a breakpoint with all functions with 390 in their name. - -info breakpoints -lists all breakpoints - -delete: delete breakpoint by number or delete them all -e.g. -delete 1 will delete the first breakpoint -delete will delete them all - -watch: This will set a watchpoint ( usually hardware assisted ), -This will watch a variable till it changes -e.g. -watch cnt, will watch the variable cnt till it changes. -As an aside unfortunately gdb's, architecture independent watchpoint code -is inconsistent & not very good, watchpoints usually work but not always. - -info watchpoints: Display currently active watchpoints - -condition: ( another useful one ) -Specify breakpoint number N to break only if COND is true. -Usage is `condition N COND', where N is an integer and COND is an -expression to be evaluated whenever breakpoint N is reached. - - - -User defined functions/macros ------------------------------ -define: ( Note this is very very useful,simple & powerful ) -usage define end - -examples which you should consider putting into .gdbinit in your home directory -define d -stepi -disassemble $pc $pc+10 -end - -define e -nexti -disassemble $pc $pc+10 -end - - -Other hard to classify stuff ----------------------------- -signal n: -sends the victim program a signal. -e.g. signal 3 will send a SIGQUIT. - -info signals: -what gdb does when the victim receives certain signals. - -list: -e.g. -list lists current function source -list 1,10 list first 10 lines of current file. -list test.c:1,10 - - -directory: -Adds directories to be searched for source if gdb cannot find the source. -(note it is a bit sensitive about slashes) -e.g. To add the root of the filesystem to the searchpath do -directory // - - -call -This calls a function in the victim program, this is pretty powerful -e.g. -(gdb) call printf("hello world") -outputs: -$1 = 11 - -You might now be thinking that the line above didn't work, something extra had -to be done. -(gdb) call fflush(stdout) -hello world$2 = 0 -As an aside the debugger also calls malloc & free under the hood -to make space for the "hello world" string. - - - -hints ------ -1) command completion works just like bash -( if you are a bad typist like me this really helps ) -e.g. hit br & cursor up & down :-). - -2) if you have a debugging problem that takes a few steps to recreate -put the steps into a file called .gdbinit in your current working directory -if you have defined a few extra useful user defined commands put these in -your home directory & they will be read each time gdb is launched. - -A typical .gdbinit file might be. -break main -run -break runtime_exception -cont - - -stack chaining in gdb by hand ------------------------------ -This is done using a the same trick described for VM -p/x (*($sp+56))&0x7fffffff get the first backchain. - -For z/Architecture -Replace 56 with 112 & ignore the &0x7fffffff -in the macros below & do nasty casts to longs like the following -as gdb unfortunately deals with printed arguments as ints which -messes up everything. -i.e. here is a 3rd backchain dereference -p/x *(long *)(***(long ***)$sp+112) - - -this outputs -$5 = 0x528f18 -on my machine. -Now you can use -info symbol (*($sp+56))&0x7fffffff -you might see something like. -rl_getc + 36 in section .text telling you what is located at address 0x528f18 -Now do. -p/x (*(*$sp+56))&0x7fffffff -This outputs -$6 = 0x528ed0 -Now do. -info symbol (*(*$sp+56))&0x7fffffff -rl_read_key + 180 in section .text -now do -p/x (*(**$sp+56))&0x7fffffff -& so on. - -Disassembling instructions without debug info ---------------------------------------------- -gdb typically complains if there is a lack of debugging -symbols in the disassemble command with -"No function contains specified address." To get around -this do -x/xi
-e.g. -x/20xi 0x400730 - - - -Note: Remember gdb has history just like bash you don't need to retype the -whole line just use the up & down arrows. - - - -For more info -------------- -From your linuxbox do -man gdb or info gdb. - -core dumps ----------- -What a core dump ?, -A core dump is a file generated by the kernel (if allowed) which contains the -registers and all active pages of the program which has crashed. -From this file gdb will allow you to look at the registers, stack trace and -memory of the program as if it just crashed on your system. It is usually -called core and created in the current working directory. -This is very useful in that a customer can mail a core dump to a technical -support department and the technical support department can reconstruct what -happened. Provided they have an identical copy of this program with debugging -symbols compiled in and the source base of this build is available. -In short it is far more useful than something like a crash log could ever hope -to be. - -Why have I never seen one ?. -Probably because you haven't used the command -ulimit -c unlimited in bash -to allow core dumps, now do -ulimit -a -to verify that the limit was accepted. - -A sample core dump -To create this I'm going to do -ulimit -c unlimited -gdb -to launch gdb (my victim app. ) now be bad & do the following from another -telnet/xterm session to the same machine -ps -aux | grep gdb -kill -SIGSEGV -or alternatively use killall -SIGSEGV gdb if you have the killall command. -Now look at the core dump. -./gdb core -Displays the following -GNU gdb 4.18 -Copyright 1998 Free Software Foundation, Inc. -GDB is free software, covered by the GNU General Public License, and you are -welcome to change it and/or distribute copies of it under certain conditions. -Type "show copying" to see the conditions. -There is absolutely no warranty for GDB. Type "show warranty" for details. -This GDB was configured as "s390-ibm-linux"... -Core was generated by `./gdb'. -Program terminated with signal 11, Segmentation fault. -Reading symbols from /usr/lib/libncurses.so.4...done. -Reading symbols from /lib/libm.so.6...done. -Reading symbols from /lib/libc.so.6...done. -Reading symbols from /lib/ld-linux.so.2...done. -#0 0x40126d1a in read () from /lib/libc.so.6 -Setting up the environment for debugging gdb. -Breakpoint 1 at 0x4dc6f8: file utils.c, line 471. -Breakpoint 2 at 0x4d87a4: file top.c, line 2609. -(top-gdb) info stack -#0 0x40126d1a in read () from /lib/libc.so.6 -#1 0x528f26 in rl_getc (stream=0x7ffffde8) at input.c:402 -#2 0x528ed0 in rl_read_key () at input.c:381 -#3 0x5167e6 in readline_internal_char () at readline.c:454 -#4 0x5168ee in readline_internal_charloop () at readline.c:507 -#5 0x51692c in readline_internal () at readline.c:521 -#6 0x5164fe in readline (prompt=0x7ffff810) - at readline.c:349 -#7 0x4d7a8a in command_line_input (prompt=0x564420 "(gdb) ", repeat=1, - annotation_suffix=0x4d6b44 "prompt") at top.c:2091 -#8 0x4d6cf0 in command_loop () at top.c:1345 -#9 0x4e25bc in main (argc=1, argv=0x7ffffdf4) at main.c:635 - - -LDD -=== -This is a program which lists the shared libraries which a library needs, -Note you also get the relocations of the shared library text segments which -help when using objdump --source. -e.g. - ldd ./gdb -outputs -libncurses.so.4 => /usr/lib/libncurses.so.4 (0x40018000) -libm.so.6 => /lib/libm.so.6 (0x4005e000) -libc.so.6 => /lib/libc.so.6 (0x40084000) -/lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) - - -Debugging shared libraries -========================== -Most programs use shared libraries, however it can be very painful -when you single step instruction into a function like printf for the -first time & you end up in functions like _dl_runtime_resolve this is -the ld.so doing lazy binding, lazy binding is a concept in ELF where -shared library functions are not loaded into memory unless they are -actually used, great for saving memory but a pain to debug. -To get around this either relink the program -static or exit gdb type -export LD_BIND_NOW=true this will stop lazy binding & restart the gdb'ing -the program in question. - - - -Debugging modules -================= -As modules are dynamically loaded into the kernel their address can be -anywhere to get around this use the -m option with insmod to emit a load -map which can be piped into a file if required. - -The proc file system -==================== -What is it ?. -It is a filesystem created by the kernel with files which are created on demand -by the kernel if read, or can be used to modify kernel parameters, -it is a powerful concept. - -e.g. - -cat /proc/sys/net/ipv4/ip_forward -On my machine outputs -0 -telling me ip_forwarding is not on to switch it on I can do -echo 1 > /proc/sys/net/ipv4/ip_forward -cat it again -cat /proc/sys/net/ipv4/ip_forward -On my machine now outputs -1 -IP forwarding is on. -There is a lot of useful info in here best found by going in and having a look -around, so I'll take you through some entries I consider important. - -All the processes running on the machine have their own entry defined by -/proc/ -So lets have a look at the init process -cd /proc/1 - -cat cmdline -emits -init [2] - -cd /proc/1/fd -This contains numerical entries of all the open files, -some of these you can cat e.g. stdout (2) - -cat /proc/29/maps -on my machine emits - -00400000-00478000 r-xp 00000000 5f:00 4103 /bin/bash -00478000-0047e000 rw-p 00077000 5f:00 4103 /bin/bash -0047e000-00492000 rwxp 00000000 00:00 0 -40000000-40015000 r-xp 00000000 5f:00 14382 /lib/ld-2.1.2.so -40015000-40016000 rw-p 00014000 5f:00 14382 /lib/ld-2.1.2.so -40016000-40017000 rwxp 00000000 00:00 0 -40017000-40018000 rw-p 00000000 00:00 0 -40018000-4001b000 r-xp 00000000 5f:00 14435 /lib/libtermcap.so.2.0.8 -4001b000-4001c000 rw-p 00002000 5f:00 14435 /lib/libtermcap.so.2.0.8 -4001c000-4010d000 r-xp 00000000 5f:00 14387 /lib/libc-2.1.2.so -4010d000-40111000 rw-p 000f0000 5f:00 14387 /lib/libc-2.1.2.so -40111000-40114000 rw-p 00000000 00:00 0 -40114000-4011e000 r-xp 00000000 5f:00 14408 /lib/libnss_files-2.1.2.so -4011e000-4011f000 rw-p 00009000 5f:00 14408 /lib/libnss_files-2.1.2.so -7fffd000-80000000 rwxp ffffe000 00:00 0 - - -Showing us the shared libraries init uses where they are in memory -& memory access permissions for each virtual memory area. - -/proc/1/cwd is a softlink to the current working directory. -/proc/1/root is the root of the filesystem for this process. - -/proc/1/mem is the current running processes memory which you -can read & write to like a file. -strace uses this sometimes as it is a bit faster than the -rather inefficient ptrace interface for peeking at DATA. - - -cat status - -Name: init -State: S (sleeping) -Pid: 1 -PPid: 0 -Uid: 0 0 0 0 -Gid: 0 0 0 0 -Groups: -VmSize: 408 kB -VmLck: 0 kB -VmRSS: 208 kB -VmData: 24 kB -VmStk: 8 kB -VmExe: 368 kB -VmLib: 0 kB -SigPnd: 0000000000000000 -SigBlk: 0000000000000000 -SigIgn: 7fffffffd7f0d8fc -SigCgt: 00000000280b2603 -CapInh: 00000000fffffeff -CapPrm: 00000000ffffffff -CapEff: 00000000fffffeff - -User PSW: 070de000 80414146 -task: 004b6000 tss: 004b62d8 ksp: 004b7ca8 pt_regs: 004b7f68 -User GPRS: -00000400 00000000 0000000b 7ffffa90 -00000000 00000000 00000000 0045d9f4 -0045cafc 7ffffa90 7fffff18 0045cb08 -00010400 804039e8 80403af8 7ffff8b0 -User ACRS: -00000000 00000000 00000000 00000000 -00000001 00000000 00000000 00000000 -00000000 00000000 00000000 00000000 -00000000 00000000 00000000 00000000 -Kernel BackChain CallChain BackChain CallChain - 004b7ca8 8002bd0c 004b7d18 8002b92c - 004b7db8 8005cd50 004b7e38 8005d12a - 004b7f08 80019114 -Showing among other things memory usage & status of some signals & -the processes'es registers from the kernel task_structure -as well as a backchain which may be useful if a process crashes -in the kernel for some unknown reason. - -Some driver debugging techniques -================================ -debug feature -------------- -Some of our drivers now support a "debug feature" in -/proc/s390dbf see s390dbf.txt in the linux/Documentation directory -for more info. -e.g. -to switch on the lcs "debug feature" -echo 5 > /proc/s390dbf/lcs/level -& then after the error occurred. -cat /proc/s390dbf/lcs/sprintf >/logfile -the logfile now contains some information which may help -tech support resolve a problem in the field. - - - -high level debugging network drivers ------------------------------------- -ifconfig is a quite useful command -it gives the current state of network drivers. - -If you suspect your network device driver is dead -one way to check is type -ifconfig -e.g. tr0 -You should see something like -tr0 Link encap:16/4 Mbps Token Ring (New) HWaddr 00:04:AC:20:8E:48 - inet addr:9.164.185.132 Bcast:9.164.191.255 Mask:255.255.224.0 - UP BROADCAST RUNNING MULTICAST MTU:2000 Metric:1 - RX packets:246134 errors:0 dropped:0 overruns:0 frame:0 - TX packets:5 errors:0 dropped:0 overruns:0 carrier:0 - collisions:0 txqueuelen:100 - -if the device doesn't say up -try -/etc/rc.d/init.d/network start -( this starts the network stack & hopefully calls ifconfig tr0 up ). -ifconfig looks at the output of /proc/net/dev and presents it in a more -presentable form. -Now ping the device from a machine in the same subnet. -if the RX packets count & TX packets counts don't increment you probably -have problems. -next -cat /proc/net/arp -Do you see any hardware addresses in the cache if not you may have problems. -Next try -ping -c 5 i.e. the Bcast field above in the output of -ifconfig. Do you see any replies from machines other than the local machine -if not you may have problems. also if the TX packets count in ifconfig -hasn't incremented either you have serious problems in your driver -(e.g. the txbusy field of the network device being stuck on ) -or you may have multiple network devices connected. - - -chandev -------- -There is a new device layer for channel devices, some -drivers e.g. lcs are registered with this layer. -If the device uses the channel device layer you'll be -able to find what interrupts it uses & the current state -of the device. -See the manpage chandev.8 &type cat /proc/chandev for more info. - - -SysRq -===== -This is now supported by linux for s/390 & z/Architecture. -To enable it do compile the kernel with -Kernel Hacking -> Magic SysRq Key Enabled -echo "1" > /proc/sys/kernel/sysrq -also type -echo "8" >/proc/sys/kernel/printk -To make printk output go to console. -On 390 all commands are prefixed with -^- -e.g. -^-t will show tasks. -^-? or some unknown command will display help. -The sysrq key reading is very picky ( I have to type the keys in an - xterm session & paste them into the x3270 console ) -& it may be wise to predefine the keys as described in the VM hints above - -This is particularly useful for syncing disks unmounting & rebooting -if the machine gets partially hung. - -Read Documentation/admin-guide/sysrq.rst for more info - -References: -=========== -Enterprise Systems Architecture Reference Summary -Enterprise Systems Architecture Principles of Operation -Hartmut Penners s390 stack frame sheet. -IBM Mainframe Channel Attachment a technology brief from a CISCO webpage -Various bits of man & info pages of Linux. -Linux & GDB source. -Various info & man pages. -CMS Help on tracing commands. -Linux for s/390 Elf Application Binary Interface -Linux for z/Series Elf Application Binary Interface ( Both Highly Recommended ) -z/Architecture Principles of Operation SA22-7832-00 -Enterprise Systems Architecture/390 Reference Summary SA22-7209-01 & the -Enterprise Systems Architecture/390 Principles of Operation SA22-7201-05 - -Special Thanks -============== -Special thanks to Neale Ferguson who maintains a much -prettier HTML version of this page at -http://linuxvm.org/penguinvm/ -Bob Grainger Stefan Bader & others for reporting bugs diff --git a/Documentation/s390/cds.rst b/Documentation/s390/cds.rst new file mode 100644 index 000000000000..7006d8209d2e --- /dev/null +++ b/Documentation/s390/cds.rst @@ -0,0 +1,530 @@ +=========================== +Linux for S/390 and zSeries +=========================== + +Common Device Support (CDS) +Device Driver I/O Support Routines + +Authors: + - Ingo Adlung + - Cornelia Huck + +Copyright, IBM Corp. 1999-2002 + +Introduction +============ + +This document describes the common device support routines for Linux/390. +Different than other hardware architectures, ESA/390 has defined a unified +I/O access method. This gives relief to the device drivers as they don't +have to deal with different bus types, polling versus interrupt +processing, shared versus non-shared interrupt processing, DMA versus port +I/O (PIO), and other hardware features more. However, this implies that +either every single device driver needs to implement the hardware I/O +attachment functionality itself, or the operating system provides for a +unified method to access the hardware, providing all the functionality that +every single device driver would have to provide itself. + +The document does not intend to explain the ESA/390 hardware architecture in +every detail.This information can be obtained from the ESA/390 Principles of +Operation manual (IBM Form. No. SA22-7201). + +In order to build common device support for ESA/390 I/O interfaces, a +functional layer was introduced that provides generic I/O access methods to +the hardware. + +The common device support layer comprises the I/O support routines defined +below. Some of them implement common Linux device driver interfaces, while +some of them are ESA/390 platform specific. + +Note: + In order to write a driver for S/390, you also need to look into the interface + described in Documentation/s390/driver-model.rst. + +Note for porting drivers from 2.4: + +The major changes are: + +* The functions use a ccw_device instead of an irq (subchannel). +* All drivers must define a ccw_driver (see driver-model.txt) and the associated + functions. +* request_irq() and free_irq() are no longer done by the driver. +* The oper_handler is (kindof) replaced by the probe() and set_online() functions + of the ccw_driver. +* The not_oper_handler is (kindof) replaced by the remove() and set_offline() + functions of the ccw_driver. +* The channel device layer is gone. +* The interrupt handlers must be adapted to use a ccw_device as argument. + Moreover, they don't return a devstat, but an irb. +* Before initiating an io, the options must be set via ccw_device_set_options(). +* Instead of calling read_dev_chars()/read_conf_data(), the driver issues + the channel program and handles the interrupt itself. + +ccw_device_get_ciw() + get commands from extended sense data. + +ccw_device_start(), ccw_device_start_timeout(), ccw_device_start_key(), ccw_device_start_key_timeout() + initiate an I/O request. + +ccw_device_resume() + resume channel program execution. + +ccw_device_halt() + terminate the current I/O request processed on the device. + +do_IRQ() + generic interrupt routine. This function is called by the interrupt entry + routine whenever an I/O interrupt is presented to the system. The do_IRQ() + routine determines the interrupt status and calls the device specific + interrupt handler according to the rules (flags) defined during I/O request + initiation with do_IO(). + +The next chapters describe the functions other than do_IRQ() in more details. +The do_IRQ() interface is not described, as it is called from the Linux/390 +first level interrupt handler only and does not comprise a device driver +callable interface. Instead, the functional description of do_IO() also +describes the input to the device specific interrupt handler. + +Note: + All explanations apply also to the 64 bit architecture s390x. + + +Common Device Support (CDS) for Linux/390 Device Drivers +======================================================== + +General Information +------------------- + +The following chapters describe the I/O related interface routines the +Linux/390 common device support (CDS) provides to allow for device specific +driver implementations on the IBM ESA/390 hardware platform. Those interfaces +intend to provide the functionality required by every device driver +implementation to allow to drive a specific hardware device on the ESA/390 +platform. Some of the interface routines are specific to Linux/390 and some +of them can be found on other Linux platforms implementations too. +Miscellaneous function prototypes, data declarations, and macro definitions +can be found in the architecture specific C header file +linux/arch/s390/include/asm/irq.h. + +Overview of CDS interface concepts +---------------------------------- + +Different to other hardware platforms, the ESA/390 architecture doesn't define +interrupt lines managed by a specific interrupt controller and bus systems +that may or may not allow for shared interrupts, DMA processing, etc.. Instead, +the ESA/390 architecture has implemented a so called channel subsystem, that +provides a unified view of the devices physically attached to the systems. +Though the ESA/390 hardware platform knows about a huge variety of different +peripheral attachments like disk devices (aka. DASDs), tapes, communication +controllers, etc. they can all be accessed by a well defined access method and +they are presenting I/O completion a unified way : I/O interruptions. Every +single device is uniquely identified to the system by a so called subchannel, +where the ESA/390 architecture allows for 64k devices be attached. + +Linux, however, was first built on the Intel PC architecture, with its two +cascaded 8259 programmable interrupt controllers (PICs), that allow for a +maximum of 15 different interrupt lines. All devices attached to such a system +share those 15 interrupt levels. Devices attached to the ISA bus system must +not share interrupt levels (aka. IRQs), as the ISA bus bases on edge triggered +interrupts. MCA, EISA, PCI and other bus systems base on level triggered +interrupts, and therewith allow for shared IRQs. However, if multiple devices +present their hardware status by the same (shared) IRQ, the operating system +has to call every single device driver registered on this IRQ in order to +determine the device driver owning the device that raised the interrupt. + +Up to kernel 2.4, Linux/390 used to provide interfaces via the IRQ (subchannel). +For internal use of the common I/O layer, these are still there. However, +device drivers should use the new calling interface via the ccw_device only. + +During its startup the Linux/390 system checks for peripheral devices. Each +of those devices is uniquely defined by a so called subchannel by the ESA/390 +channel subsystem. While the subchannel numbers are system generated, each +subchannel also takes a user defined attribute, the so called device number. +Both subchannel number and device number cannot exceed 65535. During sysfs +initialisation, the information about control unit type and device types that +imply specific I/O commands (channel command words - CCWs) in order to operate +the device are gathered. Device drivers can retrieve this set of hardware +information during their initialization step to recognize the devices they +support using the information saved in the struct ccw_device given to them. +This methods implies that Linux/390 doesn't require to probe for free (not +armed) interrupt request lines (IRQs) to drive its devices with. Where +applicable, the device drivers can use issue the READ DEVICE CHARACTERISTICS +ccw to retrieve device characteristics in its online routine. + +In order to allow for easy I/O initiation the CDS layer provides a +ccw_device_start() interface that takes a device specific channel program (one +or more CCWs) as input sets up the required architecture specific control blocks +and initiates an I/O request on behalf of the device driver. The +ccw_device_start() routine allows to specify whether it expects the CDS layer +to notify the device driver for every interrupt it observes, or with final status +only. See ccw_device_start() for more details. A device driver must never issue +ESA/390 I/O commands itself, but must use the Linux/390 CDS interfaces instead. + +For long running I/O request to be canceled, the CDS layer provides the +ccw_device_halt() function. Some devices require to initially issue a HALT +SUBCHANNEL (HSCH) command without having pending I/O requests. This function is +also covered by ccw_device_halt(). + + +get_ciw() - get command information word + +This call enables a device driver to get information about supported commands +from the extended SenseID data. + +:: + + struct ciw * + ccw_device_get_ciw(struct ccw_device *cdev, __u32 cmd); + +==== ======================================================== +cdev The ccw_device for which the command is to be retrieved. +cmd The command type to be retrieved. +==== ======================================================== + +ccw_device_get_ciw() returns: + +===== ================================================================ + NULL No extended data available, invalid device or command not found. +!NULL The command requested. +===== ================================================================ + +:: + + ccw_device_start() - Initiate I/O Request + +The ccw_device_start() routines is the I/O request front-end processor. All +device driver I/O requests must be issued using this routine. A device driver +must not issue ESA/390 I/O commands itself. Instead the ccw_device_start() +routine provides all interfaces required to drive arbitrary devices. + +This description also covers the status information passed to the device +driver's interrupt handler as this is related to the rules (flags) defined +with the associated I/O request when calling ccw_device_start(). + +:: + + int ccw_device_start(struct ccw_device *cdev, + struct ccw1 *cpa, + unsigned long intparm, + __u8 lpm, + unsigned long flags); + int ccw_device_start_timeout(struct ccw_device *cdev, + struct ccw1 *cpa, + unsigned long intparm, + __u8 lpm, + unsigned long flags, + int expires); + int ccw_device_start_key(struct ccw_device *cdev, + struct ccw1 *cpa, + unsigned long intparm, + __u8 lpm, + __u8 key, + unsigned long flags); + int ccw_device_start_key_timeout(struct ccw_device *cdev, + struct ccw1 *cpa, + unsigned long intparm, + __u8 lpm, + __u8 key, + unsigned long flags, + int expires); + +============= ============================================================= +cdev ccw_device the I/O is destined for +cpa logical start address of channel program +user_intparm user specific interrupt information; will be presented + back to the device driver's interrupt handler. Allows a + device driver to associate the interrupt with a + particular I/O request. +lpm defines the channel path to be used for a specific I/O + request. A value of 0 will make cio use the opm. +key the storage key to use for the I/O (useful for operating on a + storage with a storage key != default key) +flag defines the action to be performed for I/O processing +expires timeout value in jiffies. The common I/O layer will terminate + the running program after this and call the interrupt handler + with ERR_PTR(-ETIMEDOUT) as irb. +============= ============================================================= + +Possible flag values are: + +========================= ============================================= +DOIO_ALLOW_SUSPEND channel program may become suspended +DOIO_DENY_PREFETCH don't allow for CCW prefetch; usually + this implies the channel program might + become modified +DOIO_SUPPRESS_INTER don't call the handler on intermediate status +========================= ============================================= + +The cpa parameter points to the first format 1 CCW of a channel program:: + + struct ccw1 { + __u8 cmd_code;/* command code */ + __u8 flags; /* flags, like IDA addressing, etc. */ + __u16 count; /* byte count */ + __u32 cda; /* data address */ + } __attribute__ ((packed,aligned(8))); + +with the following CCW flags values defined: + +=================== ========================= +CCW_FLAG_DC data chaining +CCW_FLAG_CC command chaining +CCW_FLAG_SLI suppress incorrect length +CCW_FLAG_SKIP skip +CCW_FLAG_PCI PCI +CCW_FLAG_IDA indirect addressing +CCW_FLAG_SUSPEND suspend +=================== ========================= + + +Via ccw_device_set_options(), the device driver may specify the following +options for the device: + +========================= ====================================== +DOIO_EARLY_NOTIFICATION allow for early interrupt notification +DOIO_REPORT_ALL report all interrupt conditions +========================= ====================================== + + +The ccw_device_start() function returns: + +======== ====================================================================== + 0 successful completion or request successfully initiated + -EBUSY The device is currently processing a previous I/O request, or there is + a status pending at the device. +-ENODEV cdev is invalid, the device is not operational or the ccw_device is + not online. +======== ====================================================================== + +When the I/O request completes, the CDS first level interrupt handler will +accumulate the status in a struct irb and then call the device interrupt handler. +The intparm field will contain the value the device driver has associated with a +particular I/O request. If a pending device status was recognized, +intparm will be set to 0 (zero). This may happen during I/O initiation or delayed +by an alert status notification. In any case this status is not related to the +current (last) I/O request. In case of a delayed status notification no special +interrupt will be presented to indicate I/O completion as the I/O request was +never started, even though ccw_device_start() returned with successful completion. + +The irb may contain an error value, and the device driver should check for this +first: + +========== ================================================================= +-ETIMEDOUT the common I/O layer terminated the request after the specified + timeout value +-EIO the common I/O layer terminated the request due to an error state +========== ================================================================= + +If the concurrent sense flag in the extended status word (esw) in the irb is +set, the field erw.scnt in the esw describes the number of device specific +sense bytes available in the extended control word irb->scsw.ecw[]. No device +sensing by the device driver itself is required. + +The device interrupt handler can use the following definitions to investigate +the primary unit check source coded in sense byte 0 : + +======================= ==== +SNS0_CMD_REJECT 0x80 +SNS0_INTERVENTION_REQ 0x40 +SNS0_BUS_OUT_CHECK 0x20 +SNS0_EQUIPMENT_CHECK 0x10 +SNS0_DATA_CHECK 0x08 +SNS0_OVERRUN 0x04 +SNS0_INCOMPL_DOMAIN 0x01 +======================= ==== + +Depending on the device status, multiple of those values may be set together. +Please refer to the device specific documentation for details. + +The irb->scsw.cstat field provides the (accumulated) subchannel status : + +========================= ============================ +SCHN_STAT_PCI program controlled interrupt +SCHN_STAT_INCORR_LEN incorrect length +SCHN_STAT_PROG_CHECK program check +SCHN_STAT_PROT_CHECK protection check +SCHN_STAT_CHN_DATA_CHK channel data check +SCHN_STAT_CHN_CTRL_CHK channel control check +SCHN_STAT_INTF_CTRL_CHK interface control check +SCHN_STAT_CHAIN_CHECK chaining check +========================= ============================ + +The irb->scsw.dstat field provides the (accumulated) device status : + +===================== ================= +DEV_STAT_ATTENTION attention +DEV_STAT_STAT_MOD status modifier +DEV_STAT_CU_END control unit end +DEV_STAT_BUSY busy +DEV_STAT_CHN_END channel end +DEV_STAT_DEV_END device end +DEV_STAT_UNIT_CHECK unit check +DEV_STAT_UNIT_EXCEP unit exception +===================== ================= + +Please see the ESA/390 Principles of Operation manual for details on the +individual flag meanings. + +Usage Notes: + +ccw_device_start() must be called disabled and with the ccw device lock held. + +The device driver is allowed to issue the next ccw_device_start() call from +within its interrupt handler already. It is not required to schedule a +bottom-half, unless a non deterministically long running error recovery procedure +or similar needs to be scheduled. During I/O processing the Linux/390 generic +I/O device driver support has already obtained the IRQ lock, i.e. the handler +must not try to obtain it again when calling ccw_device_start() or we end in a +deadlock situation! + +If a device driver relies on an I/O request to be completed prior to start the +next it can reduce I/O processing overhead by chaining a NoOp I/O command +CCW_CMD_NOOP to the end of the submitted CCW chain. This will force Channel-End +and Device-End status to be presented together, with a single interrupt. +However, this should be used with care as it implies the channel will remain +busy, not being able to process I/O requests for other devices on the same +channel. Therefore e.g. read commands should never use this technique, as the +result will be presented by a single interrupt anyway. + +In order to minimize I/O overhead, a device driver should use the +DOIO_REPORT_ALL only if the device can report intermediate interrupt +information prior to device-end the device driver urgently relies on. In this +case all I/O interruptions are presented to the device driver until final +status is recognized. + +If a device is able to recover from asynchronously presented I/O errors, it can +perform overlapping I/O using the DOIO_EARLY_NOTIFICATION flag. While some +devices always report channel-end and device-end together, with a single +interrupt, others present primary status (channel-end) when the channel is +ready for the next I/O request and secondary status (device-end) when the data +transmission has been completed at the device. + +Above flag allows to exploit this feature, e.g. for communication devices that +can handle lost data on the network to allow for enhanced I/O processing. + +Unless the channel subsystem at any time presents a secondary status interrupt, +exploiting this feature will cause only primary status interrupts to be +presented to the device driver while overlapping I/O is performed. When a +secondary status without error (alert status) is presented, this indicates +successful completion for all overlapping ccw_device_start() requests that have +been issued since the last secondary (final) status. + +Channel programs that intend to set the suspend flag on a channel command word +(CCW) must start the I/O operation with the DOIO_ALLOW_SUSPEND option or the +suspend flag will cause a channel program check. At the time the channel program +becomes suspended an intermediate interrupt will be generated by the channel +subsystem. + +ccw_device_resume() - Resume Channel Program Execution + +If a device driver chooses to suspend the current channel program execution by +setting the CCW suspend flag on a particular CCW, the channel program execution +is suspended. In order to resume channel program execution the CIO layer +provides the ccw_device_resume() routine. + +:: + + int ccw_device_resume(struct ccw_device *cdev); + +==== ================================================ +cdev ccw_device the resume operation is requested for +==== ================================================ + +The ccw_device_resume() function returns: + +========= ============================================== + 0 suspended channel program is resumed + -EBUSY status pending + -ENODEV cdev invalid or not-operational subchannel + -EINVAL resume function not applicable +-ENOTCONN there is no I/O request pending for completion +========= ============================================== + +Usage Notes: + +Please have a look at the ccw_device_start() usage notes for more details on +suspended channel programs. + +ccw_device_halt() - Halt I/O Request Processing + +Sometimes a device driver might need a possibility to stop the processing of +a long-running channel program or the device might require to initially issue +a halt subchannel (HSCH) I/O command. For those purposes the ccw_device_halt() +command is provided. + +ccw_device_halt() must be called disabled and with the ccw device lock held. + +:: + + int ccw_device_halt(struct ccw_device *cdev, + unsigned long intparm); + +======= ===================================================== +cdev ccw_device the halt operation is requested for +intparm interruption parameter; value is only used if no I/O + is outstanding, otherwise the intparm associated with + the I/O request is returned +======= ===================================================== + +The ccw_device_halt() function returns: + +======= ============================================================== + 0 request successfully initiated +-EBUSY the device is currently busy, or status pending. +-ENODEV cdev invalid. +-EINVAL The device is not operational or the ccw device is not online. +======= ============================================================== + +Usage Notes: + +A device driver may write a never-ending channel program by writing a channel +program that at its end loops back to its beginning by means of a transfer in +channel (TIC) command (CCW_CMD_TIC). Usually this is performed by network +device drivers by setting the PCI CCW flag (CCW_FLAG_PCI). Once this CCW is +executed a program controlled interrupt (PCI) is generated. The device driver +can then perform an appropriate action. Prior to interrupt of an outstanding +read to a network device (with or without PCI flag) a ccw_device_halt() +is required to end the pending operation. + +:: + + ccw_device_clear() - Terminage I/O Request Processing + +In order to terminate all I/O processing at the subchannel, the clear subchannel +(CSCH) command is used. It can be issued via ccw_device_clear(). + +ccw_device_clear() must be called disabled and with the ccw device lock held. + +:: + + int ccw_device_clear(struct ccw_device *cdev, unsigned long intparm); + +======= =============================================== +cdev ccw_device the clear operation is requested for +intparm interruption parameter (see ccw_device_halt()) +======= =============================================== + +The ccw_device_clear() function returns: + +======= ============================================================== + 0 request successfully initiated +-ENODEV cdev invalid +-EINVAL The device is not operational or the ccw device is not online. +======= ============================================================== + +Miscellaneous Support Routines +------------------------------ + +This chapter describes various routines to be used in a Linux/390 device +driver programming environment. + +get_ccwdev_lock() + +Get the address of the device specific lock. This is then used in +spin_lock() / spin_unlock() calls. + +:: + + __u8 ccw_device_get_path_mask(struct ccw_device *cdev); + +Get the mask of the path currently available for cdev. diff --git a/Documentation/s390/cds.txt b/Documentation/s390/cds.txt deleted file mode 100644 index 480a78ef5a1e..000000000000 --- a/Documentation/s390/cds.txt +++ /dev/null @@ -1,472 +0,0 @@ -Linux for S/390 and zSeries - -Common Device Support (CDS) -Device Driver I/O Support Routines - -Authors : Ingo Adlung - Cornelia Huck - -Copyright, IBM Corp. 1999-2002 - -Introduction - -This document describes the common device support routines for Linux/390. -Different than other hardware architectures, ESA/390 has defined a unified -I/O access method. This gives relief to the device drivers as they don't -have to deal with different bus types, polling versus interrupt -processing, shared versus non-shared interrupt processing, DMA versus port -I/O (PIO), and other hardware features more. However, this implies that -either every single device driver needs to implement the hardware I/O -attachment functionality itself, or the operating system provides for a -unified method to access the hardware, providing all the functionality that -every single device driver would have to provide itself. - -The document does not intend to explain the ESA/390 hardware architecture in -every detail.This information can be obtained from the ESA/390 Principles of -Operation manual (IBM Form. No. SA22-7201). - -In order to build common device support for ESA/390 I/O interfaces, a -functional layer was introduced that provides generic I/O access methods to -the hardware. - -The common device support layer comprises the I/O support routines defined -below. Some of them implement common Linux device driver interfaces, while -some of them are ESA/390 platform specific. - -Note: -In order to write a driver for S/390, you also need to look into the interface -described in Documentation/s390/driver-model.txt. - -Note for porting drivers from 2.4: -The major changes are: -* The functions use a ccw_device instead of an irq (subchannel). -* All drivers must define a ccw_driver (see driver-model.txt) and the associated - functions. -* request_irq() and free_irq() are no longer done by the driver. -* The oper_handler is (kindof) replaced by the probe() and set_online() functions - of the ccw_driver. -* The not_oper_handler is (kindof) replaced by the remove() and set_offline() - functions of the ccw_driver. -* The channel device layer is gone. -* The interrupt handlers must be adapted to use a ccw_device as argument. - Moreover, they don't return a devstat, but an irb. -* Before initiating an io, the options must be set via ccw_device_set_options(). -* Instead of calling read_dev_chars()/read_conf_data(), the driver issues - the channel program and handles the interrupt itself. - -ccw_device_get_ciw() - get commands from extended sense data. - -ccw_device_start() -ccw_device_start_timeout() -ccw_device_start_key() -ccw_device_start_key_timeout() - initiate an I/O request. - -ccw_device_resume() - resume channel program execution. - -ccw_device_halt() - terminate the current I/O request processed on the device. - -do_IRQ() - generic interrupt routine. This function is called by the interrupt entry - routine whenever an I/O interrupt is presented to the system. The do_IRQ() - routine determines the interrupt status and calls the device specific - interrupt handler according to the rules (flags) defined during I/O request - initiation with do_IO(). - -The next chapters describe the functions other than do_IRQ() in more details. -The do_IRQ() interface is not described, as it is called from the Linux/390 -first level interrupt handler only and does not comprise a device driver -callable interface. Instead, the functional description of do_IO() also -describes the input to the device specific interrupt handler. - -Note: All explanations apply also to the 64 bit architecture s390x. - - -Common Device Support (CDS) for Linux/390 Device Drivers - -General Information - -The following chapters describe the I/O related interface routines the -Linux/390 common device support (CDS) provides to allow for device specific -driver implementations on the IBM ESA/390 hardware platform. Those interfaces -intend to provide the functionality required by every device driver -implementation to allow to drive a specific hardware device on the ESA/390 -platform. Some of the interface routines are specific to Linux/390 and some -of them can be found on other Linux platforms implementations too. -Miscellaneous function prototypes, data declarations, and macro definitions -can be found in the architecture specific C header file -linux/arch/s390/include/asm/irq.h. - -Overview of CDS interface concepts - -Different to other hardware platforms, the ESA/390 architecture doesn't define -interrupt lines managed by a specific interrupt controller and bus systems -that may or may not allow for shared interrupts, DMA processing, etc.. Instead, -the ESA/390 architecture has implemented a so called channel subsystem, that -provides a unified view of the devices physically attached to the systems. -Though the ESA/390 hardware platform knows about a huge variety of different -peripheral attachments like disk devices (aka. DASDs), tapes, communication -controllers, etc. they can all be accessed by a well defined access method and -they are presenting I/O completion a unified way : I/O interruptions. Every -single device is uniquely identified to the system by a so called subchannel, -where the ESA/390 architecture allows for 64k devices be attached. - -Linux, however, was first built on the Intel PC architecture, with its two -cascaded 8259 programmable interrupt controllers (PICs), that allow for a -maximum of 15 different interrupt lines. All devices attached to such a system -share those 15 interrupt levels. Devices attached to the ISA bus system must -not share interrupt levels (aka. IRQs), as the ISA bus bases on edge triggered -interrupts. MCA, EISA, PCI and other bus systems base on level triggered -interrupts, and therewith allow for shared IRQs. However, if multiple devices -present their hardware status by the same (shared) IRQ, the operating system -has to call every single device driver registered on this IRQ in order to -determine the device driver owning the device that raised the interrupt. - -Up to kernel 2.4, Linux/390 used to provide interfaces via the IRQ (subchannel). -For internal use of the common I/O layer, these are still there. However, -device drivers should use the new calling interface via the ccw_device only. - -During its startup the Linux/390 system checks for peripheral devices. Each -of those devices is uniquely defined by a so called subchannel by the ESA/390 -channel subsystem. While the subchannel numbers are system generated, each -subchannel also takes a user defined attribute, the so called device number. -Both subchannel number and device number cannot exceed 65535. During sysfs -initialisation, the information about control unit type and device types that -imply specific I/O commands (channel command words - CCWs) in order to operate -the device are gathered. Device drivers can retrieve this set of hardware -information during their initialization step to recognize the devices they -support using the information saved in the struct ccw_device given to them. -This methods implies that Linux/390 doesn't require to probe for free (not -armed) interrupt request lines (IRQs) to drive its devices with. Where -applicable, the device drivers can use issue the READ DEVICE CHARACTERISTICS -ccw to retrieve device characteristics in its online routine. - -In order to allow for easy I/O initiation the CDS layer provides a -ccw_device_start() interface that takes a device specific channel program (one -or more CCWs) as input sets up the required architecture specific control blocks -and initiates an I/O request on behalf of the device driver. The -ccw_device_start() routine allows to specify whether it expects the CDS layer -to notify the device driver for every interrupt it observes, or with final status -only. See ccw_device_start() for more details. A device driver must never issue -ESA/390 I/O commands itself, but must use the Linux/390 CDS interfaces instead. - -For long running I/O request to be canceled, the CDS layer provides the -ccw_device_halt() function. Some devices require to initially issue a HALT -SUBCHANNEL (HSCH) command without having pending I/O requests. This function is -also covered by ccw_device_halt(). - - -get_ciw() - get command information word - -This call enables a device driver to get information about supported commands -from the extended SenseID data. - -struct ciw * -ccw_device_get_ciw(struct ccw_device *cdev, __u32 cmd); - -cdev - The ccw_device for which the command is to be retrieved. -cmd - The command type to be retrieved. - -ccw_device_get_ciw() returns: -NULL - No extended data available, invalid device or command not found. -!NULL - The command requested. - - -ccw_device_start() - Initiate I/O Request - -The ccw_device_start() routines is the I/O request front-end processor. All -device driver I/O requests must be issued using this routine. A device driver -must not issue ESA/390 I/O commands itself. Instead the ccw_device_start() -routine provides all interfaces required to drive arbitrary devices. - -This description also covers the status information passed to the device -driver's interrupt handler as this is related to the rules (flags) defined -with the associated I/O request when calling ccw_device_start(). - -int ccw_device_start(struct ccw_device *cdev, - struct ccw1 *cpa, - unsigned long intparm, - __u8 lpm, - unsigned long flags); -int ccw_device_start_timeout(struct ccw_device *cdev, - struct ccw1 *cpa, - unsigned long intparm, - __u8 lpm, - unsigned long flags, - int expires); -int ccw_device_start_key(struct ccw_device *cdev, - struct ccw1 *cpa, - unsigned long intparm, - __u8 lpm, - __u8 key, - unsigned long flags); -int ccw_device_start_key_timeout(struct ccw_device *cdev, - struct ccw1 *cpa, - unsigned long intparm, - __u8 lpm, - __u8 key, - unsigned long flags, - int expires); - -cdev : ccw_device the I/O is destined for -cpa : logical start address of channel program -user_intparm : user specific interrupt information; will be presented - back to the device driver's interrupt handler. Allows a - device driver to associate the interrupt with a - particular I/O request. -lpm : defines the channel path to be used for a specific I/O - request. A value of 0 will make cio use the opm. -key : the storage key to use for the I/O (useful for operating on a - storage with a storage key != default key) -flag : defines the action to be performed for I/O processing -expires : timeout value in jiffies. The common I/O layer will terminate - the running program after this and call the interrupt handler - with ERR_PTR(-ETIMEDOUT) as irb. - -Possible flag values are : - -DOIO_ALLOW_SUSPEND - channel program may become suspended -DOIO_DENY_PREFETCH - don't allow for CCW prefetch; usually - this implies the channel program might - become modified -DOIO_SUPPRESS_INTER - don't call the handler on intermediate status - -The cpa parameter points to the first format 1 CCW of a channel program : - -struct ccw1 { - __u8 cmd_code;/* command code */ - __u8 flags; /* flags, like IDA addressing, etc. */ - __u16 count; /* byte count */ - __u32 cda; /* data address */ -} __attribute__ ((packed,aligned(8))); - -with the following CCW flags values defined : - -CCW_FLAG_DC - data chaining -CCW_FLAG_CC - command chaining -CCW_FLAG_SLI - suppress incorrect length -CCW_FLAG_SKIP - skip -CCW_FLAG_PCI - PCI -CCW_FLAG_IDA - indirect addressing -CCW_FLAG_SUSPEND - suspend - - -Via ccw_device_set_options(), the device driver may specify the following -options for the device: - -DOIO_EARLY_NOTIFICATION - allow for early interrupt notification -DOIO_REPORT_ALL - report all interrupt conditions - - -The ccw_device_start() function returns : - - 0 - successful completion or request successfully initiated --EBUSY - The device is currently processing a previous I/O request, or there is - a status pending at the device. --ENODEV - cdev is invalid, the device is not operational or the ccw_device is - not online. - -When the I/O request completes, the CDS first level interrupt handler will -accumulate the status in a struct irb and then call the device interrupt handler. -The intparm field will contain the value the device driver has associated with a -particular I/O request. If a pending device status was recognized, -intparm will be set to 0 (zero). This may happen during I/O initiation or delayed -by an alert status notification. In any case this status is not related to the -current (last) I/O request. In case of a delayed status notification no special -interrupt will be presented to indicate I/O completion as the I/O request was -never started, even though ccw_device_start() returned with successful completion. - -The irb may contain an error value, and the device driver should check for this -first: - --ETIMEDOUT: the common I/O layer terminated the request after the specified - timeout value --EIO: the common I/O layer terminated the request due to an error state - -If the concurrent sense flag in the extended status word (esw) in the irb is -set, the field erw.scnt in the esw describes the number of device specific -sense bytes available in the extended control word irb->scsw.ecw[]. No device -sensing by the device driver itself is required. - -The device interrupt handler can use the following definitions to investigate -the primary unit check source coded in sense byte 0 : - -SNS0_CMD_REJECT 0x80 -SNS0_INTERVENTION_REQ 0x40 -SNS0_BUS_OUT_CHECK 0x20 -SNS0_EQUIPMENT_CHECK 0x10 -SNS0_DATA_CHECK 0x08 -SNS0_OVERRUN 0x04 -SNS0_INCOMPL_DOMAIN 0x01 - -Depending on the device status, multiple of those values may be set together. -Please refer to the device specific documentation for details. - -The irb->scsw.cstat field provides the (accumulated) subchannel status : - -SCHN_STAT_PCI - program controlled interrupt -SCHN_STAT_INCORR_LEN - incorrect length -SCHN_STAT_PROG_CHECK - program check -SCHN_STAT_PROT_CHECK - protection check -SCHN_STAT_CHN_DATA_CHK - channel data check -SCHN_STAT_CHN_CTRL_CHK - channel control check -SCHN_STAT_INTF_CTRL_CHK - interface control check -SCHN_STAT_CHAIN_CHECK - chaining check - -The irb->scsw.dstat field provides the (accumulated) device status : - -DEV_STAT_ATTENTION - attention -DEV_STAT_STAT_MOD - status modifier -DEV_STAT_CU_END - control unit end -DEV_STAT_BUSY - busy -DEV_STAT_CHN_END - channel end -DEV_STAT_DEV_END - device end -DEV_STAT_UNIT_CHECK - unit check -DEV_STAT_UNIT_EXCEP - unit exception - -Please see the ESA/390 Principles of Operation manual for details on the -individual flag meanings. - -Usage Notes : - -ccw_device_start() must be called disabled and with the ccw device lock held. - -The device driver is allowed to issue the next ccw_device_start() call from -within its interrupt handler already. It is not required to schedule a -bottom-half, unless a non deterministically long running error recovery procedure -or similar needs to be scheduled. During I/O processing the Linux/390 generic -I/O device driver support has already obtained the IRQ lock, i.e. the handler -must not try to obtain it again when calling ccw_device_start() or we end in a -deadlock situation! - -If a device driver relies on an I/O request to be completed prior to start the -next it can reduce I/O processing overhead by chaining a NoOp I/O command -CCW_CMD_NOOP to the end of the submitted CCW chain. This will force Channel-End -and Device-End status to be presented together, with a single interrupt. -However, this should be used with care as it implies the channel will remain -busy, not being able to process I/O requests for other devices on the same -channel. Therefore e.g. read commands should never use this technique, as the -result will be presented by a single interrupt anyway. - -In order to minimize I/O overhead, a device driver should use the -DOIO_REPORT_ALL only if the device can report intermediate interrupt -information prior to device-end the device driver urgently relies on. In this -case all I/O interruptions are presented to the device driver until final -status is recognized. - -If a device is able to recover from asynchronously presented I/O errors, it can -perform overlapping I/O using the DOIO_EARLY_NOTIFICATION flag. While some -devices always report channel-end and device-end together, with a single -interrupt, others present primary status (channel-end) when the channel is -ready for the next I/O request and secondary status (device-end) when the data -transmission has been completed at the device. - -Above flag allows to exploit this feature, e.g. for communication devices that -can handle lost data on the network to allow for enhanced I/O processing. - -Unless the channel subsystem at any time presents a secondary status interrupt, -exploiting this feature will cause only primary status interrupts to be -presented to the device driver while overlapping I/O is performed. When a -secondary status without error (alert status) is presented, this indicates -successful completion for all overlapping ccw_device_start() requests that have -been issued since the last secondary (final) status. - -Channel programs that intend to set the suspend flag on a channel command word -(CCW) must start the I/O operation with the DOIO_ALLOW_SUSPEND option or the -suspend flag will cause a channel program check. At the time the channel program -becomes suspended an intermediate interrupt will be generated by the channel -subsystem. - -ccw_device_resume() - Resume Channel Program Execution - -If a device driver chooses to suspend the current channel program execution by -setting the CCW suspend flag on a particular CCW, the channel program execution -is suspended. In order to resume channel program execution the CIO layer -provides the ccw_device_resume() routine. - -int ccw_device_resume(struct ccw_device *cdev); - -cdev - ccw_device the resume operation is requested for - -The ccw_device_resume() function returns: - - 0 - suspended channel program is resumed --EBUSY - status pending --ENODEV - cdev invalid or not-operational subchannel --EINVAL - resume function not applicable --ENOTCONN - there is no I/O request pending for completion - -Usage Notes: -Please have a look at the ccw_device_start() usage notes for more details on -suspended channel programs. - -ccw_device_halt() - Halt I/O Request Processing - -Sometimes a device driver might need a possibility to stop the processing of -a long-running channel program or the device might require to initially issue -a halt subchannel (HSCH) I/O command. For those purposes the ccw_device_halt() -command is provided. - -ccw_device_halt() must be called disabled and with the ccw device lock held. - -int ccw_device_halt(struct ccw_device *cdev, - unsigned long intparm); - -cdev : ccw_device the halt operation is requested for -intparm : interruption parameter; value is only used if no I/O - is outstanding, otherwise the intparm associated with - the I/O request is returned - -The ccw_device_halt() function returns : - - 0 - request successfully initiated --EBUSY - the device is currently busy, or status pending. --ENODEV - cdev invalid. --EINVAL - The device is not operational or the ccw device is not online. - -Usage Notes : - -A device driver may write a never-ending channel program by writing a channel -program that at its end loops back to its beginning by means of a transfer in -channel (TIC) command (CCW_CMD_TIC). Usually this is performed by network -device drivers by setting the PCI CCW flag (CCW_FLAG_PCI). Once this CCW is -executed a program controlled interrupt (PCI) is generated. The device driver -can then perform an appropriate action. Prior to interrupt of an outstanding -read to a network device (with or without PCI flag) a ccw_device_halt() -is required to end the pending operation. - -ccw_device_clear() - Terminage I/O Request Processing - -In order to terminate all I/O processing at the subchannel, the clear subchannel -(CSCH) command is used. It can be issued via ccw_device_clear(). - -ccw_device_clear() must be called disabled and with the ccw device lock held. - -int ccw_device_clear(struct ccw_device *cdev, unsigned long intparm); - -cdev: ccw_device the clear operation is requested for -intparm: interruption parameter (see ccw_device_halt()) - -The ccw_device_clear() function returns: - - 0 - request successfully initiated --ENODEV - cdev invalid --EINVAL - The device is not operational or the ccw device is not online. - -Miscellaneous Support Routines - -This chapter describes various routines to be used in a Linux/390 device -driver programming environment. - -get_ccwdev_lock() - -Get the address of the device specific lock. This is then used in -spin_lock() / spin_unlock() calls. - - -__u8 ccw_device_get_path_mask(struct ccw_device *cdev); - -Get the mask of the path currently available for cdev. diff --git a/Documentation/s390/common_io.rst b/Documentation/s390/common_io.rst new file mode 100644 index 000000000000..846485681ce7 --- /dev/null +++ b/Documentation/s390/common_io.rst @@ -0,0 +1,140 @@ +====================== +S/390 common I/O-Layer +====================== + +command line parameters, procfs and debugfs entries +=================================================== + +Command line parameters +----------------------- + +* ccw_timeout_log + + Enable logging of debug information in case of ccw device timeouts. + +* cio_ignore = device[,device[,..]] + + device := {all | [!]ipldev | [!]condev | [!] | [!]-} + + The given devices will be ignored by the common I/O-layer; no detection + and device sensing will be done on any of those devices. The subchannel to + which the device in question is attached will be treated as if no device was + attached. + + An ignored device can be un-ignored later; see the "/proc entries"-section for + details. + + The devices must be given either as bus ids (0.x.abcd) or as hexadecimal + device numbers (0xabcd or abcd, for 2.4 backward compatibility). If you + give a device number 0xabcd, it will be interpreted as 0.0.abcd. + + You can use the 'all' keyword to ignore all devices. The 'ipldev' and 'condev' + keywords can be used to refer to the CCW based boot device and CCW console + device respectively (these are probably useful only when combined with the '!' + operator). The '!' operator will cause the I/O-layer to _not_ ignore a device. + The command line + is parsed from left to right. + + For example:: + + cio_ignore=0.0.0023-0.0.0042,0.0.4711 + + will ignore all devices ranging from 0.0.0023 to 0.0.0042 and the device + 0.0.4711, if detected. + + As another example:: + + cio_ignore=all,!0.0.4711,!0.0.fd00-0.0.fd02 + + will ignore all devices but 0.0.4711, 0.0.fd00, 0.0.fd01, 0.0.fd02. + + By default, no devices are ignored. + + +/proc entries +------------- + +* /proc/cio_ignore + + Lists the ranges of devices (by bus id) which are ignored by common I/O. + + You can un-ignore certain or all devices by piping to /proc/cio_ignore. + "free all" will un-ignore all ignored devices, + "free , , ..." will un-ignore the specified + devices. + + For example, if devices 0.0.0023 to 0.0.0042 and 0.0.4711 are ignored, + + - echo free 0.0.0030-0.0.0032 > /proc/cio_ignore + will un-ignore devices 0.0.0030 to 0.0.0032 and will leave devices 0.0.0023 + to 0.0.002f, 0.0.0033 to 0.0.0042 and 0.0.4711 ignored; + - echo free 0.0.0041 > /proc/cio_ignore will furthermore un-ignore device + 0.0.0041; + - echo free all > /proc/cio_ignore will un-ignore all remaining ignored + devices. + + When a device is un-ignored, device recognition and sensing is performed and + the device driver will be notified if possible, so the device will become + available to the system. Note that un-ignoring is performed asynchronously. + + You can also add ranges of devices to be ignored by piping to + /proc/cio_ignore; "add , , ..." will ignore the + specified devices. + + Note: While already known devices can be added to the list of devices to be + ignored, there will be no effect on then. However, if such a device + disappears and then reappears, it will then be ignored. To make + known devices go away, you need the "purge" command (see below). + + For example:: + + "echo add 0.0.a000-0.0.accc, 0.0.af00-0.0.afff > /proc/cio_ignore" + + will add 0.0.a000-0.0.accc and 0.0.af00-0.0.afff to the list of ignored + devices. + + You can remove already known but now ignored devices via:: + + "echo purge > /proc/cio_ignore" + + All devices ignored but still registered and not online (= not in use) + will be deregistered and thus removed from the system. + + The devices can be specified either by bus id (0.x.abcd) or, for 2.4 backward + compatibility, by the device number in hexadecimal (0xabcd or abcd). Device + numbers given as 0xabcd will be interpreted as 0.0.abcd. + +* /proc/cio_settle + + A write request to this file is blocked until all queued cio actions are + handled. This will allow userspace to wait for pending work affecting + device availability after changing cio_ignore or the hardware configuration. + +* For some of the information present in the /proc filesystem in 2.4 (namely, + /proc/subchannels and /proc/chpids), see driver-model.txt. + Information formerly in /proc/irq_count is now in /proc/interrupts. + + +debugfs entries +--------------- + +* /sys/kernel/debug/s390dbf/cio_*/ (S/390 debug feature) + + Some views generated by the debug feature to hold various debug outputs. + + - /sys/kernel/debug/s390dbf/cio_crw/sprintf + Messages from the processing of pending channel report words (machine check + handling). + + - /sys/kernel/debug/s390dbf/cio_msg/sprintf + Various debug messages from the common I/O-layer. + + - /sys/kernel/debug/s390dbf/cio_trace/hex_ascii + Logs the calling of functions in the common I/O-layer and, if applicable, + which subchannel they were called for, as well as dumps of some data + structures (like irb in an error case). + + The level of logging can be changed to be more or less verbose by piping to + /sys/kernel/debug/s390dbf/cio_*/level a number between 0 and 6; see the + documentation on the S/390 debug feature (Documentation/s390/s390dbf.rst) + for details. diff --git a/Documentation/s390/dasd.rst b/Documentation/s390/dasd.rst new file mode 100644 index 000000000000..9e22247285c8 --- /dev/null +++ b/Documentation/s390/dasd.rst @@ -0,0 +1,84 @@ +================== +DASD device driver +================== + +S/390's disk devices (DASDs) are managed by Linux via the DASD device +driver. It is valid for all types of DASDs and represents them to +Linux as block devices, namely "dd". Currently the DASD driver uses a +single major number (254) and 4 minor numbers per volume (1 for the +physical volume and 3 for partitions). With respect to partitions see +below. Thus you may have up to 64 DASD devices in your system. + +The kernel parameter 'dasd=from-to,...' may be issued arbitrary times +in the kernel's parameter line or not at all. The 'from' and 'to' +parameters are to be given in hexadecimal notation without a leading +0x. +If you supply kernel parameters the different instances are processed +in order of appearance and a minor number is reserved for any device +covered by the supplied range up to 64 volumes. Additional DASDs are +ignored. If you do not supply the 'dasd=' kernel parameter at all, the +DASD driver registers all supported DASDs of your system to a minor +number in ascending order of the subchannel number. + +The driver currently supports ECKD-devices and there are stubs for +support of the FBA and CKD architectures. For the FBA architecture +only some smart data structures are missing to make the support +complete. +We performed our testing on 3380 and 3390 type disks of different +sizes, under VM and on the bare hardware (LPAR), using internal disks +of the multiprise as well as a RAMAC virtual array. Disks exported by +an Enterprise Storage Server (Seascape) should work fine as well. + +We currently implement one partition per volume, which is the whole +volume, skipping the first blocks up to the volume label. These are +reserved for IPL records and IBM's volume label to assure +accessibility of the DASD from other OSs. In a later stage we will +provide support of partitions, maybe VTOC oriented or using a kind of +partition table in the label record. + +Usage +===== + +-Low-level format (?CKD only) +For using an ECKD-DASD as a Linux harddisk you have to low-level +format the tracks by issuing the BLKDASDFORMAT-ioctl on that +device. This will erase any data on that volume including IBM volume +labels, VTOCs etc. The ioctl may take a `struct format_data *` or +'NULL' as an argument:: + + typedef struct { + int start_unit; + int stop_unit; + int blksize; + } format_data_t; + +When a NULL argument is passed to the BLKDASDFORMAT ioctl the whole +disk is formatted to a blocksize of 1024 bytes. Otherwise start_unit +and stop_unit are the first and last track to be formatted. If +stop_unit is -1 it implies that the DASD is formatted from start_unit +up to the last track. blksize can be any power of two between 512 and +4096. We recommend no blksize lower than 1024 because the ext2fs uses +1kB blocks anyway and you gain approx. 50% of capacity increasing your +blksize from 512 byte to 1kB. + +Make a filesystem +================= + +Then you can mk??fs the filesystem of your choice on that volume or +partition. For reasons of sanity you should build your filesystem on +the partition /dev/dd?1 instead of the whole volume. You only lose 3kB +but may be sure that you can reuse your data after introduction of a +real partition table. + +Bugs +==== + +- Performance sometimes is rather low because we don't fully exploit clustering + +TODO-List +========= + +- Add IBM'S Disk layout to genhd +- Enhance driver to use more than one major number +- Enable usage as a module +- Support Cache fast write and DASD fast write (ECKD) diff --git a/Documentation/s390/debugging390.rst b/Documentation/s390/debugging390.rst new file mode 100644 index 000000000000..d49305fd5e1a --- /dev/null +++ b/Documentation/s390/debugging390.rst @@ -0,0 +1,2613 @@ +============================================= +Debugging on Linux for s/390 & z/Architecture +============================================= + +Denis Joseph Barrow (djbarrow@de.ibm.com,barrow_dj@yahoo.com) + +Copyright (C) 2000-2001 IBM Deutschland Entwicklung GmbH, IBM Corporation + +.. Best viewed with fixed width fonts + +Overview of Document: +===================== +This document is intended to give a good overview of how to debug Linux for +s/390 and z/Architecture. It is not intended as a complete reference and not a +tutorial on the fundamentals of C & assembly. It doesn't go into +390 IO in any detail. It is intended to complement the documents in the +reference section below & any other worthwhile references you get. + +It is intended like the Enterprise Systems Architecture/390 Reference Summary +to be printed out & used as a quick cheat sheet self help style reference when +problems occur. + +.. Contents + ======== + Register Set + Address Spaces on Intel Linux + Address Spaces on Linux for s/390 & z/Architecture + The Linux for s/390 & z/Architecture Kernel Task Structure + Register Usage & Stackframes on Linux for s/390 & z/Architecture + A sample program with comments + Compiling programs for debugging on Linux for s/390 & z/Architecture + Debugging under VM + s/390 & z/Architecture IO Overview + Debugging IO on s/390 & z/Architecture under VM + GDB on s/390 & z/Architecture + Stack chaining in gdb by hand + Examining core dumps + ldd + Debugging modules + The proc file system + SysRq + References + Special Thanks + +Register Set +============ +The current architectures have the following registers. + +16 General propose registers, 32 bit on s/390 and 64 bit on z/Architecture, +r0-r15 (or gpr0-gpr15), used for arithmetic and addressing. + +16 Control registers, 32 bit on s/390 and 64 bit on z/Architecture, cr0-cr15, +kernel usage only, used for memory management, interrupt control, debugging +control etc. + +16 Access registers (ar0-ar15), 32 bit on both s/390 and z/Architecture, +normally not used by normal programs but potentially could be used as +temporary storage. These registers have a 1:1 association with general +purpose registers and are designed to be used in the so-called access +register mode to select different address spaces. +Access register 0 (and access register 1 on z/Architecture, which needs a +64 bit pointer) is currently used by the pthread library as a pointer to +the current running threads private area. + +16 64-bit floating point registers (fp0-fp15 ) IEEE & HFP floating +point format compliant on G5 upwards & a Floating point control reg (FPC) + +4 64-bit registers (fp0,fp2,fp4 & fp6) HFP only on older machines. + +Note: + Linux (currently) always uses IEEE & emulates G5 IEEE format on older + machines, ( provided the kernel is configured for this ). + + +The PSW is the most important register on the machine it +is 64 bit on s/390 & 128 bit on z/Architecture & serves the roles of +a program counter (pc), condition code register,memory space designator. +In IBM standard notation I am counting bit 0 as the MSB. +It has several advantages over a normal program counter +in that you can change address translation & program counter +in a single instruction. To change address translation, +e.g. switching address translation off requires that you +have a logical=physical mapping for the address you are +currently running at. + ++-------------------------+-------------------------------------------------+ +| Bit | | ++--------+----------------+ Value | +| s/390 | z/Architecture | | ++========+================+=================================================+ +| 0 | 0 | Reserved (must be 0) otherwise specification | +| | | exception occurs. | ++--------+----------------+-------------------------------------------------+ +| 1 | 1 | Program Event Recording 1 PER enabled, | +| | | PER is used to facilitate debugging e.g. | +| | | single stepping. | ++--------+----------------+-------------------------------------------------+ +| 2-4 | 2-4 | Reserved (must be 0). | ++--------+----------------+-------------------------------------------------+ +| 5 | 5 | Dynamic address translation 1=DAT on. | ++--------+----------------+-------------------------------------------------+ +| 6 | 6 | Input/Output interrupt Mask | ++--------+----------------+-------------------------------------------------+ +| 7 | 7 | External interrupt Mask used primarily for | +| | | interprocessor signalling and clock interrupts. | ++--------+----------------+-------------------------------------------------+ +| 8-11 | 8-11 | PSW Key used for complex memory protection | +| | | mechanism (not used under linux) | ++--------+----------------+-------------------------------------------------+ +| 12 | 12 | 1 on s/390 0 on z/Architecture | ++--------+----------------+-------------------------------------------------+ +| 13 | 13 | Machine Check Mask 1=enable machine check | +| | | interrupts | ++--------+----------------+-------------------------------------------------+ +| 14 | 14 | Wait State. Set this to 1 to stop the processor | +| | | except for interrupts and give time to other | +| | | LPARS. Used in CPU idle in the kernel to | +| | | increase overall usage of processor resources. | ++--------+----------------+-------------------------------------------------+ +| 15 | 15 | Problem state (if set to 1 certain instructions | +| | | are disabled). All linux user programs run with | +| | | this bit 1 (useful info for debugging under VM).| ++--------+----------------+-------------------------------------------------+ +| 16-17 | 16-17 | Address Space Control | +| | | | +| | | 00 Primary Space Mode: | +| | | | +| | | The register CR1 contains the primary | +| | | address-space control element (PASCE), which | +| | | points to the primary space region/segment | +| | | table origin. | +| | | | +| | | 01 Access register mode | +| | | | +| | | 10 Secondary Space Mode: | +| | | | +| | | The register CR7 contains the secondary | +| | | address-space control element (SASCE), which | +| | | points to the secondary space region or | +| | | segment table origin. | +| | | | +| | | 11 Home Space Mode: | +| | | | +| | | The register CR13 contains the home space | +| | | address-space control element (HASCE), which | +| | | points to the home space region/segment | +| | | table origin. | +| | | | +| | | See "Address Spaces on Linux for s/390 & | +| | | z/Architecture" below for more information | +| | | about address space usage in Linux. | ++--------+----------------+-------------------------------------------------+ +| 18-19 | 18-19 | Condition codes (CC) | ++--------+----------------+-------------------------------------------------+ +| 20 | 20 | Fixed point overflow mask if 1=FPU exceptions | +| | | for this event occur (normally 0) | ++--------+----------------+-------------------------------------------------+ +| 21 | 21 | Decimal overflow mask if 1=FPU exceptions for | +| | | this event occur (normally 0) | ++--------+----------------+-------------------------------------------------+ +| 22 | 22 | Exponent underflow mask if 1=FPU exceptions | +| | | for this event occur (normally 0) | ++--------+----------------+-------------------------------------------------+ +| 23 | 23 | Significance Mask if 1=FPU exceptions for this | +| | | event occur (normally 0) | ++--------+----------------+-------------------------------------------------+ +| 24-31 | 24-30 | Reserved Must be 0. | +| +----------------+-------------------------------------------------+ +| | 31 | Extended Addressing Mode | +| +----------------+-------------------------------------------------+ +| | 32 | Basic Addressing Mode | +| | | | +| | | Used to set addressing mode | +| | | | +| | | +---------+----------+----------+ | +| | | | PSW 31 | PSW 32 | | | +| | | +---------+----------+----------+ | +| | | | 0 | 0 | 24 bit | | +| | | +---------+----------+----------+ | +| | | | 0 | 1 | 31 bit | | +| | | +---------+----------+----------+ | +| | | | 1 | 1 | 64 bit | | +| | | +---------+----------+----------+ | ++--------+----------------+-------------------------------------------------+ +| 32 | | 1=31 bit addressing mode 0=24 bit addressing | +| | | mode (for backward compatibility), linux | +| | | always runs with this bit set to 1 | ++--------+----------------+-------------------------------------------------+ +| 33-64 | | Instruction address. | +| +----------------+-------------------------------------------------+ +| | 33-63 | Reserved must be 0 | +| +----------------+-------------------------------------------------+ +| | 64-127 | Address | +| | | | +| | | - In 24 bits mode bits 64-103=0 bits 104-127 | +| | | Address | +| | | - In 31 bits mode bits 64-96=0 bits 97-127 | +| | | Address | +| | | | +| | | Note: | +| | | unlike 31 bit mode on s/390 bit 96 must be | +| | | zero when loading the address with LPSWE | +| | | otherwise a specification exception occurs, | +| | | LPSW is fully backward compatible. | ++--------+----------------+-------------------------------------------------+ + +Prefix Page(s) +-------------- +This per cpu memory area is too intimately tied to the processor not to mention. +It exists between the real addresses 0-4096 on s/390 and between 0-8192 on +z/Architecture and is exchanged with one page on s/390 or two pages on +z/Architecture in absolute storage by the set prefix instruction during Linux +startup. + +This page is mapped to a different prefix for each processor in an SMP +configuration (assuming the OS designer is sane of course). + +Bytes 0-512 (200 hex) on s/390 and 0-512, 4096-4544, 4604-5119 currently on +z/Architecture are used by the processor itself for holding such information +as exception indications and entry points for exceptions. + +Bytes after 0xc00 hex are used by linux for per processor globals on s/390 and +z/Architecture (there is a gap on z/Architecture currently between 0xc00 and +0x1000, too, which is used by Linux). + +The closest thing to this on traditional architectures is the interrupt +vector table. This is a good thing & does simplify some of the kernel coding +however it means that we now cannot catch stray NULL pointers in the +kernel without hard coded checks. + + + +Address Spaces on Intel Linux +============================= + +The traditional Intel Linux is approximately mapped as follows forgive +the ascii art:: + + 0xFFFFFFFF 4GB Himem ***************** + * * + * Kernel Space * + * * + ***************** **************** + User Space Himem * User Stack * * * + (typically 0xC0000000 3GB ) ***************** * * + * Shared Libs * * Next Process * + ***************** * to * + * * <== * Run * <== + * User Program * * * + * Data BSS * * * + * Text * * * + * Sections * * * + 0x00000000 ***************** **************** + +Now it is easy to see that on Intel it is quite easy to recognise a kernel +address as being one greater than user space himem (in this case 0xC0000000), +and addresses of less than this are the ones in the current running program on +this processor (if an smp box). + +If using the virtual machine ( VM ) as a debugger it is quite difficult to +know which user process is running as the address space you are looking at +could be from any process in the run queue. + +The limitation of Intels addressing technique is that the linux +kernel uses a very simple real address to virtual addressing technique +of Real Address=Virtual Address-User Space Himem. +This means that on Intel the kernel linux can typically only address +Himem=0xFFFFFFFF-0xC0000000=1GB & this is all the RAM these machines +can typically use. + +They can lower User Himem to 2GB or lower & thus be +able to use 2GB of RAM however this shrinks the maximum size +of User Space from 3GB to 2GB they have a no win limit of 4GB unless +they go to 64 Bit. + + +On 390 our limitations & strengths make us slightly different. +For backward compatibility we are only allowed use 31 bits (2GB) +of our 32 bit addresses, however, we use entirely separate address +spaces for the user & kernel. + +This means we can support 2GB of non Extended RAM on s/390, & more +with the Extended memory management swap device & +currently 4TB of physical memory currently on z/Architecture. + + +Address Spaces on Linux for s/390 & z/Architecture +================================================== + +Our addressing scheme is basically as follows:: + + Primary Space Home Space + Himem 0x7fffffff 2GB on s/390 ***************** **************** + currently 0x3ffffffffff (2^42)-1 * User Stack * * * + on z/Architecture. ***************** * * + * Shared Libs * * * + ***************** * * + * * * Kernel * + * User Program * * * + * Data BSS * * * + * Text * * * + * Sections * * * + 0x00000000 ***************** **************** + +This also means that we need to look at the PSW problem state bit and the +addressing mode to decide whether we are looking at user or kernel space. + +User space runs in primary address mode (or access register mode within +the vdso code). + +The kernel usually also runs in home space mode, however when accessing +user space the kernel switches to primary or secondary address mode if +the mvcos instruction is not available or if a compare-and-swap (futex) +instruction on a user space address is performed. + +When also looking at the ASCE control registers, this means: + +User space: + +- runs in primary or access register mode +- cr1 contains the user asce +- cr7 contains the user asce +- cr13 contains the kernel asce + +Kernel space: + +- runs in home space mode +- cr1 contains the user or kernel asce + + - the kernel asce is loaded when a uaccess requires primary or + secondary address mode + +- cr7 contains the user or kernel asce, (changed with set_fs()) +- cr13 contains the kernel asce + +In case of uaccess the kernel changes to: + +- primary space mode in case of a uaccess (copy_to_user) and uses + e.g. the mvcp instruction to access user space. However the kernel + will stay in home space mode if the mvcos instruction is available +- secondary space mode in case of futex atomic operations, so that the + instructions come from primary address space and data from secondary + space + +In case of KVM, the kernel runs in home space mode, but cr1 gets switched +to contain the gmap asce before the SIE instruction gets executed. When +the SIE instruction is finished, cr1 will be switched back to contain the +user asce. + + +Virtual Addresses on s/390 & z/Architecture +=========================================== + +A virtual address on s/390 is made up of 3 parts +The SX (segment index, roughly corresponding to the PGD & PMD in Linux +terminology) being bits 1-11. + +The PX (page index, corresponding to the page table entry (pte) in Linux +terminology) being bits 12-19. + +The remaining bits BX (the byte index are the offset in the page ) +i.e. bits 20 to 31. + +On z/Architecture in linux we currently make up an address from 4 parts. + +- The region index bits (RX) 0-32 we currently use bits 22-32 +- The segment index (SX) being bits 33-43 +- The page index (PX) being bits 44-51 +- The byte index (BX) being bits 52-63 + +Notes: + 1) s/390 has no PMD so the PMD is really the PGD also. + A lot of this stuff is defined in pgtable.h. + + 2) Also seeing as s/390's page indexes are only 1k in size + (bits 12-19 x 4 bytes per pte ) we use 1 ( page 4k ) + to make the best use of memory by updating 4 segment indices + entries each time we mess with a PMD & use offsets + 0,1024,2048 & 3072 in this page as for our segment indexes. + On z/Architecture our page indexes are now 2k in size + ( bits 12-19 x 8 bytes per pte ) we do a similar trick + but only mess with 2 segment indices each time we mess with + a PMD. + + 3) As z/Architecture supports up to a massive 5-level page table lookup we + can only use 3 currently on Linux ( as this is all the generic kernel + currently supports ) however this may change in future + this allows us to access ( according to my sums ) + 4TB of virtual storage per process i.e. + 4096*512(PTES)*1024(PMDS)*2048(PGD) = 4398046511104 bytes, + enough for another 2 or 3 of years I think :-). + to do this we use a region-third-table designation type in + our address space control registers. + + +The Linux for s/390 & z/Architecture Kernel Task Structure +========================================================== +Each process/thread under Linux for S390 has its own kernel task_struct +defined in linux/include/linux/sched.h +The S390 on initialisation & resuming of a process on a cpu sets +the __LC_KERNEL_STACK variable in the spare prefix area for this cpu +(which we use for per-processor globals). + +The kernel stack pointer is intimately tied with the task structure for +each processor as follows:: + + s/390 + ************************ + * 1 page kernel stack * + * ( 4K ) * + ************************ + * 1 page task_struct * + * ( 4K ) * + 8K aligned ************************ + + z/Architecture + ************************ + * 2 page kernel stack * + * ( 8K ) * + ************************ + * 2 page task_struct * + * ( 8K ) * + 16K aligned ************************ + +What this means is that we don't need to dedicate any register or global +variable to point to the current running process & can retrieve it with the +following very simple construct for s/390 & one very similar for +z/Architecture:: + + static inline struct task_struct * get_current(void) + { + struct task_struct *current; + __asm__("lhi %0,-8192\n\t" + "nr %0,15" + : "=r" (current) ); + return current; + } + +i.e. just anding the current kernel stack pointer with the mask -8192. +Thankfully because Linux doesn't have support for nested IO interrupts +& our devices have large buffers can survive interrupts being shut for +short amounts of time we don't need a separate stack for interrupts. + + + + +Register Usage & Stackframes on Linux for s/390 & z/Architecture +================================================================= +Overview: +--------- +This is the code that gcc produces at the top & the bottom of +each function. It usually is fairly consistent & similar from +function to function & if you know its layout you can probably +make some headway in finding the ultimate cause of a problem +after a crash without a source level debugger. + +Note: To follow stackframes requires a knowledge of C or Pascal & +limited knowledge of one assembly language. + +It should be noted that there are some differences between the +s/390 and z/Architecture stack layouts as the z/Architecture stack layout +didn't have to maintain compatibility with older linkage formats. + +Glossary: +--------- +alloca: + This is a built in compiler function for runtime allocation + of extra space on the callers stack which is obviously freed + up on function exit ( e.g. the caller may choose to allocate nothing + of a buffer of 4k if required for temporary purposes ), it generates + very efficient code ( a few cycles ) when compared to alternatives + like malloc. + +automatics: + These are local variables on the stack, i.e they aren't in registers & + they aren't static. + +back-chain: + This is a pointer to the stack pointer before entering a + framed functions ( see frameless function ) prologue got by + dereferencing the address of the current stack pointer, + i.e. got by accessing the 32 bit value at the stack pointers + current location. + +base-pointer: + This is a pointer to the back of the literal pool which + is an area just behind each procedure used to store constants + in each function. + +call-clobbered: + The caller probably needs to save these registers if there + is something of value in them, on the stack or elsewhere before making a + call to another procedure so that it can restore it later. + +epilogue: + The code generated by the compiler to return to the caller. + +frameless-function: + A frameless function in Linux for s390 & z/Architecture is one which doesn't + need more than the register save area (96 bytes on s/390, 160 on z/Architecture) + given to it by the caller. + + A frameless function never: + + 1) Sets up a back chain. + 2) Calls alloca. + 3) Calls other normal functions + 4) Has automatics. + +GOT-pointer: + This is a pointer to the global-offset-table in ELF + ( Executable Linkable Format, Linux'es most common executable format ), + all globals & shared library objects are found using this pointer. + +lazy-binding + ELF shared libraries are typically only loaded when routines in the shared + library are actually first called at runtime. This is lazy binding. + +procedure-linkage-table + This is a table found from the GOT which contains pointers to routines + in other shared libraries which can't be called to by easier means. + +prologue: + The code generated by the compiler to set up the stack frame. + +outgoing-args: + This is extra area allocated on the stack of the calling function if the + parameters for the callee's cannot all be put in registers, the same + area can be reused by each function the caller calls. + +routine-descriptor: + A COFF executable format based concept of a procedure reference + actually being 8 bytes or more as opposed to a simple pointer to the routine. + This is typically defined as follows: + + - Routine Descriptor offset 0=Pointer to Function + - Routine Descriptor offset 4=Pointer to Table of Contents + + The table of contents/TOC is roughly equivalent to a GOT pointer. + & it means that shared libraries etc. can be shared between several + environments each with their own TOC. + +static-chain: + This is used in nested functions a concept adopted from pascal + by gcc not used in ansi C or C++ ( although quite useful ), basically it + is a pointer used to reference local variables of enclosing functions. + You might come across this stuff once or twice in your lifetime. + + e.g. + + The function below should return 11 though gcc may get upset & toss warnings + about unused variables:: + + int FunctionA(int a) + { + int b; + FunctionC(int c) + { + b=c+1; + } + FunctionC(10); + return(b); + } + + +s/390 & z/Architecture Register usage +===================================== + +======== ========================================== =============== +r0 used by syscalls/assembly call-clobbered +r1 used by syscalls/assembly call-clobbered +r2 argument 0 / return value 0 call-clobbered +r3 argument 1 / return value 1 (if long long) call-clobbered +r4 argument 2 call-clobbered +r5 argument 3 call-clobbered +r6 argument 4 saved +r7 pointer-to arguments 5 to ... saved +r8 this & that saved +r9 this & that saved +r10 static-chain ( if nested function ) saved +r11 frame-pointer ( if function used alloca ) saved +r12 got-pointer saved +r13 base-pointer saved +r14 return-address saved +r15 stack-pointer saved + +f0 argument 0 / return value ( float/double ) call-clobbered +f2 argument 1 call-clobbered +f4 z/Architecture argument 2 saved +f6 z/Architecture argument 3 saved +======== ========================================== =============== + +The remaining floating points +f1,f3,f5 f7-f15 are call-clobbered. + +Notes: +------ +1) The only requirement is that registers which are used + by the callee are saved, e.g. the compiler is perfectly + capable of using r11 for purposes other than a frame a + frame pointer if a frame pointer is not needed. +2) In functions with variable arguments e.g. printf the calling procedure + is identical to one without variable arguments & the same number of + parameters. However, the prologue of this function is somewhat more + hairy owing to it having to move these parameters to the stack to + get va_start, va_arg & va_end to work. +3) Access registers are currently unused by gcc but are used in + the kernel. Possibilities exist to use them at the moment for + temporary storage but it isn't recommended. +4) Only 4 of the floating point registers are used for + parameter passing as older machines such as G3 only have only 4 + & it keeps the stack frame compatible with other compilers. + However with IEEE floating point emulation under linux on the + older machines you are free to use the other 12. +5) A long long or double parameter cannot be have the + first 4 bytes in a register & the second four bytes in the + outgoing args area. It must be purely in the outgoing args + area if crossing this boundary. +6) Floating point parameters are mixed with outgoing args + on the outgoing args area in the order the are passed in as parameters. +7) Floating point arguments 2 & 3 are saved in the outgoing args area for + z/Architecture + + +Stack Frame Layout +------------------ + +========= ============== ====================================================== +s/390 z/Architecture +========= ============== ====================================================== +0 0 back chain ( a 0 here signifies end of back chain ) +4 8 eos ( end of stack, not used on Linux for S390 used + in other linkage formats ) +8 16 glue used in other s/390 linkage formats for saved + routine descriptors etc. +12 24 glue used in other s/390 linkage formats for saved + routine descriptors etc. +16 32 scratch area +20 40 scratch area +24 48 saved r6 of caller function +28 56 saved r7 of caller function +32 64 saved r8 of caller function +36 72 saved r9 of caller function +40 80 saved r10 of caller function +44 88 saved r11 of caller function +48 96 saved r12 of caller function +52 104 saved r13 of caller function +56 112 saved r14 of caller function +60 120 saved r15 of caller function +64 128 saved f4 of caller function +72 132 saved f6 of caller function +80 undefined +96 160 outgoing args passed from caller to callee +96+x 160+x possible stack alignment ( 8 bytes desirable ) +96+x+y 160+x+y alloca space of caller ( if used ) +96+x+y+z 160+x+y+z automatics of caller ( if used ) +0 back-chain +========= ============== ====================================================== + +A sample program with comments. +=============================== + +Comments on the function test +----------------------------- +1) It didn't need to set up a pointer to the constant pool gpr13 as it is not + used ( :-( ). +2) This is a frameless function & no stack is bought. +3) The compiler was clever enough to recognise that it could return the + value in r2 as well as use it for the passed in parameter ( :-) ). +4) The basr ( branch relative & save ) trick works as follows the instruction + has a special case with r0,r0 with some instruction operands is understood as + the literal value 0, some risc architectures also do this ). So now + we are branching to the next address & the address new program counter is + in r13,so now we subtract the size of the function prologue we have executed + the size of the literal pool to get to the top of the literal pool:: + + + 0040037c int test(int b) + { # Function prologue below + 40037c: 90 de f0 34 stm %r13,%r14,52(%r15) # Save registers r13 & r14 + 400380: 0d d0 basr %r13,%r0 # Set up pointer to constant pool using + 400382: a7 da ff fa ahi %r13,-6 # basr trick + return(5+b); + # Huge main program + 400386: a7 2a 00 05 ahi %r2,5 # add 5 to r2 + + # Function epilogue below + 40038a: 98 de f0 34 lm %r13,%r14,52(%r15) # restore registers r13 & 14 + 40038e: 07 fe br %r14 # return + } + +Comments on the function main +----------------------------- +1) The compiler did this function optimally ( 8-) ):: + + Literal pool for main. + 400390: ff ff ff ec .long 0xffffffec + main(int argc,char *argv[]) + { # Function prologue below + 400394: 90 bf f0 2c stm %r11,%r15,44(%r15) # Save necessary registers + 400398: 18 0f lr %r0,%r15 # copy stack pointer to r0 + 40039a: a7 fa ff a0 ahi %r15,-96 # Make area for callee saving + 40039e: 0d d0 basr %r13,%r0 # Set up r13 to point to + 4003a0: a7 da ff f0 ahi %r13,-16 # literal pool + 4003a4: 50 00 f0 00 st %r0,0(%r15) # Save backchain + + return(test(5)); # Main Program Below + 4003a8: 58 e0 d0 00 l %r14,0(%r13) # load relative address of test from + # literal pool + 4003ac: a7 28 00 05 lhi %r2,5 # Set first parameter to 5 + 4003b0: 4d ee d0 00 bas %r14,0(%r14,%r13) # jump to test setting r14 as return + # address using branch & save instruction. + + # Function Epilogue below + 4003b4: 98 bf f0 8c lm %r11,%r15,140(%r15)# Restore necessary registers. + 4003b8: 07 fe br %r14 # return to do program exit + } + + +Compiler updates +---------------- + +:: + + main(int argc,char *argv[]) + { + 4004fc: 90 7f f0 1c stm %r7,%r15,28(%r15) + 400500: a7 d5 00 04 bras %r13,400508 + 400504: 00 40 04 f4 .long 0x004004f4 + # compiler now puts constant pool in code to so it saves an instruction + 400508: 18 0f lr %r0,%r15 + 40050a: a7 fa ff a0 ahi %r15,-96 + 40050e: 50 00 f0 00 st %r0,0(%r15) + return(test(5)); + 400512: 58 10 d0 00 l %r1,0(%r13) + 400516: a7 28 00 05 lhi %r2,5 + 40051a: 0d e1 basr %r14,%r1 + # compiler adds 1 extra instruction to epilogue this is done to + # avoid processor pipeline stalls owing to data dependencies on g5 & + # above as register 14 in the old code was needed directly after being loaded + # by the lm %r11,%r15,140(%r15) for the br %14. + 40051c: 58 40 f0 98 l %r4,152(%r15) + 400520: 98 7f f0 7c lm %r7,%r15,124(%r15) + 400524: 07 f4 br %r4 + } + + +Hartmut ( our compiler developer ) also has been threatening to take out the +stack backchain in optimised code as this also causes pipeline stalls, you +have been warned. + +64 bit z/Architecture code disassembly +-------------------------------------- + +If you understand the stuff above you'll understand the stuff +below too so I'll avoid repeating myself & just say that +some of the instructions have g's on the end of them to indicate +they are 64 bit & the stack offsets are a bigger, +the only other difference you'll find between 32 & 64 bit is that +we now use f4 & f6 for floating point arguments on 64 bit:: + + 00000000800005b0 : + int test(int b) + { + return(5+b); + 800005b0: a7 2a 00 05 ahi %r2,5 + 800005b4: b9 14 00 22 lgfr %r2,%r2 # downcast to integer + 800005b8: 07 fe br %r14 + 800005ba: 07 07 bcr 0,%r7 + + + } + + 00000000800005bc
: + main(int argc,char *argv[]) + { + 800005bc: eb bf f0 58 00 24 stmg %r11,%r15,88(%r15) + 800005c2: b9 04 00 1f lgr %r1,%r15 + 800005c6: a7 fb ff 60 aghi %r15,-160 + 800005ca: e3 10 f0 00 00 24 stg %r1,0(%r15) + return(test(5)); + 800005d0: a7 29 00 05 lghi %r2,5 + # brasl allows jumps > 64k & is overkill here bras would do fune + 800005d4: c0 e5 ff ff ff ee brasl %r14,800005b0 + 800005da: e3 40 f1 10 00 04 lg %r4,272(%r15) + 800005e0: eb bf f0 f8 00 04 lmg %r11,%r15,248(%r15) + 800005e6: 07 f4 br %r4 + } + + + +Compiling programs for debugging on Linux for s/390 & z/Architecture +==================================================================== +-gdwarf-2 now works it should be considered the default debugging +format for s/390 & z/Architecture as it is more reliable for debugging +shared libraries, normal -g debugging works much better now +Thanks to the IBM java compiler developers bug reports. + +This is typically done adding/appending the flags -g or -gdwarf-2 to the +CFLAGS & LDFLAGS variables Makefile of the program concerned. + +If using gdb & you would like accurate displays of registers & +stack traces compile without optimisation i.e make sure +that there is no -O2 or similar on the CFLAGS line of the Makefile & +the emitted gcc commands, obviously this will produce worse code +( not advisable for shipment ) but it is an aid to the debugging process. + +This aids debugging because the compiler will copy parameters passed in +in registers onto the stack so backtracing & looking at passed in +parameters will work, however some larger programs which use inline functions +will not compile without optimisation. + +Debugging with optimisation has since much improved after fixing +some bugs, please make sure you are using gdb-5.0 or later developed +after Nov'2000. + + + +Debugging under VM +================== + +Notes +----- +Addresses & values in the VM debugger are always hex never decimal +Address ranges are of the format - or +. +For example, the address range 0x2000 to 0x3000 can be described as 2000-3000 +or 2000.1000 + +The VM Debugger is case insensitive. + +VM's strengths are usually other debuggers weaknesses you can get at any +resource no matter how sensitive e.g. memory management resources, change +address translation in the PSW. For kernel hacking you will reap dividends if +you get good at it. + +The VM Debugger displays operators but not operands, and also the debugger +displays useful information on the same line as the author of the code probably +felt that it was a good idea not to go over the 80 columns on the screen. +This isn't as unintuitive as it may seem as the s/390 instructions are easy to +decode mentally and you can make a good guess at a lot of them as all the +operands are nibble (half byte aligned). +So if you have an objdump listing by hand, it is quite easy to follow, and if +you don't have an objdump listing keep a copy of the s/390 Reference Summary +or alternatively the s/390 principles of operation next to you. +e.g. even I can guess that +0001AFF8' LR 180F CC 0 +is a ( load register ) lr r0,r15 + +Also it is very easy to tell the length of a 390 instruction from the 2 most +significant bits in the instruction (not that this info is really useful except +if you are trying to make sense of a hexdump of code). +Here is a table + +======================= ================== +Bits Instruction Length +======================= ================== +00 2 Bytes +01 4 Bytes +10 4 Bytes +11 6 Bytes +======================= ================== + +The debugger also displays other useful info on the same line such as the +addresses being operated on destination addresses of branches & condition codes. +e.g.:: + + 00019736' AHI A7DAFF0E CC 1 + 000198BA' BRC A7840004 -> 000198C2' CC 0 + 000198CE' STM 900EF068 >> 0FA95E78 CC 2 + + + +Useful VM debugger commands +--------------------------- + +I suppose I'd better mention this before I start +to list the current active traces do:: + + Q TR + +there can be a maximum of 255 of these per set +( more about trace sets later ). + +To stop traces issue a:: + + TR END. + +To delete a particular breakpoint issue:: + + TR DEL + +The PA1 key drops to CP mode so you can issue debugger commands, +Doing alt c (on my 3270 console at least ) clears the screen. + +hitting b comes back to the running operating system +from cp mode ( in our case linux ). + +It is typically useful to add shortcuts to your profile.exec file +if you have one ( this is roughly equivalent to autoexec.bat in DOS ). +file here are a few from mine:: + + /* this gives me command history on issuing f12 */ + set pf12 retrieve + /* this continues */ + set pf8 imm b + /* goes to trace set a */ + set pf1 imm tr goto a + /* goes to trace set b */ + set pf2 imm tr goto b + /* goes to trace set c */ + set pf3 imm tr goto c + + + +Instruction Tracing +------------------- +Setting a simple breakpoint:: + + TR I PSWA
+ +To debug a particular function try:: + + TR I R + TR I on its own will single step. + TR I DATA will trace for particular mnemonics + +e.g.:: + + TR I DATA 4D R 0197BC.4000 + +will trace for BAS'es ( opcode 4D ) in the range 0197BC.4000 + +if you were inclined you could add traces for all branch instructions & +suffix them with the run prefix so you would have a backtrace on screen +when a program crashes:: + + TR BR will trace branches into or out of an address. + +e.g.:: + + TR BR INTO 0 + +is often quite useful if a program is getting awkward & deciding +to branch to 0 & crashing as this will stop at the address before in jumps to 0. + +:: + + TR I R
RUN cmd d g + +single steps a range of addresses but stays running & +displays the gprs on each step. + + + +Displaying & modifying Registers +-------------------------------- +D G + will display all the gprs + +Adding a extra G to all the commands is necessary to access the full 64 bit +content in VM on z/Architecture. Obviously this isn't required for access +registers as these are still 32 bit. + +e.g. + +DGG + instead of DG + +D X + will display all the control registers +D AR + will display all the access registers +D AR4-7 + will display access registers 4 to 7 +CPU ALL D G + will display the GRPS of all CPUS in the configuration +D PSW + will display the current PSW +st PSW 2000 + will put the value 2000 into the PSW & cause crash your machine. +D PREFIX + displays the prefix offset + + +Displaying Memory +----------------- +To display memory mapped using the current PSW's mapping try:: + + D + +To make VM display a message each time it hits a particular address and +continue try: + +D I + will disassemble/display a range of instructions. + +ST addr 32 bit word + will store a 32 bit aligned address +D T + will display the EBCDIC in an address (if you are that way inclined) +D R + will display real addresses ( without DAT ) but with prefixing. + +There are other complex options to display if you need to get at say home space +but are in primary space the easiest thing to do is to temporarily +modify the PSW to the other addressing mode, display the stuff & then +restore it. + + + +Hints +----- +If you want to issue a debugger command without halting your virtual machine +with the PA1 key try prefixing the command with #CP e.g.:: + + #cp tr i pswa 2000 + +also suffixing most debugger commands with RUN will cause them not +to stop just display the mnemonic at the current instruction on the console. + +If you have several breakpoints you want to put into your program & +you get fed up of cross referencing with System.map +you can do the following trick for several symbols. + +:: + + grep do_signal System.map + +which emits the following among other things:: + + 0001f4e0 T do_signal + +now you can do:: + + TR I PSWA 0001f4e0 cmd msg * do_signal + +This sends a message to your own console each time do_signal is entered. +( As an aside I wrote a perl script once which automatically generated a REXX +script with breakpoints on every kernel procedure, this isn't a good idea +because there are thousands of these routines & VM can only set 255 breakpoints +at a time so you nearly had to spend as long pruning the file down as you would +entering the msgs by hand), however, the trick might be useful for a single +object file. In the 3270 terminal emulator x3270 there is a very useful option +in the file menu called "Save Screen In File" - this is very good for keeping a +copy of traces. + +From CMS help will give you online help on a particular command. +e.g.:: + + HELP DISPLAY + +Also CP has a file called profile.exec which automatically gets called +on startup of CMS ( like autoexec.bat ), keeping on a DOS analogy session +CP has a feature similar to doskey, it may be useful for you to +use profile.exec to define some keystrokes. + +SET PF9 IMM B + This does a single step in VM on pressing F8. + +SET PF10 ^ + This sets up the ^ key. + which can be used for ^c (ctrl-c),^z (ctrl-z) which can't be typed + directly into some 3270 consoles. + +SET PF11 ^- + This types the starting keystrokes for a sysrq see SysRq below. +SET PF12 RETRIEVE + This retrieves command history on pressing F12. + + +Sometimes in VM the display is set up to scroll automatically this +can be very annoying if there are messages you wish to look at +to stop this do + +TERM MORE 255 255 + This will nearly stop automatic screen updates, however it will + cause a denial of service if lots of messages go to the 3270 console, + so it would be foolish to use this as the default on a production machine. + + +Tracing particular processes +---------------------------- +The kernel's text segment is intentionally at an address in memory that it will +very seldom collide with text segments of user programs ( thanks Martin ), +this simplifies debugging the kernel. +However it is quite common for user processes to have addresses which collide +this can make debugging a particular process under VM painful under normal +circumstances as the process may change when doing a:: + + TR I R
. + +Thankfully after reading VM's online help I figured out how to debug +I particular process. + +Your first problem is to find the STD ( segment table designation ) +of the program you wish to debug. +There are several ways you can do this here are a few + +Run:: + + objdump --syms | grep main + +To get the address of main in the program. Then:: + + tr i pswa
+ +Start the program, if VM drops to CP on what looks like the entry +point of the main function this is most likely the process you wish to debug. +Now do a D X13 or D XG13 on z/Architecture. + +On 31 bit the STD is bits 1-19 ( the STO segment table origin ) +& 25-31 ( the STL segment table length ) of CR13. + +now type:: + + TR I R STD 0.7fffffff + +e.g.:: + + TR I R STD 8F32E1FF 0.7fffffff + +Another very useful variation is:: + + TR STORE INTO STD
+ +for finding out when a particular variable changes. + +An alternative way of finding the STD of a currently running process +is to do the following, ( this method is more complex but +could be quite convenient if you aren't updating the kernel much & +so your kernel structures will stay constant for a reasonable period of +time ). + +:: + + grep task /proc//status + +from this you should see something like:: + + task: 0f160000 ksp: 0f161de8 pt_regs: 0f161f68 + +This now gives you a pointer to the task structure. + +Now make:: + + CC:="s390-gcc -g" kernel/sched.s + +To get the task_struct stabinfo. + +( task_struct is defined in include/linux/sched.h ). + +Now we want to look at +task->active_mm->pgd + +on my machine the active_mm in the task structure stab is +active_mm:(4,12),672,32 + +its offset is 672/8=84=0x54 + +the pgd member in the mm_struct stab is +pgd:(4,6)=*(29,5),96,32 +so its offset is 96/8=12=0xc + +so we'll:: + + hexdump -s 0xf160054 /dev/mem | more + +i.e. task_struct+active_mm offset +to look at the active_mm member:: + + f160054 0fee cc60 0019 e334 0000 0000 0000 0011 + +:: + + hexdump -s 0x0feecc6c /dev/mem | more + +i.e. active_mm+pgd offset:: + + feecc6c 0f2c 0000 0000 0001 0000 0001 0000 0010 + +we get something like +now do:: + + TR I R STD 0.7fffffff + +i.e. the 0x7f is added because the pgd only +gives the page table origin & we need to set the low bits +to the maximum possible segment table length. + +:: + + TR I R STD 0f2c007f 0.7fffffff + +on z/Architecture you'll probably need to do:: + + TR I R STD 0.ffffffffffffffff + +to set the TableType to 0x1 & the Table length to 3. + + + +Tracing Program Exceptions +-------------------------- +If you get a crash which says something like +illegal operation or specification exception followed by a register dump +You can restart linux & trace these using the tr prog trace +option. + + +The most common ones you will normally be tracing for is: + +- 1=operation exception +- 2=privileged operation exception +- 4=protection exception +- 5=addressing exception +- 6=specification exception +- 10=segment translation exception +- 11=page translation exception + +The full list of these is on page 22 of the current s/390 Reference Summary. +e.g. + +tr prog 10 will trace segment translation exceptions. + +tr prog on its own will trace all program interruption codes. + +Trace Sets +---------- +On starting VM you are initially in the INITIAL trace set. +You can do a Q TR to verify this. +If you have a complex tracing situation where you wish to wait for instance +till a driver is open before you start tracing IO, but know in your +heart that you are going to have to make several runs through the code till you +have a clue whats going on. + +What you can do is:: + + TR I PSWA + +hit b to continue till breakpoint + +reach the breakpoint + +now do your:: + + TR GOTO B + TR IO 7c08-7c09 inst int run + +or whatever the IO channels you wish to trace are & hit b + +To got back to the initial trace set do:: + + TR GOTO INITIAL + +& the TR I PSWA will be the only active breakpoint again. + + +Tracing linux syscalls under VM +------------------------------- +Syscalls are implemented on Linux for S390 by the Supervisor call instruction +(SVC). There 256 possibilities of these as the instruction is made up of a 0xA +opcode and the second byte being the syscall number. They are traced using the +simple command:: + + TR SVC + +the syscalls are defined in linux/arch/s390/include/asm/unistd.h +e.g. to trace all file opens just do:: + + TR SVC 5 ( as this is the syscall number of open ) + + +SMP Specific commands +--------------------- +To find out how many cpus you have +Q CPUS displays all the CPU's available to your virtual machine +To find the cpu that the current cpu VM debugger commands are being directed at +do Q CPU to change the current cpu VM debugger commands are being directed at +do:: + + CPU + +On a SMP guest issue a command to all CPUs try prefixing the command with cpu +all. To issue a command to a particular cpu try cpu e.g.:: + + CPU 01 TR I R 2000.3000 + +If you are running on a guest with several cpus & you have a IO related problem +& cannot follow the flow of code but you know it isn't smp related. + +from the bash prompt issue:: + + shutdown -h now or halt. + +do a:: + + Q CPUS + +to find out how many cpus you have detach each one of them from cp except +cpu 0 by issuing a:: + + DETACH CPU 01-(number of cpus in configuration) + +& boot linux again. + +TR SIGP + will trace inter processor signal processor instructions. + +DEFINE CPU 01-(number in configuration) + will get your guests cpus back. + + +Help for displaying ascii textstrings +------------------------------------- +On the very latest VM Nucleus'es VM can now display ascii +( thanks Neale for the hint ) by doing:: + + D TX. + +e.g.:: + + D TX0.100 + +Alternatively +============= +Under older VM debuggers (I love EBDIC too) you can use following little +program which converts a command line of hex digits to ascii text. It can be +compiled under linux and you can copy the hex digits from your x3270 terminal +to your xterm if you are debugging from a linuxbox. + +This is quite useful when looking at a parameter passed in as a text string +under VM ( unless you are good at decoding ASCII in your head ). + +e.g. consider tracing an open syscall:: + + TR SVC 5 + +We have stopped at a breakpoint:: + + 000151B0' SVC 0A05 -> 0001909A' CC 0 + +D 20.8 to check the SVC old psw in the prefix area and see was it from userspace +(for the layout of the prefix area consult the "Fixed Storage Locations" +chapter of the s/390 Reference Summary if you have it available). + +:: + + V00000020 070C2000 800151B2 + +The problem state bit wasn't set & it's also too early in the boot sequence +for it to be a userspace SVC if it was we would have to temporarily switch the +psw to user space addressing so we could get at the first parameter of the open +in gpr2. + +Next do a:: + + D G2 + GPR 2 = 00014CB4 + +Now display what gpr2 is pointing to:: + + D 00014CB4.20 + V00014CB4 2F646576 2F636F6E 736F6C65 00001BF5 + V00014CC4 FC00014C B4001001 E0001000 B8070707 + +Now copy the text till the first 00 hex ( which is the end of the string +to an xterm & do hex2ascii on it:: + + hex2ascii 2F646576 2F636F6E 736F6C65 00 + +outputs:: + + Decoded Hex:=/ d e v / c o n s o l e 0x00 + +We were opening the console device, + +You can compile the code below yourself for practice :-), + +:: + + /* + * hex2ascii.c + * a useful little tool for converting a hexadecimal command line to ascii + * + * Author(s): Denis Joseph Barrow (djbarrow@de.ibm.com,barrow_dj@yahoo.com) + * (C) 2000 IBM Deutschland Entwicklung GmbH, IBM Corporation. + */ + #include + + int main(int argc,char *argv[]) + { + int cnt1,cnt2,len,toggle=0; + int startcnt=1; + unsigned char c,hex; + + if(argc>1&&(strcmp(argv[1],"-a")==0)) + startcnt=2; + printf("Decoded Hex:="); + for(cnt1=startcnt;cnt1='0'&&c<='9') + c=c-'0'; + if(c>='A'&&c<='F') + c=c-'A'+10; + if(c>='a'&&c<='f') + c=c-'a'+10; + switch(toggle) + { + case 0: + hex=c<<4; + toggle=1; + break; + case 1: + hex+=c; + if(hex<32||hex>127) + { + if(startcnt==1) + printf("0x%02X ",(int)hex); + else + printf("."); + } + else + { + printf("%c",hex); + if(startcnt==1) + printf(" "); + } + toggle=0; + break; + } + } + } + printf("\n"); + } + + + + +Stack tracing under VM +---------------------- +A basic backtrace +----------------- + +Here are the tricks I use 9 out of 10 times it works pretty well, + +When your backchain reaches a dead end +-------------------------------------- +This can happen when an exception happens in the kernel and the kernel is +entered twice. If you reach the NULL pointer at the end of the back chain you +should be able to sniff further back if you follow the following tricks. +1) A kernel address should be easy to recognise since it is in +primary space & the problem state bit isn't set & also +The Hi bit of the address is set. +2) Another backchain should also be easy to recognise since it is an +address pointing to another address approximately 100 bytes or 0x70 hex +behind the current stackpointer. + + +Here is some practice. + +boot the kernel & hit PA1 at some random time + +d g to display the gprs, this should display something like:: + + GPR 0 = 00000001 00156018 0014359C 00000000 + GPR 4 = 00000001 001B8888 000003E0 00000000 + GPR 8 = 00100080 00100084 00000000 000FE000 + GPR 12 = 00010400 8001B2DC 8001B36A 000FFED8 + +Note that GPR14 is a return address but as we are real men we are going to +trace the stack. +display 0x40 bytes after the stack pointer:: + + V000FFED8 000FFF38 8001B838 80014C8E 000FFF38 + V000FFEE8 00000000 00000000 000003E0 00000000 + V000FFEF8 00100080 00100084 00000000 000FE000 + V000FFF08 00010400 8001B2DC 8001B36A 000FFED8 + + +Ah now look at whats in sp+56 (sp+0x38) this is 8001B36A our saved r14 if +you look above at our stackframe & also agrees with GPR14. + +now backchain:: + + d 000FFF38.40 + +we now are taking the contents of SP to get our first backchain:: + + V000FFF38 000FFFA0 00000000 00014995 00147094 + V000FFF48 00147090 001470A0 000003E0 00000000 + V000FFF58 00100080 00100084 00000000 001BF1D0 + V000FFF68 00010400 800149BA 80014CA6 000FFF38 + +This displays a 2nd return address of 80014CA6 + +now do:: + + d 000FFFA0.40 + +for our 3rd backchain:: + + V000FFFA0 04B52002 0001107F 00000000 00000000 + V000FFFB0 00000000 00000000 FF000000 0001107F + V000FFFC0 00000000 00000000 00000000 00000000 + V000FFFD0 00010400 80010802 8001085A 000FFFA0 + + +our 3rd return address is 8001085A + +as the 04B52002 looks suspiciously like rubbish it is fair to assume that the +kernel entry routines for the sake of optimisation don't set up a backchain. + +now look at System.map to see if the addresses make any sense:: + + grep -i 0001b3 System.map + +outputs among other things:: + + 0001b304 T cpu_idle + +so 8001B36A +is cpu_idle+0x66 ( quiet the cpu is asleep, don't wake it ) + +:: + + grep -i 00014 System.map + +produces among other things:: + + 00014a78 T start_kernel + +so 0014CA6 is start_kernel+some hex number I can't add in my head. + +:: + + grep -i 00108 System.map + +this produces:: + + 00010800 T _stext + +so 8001085A is _stext+0x5a + +Congrats you've done your first backchain. + + + +s/390 & z/Architecture IO Overview +================================== + +I am not going to give a course in 390 IO architecture as this would take me +quite a while and I'm no expert. Instead I'll give a 390 IO architecture +summary for Dummies. If you have the s/390 principles of operation available +read this instead. If nothing else you may find a few useful keywords in here +and be able to use them on a web search engine to find more useful information. + +Unlike other bus architectures modern 390 systems do their IO using mostly +fibre optics and devices such as tapes and disks can be shared between several +mainframes. Also S390 can support up to 65536 devices while a high end PC based +system might be choking with around 64. + +Here is some of the common IO terminology: + +Subchannel: + This is the logical number most IO commands use to talk to an IO device. There + can be up to 0x10000 (65536) of these in a configuration, typically there are a + few hundred. Under VM for simplicity they are allocated contiguously, however + on the native hardware they are not. They typically stay consistent between + boots provided no new hardware is inserted or removed. + + Under Linux for s390 we use these as IRQ's and also when issuing an IO command + (CLEAR SUBCHANNEL, HALT SUBCHANNEL, MODIFY SUBCHANNEL, RESUME SUBCHANNEL, + START SUBCHANNEL, STORE SUBCHANNEL and TEST SUBCHANNEL). We use this as the ID + of the device we wish to talk to. The most important of these instructions are + START SUBCHANNEL (to start IO), TEST SUBCHANNEL (to check whether the IO + completed successfully) and HALT SUBCHANNEL (to kill IO). A subchannel can have + up to 8 channel paths to a device, this offers redundancy if one is not + available. + +Device Number: + This number remains static and is closely tied to the hardware. There are 65536 + of these, made up of a CHPID (Channel Path ID, the most significant 8 bits) and + another lsb 8 bits. These remain static even if more devices are inserted or + removed from the hardware. There is a 1 to 1 mapping between subchannels and + device numbers, provided devices aren't inserted or removed. + +Channel Control Words: + CCWs are linked lists of instructions initially pointed to by an operation + request block (ORB), which is initially given to Start Subchannel (SSCH) + command along with the subchannel number for the IO subsystem to process + while the CPU continues executing normal code. + CCWs come in two flavours, Format 0 (24 bit for backward compatibility) and + Format 1 (31 bit). These are typically used to issue read and write (and many + other) instructions. They consist of a length field and an absolute address + field. + + Each IO typically gets 1 or 2 interrupts, one for channel end (primary status) + when the channel is idle, and the second for device end (secondary status). + Sometimes you get both concurrently. You check how the IO went on by issuing a + TEST SUBCHANNEL at each interrupt, from which you receive an Interruption + response block (IRB). If you get channel and device end status in the IRB + without channel checks etc. your IO probably went okay. If you didn't you + probably need to examine the IRB, extended status word etc. + If an error occurs, more sophisticated control units have a facility known as + concurrent sense. This means that if an error occurs Extended sense information + will be presented in the Extended status word in the IRB. If not you have to + issue a subsequent SENSE CCW command after the test subchannel. + + +TPI (Test pending interrupt) can also be used for polled IO, but in +multitasking multiprocessor systems it isn't recommended except for +checking special cases (i.e. non looping checks for pending IO etc.). + +Store Subchannel and Modify Subchannel can be used to examine and modify +operating characteristics of a subchannel (e.g. channel paths). + +Other IO related Terms: + +Sysplex: + S390's Clustering Technology +QDIO: + S390's new high speed IO architecture to support devices such as gigabit + ethernet, this architecture is also designed to be forward compatible with + upcoming 64 bit machines. + + +General Concepts +---------------- + +Input Output Processors (IOP's) are responsible for communicating between +the mainframe CPU's & the channel & relieve the mainframe CPU's from the +burden of communicating with IO devices directly, this allows the CPU's to +concentrate on data processing. + +IOP's can use one or more links ( known as channel paths ) to talk to each +IO device. It first checks for path availability & chooses an available one, +then starts ( & sometimes terminates IO ). +There are two types of channel path: ESCON & the Parallel IO interface. + +IO devices are attached to control units, control units provide the +logic to interface the channel paths & channel path IO protocols to +the IO devices, they can be integrated with the devices or housed separately +& often talk to several similar devices ( typical examples would be raid +controllers or a control unit which connects to 1000 3270 terminals ):: + + + +---------------------------------------------------------------+ + | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ | + | | CPU | | CPU | | CPU | | CPU | | Main | | Expanded | | + | | | | | | | | | | Memory | | Storage | | + | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ | + |---------------------------------------------------------------+ + | IOP | IOP | IOP | + |--------------------------------------------------------------- + | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C | + ---------------------------------------------------------------- + || || + || Bus & Tag Channel Path || ESCON + || ====================== || Channel + || || || || Path + +----------+ +----------+ +----------+ + | | | | | | + | CU | | CU | | CU | + | | | | | | + +----------+ +----------+ +----------+ + | | | | | + +----------+ +----------+ +----------+ +----------+ +----------+ + |I/O Device| |I/O Device| |I/O Device| |I/O Device| |I/O Device| + +----------+ +----------+ +----------+ +----------+ +----------+ + CPU = Central Processing Unit + C = Channel + IOP = IP Processor + CU = Control Unit + +The 390 IO systems come in 2 flavours the current 390 machines support both + +The Older 360 & 370 Interface,sometimes called the Parallel I/O interface, +sometimes called Bus-and Tag & sometimes Original Equipment Manufacturers +Interface (OEMI). + +This byte wide Parallel channel path/bus has parity & data on the "Bus" cable +and control lines on the "Tag" cable. These can operate in byte multiplex mode +for sharing between several slow devices or burst mode and monopolize the +channel for the whole burst. Up to 256 devices can be addressed on one of these +cables. These cables are about one inch in diameter. The maximum unextended +length supported by these cables is 125 Meters but this can be extended up to +2km with a fibre optic channel extended such as a 3044. The maximum burst speed +supported is 4.5 megabytes per second. However, some really old processors +support only transfer rates of 3.0, 2.0 & 1.0 MB/sec. +One of these paths can be daisy chained to up to 8 control units. + + +ESCON if fibre optic it is also called FICON +Was introduced by IBM in 1990. Has 2 fibre optic cables and uses either leds or +lasers for communication at a signaling rate of up to 200 megabits/sec. As +10bits are transferred for every 8 bits info this drops to 160 megabits/sec +and to 18.6 Megabytes/sec once control info and CRC are added. ESCON only +operates in burst mode. + +ESCONs typical max cable length is 3km for the led version and 20km for the +laser version known as XDF (extended distance facility). This can be further +extended by using an ESCON director which triples the above mentioned ranges. +Unlike Bus & Tag as ESCON is serial it uses a packet switching architecture, +the standard Bus & Tag control protocol is however present within the packets. +Up to 256 devices can be attached to each control unit that uses one of these +interfaces. + +Common 390 Devices include: +Network adapters typically OSA2,3172's,2116's & OSA-E gigabit ethernet adapters, +Consoles 3270 & 3215 (a teletype emulated under linux for a line mode console). +DASD's direct access storage devices ( otherwise known as hard disks ). +Tape Drives. +CTC ( Channel to Channel Adapters ), +ESCON or Parallel Cables used as a very high speed serial link +between 2 machines. + + +Debugging IO on s/390 & z/Architecture under VM +=============================================== + +Now we are ready to go on with IO tracing commands under VM + +A few self explanatory queries:: + + Q OSA + Q CTC + Q DISK ( This command is CMS specific ) + Q DASD + +Q OSA on my machine returns:: + + OSA 7C08 ON OSA 7C08 SUBCHANNEL = 0000 + OSA 7C09 ON OSA 7C09 SUBCHANNEL = 0001 + OSA 7C14 ON OSA 7C14 SUBCHANNEL = 0002 + OSA 7C15 ON OSA 7C15 SUBCHANNEL = 0003 + +If you have a guest with certain privileges you may be able to see devices +which don't belong to you. To avoid this, add the option V. +e.g.:: + + Q V OSA + +Now using the device numbers returned by this command we will +Trace the io starting up on the first device 7c08 & 7c09 +In our simplest case we can trace the +start subchannels +like TR SSCH 7C08-7C09 +or the halt subchannels +or TR HSCH 7C08-7C09 +MSCH's ,STSCH's I think you can guess the rest + +A good trick is tracing all the IO's and CCWS and spooling them into the reader +of another VM guest so he can ftp the logfile back to his own machine. I'll do +a small bit of this and give you a look at the output. + +1) Spool stdout to VM reader:: + + SP PRT TO (another vm guest ) or * for the local vm guest + +2) Fill the reader with the trace:: + + TR IO 7c08-7c09 INST INT CCW PRT RUN + +3) Start up linux:: + + i 00c +4) Finish the trace:: + + TR END + +5) close the reader:: + + C PRT + +6) list reader contents:: + + RDRLIST + +7) copy it to linux4's minidisk:: + + RECEIVE / LOG TXT A1 ( replace + +8) +filel & press F11 to look at it +You should see something like:: + + 00020942' SSCH B2334000 0048813C CC 0 SCH 0000 DEV 7C08 + CPA 000FFDF0 PARM 00E2C9C4 KEY 0 FPI C0 LPM 80 + CCW 000FFDF0 E4200100 00487FE8 0000 E4240100 ........ + IDAL 43D8AFE8 + IDAL 0FB76000 + 00020B0A' I/O DEV 7C08 -> 000197BC' SCH 0000 PARM 00E2C9C4 + 00021628' TSCH B2354000 >> 00488164 CC 0 SCH 0000 DEV 7C08 + CCWA 000FFDF8 DEV STS 0C SCH STS 00 CNT 00EC + KEY 0 FPI C0 CC 0 CTLS 4007 + 00022238' STSCH B2344000 >> 00488108 CC 0 SCH 0000 DEV 7C08 + +If you don't like messing up your readed ( because you possibly booted from it ) +you can alternatively spool it to another readers guest. + + +Other common VM device related commands +--------------------------------------------- +These commands are listed only because they have +been of use to me in the past & may be of use to +you too. For more complete info on each of the commands +use type HELP from CMS. + +detaching devices:: + + DET + ATT + +attach a device to guest * for your own guest + +READY + cause VM to issue a fake interrupt. + +The VARY command is normally only available to VM administrators:: + + VARY ON PATH TO + VARY OFF PATH FROM + +This is used to switch on or off channel paths to devices. + +Q CHPID + This displays state of devices using this channel path + +D SCHIB + This displays the subchannel information SCHIB block for the device. + this I believe is also only available to administrators. + +DEFINE CTC + defines a virtual CTC channel to channel connection + 2 need to be defined on each guest for the CTC driver to use. + +COUPLE devno userid remote devno + Joins a local virtual device to a remote virtual device + ( commonly used for the CTC driver ). + +Building a VM ramdisk under CMS which linux can use:: + + def vfb- + +blocksize is commonly 4096 for linux. + +Formatting it:: + + format (blksize + +Sharing a disk between multiple guests:: + + LINK userid devno1 devno2 mode password + + + +GDB on S390 +=========== +N.B. if compiling for debugging gdb works better without optimisation +( see Compiling programs for debugging ) + +invocation +---------- +gdb + +Online help +----------- +help: gives help on commands + +e.g.:: + + help + help display + +Note gdb's online help is very good use it. + + +Assembly +-------- +info registers: + displays registers other than floating point. + +info all-registers: + displays floating points as well. + +disassemble: + disassembles + +e.g.:: + + disassemble without parameters will disassemble the current function + disassemble $pc $pc+10 + +Viewing & modifying variables +----------------------------- +print or p: + displays variable or register + +e.g. p/x $sp will display the stack pointer + +display: + prints variable or register each time program stops + +e.g.:: + + display/x $pc will display the program counter + display argc + +undisplay: + undo's display's + +info breakpoints: + shows all current breakpoints + +info stack: + shows stack back trace (if this doesn't work too well, I'll show + you the stacktrace by hand below). + +info locals: + displays local variables. + +info args: + display current procedure arguments. + +set args: + will set argc & argv each time the victim program is invoked + +e.g.:: + + set =value + set argc=100 + set $pc=0 + + + +Modifying execution +------------------- +step: + steps n lines of sourcecode + +step + steps 1 line. + +step 100 + steps 100 lines of code. + +next: + like step except this will not step into subroutines + +stepi: + steps a single machine code instruction. + +e.g.:: + + stepi 100 + +nexti: + steps a single machine code instruction but will not step into + subroutines. + +finish: + will run until exit of the current routine + +run: + (re)starts a program + +cont: + continues a program + +quit: + exits gdb. + + +breakpoints +------------ + +break + sets a breakpoint + +e.g.:: + + break main + break *$pc + break *0x400618 + +Here's a really useful one for large programs + +rbr + Set a breakpoint for all functions matching REGEXP + +e.g.:: + + rbr 390 + +will set a breakpoint with all functions with 390 in their name. + +info breakpoints + lists all breakpoints + +delete: + delete breakpoint by number or delete them all + +e.g. + +delete 1 + will delete the first breakpoint + + +delete + will delete them all + +watch: + This will set a watchpoint ( usually hardware assisted ), + +This will watch a variable till it changes + +e.g. + +watch cnt + will watch the variable cnt till it changes. + +As an aside unfortunately gdb's, architecture independent watchpoint code +is inconsistent & not very good, watchpoints usually work but not always. + +info watchpoints: + Display currently active watchpoints + +condition: ( another useful one ) + Specify breakpoint number N to break only if COND is true. + +Usage is `condition N COND`, where N is an integer and COND is an +expression to be evaluated whenever breakpoint N is reached. + + + +User defined functions/macros +----------------------------- +define: ( Note this is very very useful,simple & powerful ) + +usage define end + +examples which you should consider putting into .gdbinit in your home +directory:: + + define d + stepi + disassemble $pc $pc+10 + end + define e + nexti + disassemble $pc $pc+10 + end + + +Other hard to classify stuff +---------------------------- +signal n: + sends the victim program a signal. + +e.g. `signal 3` will send a SIGQUIT. + +info signals: + what gdb does when the victim receives certain signals. + +list: + +e.g.: + +list + lists current function source +list 1,10 + list first 10 lines of current file. + +list test.c:1,10 + + +directory: + Adds directories to be searched for source if gdb cannot find the source. + (note it is a bit sensitive about slashes) + +e.g. To add the root of the filesystem to the searchpath do:: + + directory // + + +call +This calls a function in the victim program, this is pretty powerful +e.g. +(gdb) call printf("hello world") +outputs: +$1 = 11 + +You might now be thinking that the line above didn't work, something extra had +to be done. +(gdb) call fflush(stdout) +hello world$2 = 0 +As an aside the debugger also calls malloc & free under the hood +to make space for the "hello world" string. + + + +hints +----- +1) command completion works just like bash + ( if you are a bad typist like me this really helps ) + +e.g. hit br & cursor up & down :-). + +2) if you have a debugging problem that takes a few steps to recreate +put the steps into a file called .gdbinit in your current working directory +if you have defined a few extra useful user defined commands put these in +your home directory & they will be read each time gdb is launched. + +A typical .gdbinit file might be.:: + + break main + run + break runtime_exception + cont + + +stack chaining in gdb by hand +----------------------------- +This is done using a the same trick described for VM:: + + p/x (*($sp+56))&0x7fffffff + +get the first backchain. + +For z/Architecture +Replace 56 with 112 & ignore the &0x7fffffff +in the macros below & do nasty casts to longs like the following +as gdb unfortunately deals with printed arguments as ints which +messes up everything. + +i.e. here is a 3rd backchain dereference:: + + p/x *(long *)(***(long ***)$sp+112) + + +this outputs:: + + $5 = 0x528f18 + +on my machine. + +Now you can use:: + + info symbol (*($sp+56))&0x7fffffff + +you might see something like:: + + rl_getc + 36 in section .text + +telling you what is located at address 0x528f18 +Now do:: + + p/x (*(*$sp+56))&0x7fffffff + +This outputs:: + + $6 = 0x528ed0 + +Now do:: + + info symbol (*(*$sp+56))&0x7fffffff + rl_read_key + 180 in section .text + +now do:: + + p/x (*(**$sp+56))&0x7fffffff + +& so on. + +Disassembling instructions without debug info +--------------------------------------------- +gdb typically complains if there is a lack of debugging +symbols in the disassemble command with +"No function contains specified address." To get around +this do:: + + x/xi
+ +e.g.:: + + x/20xi 0x400730 + + + +Note: + Remember gdb has history just like bash you don't need to retype the + whole line just use the up & down arrows. + + + +For more info +------------- +From your linuxbox do:: + + man gdb + +or:: + + info gdb. + +core dumps +---------- + +What a core dump ? +^^^^^^^^^^^^^^^^^^ + +A core dump is a file generated by the kernel (if allowed) which contains the +registers and all active pages of the program which has crashed. + +From this file gdb will allow you to look at the registers, stack trace and +memory of the program as if it just crashed on your system. It is usually +called core and created in the current working directory. + +This is very useful in that a customer can mail a core dump to a technical +support department and the technical support department can reconstruct what +happened. Provided they have an identical copy of this program with debugging +symbols compiled in and the source base of this build is available. + +In short it is far more useful than something like a crash log could ever hope +to be. + +Why have I never seen one ? +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Probably because you haven't used the command:: + + ulimit -c unlimited in bash + +to allow core dumps, now do:: + + ulimit -a + +to verify that the limit was accepted. + +A sample core dump + To create this I'm going to do:: + + ulimit -c unlimited + gdb + +to launch gdb (my victim app. ) now be bad & do the following from another +telnet/xterm session to the same machine:: + + ps -aux | grep gdb + kill -SIGSEGV + +or alternatively use `killall -SIGSEGV gdb` if you have the killall command. + +Now look at the core dump:: + + ./gdb core + +Displays the following:: + + GNU gdb 4.18 + Copyright 1998 Free Software Foundation, Inc. + GDB is free software, covered by the GNU General Public License, and you are + welcome to change it and/or distribute copies of it under certain conditions. + Type "show copying" to see the conditions. + There is absolutely no warranty for GDB. Type "show warranty" for details. + This GDB was configured as "s390-ibm-linux"... + Core was generated by `./gdb'. + Program terminated with signal 11, Segmentation fault. + Reading symbols from /usr/lib/libncurses.so.4...done. + Reading symbols from /lib/libm.so.6...done. + Reading symbols from /lib/libc.so.6...done. + Reading symbols from /lib/ld-linux.so.2...done. + #0 0x40126d1a in read () from /lib/libc.so.6 + Setting up the environment for debugging gdb. + Breakpoint 1 at 0x4dc6f8: file utils.c, line 471. + Breakpoint 2 at 0x4d87a4: file top.c, line 2609. + (top-gdb) info stack + #0 0x40126d1a in read () from /lib/libc.so.6 + #1 0x528f26 in rl_getc (stream=0x7ffffde8) at input.c:402 + #2 0x528ed0 in rl_read_key () at input.c:381 + #3 0x5167e6 in readline_internal_char () at readline.c:454 + #4 0x5168ee in readline_internal_charloop () at readline.c:507 + #5 0x51692c in readline_internal () at readline.c:521 + #6 0x5164fe in readline (prompt=0x7ffff810) + at readline.c:349 + #7 0x4d7a8a in command_line_input (prompt=0x564420 "(gdb) ", repeat=1, + annotation_suffix=0x4d6b44 "prompt") at top.c:2091 + #8 0x4d6cf0 in command_loop () at top.c:1345 + #9 0x4e25bc in main (argc=1, argv=0x7ffffdf4) at main.c:635 + + +LDD +=== +This is a program which lists the shared libraries which a library needs, +Note you also get the relocations of the shared library text segments which +help when using objdump --source. + +e.g.:: + + ldd ./gdb + +outputs:: + + libncurses.so.4 => /usr/lib/libncurses.so.4 (0x40018000) + libm.so.6 => /lib/libm.so.6 (0x4005e000) + libc.so.6 => /lib/libc.so.6 (0x40084000) + /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) + + +Debugging shared libraries +========================== +Most programs use shared libraries, however it can be very painful +when you single step instruction into a function like printf for the +first time & you end up in functions like _dl_runtime_resolve this is +the ld.so doing lazy binding, lazy binding is a concept in ELF where +shared library functions are not loaded into memory unless they are +actually used, great for saving memory but a pain to debug. + +To get around this either relink the program -static or exit gdb type +export LD_BIND_NOW=true this will stop lazy binding & restart the gdb'ing +the program in question. + + + +Debugging modules +================= +As modules are dynamically loaded into the kernel their address can be +anywhere to get around this use the -m option with insmod to emit a load +map which can be piped into a file if required. + +The proc file system +==================== +What is it ?. +It is a filesystem created by the kernel with files which are created on demand +by the kernel if read, or can be used to modify kernel parameters, +it is a powerful concept. + +e.g.:: + + cat /proc/sys/net/ipv4/ip_forward + +On my machine outputs:: + + 0 + +telling me ip_forwarding is not on to switch it on I can do:: + + echo 1 > /proc/sys/net/ipv4/ip_forward + +cat it again:: + + cat /proc/sys/net/ipv4/ip_forward + +On my machine now outputs:: + + 1 + +IP forwarding is on. + +There is a lot of useful info in here best found by going in and having a look +around, so I'll take you through some entries I consider important. + +All the processes running on the machine have their own entry defined by +/proc/ + +So lets have a look at the init process:: + + cd /proc/1 + cat cmdline + +emits:: + + init [2] + +:: + + cd /proc/1/fd + +This contains numerical entries of all the open files, +some of these you can cat e.g. stdout (2):: + + cat /proc/29/maps + +on my machine emits:: + + 00400000-00478000 r-xp 00000000 5f:00 4103 /bin/bash + 00478000-0047e000 rw-p 00077000 5f:00 4103 /bin/bash + 0047e000-00492000 rwxp 00000000 00:00 0 + 40000000-40015000 r-xp 00000000 5f:00 14382 /lib/ld-2.1.2.so + 40015000-40016000 rw-p 00014000 5f:00 14382 /lib/ld-2.1.2.so + 40016000-40017000 rwxp 00000000 00:00 0 + 40017000-40018000 rw-p 00000000 00:00 0 + 40018000-4001b000 r-xp 00000000 5f:00 14435 /lib/libtermcap.so.2.0.8 + 4001b000-4001c000 rw-p 00002000 5f:00 14435 /lib/libtermcap.so.2.0.8 + 4001c000-4010d000 r-xp 00000000 5f:00 14387 /lib/libc-2.1.2.so + 4010d000-40111000 rw-p 000f0000 5f:00 14387 /lib/libc-2.1.2.so + 40111000-40114000 rw-p 00000000 00:00 0 + 40114000-4011e000 r-xp 00000000 5f:00 14408 /lib/libnss_files-2.1.2.so + 4011e000-4011f000 rw-p 00009000 5f:00 14408 /lib/libnss_files-2.1.2.so + 7fffd000-80000000 rwxp ffffe000 00:00 0 + + +Showing us the shared libraries init uses where they are in memory +& memory access permissions for each virtual memory area. + +/proc/1/cwd is a softlink to the current working directory. + +/proc/1/root is the root of the filesystem for this process. + +/proc/1/mem is the current running processes memory which you +can read & write to like a file. + +strace uses this sometimes as it is a bit faster than the +rather inefficient ptrace interface for peeking at DATA. + +:: + + cat status + + Name: init + State: S (sleeping) + Pid: 1 + PPid: 0 + Uid: 0 0 0 0 + Gid: 0 0 0 0 + Groups: + VmSize: 408 kB + VmLck: 0 kB + VmRSS: 208 kB + VmData: 24 kB + VmStk: 8 kB + VmExe: 368 kB + VmLib: 0 kB + SigPnd: 0000000000000000 + SigBlk: 0000000000000000 + SigIgn: 7fffffffd7f0d8fc + SigCgt: 00000000280b2603 + CapInh: 00000000fffffeff + CapPrm: 00000000ffffffff + CapEff: 00000000fffffeff + + User PSW: 070de000 80414146 + task: 004b6000 tss: 004b62d8 ksp: 004b7ca8 pt_regs: 004b7f68 + User GPRS: + 00000400 00000000 0000000b 7ffffa90 + 00000000 00000000 00000000 0045d9f4 + 0045cafc 7ffffa90 7fffff18 0045cb08 + 00010400 804039e8 80403af8 7ffff8b0 + User ACRS: + 00000000 00000000 00000000 00000000 + 00000001 00000000 00000000 00000000 + 00000000 00000000 00000000 00000000 + 00000000 00000000 00000000 00000000 + Kernel BackChain CallChain BackChain CallChain + 004b7ca8 8002bd0c 004b7d18 8002b92c + 004b7db8 8005cd50 004b7e38 8005d12a + 004b7f08 80019114 + +Showing among other things memory usage & status of some signals & +the processes'es registers from the kernel task_structure +as well as a backchain which may be useful if a process crashes +in the kernel for some unknown reason. + +Some driver debugging techniques +================================ +debug feature +------------- +Some of our drivers now support a "debug feature" in +/proc/s390dbf see s390dbf.txt in the linux/Documentation directory +for more info. + +e.g. +to switch on the lcs "debug feature":: + + echo 5 > /proc/s390dbf/lcs/level + +& then after the error occurred:: + + cat /proc/s390dbf/lcs/sprintf >/logfile + +the logfile now contains some information which may help +tech support resolve a problem in the field. + + + +high level debugging network drivers +------------------------------------ +ifconfig is a quite useful command +it gives the current state of network drivers. + +If you suspect your network device driver is dead +one way to check is type:: + + ifconfig + +e.g. tr0 + +You should see something like:: + + ifconfig tr0 + tr0 Link encap:16/4 Mbps Token Ring (New) HWaddr 00:04:AC:20:8E:48 + inet addr:9.164.185.132 Bcast:9.164.191.255 Mask:255.255.224.0 + UP BROADCAST RUNNING MULTICAST MTU:2000 Metric:1 + RX packets:246134 errors:0 dropped:0 overruns:0 frame:0 + TX packets:5 errors:0 dropped:0 overruns:0 carrier:0 + collisions:0 txqueuelen:100 + +if the device doesn't say up +try:: + + /etc/rc.d/init.d/network start + +( this starts the network stack & hopefully calls ifconfig tr0 up ). +ifconfig looks at the output of /proc/net/dev and presents it in a more +presentable form. + +Now ping the device from a machine in the same subnet. + +if the RX packets count & TX packets counts don't increment you probably +have problems. + +next:: + + cat /proc/net/arp + +Do you see any hardware addresses in the cache if not you may have problems. +Next try:: + + ping -c 5 + +i.e. the Bcast field above in the output of +ifconfig. Do you see any replies from machines other than the local machine +if not you may have problems. also if the TX packets count in ifconfig +hasn't incremented either you have serious problems in your driver +(e.g. the txbusy field of the network device being stuck on ) +or you may have multiple network devices connected. + + +chandev +------- +There is a new device layer for channel devices, some +drivers e.g. lcs are registered with this layer. + +If the device uses the channel device layer you'll be +able to find what interrupts it uses & the current state +of the device. + +See the manpage chandev.8 &type cat /proc/chandev for more info. + + +SysRq +===== +This is now supported by linux for s/390 & z/Architecture. + +To enable it do compile the kernel with:: + + Kernel Hacking -> Magic SysRq Key Enabled + +Then:: + + echo "1" > /proc/sys/kernel/sysrq + +also type:: + + echo "8" >/proc/sys/kernel/printk + +To make printk output go to console. + +On 390 all commands are prefixed with:: + + ^- + +e.g.:: + + ^-t will show tasks. + ^-? or some unknown command will display help. + +The sysrq key reading is very picky ( I have to type the keys in an +xterm session & paste them into the x3270 console ) +& it may be wise to predefine the keys as described in the VM hints above + +This is particularly useful for syncing disks unmounting & rebooting +if the machine gets partially hung. + +Read Documentation/admin-guide/sysrq.rst for more info + +References: +=========== +- Enterprise Systems Architecture Reference Summary +- Enterprise Systems Architecture Principles of Operation +- Hartmut Penners s390 stack frame sheet. +- IBM Mainframe Channel Attachment a technology brief from a CISCO webpage +- Various bits of man & info pages of Linux. +- Linux & GDB source. +- Various info & man pages. +- CMS Help on tracing commands. +- Linux for s/390 Elf Application Binary Interface +- Linux for z/Series Elf Application Binary Interface ( Both Highly Recommended ) +- z/Architecture Principles of Operation SA22-7832-00 +- Enterprise Systems Architecture/390 Reference Summary SA22-7209-01 & the +- Enterprise Systems Architecture/390 Principles of Operation SA22-7201-05 + +Special Thanks +============== +Special thanks to Neale Ferguson who maintains a much +prettier HTML version of this page at +http://linuxvm.org/penguinvm/ +Bob Grainger Stefan Bader & others for reporting bugs diff --git a/Documentation/s390/driver-model.rst b/Documentation/s390/driver-model.rst new file mode 100644 index 000000000000..ad4bc2dbea43 --- /dev/null +++ b/Documentation/s390/driver-model.rst @@ -0,0 +1,328 @@ +============================= +S/390 driver model interfaces +============================= + +1. CCW devices +-------------- + +All devices which can be addressed by means of ccws are called 'CCW devices' - +even if they aren't actually driven by ccws. + +All ccw devices are accessed via a subchannel, this is reflected in the +structures under devices/:: + + devices/ + - system/ + - css0/ + - 0.0.0000/0.0.0815/ + - 0.0.0001/0.0.4711/ + - 0.0.0002/ + - 0.1.0000/0.1.1234/ + ... + - defunct/ + +In this example, device 0815 is accessed via subchannel 0 in subchannel set 0, +device 4711 via subchannel 1 in subchannel set 0, and subchannel 2 is a non-I/O +subchannel. Device 1234 is accessed via subchannel 0 in subchannel set 1. + +The subchannel named 'defunct' does not represent any real subchannel on the +system; it is a pseudo subchannel where disconnected ccw devices are moved to +if they are displaced by another ccw device becoming operational on their +former subchannel. The ccw devices will be moved again to a proper subchannel +if they become operational again on that subchannel. + +You should address a ccw device via its bus id (e.g. 0.0.4711); the device can +be found under bus/ccw/devices/. + +All ccw devices export some data via sysfs. + +cutype: + The control unit type / model. + +devtype: + The device type / model, if applicable. + +availability: + Can be 'good' or 'boxed'; 'no path' or 'no device' for + disconnected devices. + +online: + An interface to set the device online and offline. + In the special case of the device being disconnected (see the + notify function under 1.2), piping 0 to online will forcibly delete + the device. + +The device drivers can add entries to export per-device data and interfaces. + +There is also some data exported on a per-subchannel basis (see under +bus/css/devices/): + +chpids: + Via which chpids the device is connected. + +pimpampom: + The path installed, path available and path operational masks. + +There also might be additional data, for example for block devices. + + +1.1 Bringing up a ccw device +---------------------------- + +This is done in several steps. + +a. Each driver can provide one or more parameter interfaces where parameters can + be specified. These interfaces are also in the driver's responsibility. +b. After a. has been performed, if necessary, the device is finally brought up + via the 'online' interface. + + +1.2 Writing a driver for ccw devices +------------------------------------ + +The basic struct ccw_device and struct ccw_driver data structures can be found +under include/asm/ccwdev.h:: + + struct ccw_device { + spinlock_t *ccwlock; + struct ccw_device_private *private; + struct ccw_device_id id; + + struct ccw_driver *drv; + struct device dev; + int online; + + void (*handler) (struct ccw_device *dev, unsigned long intparm, + struct irb *irb); + }; + + struct ccw_driver { + struct module *owner; + struct ccw_device_id *ids; + int (*probe) (struct ccw_device *); + int (*remove) (struct ccw_device *); + int (*set_online) (struct ccw_device *); + int (*set_offline) (struct ccw_device *); + int (*notify) (struct ccw_device *, int); + struct device_driver driver; + char *name; + }; + +The 'private' field contains data needed for internal i/o operation only, and +is not available to the device driver. + +Each driver should declare in a MODULE_DEVICE_TABLE into which CU types/models +and/or device types/models it is interested. This information can later be found +in the struct ccw_device_id fields:: + + struct ccw_device_id { + __u16 match_flags; + + __u16 cu_type; + __u16 dev_type; + __u8 cu_model; + __u8 dev_model; + + unsigned long driver_info; + }; + +The functions in ccw_driver should be used in the following way: + +probe: + This function is called by the device layer for each device the driver + is interested in. The driver should only allocate private structures + to put in dev->driver_data and create attributes (if needed). Also, + the interrupt handler (see below) should be set here. + +:: + + int (*probe) (struct ccw_device *cdev); + +Parameters: + cdev + - the device to be probed. + + +remove: + This function is called by the device layer upon removal of the driver, + the device or the module. The driver should perform cleanups here. + +:: + + int (*remove) (struct ccw_device *cdev); + +Parameters: + cdev + - the device to be removed. + + +set_online: + This function is called by the common I/O layer when the device is + activated via the 'online' attribute. The driver should finally + setup and activate the device here. + +:: + + int (*set_online) (struct ccw_device *); + +Parameters: + cdev + - the device to be activated. The common layer has + verified that the device is not already online. + + +set_offline: This function is called by the common I/O layer when the device is + de-activated via the 'online' attribute. The driver should shut + down the device, but not de-allocate its private data. + +:: + + int (*set_offline) (struct ccw_device *); + +Parameters: + cdev + - the device to be deactivated. The common layer has + verified that the device is online. + + +notify: + This function is called by the common I/O layer for some state changes + of the device. + + Signalled to the driver are: + + * In online state, device detached (CIO_GONE) or last path gone + (CIO_NO_PATH). The driver must return !0 to keep the device; for + return code 0, the device will be deleted as usual (also when no + notify function is registered). If the driver wants to keep the + device, it is moved into disconnected state. + * In disconnected state, device operational again (CIO_OPER). The + common I/O layer performs some sanity checks on device number and + Device / CU to be reasonably sure if it is still the same device. + If not, the old device is removed and a new one registered. By the + return code of the notify function the device driver signals if it + wants the device back: !0 for keeping, 0 to make the device being + removed and re-registered. + +:: + + int (*notify) (struct ccw_device *, int); + +Parameters: + cdev + - the device whose state changed. + + event + - the event that happened. This can be one of CIO_GONE, + CIO_NO_PATH or CIO_OPER. + +The handler field of the struct ccw_device is meant to be set to the interrupt +handler for the device. In order to accommodate drivers which use several +distinct handlers (e.g. multi subchannel devices), this is a member of ccw_device +instead of ccw_driver. +The handler is registered with the common layer during set_online() processing +before the driver is called, and is deregistered during set_offline() after the +driver has been called. Also, after registering / before deregistering, path +grouping resp. disbanding of the path group (if applicable) are performed. + +:: + + void (*handler) (struct ccw_device *dev, unsigned long intparm, struct irb *irb); + +Parameters: dev - the device the handler is called for + intparm - the intparm which allows the device driver to identify + the i/o the interrupt is associated with, or to recognize + the interrupt as unsolicited. + irb - interruption response block which contains the accumulated + status. + +The device driver is called from the common ccw_device layer and can retrieve +information about the interrupt from the irb parameter. + + +1.3 ccwgroup devices +-------------------- + +The ccwgroup mechanism is designed to handle devices consisting of multiple ccw +devices, like lcs or ctc. + +The ccw driver provides a 'group' attribute. Piping bus ids of ccw devices to +this attributes creates a ccwgroup device consisting of these ccw devices (if +possible). This ccwgroup device can be set online or offline just like a normal +ccw device. + +Each ccwgroup device also provides an 'ungroup' attribute to destroy the device +again (only when offline). This is a generic ccwgroup mechanism (the driver does +not need to implement anything beyond normal removal routines). + +A ccw device which is a member of a ccwgroup device carries a pointer to the +ccwgroup device in the driver_data of its device struct. This field must not be +touched by the driver - it should use the ccwgroup device's driver_data for its +private data. + +To implement a ccwgroup driver, please refer to include/asm/ccwgroup.h. Keep in +mind that most drivers will need to implement both a ccwgroup and a ccw +driver. + + +2. Channel paths +----------------- + +Channel paths show up, like subchannels, under the channel subsystem root (css0) +and are called 'chp0.'. They have no driver and do not belong to any bus. +Please note, that unlike /proc/chpids in 2.4, the channel path objects reflect +only the logical state and not the physical state, since we cannot track the +latter consistently due to lacking machine support (we don't need to be aware +of it anyway). + +status + - Can be 'online' or 'offline'. + Piping 'on' or 'off' sets the chpid logically online/offline. + Piping 'on' to an online chpid triggers path reprobing for all devices + the chpid connects to. This can be used to force the kernel to re-use + a channel path the user knows to be online, but the machine hasn't + created a machine check for. + +type + - The physical type of the channel path. + +shared + - Whether the channel path is shared. + +cmg + - The channel measurement group. + +3. System devices +----------------- + +3.1 xpram +--------- + +xpram shows up under devices/system/ as 'xpram'. + +3.2 cpus +-------- + +For each cpu, a directory is created under devices/system/cpu/. Each cpu has an +attribute 'online' which can be 0 or 1. + + +4. Other devices +---------------- + +4.1 Netiucv +----------- + +The netiucv driver creates an attribute 'connection' under +bus/iucv/drivers/netiucv. Piping to this attribute creates a new netiucv +connection to the specified host. + +Netiucv connections show up under devices/iucv/ as "netiucv". The interface +number is assigned sequentially to the connections defined via the 'connection' +attribute. + +user + - shows the connection partner. + +buffer + - maximum buffer size. Pipe to it to change buffer size. diff --git a/Documentation/s390/driver-model.txt b/Documentation/s390/driver-model.txt deleted file mode 100644 index ed265cf54cde..000000000000 --- a/Documentation/s390/driver-model.txt +++ /dev/null @@ -1,287 +0,0 @@ -S/390 driver model interfaces ------------------------------ - -1. CCW devices --------------- - -All devices which can be addressed by means of ccws are called 'CCW devices' - -even if they aren't actually driven by ccws. - -All ccw devices are accessed via a subchannel, this is reflected in the -structures under devices/: - -devices/ - - system/ - - css0/ - - 0.0.0000/0.0.0815/ - - 0.0.0001/0.0.4711/ - - 0.0.0002/ - - 0.1.0000/0.1.1234/ - ... - - defunct/ - -In this example, device 0815 is accessed via subchannel 0 in subchannel set 0, -device 4711 via subchannel 1 in subchannel set 0, and subchannel 2 is a non-I/O -subchannel. Device 1234 is accessed via subchannel 0 in subchannel set 1. - -The subchannel named 'defunct' does not represent any real subchannel on the -system; it is a pseudo subchannel where disconnected ccw devices are moved to -if they are displaced by another ccw device becoming operational on their -former subchannel. The ccw devices will be moved again to a proper subchannel -if they become operational again on that subchannel. - -You should address a ccw device via its bus id (e.g. 0.0.4711); the device can -be found under bus/ccw/devices/. - -All ccw devices export some data via sysfs. - -cutype: The control unit type / model. - -devtype: The device type / model, if applicable. - -availability: Can be 'good' or 'boxed'; 'no path' or 'no device' for - disconnected devices. - -online: An interface to set the device online and offline. - In the special case of the device being disconnected (see the - notify function under 1.2), piping 0 to online will forcibly delete - the device. - -The device drivers can add entries to export per-device data and interfaces. - -There is also some data exported on a per-subchannel basis (see under -bus/css/devices/): - -chpids: Via which chpids the device is connected. - -pimpampom: The path installed, path available and path operational masks. - -There also might be additional data, for example for block devices. - - -1.1 Bringing up a ccw device ----------------------------- - -This is done in several steps. - -a. Each driver can provide one or more parameter interfaces where parameters can - be specified. These interfaces are also in the driver's responsibility. -b. After a. has been performed, if necessary, the device is finally brought up - via the 'online' interface. - - -1.2 Writing a driver for ccw devices ------------------------------------- - -The basic struct ccw_device and struct ccw_driver data structures can be found -under include/asm/ccwdev.h. - -struct ccw_device { - spinlock_t *ccwlock; - struct ccw_device_private *private; - struct ccw_device_id id; - - struct ccw_driver *drv; - struct device dev; - int online; - - void (*handler) (struct ccw_device *dev, unsigned long intparm, - struct irb *irb); -}; - -struct ccw_driver { - struct module *owner; - struct ccw_device_id *ids; - int (*probe) (struct ccw_device *); - int (*remove) (struct ccw_device *); - int (*set_online) (struct ccw_device *); - int (*set_offline) (struct ccw_device *); - int (*notify) (struct ccw_device *, int); - struct device_driver driver; - char *name; -}; - -The 'private' field contains data needed for internal i/o operation only, and -is not available to the device driver. - -Each driver should declare in a MODULE_DEVICE_TABLE into which CU types/models -and/or device types/models it is interested. This information can later be found -in the struct ccw_device_id fields: - -struct ccw_device_id { - __u16 match_flags; - - __u16 cu_type; - __u16 dev_type; - __u8 cu_model; - __u8 dev_model; - - unsigned long driver_info; -}; - -The functions in ccw_driver should be used in the following way: -probe: This function is called by the device layer for each device the driver - is interested in. The driver should only allocate private structures - to put in dev->driver_data and create attributes (if needed). Also, - the interrupt handler (see below) should be set here. - -int (*probe) (struct ccw_device *cdev); - -Parameters: cdev - the device to be probed. - - -remove: This function is called by the device layer upon removal of the driver, - the device or the module. The driver should perform cleanups here. - -int (*remove) (struct ccw_device *cdev); - -Parameters: cdev - the device to be removed. - - -set_online: This function is called by the common I/O layer when the device is - activated via the 'online' attribute. The driver should finally - setup and activate the device here. - -int (*set_online) (struct ccw_device *); - -Parameters: cdev - the device to be activated. The common layer has - verified that the device is not already online. - - -set_offline: This function is called by the common I/O layer when the device is - de-activated via the 'online' attribute. The driver should shut - down the device, but not de-allocate its private data. - -int (*set_offline) (struct ccw_device *); - -Parameters: cdev - the device to be deactivated. The common layer has - verified that the device is online. - - -notify: This function is called by the common I/O layer for some state changes - of the device. - Signalled to the driver are: - * In online state, device detached (CIO_GONE) or last path gone - (CIO_NO_PATH). The driver must return !0 to keep the device; for - return code 0, the device will be deleted as usual (also when no - notify function is registered). If the driver wants to keep the - device, it is moved into disconnected state. - * In disconnected state, device operational again (CIO_OPER). The - common I/O layer performs some sanity checks on device number and - Device / CU to be reasonably sure if it is still the same device. - If not, the old device is removed and a new one registered. By the - return code of the notify function the device driver signals if it - wants the device back: !0 for keeping, 0 to make the device being - removed and re-registered. - -int (*notify) (struct ccw_device *, int); - -Parameters: cdev - the device whose state changed. - event - the event that happened. This can be one of CIO_GONE, - CIO_NO_PATH or CIO_OPER. - -The handler field of the struct ccw_device is meant to be set to the interrupt -handler for the device. In order to accommodate drivers which use several -distinct handlers (e.g. multi subchannel devices), this is a member of ccw_device -instead of ccw_driver. -The handler is registered with the common layer during set_online() processing -before the driver is called, and is deregistered during set_offline() after the -driver has been called. Also, after registering / before deregistering, path -grouping resp. disbanding of the path group (if applicable) are performed. - -void (*handler) (struct ccw_device *dev, unsigned long intparm, struct irb *irb); - -Parameters: dev - the device the handler is called for - intparm - the intparm which allows the device driver to identify - the i/o the interrupt is associated with, or to recognize - the interrupt as unsolicited. - irb - interruption response block which contains the accumulated - status. - -The device driver is called from the common ccw_device layer and can retrieve -information about the interrupt from the irb parameter. - - -1.3 ccwgroup devices --------------------- - -The ccwgroup mechanism is designed to handle devices consisting of multiple ccw -devices, like lcs or ctc. - -The ccw driver provides a 'group' attribute. Piping bus ids of ccw devices to -this attributes creates a ccwgroup device consisting of these ccw devices (if -possible). This ccwgroup device can be set online or offline just like a normal -ccw device. - -Each ccwgroup device also provides an 'ungroup' attribute to destroy the device -again (only when offline). This is a generic ccwgroup mechanism (the driver does -not need to implement anything beyond normal removal routines). - -A ccw device which is a member of a ccwgroup device carries a pointer to the -ccwgroup device in the driver_data of its device struct. This field must not be -touched by the driver - it should use the ccwgroup device's driver_data for its -private data. - -To implement a ccwgroup driver, please refer to include/asm/ccwgroup.h. Keep in -mind that most drivers will need to implement both a ccwgroup and a ccw -driver. - - -2. Channel paths ------------------ - -Channel paths show up, like subchannels, under the channel subsystem root (css0) -and are called 'chp0.'. They have no driver and do not belong to any bus. -Please note, that unlike /proc/chpids in 2.4, the channel path objects reflect -only the logical state and not the physical state, since we cannot track the -latter consistently due to lacking machine support (we don't need to be aware -of it anyway). - -status - Can be 'online' or 'offline'. - Piping 'on' or 'off' sets the chpid logically online/offline. - Piping 'on' to an online chpid triggers path reprobing for all devices - the chpid connects to. This can be used to force the kernel to re-use - a channel path the user knows to be online, but the machine hasn't - created a machine check for. - -type - The physical type of the channel path. - -shared - Whether the channel path is shared. - -cmg - The channel measurement group. - -3. System devices ------------------ - -3.1 xpram ---------- - -xpram shows up under devices/system/ as 'xpram'. - -3.2 cpus --------- - -For each cpu, a directory is created under devices/system/cpu/. Each cpu has an -attribute 'online' which can be 0 or 1. - - -4. Other devices ----------------- - -4.1 Netiucv ------------ - -The netiucv driver creates an attribute 'connection' under -bus/iucv/drivers/netiucv. Piping to this attribute creates a new netiucv -connection to the specified host. - -Netiucv connections show up under devices/iucv/ as "netiucv". The interface -number is assigned sequentially to the connections defined via the 'connection' -attribute. - -user - shows the connection partner. - -buffer - maximum buffer size. - Pipe to it to change buffer size. - - diff --git a/Documentation/s390/index.rst b/Documentation/s390/index.rst new file mode 100644 index 000000000000..1a914da2a07b --- /dev/null +++ b/Documentation/s390/index.rst @@ -0,0 +1,30 @@ +:orphan: + +================= +s390 Architecture +================= + +.. toctree:: + :maxdepth: 1 + + cds + 3270 + debugging390 + driver-model + monreader + qeth + s390dbf + vfio-ap + vfio-ccw + zfcpdump + dasd + common_io + + text_files + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/s390/monreader.rst b/Documentation/s390/monreader.rst new file mode 100644 index 000000000000..1e857575c113 --- /dev/null +++ b/Documentation/s390/monreader.rst @@ -0,0 +1,212 @@ +================================================= +Linux API for read access to z/VM Monitor Records +================================================= + +Date : 2004-Nov-26 + +Author: Gerald Schaefer (geraldsc@de.ibm.com) + + + + +Description +=========== +This item delivers a new Linux API in the form of a misc char device that is +usable from user space and allows read access to the z/VM Monitor Records +collected by the `*MONITOR` System Service of z/VM. + + +User Requirements +================= +The z/VM guest on which you want to access this API needs to be configured in +order to allow IUCV connections to the `*MONITOR` service, i.e. it needs the +IUCV `*MONITOR` statement in its user entry. If the monitor DCSS to be used is +restricted (likely), you also need the NAMESAVE statement. +This item will use the IUCV device driver to access the z/VM services, so you +need a kernel with IUCV support. You also need z/VM version 4.4 or 5.1. + +There are two options for being able to load the monitor DCSS (examples assume +that the monitor DCSS begins at 144 MB and ends at 152 MB). You can query the +location of the monitor DCSS with the Class E privileged CP command Q NSS MAP +(the values BEGPAG and ENDPAG are given in units of 4K pages). + +See also "CP Command and Utility Reference" (SC24-6081-00) for more information +on the DEF STOR and Q NSS MAP commands, as well as "Saved Segments Planning +and Administration" (SC24-6116-00) for more information on DCSSes. + +1st option: +----------- +You can use the CP command DEF STOR CONFIG to define a "memory hole" in your +guest virtual storage around the address range of the DCSS. + +Example: DEF STOR CONFIG 0.140M 200M.200M + +This defines two blocks of storage, the first is 140MB in size an begins at +address 0MB, the second is 200MB in size and begins at address 200MB, +resulting in a total storage of 340MB. Note that the first block should +always start at 0 and be at least 64MB in size. + +2nd option: +----------- +Your guest virtual storage has to end below the starting address of the DCSS +and you have to specify the "mem=" kernel parameter in your parmfile with a +value greater than the ending address of the DCSS. + +Example:: + + DEF STOR 140M + +This defines 140MB storage size for your guest, the parameter "mem=160M" is +added to the parmfile. + + +User Interface +============== +The char device is implemented as a kernel module named "monreader", +which can be loaded via the modprobe command, or it can be compiled into the +kernel instead. There is one optional module (or kernel) parameter, "mondcss", +to specify the name of the monitor DCSS. If the module is compiled into the +kernel, the kernel parameter "monreader.mondcss=" can be specified +in the parmfile. + +The default name for the DCSS is "MONDCSS" if none is specified. In case that +there are other users already connected to the `*MONITOR` service (e.g. +Performance Toolkit), the monitor DCSS is already defined and you have to use +the same DCSS. The CP command Q MONITOR (Class E privileged) shows the name +of the monitor DCSS, if already defined, and the users connected to the +`*MONITOR` service. +Refer to the "z/VM Performance" book (SC24-6109-00) on how to create a monitor +DCSS if your z/VM doesn't have one already, you need Class E privileges to +define and save a DCSS. + +Example: +-------- + +:: + + modprobe monreader mondcss=MYDCSS + +This loads the module and sets the DCSS name to "MYDCSS". + +NOTE: +----- +This API provides no interface to control the `*MONITOR` service, e.g. specify +which data should be collected. This can be done by the CP command MONITOR +(Class E privileged), see "CP Command and Utility Reference". + +Device nodes with udev: +----------------------- +After loading the module, a char device will be created along with the device +node //monreader. + +Device nodes without udev: +-------------------------- +If your distribution does not support udev, a device node will not be created +automatically and you have to create it manually after loading the module. +Therefore you need to know the major and minor numbers of the device. These +numbers can be found in /sys/class/misc/monreader/dev. + +Typing cat /sys/class/misc/monreader/dev will give an output of the form +:. The device node can be created via the mknod command, enter +mknod c , where is the name of the device node +to be created. + +Example: +-------- + +:: + + # modprobe monreader + # cat /sys/class/misc/monreader/dev + 10:63 + # mknod /dev/monreader c 10 63 + +This loads the module with the default monitor DCSS (MONDCSS) and creates a +device node. + +File operations: +---------------- +The following file operations are supported: open, release, read, poll. +There are two alternative methods for reading: either non-blocking read in +conjunction with polling, or blocking read without polling. IOCTLs are not +supported. + +Read: +----- +Reading from the device provides a 12 Byte monitor control element (MCE), +followed by a set of one or more contiguous monitor records (similar to the +output of the CMS utility MONWRITE without the 4K control blocks). The MCE +contains information on the type of the following record set (sample/event +data), the monitor domains contained within it and the start and end address +of the record set in the monitor DCSS. The start and end address can be used +to determine the size of the record set, the end address is the address of the +last byte of data. The start address is needed to handle "end-of-frame" records +correctly (domain 1, record 13), i.e. it can be used to determine the record +start offset relative to a 4K page (frame) boundary. + +See "Appendix A: `*MONITOR`" in the "z/VM Performance" document for a description +of the monitor control element layout. The layout of the monitor records can +be found here (z/VM 5.1): http://www.vm.ibm.com/pubs/mon510/index.html + +The layout of the data stream provided by the monreader device is as follows:: + + ... + <0 byte read> + \ + | + ... |- data set + | + / + <0 byte read> + ... + +There may be more than one combination of MCE and corresponding record set +within one data set and the end of each data set is indicated by a successful +read with a return value of 0 (0 byte read). +Any received data must be considered invalid until a complete set was +read successfully, including the closing 0 byte read. Therefore you should +always read the complete set into a buffer before processing the data. + +The maximum size of a data set can be as large as the size of the +monitor DCSS, so design the buffer adequately or use dynamic memory allocation. +The size of the monitor DCSS will be printed into syslog after loading the +module. You can also use the (Class E privileged) CP command Q NSS MAP to +list all available segments and information about them. + +As with most char devices, error conditions are indicated by returning a +negative value for the number of bytes read. In this case, the errno variable +indicates the error condition: + +EIO: + reply failed, read data is invalid and the application + should discard the data read since the last successful read with 0 size. +EFAULT: + copy_to_user failed, read data is invalid and the application should + discard the data read since the last successful read with 0 size. +EAGAIN: + occurs on a non-blocking read if there is no data available at the + moment. There is no data missing or corrupted, just try again or rather + use polling for non-blocking reads. +EOVERFLOW: + message limit reached, the data read since the last successful + read with 0 size is valid but subsequent records may be missing. + +In the last case (EOVERFLOW) there may be missing data, in the first two cases +(EIO, EFAULT) there will be missing data. It's up to the application if it will +continue reading subsequent data or rather exit. + +Open: +----- +Only one user is allowed to open the char device. If it is already in use, the +open function will fail (return a negative value) and set errno to EBUSY. +The open function may also fail if an IUCV connection to the `*MONITOR` service +cannot be established. In this case errno will be set to EIO and an error +message with an IPUSER SEVER code will be printed into syslog. The IPUSER SEVER +codes are described in the "z/VM Performance" book, Appendix A. + +NOTE: +----- +As soon as the device is opened, incoming messages will be accepted and they +will account for the message limit, i.e. opening the device without reading +from it will provoke the "message limit reached" error (EOVERFLOW error code) +eventually. diff --git a/Documentation/s390/monreader.txt b/Documentation/s390/monreader.txt deleted file mode 100644 index d3729585fdb0..000000000000 --- a/Documentation/s390/monreader.txt +++ /dev/null @@ -1,197 +0,0 @@ - -Date : 2004-Nov-26 -Author: Gerald Schaefer (geraldsc@de.ibm.com) - - - Linux API for read access to z/VM Monitor Records - ================================================= - - -Description -=========== -This item delivers a new Linux API in the form of a misc char device that is -usable from user space and allows read access to the z/VM Monitor Records -collected by the *MONITOR System Service of z/VM. - - -User Requirements -================= -The z/VM guest on which you want to access this API needs to be configured in -order to allow IUCV connections to the *MONITOR service, i.e. it needs the -IUCV *MONITOR statement in its user entry. If the monitor DCSS to be used is -restricted (likely), you also need the NAMESAVE statement. -This item will use the IUCV device driver to access the z/VM services, so you -need a kernel with IUCV support. You also need z/VM version 4.4 or 5.1. - -There are two options for being able to load the monitor DCSS (examples assume -that the monitor DCSS begins at 144 MB and ends at 152 MB). You can query the -location of the monitor DCSS with the Class E privileged CP command Q NSS MAP -(the values BEGPAG and ENDPAG are given in units of 4K pages). - -See also "CP Command and Utility Reference" (SC24-6081-00) for more information -on the DEF STOR and Q NSS MAP commands, as well as "Saved Segments Planning -and Administration" (SC24-6116-00) for more information on DCSSes. - -1st option: ------------ -You can use the CP command DEF STOR CONFIG to define a "memory hole" in your -guest virtual storage around the address range of the DCSS. - -Example: DEF STOR CONFIG 0.140M 200M.200M - -This defines two blocks of storage, the first is 140MB in size an begins at -address 0MB, the second is 200MB in size and begins at address 200MB, -resulting in a total storage of 340MB. Note that the first block should -always start at 0 and be at least 64MB in size. - -2nd option: ------------ -Your guest virtual storage has to end below the starting address of the DCSS -and you have to specify the "mem=" kernel parameter in your parmfile with a -value greater than the ending address of the DCSS. - -Example: DEF STOR 140M - -This defines 140MB storage size for your guest, the parameter "mem=160M" is -added to the parmfile. - - -User Interface -============== -The char device is implemented as a kernel module named "monreader", -which can be loaded via the modprobe command, or it can be compiled into the -kernel instead. There is one optional module (or kernel) parameter, "mondcss", -to specify the name of the monitor DCSS. If the module is compiled into the -kernel, the kernel parameter "monreader.mondcss=" can be specified -in the parmfile. - -The default name for the DCSS is "MONDCSS" if none is specified. In case that -there are other users already connected to the *MONITOR service (e.g. -Performance Toolkit), the monitor DCSS is already defined and you have to use -the same DCSS. The CP command Q MONITOR (Class E privileged) shows the name -of the monitor DCSS, if already defined, and the users connected to the -*MONITOR service. -Refer to the "z/VM Performance" book (SC24-6109-00) on how to create a monitor -DCSS if your z/VM doesn't have one already, you need Class E privileges to -define and save a DCSS. - -Example: --------- -modprobe monreader mondcss=MYDCSS - -This loads the module and sets the DCSS name to "MYDCSS". - -NOTE: ------ -This API provides no interface to control the *MONITOR service, e.g. specify -which data should be collected. This can be done by the CP command MONITOR -(Class E privileged), see "CP Command and Utility Reference". - -Device nodes with udev: ------------------------ -After loading the module, a char device will be created along with the device -node //monreader. - -Device nodes without udev: --------------------------- -If your distribution does not support udev, a device node will not be created -automatically and you have to create it manually after loading the module. -Therefore you need to know the major and minor numbers of the device. These -numbers can be found in /sys/class/misc/monreader/dev. -Typing cat /sys/class/misc/monreader/dev will give an output of the form -:. The device node can be created via the mknod command, enter -mknod c , where is the name of the device node -to be created. - -Example: --------- -# modprobe monreader -# cat /sys/class/misc/monreader/dev -10:63 -# mknod /dev/monreader c 10 63 - -This loads the module with the default monitor DCSS (MONDCSS) and creates a -device node. - -File operations: ----------------- -The following file operations are supported: open, release, read, poll. -There are two alternative methods for reading: either non-blocking read in -conjunction with polling, or blocking read without polling. IOCTLs are not -supported. - -Read: ------ -Reading from the device provides a 12 Byte monitor control element (MCE), -followed by a set of one or more contiguous monitor records (similar to the -output of the CMS utility MONWRITE without the 4K control blocks). The MCE -contains information on the type of the following record set (sample/event -data), the monitor domains contained within it and the start and end address -of the record set in the monitor DCSS. The start and end address can be used -to determine the size of the record set, the end address is the address of the -last byte of data. The start address is needed to handle "end-of-frame" records -correctly (domain 1, record 13), i.e. it can be used to determine the record -start offset relative to a 4K page (frame) boundary. - -See "Appendix A: *MONITOR" in the "z/VM Performance" document for a description -of the monitor control element layout. The layout of the monitor records can -be found here (z/VM 5.1): http://www.vm.ibm.com/pubs/mon510/index.html - -The layout of the data stream provided by the monreader device is as follows: -... -<0 byte read> - \ - | -... |- data set - | - / -<0 byte read> -... - -There may be more than one combination of MCE and corresponding record set -within one data set and the end of each data set is indicated by a successful -read with a return value of 0 (0 byte read). -Any received data must be considered invalid until a complete set was -read successfully, including the closing 0 byte read. Therefore you should -always read the complete set into a buffer before processing the data. - -The maximum size of a data set can be as large as the size of the -monitor DCSS, so design the buffer adequately or use dynamic memory allocation. -The size of the monitor DCSS will be printed into syslog after loading the -module. You can also use the (Class E privileged) CP command Q NSS MAP to -list all available segments and information about them. - -As with most char devices, error conditions are indicated by returning a -negative value for the number of bytes read. In this case, the errno variable -indicates the error condition: - -EIO: reply failed, read data is invalid and the application - should discard the data read since the last successful read with 0 size. -EFAULT: copy_to_user failed, read data is invalid and the application should - discard the data read since the last successful read with 0 size. -EAGAIN: occurs on a non-blocking read if there is no data available at the - moment. There is no data missing or corrupted, just try again or rather - use polling for non-blocking reads. -EOVERFLOW: message limit reached, the data read since the last successful - read with 0 size is valid but subsequent records may be missing. - -In the last case (EOVERFLOW) there may be missing data, in the first two cases -(EIO, EFAULT) there will be missing data. It's up to the application if it will -continue reading subsequent data or rather exit. - -Open: ------ -Only one user is allowed to open the char device. If it is already in use, the -open function will fail (return a negative value) and set errno to EBUSY. -The open function may also fail if an IUCV connection to the *MONITOR service -cannot be established. In this case errno will be set to EIO and an error -message with an IPUSER SEVER code will be printed into syslog. The IPUSER SEVER -codes are described in the "z/VM Performance" book, Appendix A. - -NOTE: ------ -As soon as the device is opened, incoming messages will be accepted and they -will account for the message limit, i.e. opening the device without reading -from it will provoke the "message limit reached" error (EOVERFLOW error code) -eventually. - diff --git a/Documentation/s390/qeth.rst b/Documentation/s390/qeth.rst new file mode 100644 index 000000000000..f02fdaa68de0 --- /dev/null +++ b/Documentation/s390/qeth.rst @@ -0,0 +1,64 @@ +============================= +IBM s390 QDIO Ethernet Driver +============================= + +OSA and HiperSockets Bridge Port Support +======================================== + +Uevents +------- + +To generate the events the device must be assigned a role of either +a primary or a secondary Bridge Port. For more information, see +"z/VM Connectivity, SC24-6174". + +When run on an OSA or HiperSockets Bridge Capable Port hardware, and the state +of some configured Bridge Port device on the channel changes, a udev +event with ACTION=CHANGE is emitted on behalf of the corresponding +ccwgroup device. The event has the following attributes: + +BRIDGEPORT=statechange + indicates that the Bridge Port device changed + its state. + +ROLE={primary|secondary|none} + the role assigned to the port. + +STATE={active|standby|inactive} + the newly assumed state of the port. + +When run on HiperSockets Bridge Capable Port hardware with host address +notifications enabled, a udev event with ACTION=CHANGE is emitted. +It is emitted on behalf of the corresponding ccwgroup device when a host +or a VLAN is registered or unregistered on the network served by the device. +The event has the following attributes: + +BRIDGEDHOST={reset|register|deregister|abort} + host address + notifications are started afresh, a new host or VLAN is registered or + deregistered on the Bridge Port HiperSockets channel, or address + notifications are aborted. + +VLAN=numeric-vlan-id + VLAN ID on which the event occurred. Not included + if no VLAN is involved in the event. + +MAC=xx:xx:xx:xx:xx:xx + MAC address of the host that is being registered + or deregistered from the HiperSockets channel. Not reported if the + event reports the creation or destruction of a VLAN. + +NTOK_BUSID=x.y.zzzz + device bus ID (CSSID, SSID and device number). + +NTOK_IID=xx + device IID. + +NTOK_CHPID=xx + device CHPID. + +NTOK_CHID=xxxx + device channel ID. + +Note that the `NTOK_*` attributes refer to devices other than the one +connected to the system on which the OS is running. diff --git a/Documentation/s390/qeth.txt b/Documentation/s390/qeth.txt deleted file mode 100644 index aa06fcf5f8c2..000000000000 --- a/Documentation/s390/qeth.txt +++ /dev/null @@ -1,50 +0,0 @@ -IBM s390 QDIO Ethernet Driver - -OSA and HiperSockets Bridge Port Support - -Uevents - -To generate the events the device must be assigned a role of either -a primary or a secondary Bridge Port. For more information, see -"z/VM Connectivity, SC24-6174". - -When run on an OSA or HiperSockets Bridge Capable Port hardware, and the state -of some configured Bridge Port device on the channel changes, a udev -event with ACTION=CHANGE is emitted on behalf of the corresponding -ccwgroup device. The event has the following attributes: - -BRIDGEPORT=statechange - indicates that the Bridge Port device changed - its state. - -ROLE={primary|secondary|none} - the role assigned to the port. - -STATE={active|standby|inactive} - the newly assumed state of the port. - -When run on HiperSockets Bridge Capable Port hardware with host address -notifications enabled, a udev event with ACTION=CHANGE is emitted. -It is emitted on behalf of the corresponding ccwgroup device when a host -or a VLAN is registered or unregistered on the network served by the device. -The event has the following attributes: - -BRIDGEDHOST={reset|register|deregister|abort} - host address - notifications are started afresh, a new host or VLAN is registered or - deregistered on the Bridge Port HiperSockets channel, or address - notifications are aborted. - -VLAN=numeric-vlan-id - VLAN ID on which the event occurred. Not included - if no VLAN is involved in the event. - -MAC=xx:xx:xx:xx:xx:xx - MAC address of the host that is being registered - or deregistered from the HiperSockets channel. Not reported if the - event reports the creation or destruction of a VLAN. - -NTOK_BUSID=x.y.zzzz - device bus ID (CSSID, SSID and device number). - -NTOK_IID=xx - device IID. - -NTOK_CHPID=xx - device CHPID. - -NTOK_CHID=xxxx - device channel ID. - -Note that the NTOK_* attributes refer to devices other than the one -connected to the system on which the OS is running. diff --git a/Documentation/s390/s390dbf.rst b/Documentation/s390/s390dbf.rst new file mode 100644 index 000000000000..ec2a1faa414b --- /dev/null +++ b/Documentation/s390/s390dbf.rst @@ -0,0 +1,803 @@ +================== +S390 Debug Feature +================== + +files: + - arch/s390/kernel/debug.c + - arch/s390/include/asm/debug.h + +Description: +------------ +The goal of this feature is to provide a kernel debug logging API +where log records can be stored efficiently in memory, where each component +(e.g. device drivers) can have one separate debug log. +One purpose of this is to inspect the debug logs after a production system crash +in order to analyze the reason for the crash. + +If the system still runs but only a subcomponent which uses dbf fails, +it is possible to look at the debug logs on a live system via the Linux +debugfs filesystem. + +The debug feature may also very useful for kernel and driver development. + +Design: +------- +Kernel components (e.g. device drivers) can register themselves at the debug +feature with the function call debug_register(). This function initializes a +debug log for the caller. For each debug log exists a number of debug areas +where exactly one is active at one time. Each debug area consists of contiguous +pages in memory. In the debug areas there are stored debug entries (log records) +which are written by event- and exception-calls. + +An event-call writes the specified debug entry to the active debug +area and updates the log pointer for the active area. If the end +of the active debug area is reached, a wrap around is done (ring buffer) +and the next debug entry will be written at the beginning of the active +debug area. + +An exception-call writes the specified debug entry to the log and +switches to the next debug area. This is done in order to be sure +that the records which describe the origin of the exception are not +overwritten when a wrap around for the current area occurs. + +The debug areas themselves are also ordered in form of a ring buffer. +When an exception is thrown in the last debug area, the following debug +entries are then written again in the very first area. + +There are three versions for the event- and exception-calls: One for +logging raw data, one for text and one for numbers. + +Each debug entry contains the following data: + +- Timestamp +- Cpu-Number of calling task +- Level of debug entry (0...6) +- Return Address to caller +- Flag, if entry is an exception or not + +The debug logs can be inspected in a live system through entries in +the debugfs-filesystem. Under the toplevel directory "s390dbf" there is +a directory for each registered component, which is named like the +corresponding component. The debugfs normally should be mounted to +/sys/kernel/debug therefore the debug feature can be accessed under +/sys/kernel/debug/s390dbf. + +The content of the directories are files which represent different views +to the debug log. Each component can decide which views should be +used through registering them with the function debug_register_view(). +Predefined views for hex/ascii, sprintf and raw binary data are provided. +It is also possible to define other views. The content of +a view can be inspected simply by reading the corresponding debugfs file. + +All debug logs have an actual debug level (range from 0 to 6). +The default level is 3. Event and Exception functions have a 'level' +parameter. Only debug entries with a level that is lower or equal +than the actual level are written to the log. This means, when +writing events, high priority log entries should have a low level +value whereas low priority entries should have a high one. +The actual debug level can be changed with the help of the debugfs-filesystem +through writing a number string "x" to the 'level' debugfs file which is +provided for every debug log. Debugging can be switched off completely +by using "-" on the 'level' debugfs file. + +Example:: + + > echo "-" > /sys/kernel/debug/s390dbf/dasd/level + +It is also possible to deactivate the debug feature globally for every +debug log. You can change the behavior using 2 sysctl parameters in +/proc/sys/s390dbf: + +There are currently 2 possible triggers, which stop the debug feature +globally. The first possibility is to use the "debug_active" sysctl. If +set to 1 the debug feature is running. If "debug_active" is set to 0 the +debug feature is turned off. + +The second trigger which stops the debug feature is a kernel oops. +That prevents the debug feature from overwriting debug information that +happened before the oops. After an oops you can reactivate the debug feature +by piping 1 to /proc/sys/s390dbf/debug_active. Nevertheless, its not +suggested to use an oopsed kernel in a production environment. + +If you want to disallow the deactivation of the debug feature, you can use +the "debug_stoppable" sysctl. If you set "debug_stoppable" to 0 the debug +feature cannot be stopped. If the debug feature is already stopped, it +will stay deactivated. + +---------------------------------------------------------------------------- + +Kernel Interfaces: +------------------ + +:: + + debug_info_t *debug_register(char *name, int pages, int nr_areas, + int buf_size); + +Parameter: + name: + Name of debug log (e.g. used for debugfs entry) + pages: + Number of pages, which will be allocated per area + nr_areas: + Number of debug areas + buf_size: + Size of data area in each debug entry + +Return Value: + Handle for generated debug area + + NULL if register failed + +Description: Allocates memory for a debug log + Must not be called within an interrupt handler + +---------------------------------------------------------------------------- + +:: + + debug_info_t *debug_register_mode(char *name, int pages, int nr_areas, + int buf_size, mode_t mode, uid_t uid, + gid_t gid); + +Parameter: + name: + Name of debug log (e.g. used for debugfs entry) + pages: + Number of pages, which will be allocated per area + nr_areas: + Number of debug areas + buf_size: + Size of data area in each debug entry + mode: + File mode for debugfs files. E.g. S_IRWXUGO + uid: + User ID for debugfs files. Currently only 0 is + supported. + gid: + Group ID for debugfs files. Currently only 0 is + supported. + +Return Value: + Handle for generated debug area + + NULL if register failed + +Description: + Allocates memory for a debug log + Must not be called within an interrupt handler + +--------------------------------------------------------------------------- + +:: + + void debug_unregister (debug_info_t * id); + +Parameter: + id: + handle for debug log + +Return Value: + none + +Description: + frees memory for a debug log and removes all registered debug + views. + + Must not be called within an interrupt handler + +--------------------------------------------------------------------------- + +:: + + void debug_set_level (debug_info_t * id, int new_level); + +Parameter: id: handle for debug log + new_level: new debug level + +Return Value: + none + +Description: + Sets new actual debug level if new_level is valid. + +--------------------------------------------------------------------------- + +:: + + bool debug_level_enabled (debug_info_t * id, int level); + +Parameter: + id: + handle for debug log + level: + debug level + +Return Value: + True if level is less or equal to the current debug level. + +Description: + Returns true if debug events for the specified level would be + logged. Otherwise returns false. + +--------------------------------------------------------------------------- + +:: + + void debug_stop_all(void); + +Parameter: + none + +Return Value: + none + +Description: + stops the debug feature if stopping is allowed. Currently + used in case of a kernel oops. + +--------------------------------------------------------------------------- + +:: + + debug_entry_t* debug_event (debug_info_t* id, int level, void* data, + int length); + +Parameter: + id: + handle for debug log + level: + debug level + data: + pointer to data for debug entry + length: + length of data in bytes + +Return Value: + Address of written debug entry + +Description: + writes debug entry to active debug area (if level <= actual + debug level) + +--------------------------------------------------------------------------- + +:: + + debug_entry_t* debug_int_event (debug_info_t * id, int level, + unsigned int data); + debug_entry_t* debug_long_event(debug_info_t * id, int level, + unsigned long data); + +Parameter: + id: + handle for debug log + level: + debug level + data: + integer value for debug entry + +Return Value: + Address of written debug entry + +Description: + writes debug entry to active debug area (if level <= actual + debug level) + +--------------------------------------------------------------------------- + +:: + + debug_entry_t* debug_text_event (debug_info_t * id, int level, + const char* data); + +Parameter: + id: + handle for debug log + level: + debug level + data: + string for debug entry + +Return Value: + Address of written debug entry + +Description: + writes debug entry in ascii format to active debug area + (if level <= actual debug level) + +--------------------------------------------------------------------------- + +:: + + debug_entry_t* debug_sprintf_event (debug_info_t * id, int level, + char* string,...); + +Parameter: + id: + handle for debug log + level: + debug level + string: + format string for debug entry + ...: + varargs used as in sprintf() + +Return Value: Address of written debug entry + +Description: + writes debug entry with format string and varargs (longs) to + active debug area (if level $<=$ actual debug level). + floats and long long datatypes cannot be used as varargs. + +--------------------------------------------------------------------------- + +:: + + debug_entry_t* debug_exception (debug_info_t* id, int level, void* data, + int length); + +Parameter: + id: + handle for debug log + level: + debug level + data: + pointer to data for debug entry + length: + length of data in bytes + +Return Value: + Address of written debug entry + +Description: + writes debug entry to active debug area (if level <= actual + debug level) and switches to next debug area + +--------------------------------------------------------------------------- + +:: + + debug_entry_t* debug_int_exception (debug_info_t * id, int level, + unsigned int data); + debug_entry_t* debug_long_exception(debug_info_t * id, int level, + unsigned long data); + +Parameter: id: handle for debug log + level: debug level + data: integer value for debug entry + +Return Value: Address of written debug entry + +Description: writes debug entry to active debug area (if level <= actual + debug level) and switches to next debug area + +--------------------------------------------------------------------------- + +:: + + debug_entry_t* debug_text_exception (debug_info_t * id, int level, + const char* data); + +Parameter: id: handle for debug log + level: debug level + data: string for debug entry + +Return Value: Address of written debug entry + +Description: writes debug entry in ascii format to active debug area + (if level <= actual debug level) and switches to next debug + area + +--------------------------------------------------------------------------- + +:: + + debug_entry_t* debug_sprintf_exception (debug_info_t * id, int level, + char* string,...); + +Parameter: id: handle for debug log + level: debug level + string: format string for debug entry + ...: varargs used as in sprintf() + +Return Value: Address of written debug entry + +Description: writes debug entry with format string and varargs (longs) to + active debug area (if level $<=$ actual debug level) and + switches to next debug area. + floats and long long datatypes cannot be used as varargs. + +--------------------------------------------------------------------------- + +:: + + int debug_register_view (debug_info_t * id, struct debug_view *view); + +Parameter: id: handle for debug log + view: pointer to debug view struct + +Return Value: 0 : ok + < 0: Error + +Description: registers new debug view and creates debugfs dir entry + +--------------------------------------------------------------------------- + +:: + + int debug_unregister_view (debug_info_t * id, struct debug_view *view); + +Parameter: id: handle for debug log + view: pointer to debug view struct + +Return Value: 0 : ok + < 0: Error + +Description: unregisters debug view and removes debugfs dir entry + + + +Predefined views: +----------------- + +extern struct debug_view debug_hex_ascii_view; + +extern struct debug_view debug_raw_view; + +extern struct debug_view debug_sprintf_view; + +Examples +-------- + +:: + + /* + * hex_ascii- + raw-view Example + */ + + #include + #include + + static debug_info_t* debug_info; + + static int init(void) + { + /* register 4 debug areas with one page each and 4 byte data field */ + + debug_info = debug_register ("test", 1, 4, 4 ); + debug_register_view(debug_info,&debug_hex_ascii_view); + debug_register_view(debug_info,&debug_raw_view); + + debug_text_event(debug_info, 4 , "one "); + debug_int_exception(debug_info, 4, 4711); + debug_event(debug_info, 3, &debug_info, 4); + + return 0; + } + + static void cleanup(void) + { + debug_unregister (debug_info); + } + + module_init(init); + module_exit(cleanup); + +--------------------------------------------------------------------------- + +:: + + /* + * sprintf-view Example + */ + + #include + #include + + static debug_info_t* debug_info; + + static int init(void) + { + /* register 4 debug areas with one page each and data field for */ + /* format string pointer + 2 varargs (= 3 * sizeof(long)) */ + + debug_info = debug_register ("test", 1, 4, sizeof(long) * 3); + debug_register_view(debug_info,&debug_sprintf_view); + + debug_sprintf_event(debug_info, 2 , "first event in %s:%i\n",__FILE__,__LINE__); + debug_sprintf_exception(debug_info, 1, "pointer to debug info: %p\n",&debug_info); + + return 0; + } + + static void cleanup(void) + { + debug_unregister (debug_info); + } + + module_init(init); + module_exit(cleanup); + +Debugfs Interface +----------------- +Views to the debug logs can be investigated through reading the corresponding +debugfs-files: + +Example:: + + > ls /sys/kernel/debug/s390dbf/dasd + flush hex_ascii level pages raw + > cat /sys/kernel/debug/s390dbf/dasd/hex_ascii | sort -k2,2 -s + 00 00974733272:680099 2 - 02 0006ad7e 07 ea 4a 90 | .... + 00 00974733272:682210 2 - 02 0006ade6 46 52 45 45 | FREE + 00 00974733272:682213 2 - 02 0006adf6 07 ea 4a 90 | .... + 00 00974733272:682281 1 * 02 0006ab08 41 4c 4c 43 | EXCP + 01 00974733272:682284 2 - 02 0006ab16 45 43 4b 44 | ECKD + 01 00974733272:682287 2 - 02 0006ab28 00 00 00 04 | .... + 01 00974733272:682289 2 - 02 0006ab3e 00 00 00 20 | ... + 01 00974733272:682297 2 - 02 0006ad7e 07 ea 4a 90 | .... + 01 00974733272:684384 2 - 00 0006ade6 46 52 45 45 | FREE + 01 00974733272:684388 2 - 00 0006adf6 07 ea 4a 90 | .... + +See section about predefined views for explanation of the above output! + +Changing the debug level +------------------------ + +Example:: + + + > cat /sys/kernel/debug/s390dbf/dasd/level + 3 + > echo "5" > /sys/kernel/debug/s390dbf/dasd/level + > cat /sys/kernel/debug/s390dbf/dasd/level + 5 + +Flushing debug areas +-------------------- +Debug areas can be flushed with piping the number of the desired +area (0...n) to the debugfs file "flush". When using "-" all debug areas +are flushed. + +Examples: + +1. Flush debug area 0:: + + > echo "0" > /sys/kernel/debug/s390dbf/dasd/flush + +2. Flush all debug areas:: + + > echo "-" > /sys/kernel/debug/s390dbf/dasd/flush + +Changing the size of debug areas +------------------------------------ +It is possible the change the size of debug areas through piping +the number of pages to the debugfs file "pages". The resize request will +also flush the debug areas. + +Example: + +Define 4 pages for the debug areas of debug feature "dasd":: + + > echo "4" > /sys/kernel/debug/s390dbf/dasd/pages + +Stooping the debug feature +-------------------------- +Example: + +1. Check if stopping is allowed:: + + > cat /proc/sys/s390dbf/debug_stoppable + +2. Stop debug feature:: + + > echo 0 > /proc/sys/s390dbf/debug_active + +lcrash Interface +---------------- +It is planned that the dump analysis tool lcrash gets an additional command +'s390dbf' to display all the debug logs. With this tool it will be possible +to investigate the debug logs on a live system and with a memory dump after +a system crash. + +Investigating raw memory +------------------------ +One last possibility to investigate the debug logs at a live +system and after a system crash is to look at the raw memory +under VM or at the Service Element. +It is possible to find the anker of the debug-logs through +the 'debug_area_first' symbol in the System map. Then one has +to follow the correct pointers of the data-structures defined +in debug.h and find the debug-areas in memory. +Normally modules which use the debug feature will also have +a global variable with the pointer to the debug-logs. Following +this pointer it will also be possible to find the debug logs in +memory. + +For this method it is recommended to use '16 * x + 4' byte (x = 0..n) +for the length of the data field in debug_register() in +order to see the debug entries well formatted. + + +Predefined Views +---------------- + +There are three predefined views: hex_ascii, raw and sprintf. +The hex_ascii view shows the data field in hex and ascii representation +(e.g. '45 43 4b 44 | ECKD'). +The raw view returns a bytestream as the debug areas are stored in memory. + +The sprintf view formats the debug entries in the same way as the sprintf +function would do. The sprintf event/exception functions write to the +debug entry a pointer to the format string (size = sizeof(long)) +and for each vararg a long value. So e.g. for a debug entry with a format +string plus two varargs one would need to allocate a (3 * sizeof(long)) +byte data area in the debug_register() function. + +IMPORTANT: + Using "%s" in sprintf event functions is dangerous. You can only + use "%s" in the sprintf event functions, if the memory for the passed string + is available as long as the debug feature exists. The reason behind this is + that due to performance considerations only a pointer to the string is stored + in the debug feature. If you log a string that is freed afterwards, you will + get an OOPS when inspecting the debug feature, because then the debug feature + will access the already freed memory. + +NOTE: + If using the sprintf view do NOT use other event/exception functions + than the sprintf-event and -exception functions. + +The format of the hex_ascii and sprintf view is as follows: + +- Number of area +- Timestamp (formatted as seconds and microseconds since 00:00:00 Coordinated + Universal Time (UTC), January 1, 1970) +- level of debug entry +- Exception flag (* = Exception) +- Cpu-Number of calling task +- Return Address to caller +- data field + +The format of the raw view is: + +- Header as described in debug.h +- datafield + +A typical line of the hex_ascii view will look like the following (first line +is only for explanation and will not be displayed when 'cating' the view): + +area time level exception cpu caller data (hex + ascii) +-------------------------------------------------------------------------- +00 00964419409:440690 1 - 00 88023fe + + +Defining views +-------------- + +Views are specified with the 'debug_view' structure. There are defined +callback functions which are used for reading and writing the debugfs files:: + + struct debug_view { + char name[DEBUG_MAX_PROCF_LEN]; + debug_prolog_proc_t* prolog_proc; + debug_header_proc_t* header_proc; + debug_format_proc_t* format_proc; + debug_input_proc_t* input_proc; + void* private_data; + }; + +where:: + + typedef int (debug_header_proc_t) (debug_info_t* id, + struct debug_view* view, + int area, + debug_entry_t* entry, + char* out_buf); + + typedef int (debug_format_proc_t) (debug_info_t* id, + struct debug_view* view, char* out_buf, + const char* in_buf); + typedef int (debug_prolog_proc_t) (debug_info_t* id, + struct debug_view* view, + char* out_buf); + typedef int (debug_input_proc_t) (debug_info_t* id, + struct debug_view* view, + struct file* file, const char* user_buf, + size_t in_buf_size, loff_t* offset); + + +The "private_data" member can be used as pointer to view specific data. +It is not used by the debug feature itself. + +The output when reading a debugfs file is structured like this:: + + "prolog_proc output" + + "header_proc output 1" "format_proc output 1" + "header_proc output 2" "format_proc output 2" + "header_proc output 3" "format_proc output 3" + ... + +When a view is read from the debugfs, the Debug Feature calls the +'prolog_proc' once for writing the prolog. +Then 'header_proc' and 'format_proc' are called for each +existing debug entry. + +The input_proc can be used to implement functionality when it is written to +the view (e.g. like with 'echo "0" > /sys/kernel/debug/s390dbf/dasd/level). + +For header_proc there can be used the default function +debug_dflt_header_fn() which is defined in debug.h. +and which produces the same header output as the predefined views. +E.g:: + + 00 00964419409:440761 2 - 00 88023ec + +In order to see how to use the callback functions check the implementation +of the default views! + +Example:: + + #include + + #define UNKNOWNSTR "data: %08x" + + const char* messages[] = + {"This error...........\n", + "That error...........\n", + "Problem..............\n", + "Something went wrong.\n", + "Everything ok........\n", + NULL + }; + + static int debug_test_format_fn( + debug_info_t * id, struct debug_view *view, + char *out_buf, const char *in_buf + ) + { + int i, rc = 0; + + if(id->buf_size >= 4) { + int msg_nr = *((int*)in_buf); + if(msg_nr < sizeof(messages)/sizeof(char*) - 1) + rc += sprintf(out_buf, "%s", messages[msg_nr]); + else + rc += sprintf(out_buf, UNKNOWNSTR, msg_nr); + } + out: + return rc; + } + + struct debug_view debug_test_view = { + "myview", /* name of view */ + NULL, /* no prolog */ + &debug_dflt_header_fn, /* default header for each entry */ + &debug_test_format_fn, /* our own format function */ + NULL, /* no input function */ + NULL /* no private data */ + }; + +test: +===== + +:: + + debug_info_t *debug_info; + ... + debug_info = debug_register ("test", 0, 4, 4 )); + debug_register_view(debug_info, &debug_test_view); + for(i = 0; i < 10; i ++) debug_int_event(debug_info, 1, i); + + > cat /sys/kernel/debug/s390dbf/test/myview + 00 00964419734:611402 1 - 00 88042ca This error........... + 00 00964419734:611405 1 - 00 88042ca That error........... + 00 00964419734:611408 1 - 00 88042ca Problem.............. + 00 00964419734:611411 1 - 00 88042ca Something went wrong. + 00 00964419734:611414 1 - 00 88042ca Everything ok........ + 00 00964419734:611417 1 - 00 88042ca data: 00000005 + 00 00964419734:611419 1 - 00 88042ca data: 00000006 + 00 00964419734:611422 1 - 00 88042ca data: 00000007 + 00 00964419734:611425 1 - 00 88042ca data: 00000008 + 00 00964419734:611428 1 - 00 88042ca data: 00000009 diff --git a/Documentation/s390/s390dbf.txt b/Documentation/s390/s390dbf.txt deleted file mode 100644 index 61329fd62e89..000000000000 --- a/Documentation/s390/s390dbf.txt +++ /dev/null @@ -1,667 +0,0 @@ -S390 Debug Feature -================== - -files: arch/s390/kernel/debug.c - arch/s390/include/asm/debug.h - -Description: ------------- -The goal of this feature is to provide a kernel debug logging API -where log records can be stored efficiently in memory, where each component -(e.g. device drivers) can have one separate debug log. -One purpose of this is to inspect the debug logs after a production system crash -in order to analyze the reason for the crash. -If the system still runs but only a subcomponent which uses dbf fails, -it is possible to look at the debug logs on a live system via the Linux -debugfs filesystem. -The debug feature may also very useful for kernel and driver development. - -Design: -------- -Kernel components (e.g. device drivers) can register themselves at the debug -feature with the function call debug_register(). This function initializes a -debug log for the caller. For each debug log exists a number of debug areas -where exactly one is active at one time. Each debug area consists of contiguous -pages in memory. In the debug areas there are stored debug entries (log records) -which are written by event- and exception-calls. - -An event-call writes the specified debug entry to the active debug -area and updates the log pointer for the active area. If the end -of the active debug area is reached, a wrap around is done (ring buffer) -and the next debug entry will be written at the beginning of the active -debug area. - -An exception-call writes the specified debug entry to the log and -switches to the next debug area. This is done in order to be sure -that the records which describe the origin of the exception are not -overwritten when a wrap around for the current area occurs. - -The debug areas themselves are also ordered in form of a ring buffer. -When an exception is thrown in the last debug area, the following debug -entries are then written again in the very first area. - -There are three versions for the event- and exception-calls: One for -logging raw data, one for text and one for numbers. - -Each debug entry contains the following data: - -- Timestamp -- Cpu-Number of calling task -- Level of debug entry (0...6) -- Return Address to caller -- Flag, if entry is an exception or not - -The debug logs can be inspected in a live system through entries in -the debugfs-filesystem. Under the toplevel directory "s390dbf" there is -a directory for each registered component, which is named like the -corresponding component. The debugfs normally should be mounted to -/sys/kernel/debug therefore the debug feature can be accessed under -/sys/kernel/debug/s390dbf. - -The content of the directories are files which represent different views -to the debug log. Each component can decide which views should be -used through registering them with the function debug_register_view(). -Predefined views for hex/ascii, sprintf and raw binary data are provided. -It is also possible to define other views. The content of -a view can be inspected simply by reading the corresponding debugfs file. - -All debug logs have an actual debug level (range from 0 to 6). -The default level is 3. Event and Exception functions have a 'level' -parameter. Only debug entries with a level that is lower or equal -than the actual level are written to the log. This means, when -writing events, high priority log entries should have a low level -value whereas low priority entries should have a high one. -The actual debug level can be changed with the help of the debugfs-filesystem -through writing a number string "x" to the 'level' debugfs file which is -provided for every debug log. Debugging can be switched off completely -by using "-" on the 'level' debugfs file. - -Example: - -> echo "-" > /sys/kernel/debug/s390dbf/dasd/level - -It is also possible to deactivate the debug feature globally for every -debug log. You can change the behavior using 2 sysctl parameters in -/proc/sys/s390dbf: -There are currently 2 possible triggers, which stop the debug feature -globally. The first possibility is to use the "debug_active" sysctl. If -set to 1 the debug feature is running. If "debug_active" is set to 0 the -debug feature is turned off. -The second trigger which stops the debug feature is a kernel oops. -That prevents the debug feature from overwriting debug information that -happened before the oops. After an oops you can reactivate the debug feature -by piping 1 to /proc/sys/s390dbf/debug_active. Nevertheless, its not -suggested to use an oopsed kernel in a production environment. -If you want to disallow the deactivation of the debug feature, you can use -the "debug_stoppable" sysctl. If you set "debug_stoppable" to 0 the debug -feature cannot be stopped. If the debug feature is already stopped, it -will stay deactivated. - -Kernel Interfaces: ------------------- - ----------------------------------------------------------------------------- -debug_info_t *debug_register(char *name, int pages, int nr_areas, - int buf_size); - -Parameter: name: Name of debug log (e.g. used for debugfs entry) - pages: number of pages, which will be allocated per area - nr_areas: number of debug areas - buf_size: size of data area in each debug entry - -Return Value: Handle for generated debug area - NULL if register failed - -Description: Allocates memory for a debug log - Must not be called within an interrupt handler - ----------------------------------------------------------------------------- -debug_info_t *debug_register_mode(char *name, int pages, int nr_areas, - int buf_size, mode_t mode, uid_t uid, - gid_t gid); - -Parameter: name: Name of debug log (e.g. used for debugfs entry) - pages: Number of pages, which will be allocated per area - nr_areas: Number of debug areas - buf_size: Size of data area in each debug entry - mode: File mode for debugfs files. E.g. S_IRWXUGO - uid: User ID for debugfs files. Currently only 0 is - supported. - gid: Group ID for debugfs files. Currently only 0 is - supported. - -Return Value: Handle for generated debug area - NULL if register failed - -Description: Allocates memory for a debug log - Must not be called within an interrupt handler - ---------------------------------------------------------------------------- -void debug_unregister (debug_info_t * id); - -Parameter: id: handle for debug log - -Return Value: none - -Description: frees memory for a debug log and removes all registered debug - views. - Must not be called within an interrupt handler - ---------------------------------------------------------------------------- -void debug_set_level (debug_info_t * id, int new_level); - -Parameter: id: handle for debug log - new_level: new debug level - -Return Value: none - -Description: Sets new actual debug level if new_level is valid. - ---------------------------------------------------------------------------- -bool debug_level_enabled (debug_info_t * id, int level); - -Parameter: id: handle for debug log - level: debug level - -Return Value: True if level is less or equal to the current debug level. - -Description: Returns true if debug events for the specified level would be - logged. Otherwise returns false. ---------------------------------------------------------------------------- -void debug_stop_all(void); - -Parameter: none - -Return Value: none - -Description: stops the debug feature if stopping is allowed. Currently - used in case of a kernel oops. - ---------------------------------------------------------------------------- -debug_entry_t* debug_event (debug_info_t* id, int level, void* data, - int length); - -Parameter: id: handle for debug log - level: debug level - data: pointer to data for debug entry - length: length of data in bytes - -Return Value: Address of written debug entry - -Description: writes debug entry to active debug area (if level <= actual - debug level) - ---------------------------------------------------------------------------- -debug_entry_t* debug_int_event (debug_info_t * id, int level, - unsigned int data); -debug_entry_t* debug_long_event(debug_info_t * id, int level, - unsigned long data); - -Parameter: id: handle for debug log - level: debug level - data: integer value for debug entry - -Return Value: Address of written debug entry - -Description: writes debug entry to active debug area (if level <= actual - debug level) - ---------------------------------------------------------------------------- -debug_entry_t* debug_text_event (debug_info_t * id, int level, - const char* data); - -Parameter: id: handle for debug log - level: debug level - data: string for debug entry - -Return Value: Address of written debug entry - -Description: writes debug entry in ascii format to active debug area - (if level <= actual debug level) - ---------------------------------------------------------------------------- -debug_entry_t* debug_sprintf_event (debug_info_t * id, int level, - char* string,...); - -Parameter: id: handle for debug log - level: debug level - string: format string for debug entry - ...: varargs used as in sprintf() - -Return Value: Address of written debug entry - -Description: writes debug entry with format string and varargs (longs) to - active debug area (if level $<=$ actual debug level). - floats and long long datatypes cannot be used as varargs. - ---------------------------------------------------------------------------- - -debug_entry_t* debug_exception (debug_info_t* id, int level, void* data, - int length); - -Parameter: id: handle for debug log - level: debug level - data: pointer to data for debug entry - length: length of data in bytes - -Return Value: Address of written debug entry - -Description: writes debug entry to active debug area (if level <= actual - debug level) and switches to next debug area - ---------------------------------------------------------------------------- -debug_entry_t* debug_int_exception (debug_info_t * id, int level, - unsigned int data); -debug_entry_t* debug_long_exception(debug_info_t * id, int level, - unsigned long data); - -Parameter: id: handle for debug log - level: debug level - data: integer value for debug entry - -Return Value: Address of written debug entry - -Description: writes debug entry to active debug area (if level <= actual - debug level) and switches to next debug area - ---------------------------------------------------------------------------- -debug_entry_t* debug_text_exception (debug_info_t * id, int level, - const char* data); - -Parameter: id: handle for debug log - level: debug level - data: string for debug entry - -Return Value: Address of written debug entry - -Description: writes debug entry in ascii format to active debug area - (if level <= actual debug level) and switches to next debug - area - ---------------------------------------------------------------------------- -debug_entry_t* debug_sprintf_exception (debug_info_t * id, int level, - char* string,...); - -Parameter: id: handle for debug log - level: debug level - string: format string for debug entry - ...: varargs used as in sprintf() - -Return Value: Address of written debug entry - -Description: writes debug entry with format string and varargs (longs) to - active debug area (if level $<=$ actual debug level) and - switches to next debug area. - floats and long long datatypes cannot be used as varargs. - ---------------------------------------------------------------------------- - -int debug_register_view (debug_info_t * id, struct debug_view *view); - -Parameter: id: handle for debug log - view: pointer to debug view struct - -Return Value: 0 : ok - < 0: Error - -Description: registers new debug view and creates debugfs dir entry - ---------------------------------------------------------------------------- -int debug_unregister_view (debug_info_t * id, struct debug_view *view); - -Parameter: id: handle for debug log - view: pointer to debug view struct - -Return Value: 0 : ok - < 0: Error - -Description: unregisters debug view and removes debugfs dir entry - - - -Predefined views: ------------------ - -extern struct debug_view debug_hex_ascii_view; -extern struct debug_view debug_raw_view; -extern struct debug_view debug_sprintf_view; - -Examples --------- - -/* - * hex_ascii- + raw-view Example - */ - -#include -#include - -static debug_info_t* debug_info; - -static int init(void) -{ - /* register 4 debug areas with one page each and 4 byte data field */ - - debug_info = debug_register ("test", 1, 4, 4 ); - debug_register_view(debug_info,&debug_hex_ascii_view); - debug_register_view(debug_info,&debug_raw_view); - - debug_text_event(debug_info, 4 , "one "); - debug_int_exception(debug_info, 4, 4711); - debug_event(debug_info, 3, &debug_info, 4); - - return 0; -} - -static void cleanup(void) -{ - debug_unregister (debug_info); -} - -module_init(init); -module_exit(cleanup); - ---------------------------------------------------------------------------- - -/* - * sprintf-view Example - */ - -#include -#include - -static debug_info_t* debug_info; - -static int init(void) -{ - /* register 4 debug areas with one page each and data field for */ - /* format string pointer + 2 varargs (= 3 * sizeof(long)) */ - - debug_info = debug_register ("test", 1, 4, sizeof(long) * 3); - debug_register_view(debug_info,&debug_sprintf_view); - - debug_sprintf_event(debug_info, 2 , "first event in %s:%i\n",__FILE__,__LINE__); - debug_sprintf_exception(debug_info, 1, "pointer to debug info: %p\n",&debug_info); - - return 0; -} - -static void cleanup(void) -{ - debug_unregister (debug_info); -} - -module_init(init); -module_exit(cleanup); - - - -Debugfs Interface ----------------- -Views to the debug logs can be investigated through reading the corresponding -debugfs-files: - -Example: - -> ls /sys/kernel/debug/s390dbf/dasd -flush hex_ascii level pages raw -> cat /sys/kernel/debug/s390dbf/dasd/hex_ascii | sort -k2,2 -s -00 00974733272:680099 2 - 02 0006ad7e 07 ea 4a 90 | .... -00 00974733272:682210 2 - 02 0006ade6 46 52 45 45 | FREE -00 00974733272:682213 2 - 02 0006adf6 07 ea 4a 90 | .... -00 00974733272:682281 1 * 02 0006ab08 41 4c 4c 43 | EXCP -01 00974733272:682284 2 - 02 0006ab16 45 43 4b 44 | ECKD -01 00974733272:682287 2 - 02 0006ab28 00 00 00 04 | .... -01 00974733272:682289 2 - 02 0006ab3e 00 00 00 20 | ... -01 00974733272:682297 2 - 02 0006ad7e 07 ea 4a 90 | .... -01 00974733272:684384 2 - 00 0006ade6 46 52 45 45 | FREE -01 00974733272:684388 2 - 00 0006adf6 07 ea 4a 90 | .... - -See section about predefined views for explanation of the above output! - -Changing the debug level ------------------------- - -Example: - - -> cat /sys/kernel/debug/s390dbf/dasd/level -3 -> echo "5" > /sys/kernel/debug/s390dbf/dasd/level -> cat /sys/kernel/debug/s390dbf/dasd/level -5 - -Flushing debug areas --------------------- -Debug areas can be flushed with piping the number of the desired -area (0...n) to the debugfs file "flush". When using "-" all debug areas -are flushed. - -Examples: - -1. Flush debug area 0: -> echo "0" > /sys/kernel/debug/s390dbf/dasd/flush - -2. Flush all debug areas: -> echo "-" > /sys/kernel/debug/s390dbf/dasd/flush - -Changing the size of debug areas ------------------------------------- -It is possible the change the size of debug areas through piping -the number of pages to the debugfs file "pages". The resize request will -also flush the debug areas. - -Example: - -Define 4 pages for the debug areas of debug feature "dasd": -> echo "4" > /sys/kernel/debug/s390dbf/dasd/pages - -Stooping the debug feature --------------------------- -Example: - -1. Check if stopping is allowed -> cat /proc/sys/s390dbf/debug_stoppable -2. Stop debug feature -> echo 0 > /proc/sys/s390dbf/debug_active - -lcrash Interface ----------------- -It is planned that the dump analysis tool lcrash gets an additional command -'s390dbf' to display all the debug logs. With this tool it will be possible -to investigate the debug logs on a live system and with a memory dump after -a system crash. - -Investigating raw memory ------------------------- -One last possibility to investigate the debug logs at a live -system and after a system crash is to look at the raw memory -under VM or at the Service Element. -It is possible to find the anker of the debug-logs through -the 'debug_area_first' symbol in the System map. Then one has -to follow the correct pointers of the data-structures defined -in debug.h and find the debug-areas in memory. -Normally modules which use the debug feature will also have -a global variable with the pointer to the debug-logs. Following -this pointer it will also be possible to find the debug logs in -memory. - -For this method it is recommended to use '16 * x + 4' byte (x = 0..n) -for the length of the data field in debug_register() in -order to see the debug entries well formatted. - - -Predefined Views ----------------- - -There are three predefined views: hex_ascii, raw and sprintf. -The hex_ascii view shows the data field in hex and ascii representation -(e.g. '45 43 4b 44 | ECKD'). -The raw view returns a bytestream as the debug areas are stored in memory. - -The sprintf view formats the debug entries in the same way as the sprintf -function would do. The sprintf event/exception functions write to the -debug entry a pointer to the format string (size = sizeof(long)) -and for each vararg a long value. So e.g. for a debug entry with a format -string plus two varargs one would need to allocate a (3 * sizeof(long)) -byte data area in the debug_register() function. - -IMPORTANT: Using "%s" in sprintf event functions is dangerous. You can only -use "%s" in the sprintf event functions, if the memory for the passed string is -available as long as the debug feature exists. The reason behind this is that -due to performance considerations only a pointer to the string is stored in -the debug feature. If you log a string that is freed afterwards, you will get -an OOPS when inspecting the debug feature, because then the debug feature will -access the already freed memory. - -NOTE: If using the sprintf view do NOT use other event/exception functions -than the sprintf-event and -exception functions. - -The format of the hex_ascii and sprintf view is as follows: -- Number of area -- Timestamp (formatted as seconds and microseconds since 00:00:00 Coordinated - Universal Time (UTC), January 1, 1970) -- level of debug entry -- Exception flag (* = Exception) -- Cpu-Number of calling task -- Return Address to caller -- data field - -The format of the raw view is: -- Header as described in debug.h -- datafield - -A typical line of the hex_ascii view will look like the following (first line -is only for explanation and will not be displayed when 'cating' the view): - -area time level exception cpu caller data (hex + ascii) --------------------------------------------------------------------------- -00 00964419409:440690 1 - 00 88023fe - - -Defining views --------------- - -Views are specified with the 'debug_view' structure. There are defined -callback functions which are used for reading and writing the debugfs files: - -struct debug_view { - char name[DEBUG_MAX_PROCF_LEN]; - debug_prolog_proc_t* prolog_proc; - debug_header_proc_t* header_proc; - debug_format_proc_t* format_proc; - debug_input_proc_t* input_proc; - void* private_data; -}; - -where - -typedef int (debug_header_proc_t) (debug_info_t* id, - struct debug_view* view, - int area, - debug_entry_t* entry, - char* out_buf); - -typedef int (debug_format_proc_t) (debug_info_t* id, - struct debug_view* view, char* out_buf, - const char* in_buf); -typedef int (debug_prolog_proc_t) (debug_info_t* id, - struct debug_view* view, - char* out_buf); -typedef int (debug_input_proc_t) (debug_info_t* id, - struct debug_view* view, - struct file* file, const char* user_buf, - size_t in_buf_size, loff_t* offset); - - -The "private_data" member can be used as pointer to view specific data. -It is not used by the debug feature itself. - -The output when reading a debugfs file is structured like this: - -"prolog_proc output" - -"header_proc output 1" "format_proc output 1" -"header_proc output 2" "format_proc output 2" -"header_proc output 3" "format_proc output 3" -... - -When a view is read from the debugfs, the Debug Feature calls the -'prolog_proc' once for writing the prolog. -Then 'header_proc' and 'format_proc' are called for each -existing debug entry. - -The input_proc can be used to implement functionality when it is written to -the view (e.g. like with 'echo "0" > /sys/kernel/debug/s390dbf/dasd/level). - -For header_proc there can be used the default function -debug_dflt_header_fn() which is defined in debug.h. -and which produces the same header output as the predefined views. -E.g: -00 00964419409:440761 2 - 00 88023ec - -In order to see how to use the callback functions check the implementation -of the default views! - -Example - -#include - -#define UNKNOWNSTR "data: %08x" - -const char* messages[] = -{"This error...........\n", - "That error...........\n", - "Problem..............\n", - "Something went wrong.\n", - "Everything ok........\n", - NULL -}; - -static int debug_test_format_fn( - debug_info_t * id, struct debug_view *view, - char *out_buf, const char *in_buf -) -{ - int i, rc = 0; - - if(id->buf_size >= 4) { - int msg_nr = *((int*)in_buf); - if(msg_nr < sizeof(messages)/sizeof(char*) - 1) - rc += sprintf(out_buf, "%s", messages[msg_nr]); - else - rc += sprintf(out_buf, UNKNOWNSTR, msg_nr); - } - out: - return rc; -} - -struct debug_view debug_test_view = { - "myview", /* name of view */ - NULL, /* no prolog */ - &debug_dflt_header_fn, /* default header for each entry */ - &debug_test_format_fn, /* our own format function */ - NULL, /* no input function */ - NULL /* no private data */ -}; - -===== -test: -===== -debug_info_t *debug_info; -... -debug_info = debug_register ("test", 0, 4, 4 )); -debug_register_view(debug_info, &debug_test_view); -for(i = 0; i < 10; i ++) debug_int_event(debug_info, 1, i); - -> cat /sys/kernel/debug/s390dbf/test/myview -00 00964419734:611402 1 - 00 88042ca This error........... -00 00964419734:611405 1 - 00 88042ca That error........... -00 00964419734:611408 1 - 00 88042ca Problem.............. -00 00964419734:611411 1 - 00 88042ca Something went wrong. -00 00964419734:611414 1 - 00 88042ca Everything ok........ -00 00964419734:611417 1 - 00 88042ca data: 00000005 -00 00964419734:611419 1 - 00 88042ca data: 00000006 -00 00964419734:611422 1 - 00 88042ca data: 00000007 -00 00964419734:611425 1 - 00 88042ca data: 00000008 -00 00964419734:611428 1 - 00 88042ca data: 00000009 diff --git a/Documentation/s390/text_files.rst b/Documentation/s390/text_files.rst new file mode 100644 index 000000000000..c94d05d4fa17 --- /dev/null +++ b/Documentation/s390/text_files.rst @@ -0,0 +1,11 @@ +ibm 3270 changelog +------------------ + +.. include:: 3270.ChangeLog + :literal: + +ibm 3270 config3270.sh +---------------------- + +.. literalinclude:: config3270.sh + :language: shell diff --git a/Documentation/s390/vfio-ap.rst b/Documentation/s390/vfio-ap.rst new file mode 100644 index 000000000000..b5c51f7c748d --- /dev/null +++ b/Documentation/s390/vfio-ap.rst @@ -0,0 +1,866 @@ +=============================== +Adjunct Processor (AP) facility +=============================== + + +Introduction +============ +The Adjunct Processor (AP) facility is an IBM Z cryptographic facility comprised +of three AP instructions and from 1 up to 256 PCIe cryptographic adapter cards. +The AP devices provide cryptographic functions to all CPUs assigned to a +linux system running in an IBM Z system LPAR. + +The AP adapter cards are exposed via the AP bus. The motivation for vfio-ap +is to make AP cards available to KVM guests using the VFIO mediated device +framework. This implementation relies considerably on the s390 virtualization +facilities which do most of the hard work of providing direct access to AP +devices. + +AP Architectural Overview +========================= +To facilitate the comprehension of the design, let's start with some +definitions: + +* AP adapter + + An AP adapter is an IBM Z adapter card that can perform cryptographic + functions. There can be from 0 to 256 adapters assigned to an LPAR. Adapters + assigned to the LPAR in which a linux host is running will be available to + the linux host. Each adapter is identified by a number from 0 to 255; however, + the maximum adapter number is determined by machine model and/or adapter type. + When installed, an AP adapter is accessed by AP instructions executed by any + CPU. + + The AP adapter cards are assigned to a given LPAR via the system's Activation + Profile which can be edited via the HMC. When the linux host system is IPL'd + in the LPAR, the AP bus detects the AP adapter cards assigned to the LPAR and + creates a sysfs device for each assigned adapter. For example, if AP adapters + 4 and 10 (0x0a) are assigned to the LPAR, the AP bus will create the following + sysfs device entries:: + + /sys/devices/ap/card04 + /sys/devices/ap/card0a + + Symbolic links to these devices will also be created in the AP bus devices + sub-directory:: + + /sys/bus/ap/devices/[card04] + /sys/bus/ap/devices/[card04] + +* AP domain + + An adapter is partitioned into domains. An adapter can hold up to 256 domains + depending upon the adapter type and hardware configuration. A domain is + identified by a number from 0 to 255; however, the maximum domain number is + determined by machine model and/or adapter type.. A domain can be thought of + as a set of hardware registers and memory used for processing AP commands. A + domain can be configured with a secure private key used for clear key + encryption. A domain is classified in one of two ways depending upon how it + may be accessed: + + * Usage domains are domains that are targeted by an AP instruction to + process an AP command. + + * Control domains are domains that are changed by an AP command sent to a + usage domain; for example, to set the secure private key for the control + domain. + + The AP usage and control domains are assigned to a given LPAR via the system's + Activation Profile which can be edited via the HMC. When a linux host system + is IPL'd in the LPAR, the AP bus module detects the AP usage and control + domains assigned to the LPAR. The domain number of each usage domain and + adapter number of each AP adapter are combined to create AP queue devices + (see AP Queue section below). The domain number of each control domain will be + represented in a bitmask and stored in a sysfs file + /sys/bus/ap/ap_control_domain_mask. The bits in the mask, from most to least + significant bit, correspond to domains 0-255. + +* AP Queue + + An AP queue is the means by which an AP command is sent to a usage domain + inside a specific adapter. An AP queue is identified by a tuple + comprised of an AP adapter ID (APID) and an AP queue index (APQI). The + APQI corresponds to a given usage domain number within the adapter. This tuple + forms an AP Queue Number (APQN) uniquely identifying an AP queue. AP + instructions include a field containing the APQN to identify the AP queue to + which the AP command is to be sent for processing. + + The AP bus will create a sysfs device for each APQN that can be derived from + the cross product of the AP adapter and usage domain numbers detected when the + AP bus module is loaded. For example, if adapters 4 and 10 (0x0a) and usage + domains 6 and 71 (0x47) are assigned to the LPAR, the AP bus will create the + following sysfs entries:: + + /sys/devices/ap/card04/04.0006 + /sys/devices/ap/card04/04.0047 + /sys/devices/ap/card0a/0a.0006 + /sys/devices/ap/card0a/0a.0047 + + The following symbolic links to these devices will be created in the AP bus + devices subdirectory:: + + /sys/bus/ap/devices/[04.0006] + /sys/bus/ap/devices/[04.0047] + /sys/bus/ap/devices/[0a.0006] + /sys/bus/ap/devices/[0a.0047] + +* AP Instructions: + + There are three AP instructions: + + * NQAP: to enqueue an AP command-request message to a queue + * DQAP: to dequeue an AP command-reply message from a queue + * PQAP: to administer the queues + + AP instructions identify the domain that is targeted to process the AP + command; this must be one of the usage domains. An AP command may modify a + domain that is not one of the usage domains, but the modified domain + must be one of the control domains. + +AP and SIE +========== +Let's now take a look at how AP instructions executed on a guest are interpreted +by the hardware. + +A satellite control block called the Crypto Control Block (CRYCB) is attached to +our main hardware virtualization control block. The CRYCB contains three fields +to identify the adapters, usage domains and control domains assigned to the KVM +guest: + +* The AP Mask (APM) field is a bit mask that identifies the AP adapters assigned + to the KVM guest. Each bit in the mask, from left to right (i.e. from most + significant to least significant bit in big endian order), corresponds to + an APID from 0-255. If a bit is set, the corresponding adapter is valid for + use by the KVM guest. + +* The AP Queue Mask (AQM) field is a bit mask identifying the AP usage domains + assigned to the KVM guest. Each bit in the mask, from left to right (i.e. from + most significant to least significant bit in big endian order), corresponds to + an AP queue index (APQI) from 0-255. If a bit is set, the corresponding queue + is valid for use by the KVM guest. + +* The AP Domain Mask field is a bit mask that identifies the AP control domains + assigned to the KVM guest. The ADM bit mask controls which domains can be + changed by an AP command-request message sent to a usage domain from the + guest. Each bit in the mask, from left to right (i.e. from most significant to + least significant bit in big endian order), corresponds to a domain from + 0-255. If a bit is set, the corresponding domain can be modified by an AP + command-request message sent to a usage domain. + +If you recall from the description of an AP Queue, AP instructions include +an APQN to identify the AP queue to which an AP command-request message is to be +sent (NQAP and PQAP instructions), or from which a command-reply message is to +be received (DQAP instruction). The validity of an APQN is defined by the matrix +calculated from the APM and AQM; it is the cross product of all assigned adapter +numbers (APM) with all assigned queue indexes (AQM). For example, if adapters 1 +and 2 and usage domains 5 and 6 are assigned to a guest, the APQNs (1,5), (1,6), +(2,5) and (2,6) will be valid for the guest. + +The APQNs can provide secure key functionality - i.e., a private key is stored +on the adapter card for each of its domains - so each APQN must be assigned to +at most one guest or to the linux host:: + + Example 1: Valid configuration: + ------------------------------ + Guest1: adapters 1,2 domains 5,6 + Guest2: adapter 1,2 domain 7 + + This is valid because both guests have a unique set of APQNs: + Guest1 has APQNs (1,5), (1,6), (2,5), (2,6); + Guest2 has APQNs (1,7), (2,7) + + Example 2: Valid configuration: + ------------------------------ + Guest1: adapters 1,2 domains 5,6 + Guest2: adapters 3,4 domains 5,6 + + This is also valid because both guests have a unique set of APQNs: + Guest1 has APQNs (1,5), (1,6), (2,5), (2,6); + Guest2 has APQNs (3,5), (3,6), (4,5), (4,6) + + Example 3: Invalid configuration: + -------------------------------- + Guest1: adapters 1,2 domains 5,6 + Guest2: adapter 1 domains 6,7 + + This is an invalid configuration because both guests have access to + APQN (1,6). + +The Design +========== +The design introduces three new objects: + +1. AP matrix device +2. VFIO AP device driver (vfio_ap.ko) +3. VFIO AP mediated matrix pass-through device + +The VFIO AP device driver +------------------------- +The VFIO AP (vfio_ap) device driver serves the following purposes: + +1. Provides the interfaces to secure APQNs for exclusive use of KVM guests. + +2. Sets up the VFIO mediated device interfaces to manage a mediated matrix + device and creates the sysfs interfaces for assigning adapters, usage + domains, and control domains comprising the matrix for a KVM guest. + +3. Configures the APM, AQM and ADM in the CRYCB referenced by a KVM guest's + SIE state description to grant the guest access to a matrix of AP devices + +Reserve APQNs for exclusive use of KVM guests +--------------------------------------------- +The following block diagram illustrates the mechanism by which APQNs are +reserved:: + + +------------------+ + 7 remove | | + +--------------------> cex4queue driver | + | | | + | +------------------+ + | + | + | +------------------+ +----------------+ + | 5 register driver | | 3 create | | + | +----------------> Device core +----------> matrix device | + | | | | | | + | | +--------^---------+ +----------------+ + | | | + | | +-------------------+ + | | +-----------------------------------+ | + | | | 4 register AP driver | | 2 register device + | | | | | + +--------+---+-v---+ +--------+-------+-+ + | | | | + | ap_bus +--------------------- > vfio_ap driver | + | | 8 probe | | + +--------^---------+ +--^--^------------+ + 6 edit | | | + apmask | +-----------------------------+ | 9 mdev create + aqmask | | 1 modprobe | + +--------+-----+---+ +----------------+-+ +----------------+ + | | | |8 create | mediated | + | admin | | VFIO device core |---------> matrix | + | + | | | device | + +------+-+---------+ +--------^---------+ +--------^-------+ + | | | | + | | 9 create vfio_ap-passthrough | | + | +------------------------------+ | + +-------------------------------------------------------------+ + 10 assign adapter/domain/control domain + +The process for reserving an AP queue for use by a KVM guest is: + +1. The administrator loads the vfio_ap device driver +2. The vfio-ap driver during its initialization will register a single 'matrix' + device with the device core. This will serve as the parent device for + all mediated matrix devices used to configure an AP matrix for a guest. +3. The /sys/devices/vfio_ap/matrix device is created by the device core +4. The vfio_ap device driver will register with the AP bus for AP queue devices + of type 10 and higher (CEX4 and newer). The driver will provide the vfio_ap + driver's probe and remove callback interfaces. Devices older than CEX4 queues + are not supported to simplify the implementation by not needlessly + complicating the design by supporting older devices that will go out of + service in the relatively near future, and for which there are few older + systems around on which to test. +5. The AP bus registers the vfio_ap device driver with the device core +6. The administrator edits the AP adapter and queue masks to reserve AP queues + for use by the vfio_ap device driver. +7. The AP bus removes the AP queues reserved for the vfio_ap driver from the + default zcrypt cex4queue driver. +8. The AP bus probes the vfio_ap device driver to bind the queues reserved for + it. +9. The administrator creates a passthrough type mediated matrix device to be + used by a guest +10. The administrator assigns the adapters, usage domains and control domains + to be exclusively used by a guest. + +Set up the VFIO mediated device interfaces +------------------------------------------ +The VFIO AP device driver utilizes the common interface of the VFIO mediated +device core driver to: + +* Register an AP mediated bus driver to add a mediated matrix device to and + remove it from a VFIO group. +* Create and destroy a mediated matrix device +* Add a mediated matrix device to and remove it from the AP mediated bus driver +* Add a mediated matrix device to and remove it from an IOMMU group + +The following high-level block diagram shows the main components and interfaces +of the VFIO AP mediated matrix device driver:: + + +-------------+ + | | + | +---------+ | mdev_register_driver() +--------------+ + | | Mdev | +<-----------------------+ | + | | bus | | | vfio_mdev.ko | + | | driver | +----------------------->+ |<-> VFIO user + | +---------+ | probe()/remove() +--------------+ APIs + | | + | MDEV CORE | + | MODULE | + | mdev.ko | + | +---------+ | mdev_register_device() +--------------+ + | |Physical | +<-----------------------+ | + | | device | | | vfio_ap.ko |<-> matrix + | |interface| +----------------------->+ | device + | +---------+ | callback +--------------+ + +-------------+ + +During initialization of the vfio_ap module, the matrix device is registered +with an 'mdev_parent_ops' structure that provides the sysfs attribute +structures, mdev functions and callback interfaces for managing the mediated +matrix device. + +* sysfs attribute structures: + + supported_type_groups + The VFIO mediated device framework supports creation of user-defined + mediated device types. These mediated device types are specified + via the 'supported_type_groups' structure when a device is registered + with the mediated device framework. The registration process creates the + sysfs structures for each mediated device type specified in the + 'mdev_supported_types' sub-directory of the device being registered. Along + with the device type, the sysfs attributes of the mediated device type are + provided. + + The VFIO AP device driver will register one mediated device type for + passthrough devices: + + /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough + + Only the read-only attributes required by the VFIO mdev framework will + be provided:: + + ... name + ... device_api + ... available_instances + ... device_api + + Where: + + * name: + specifies the name of the mediated device type + * device_api: + the mediated device type's API + * available_instances: + the number of mediated matrix passthrough devices + that can be created + * device_api: + specifies the VFIO API + mdev_attr_groups + This attribute group identifies the user-defined sysfs attributes of the + mediated device. When a device is registered with the VFIO mediated device + framework, the sysfs attribute files identified in the 'mdev_attr_groups' + structure will be created in the mediated matrix device's directory. The + sysfs attributes for a mediated matrix device are: + + assign_adapter / unassign_adapter: + Write-only attributes for assigning/unassigning an AP adapter to/from the + mediated matrix device. To assign/unassign an adapter, the APID of the + adapter is echoed to the respective attribute file. + assign_domain / unassign_domain: + Write-only attributes for assigning/unassigning an AP usage domain to/from + the mediated matrix device. To assign/unassign a domain, the domain + number of the the usage domain is echoed to the respective attribute + file. + matrix: + A read-only file for displaying the APQNs derived from the cross product + of the adapter and domain numbers assigned to the mediated matrix device. + assign_control_domain / unassign_control_domain: + Write-only attributes for assigning/unassigning an AP control domain + to/from the mediated matrix device. To assign/unassign a control domain, + the ID of the domain to be assigned/unassigned is echoed to the respective + attribute file. + control_domains: + A read-only file for displaying the control domain numbers assigned to the + mediated matrix device. + +* functions: + + create: + allocates the ap_matrix_mdev structure used by the vfio_ap driver to: + + * Store the reference to the KVM structure for the guest using the mdev + * Store the AP matrix configuration for the adapters, domains, and control + domains assigned via the corresponding sysfs attributes files + + remove: + deallocates the mediated matrix device's ap_matrix_mdev structure. This will + be allowed only if a running guest is not using the mdev. + +* callback interfaces + + open: + The vfio_ap driver uses this callback to register a + VFIO_GROUP_NOTIFY_SET_KVM notifier callback function for the mdev matrix + device. The open is invoked when QEMU connects the VFIO iommu group + for the mdev matrix device to the MDEV bus. Access to the KVM structure used + to configure the KVM guest is provided via this callback. The KVM structure, + is used to configure the guest's access to the AP matrix defined via the + mediated matrix device's sysfs attribute files. + release: + unregisters the VFIO_GROUP_NOTIFY_SET_KVM notifier callback function for the + mdev matrix device and deconfigures the guest's AP matrix. + +Configure the APM, AQM and ADM in the CRYCB +------------------------------------------- +Configuring the AP matrix for a KVM guest will be performed when the +VFIO_GROUP_NOTIFY_SET_KVM notifier callback is invoked. The notifier +function is called when QEMU connects to KVM. The guest's AP matrix is +configured via it's CRYCB by: + +* Setting the bits in the APM corresponding to the APIDs assigned to the + mediated matrix device via its 'assign_adapter' interface. +* Setting the bits in the AQM corresponding to the domains assigned to the + mediated matrix device via its 'assign_domain' interface. +* Setting the bits in the ADM corresponding to the domain dIDs assigned to the + mediated matrix device via its 'assign_control_domains' interface. + +The CPU model features for AP +----------------------------- +The AP stack relies on the presence of the AP instructions as well as two +facilities: The AP Facilities Test (APFT) facility; and the AP Query +Configuration Information (QCI) facility. These features/facilities are made +available to a KVM guest via the following CPU model features: + +1. ap: Indicates whether the AP instructions are installed on the guest. This + feature will be enabled by KVM only if the AP instructions are installed + on the host. + +2. apft: Indicates the APFT facility is available on the guest. This facility + can be made available to the guest only if it is available on the host (i.e., + facility bit 15 is set). + +3. apqci: Indicates the AP QCI facility is available on the guest. This facility + can be made available to the guest only if it is available on the host (i.e., + facility bit 12 is set). + +Note: If the user chooses to specify a CPU model different than the 'host' +model to QEMU, the CPU model features and facilities need to be turned on +explicitly; for example:: + + /usr/bin/qemu-system-s390x ... -cpu z13,ap=on,apqci=on,apft=on + +A guest can be precluded from using AP features/facilities by turning them off +explicitly; for example:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=off,apqci=off,apft=off + +Note: If the APFT facility is turned off (apft=off) for the guest, the guest +will not see any AP devices. The zcrypt device drivers that register for type 10 +and newer AP devices - i.e., the cex4card and cex4queue device drivers - need +the APFT facility to ascertain the facilities installed on a given AP device. If +the APFT facility is not installed on the guest, then the probe of device +drivers will fail since only type 10 and newer devices can be configured for +guest use. + +Example +======= +Let's now provide an example to illustrate how KVM guests may be given +access to AP facilities. For this example, we will show how to configure +three guests such that executing the lszcrypt command on the guests would +look like this: + +Guest1 +------ +=========== ===== ============ +CARD.DOMAIN TYPE MODE +=========== ===== ============ +05 CEX5C CCA-Coproc +05.0004 CEX5C CCA-Coproc +05.00ab CEX5C CCA-Coproc +06 CEX5A Accelerator +06.0004 CEX5A Accelerator +06.00ab CEX5C CCA-Coproc +=========== ===== ============ + +Guest2 +------ +=========== ===== ============ +CARD.DOMAIN TYPE MODE +=========== ===== ============ +05 CEX5A Accelerator +05.0047 CEX5A Accelerator +05.00ff CEX5A Accelerator +=========== ===== ============ + +Guest2 +------ +=========== ===== ============ +CARD.DOMAIN TYPE MODE +=========== ===== ============ +06 CEX5A Accelerator +06.0047 CEX5A Accelerator +06.00ff CEX5A Accelerator +=========== ===== ============ + +These are the steps: + +1. Install the vfio_ap module on the linux host. The dependency chain for the + vfio_ap module is: + * iommu + * s390 + * zcrypt + * vfio + * vfio_mdev + * vfio_mdev_device + * KVM + + To build the vfio_ap module, the kernel build must be configured with the + following Kconfig elements selected: + * IOMMU_SUPPORT + * S390 + * ZCRYPT + * S390_AP_IOMMU + * VFIO + * VFIO_MDEV + * VFIO_MDEV_DEVICE + * KVM + + If using make menuconfig select the following to build the vfio_ap module:: + + -> Device Drivers + -> IOMMU Hardware Support + select S390 AP IOMMU Support + -> VFIO Non-Privileged userspace driver framework + -> Mediated device driver frramework + -> VFIO driver for Mediated devices + -> I/O subsystem + -> VFIO support for AP devices + +2. Secure the AP queues to be used by the three guests so that the host can not + access them. To secure them, there are two sysfs files that specify + bitmasks marking a subset of the APQN range as 'usable by the default AP + queue device drivers' or 'not usable by the default device drivers' and thus + available for use by the vfio_ap device driver'. The location of the sysfs + files containing the masks are:: + + /sys/bus/ap/apmask + /sys/bus/ap/aqmask + + The 'apmask' is a 256-bit mask that identifies a set of AP adapter IDs + (APID). Each bit in the mask, from left to right (i.e., from most significant + to least significant bit in big endian order), corresponds to an APID from + 0-255. If a bit is set, the APID is marked as usable only by the default AP + queue device drivers; otherwise, the APID is usable by the vfio_ap + device driver. + + The 'aqmask' is a 256-bit mask that identifies a set of AP queue indexes + (APQI). Each bit in the mask, from left to right (i.e., from most significant + to least significant bit in big endian order), corresponds to an APQI from + 0-255. If a bit is set, the APQI is marked as usable only by the default AP + queue device drivers; otherwise, the APQI is usable by the vfio_ap device + driver. + + Take, for example, the following mask:: + + 0x7dffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff + + It indicates: + + 1, 2, 3, 4, 5, and 7-255 belong to the default drivers' pool, and 0 and 6 + belong to the vfio_ap device driver's pool. + + The APQN of each AP queue device assigned to the linux host is checked by the + AP bus against the set of APQNs derived from the cross product of APIDs + and APQIs marked as usable only by the default AP queue device drivers. If a + match is detected, only the default AP queue device drivers will be probed; + otherwise, the vfio_ap device driver will be probed. + + By default, the two masks are set to reserve all APQNs for use by the default + AP queue device drivers. There are two ways the default masks can be changed: + + 1. The sysfs mask files can be edited by echoing a string into the + respective sysfs mask file in one of two formats: + + * An absolute hex string starting with 0x - like "0x12345678" - sets + the mask. If the given string is shorter than the mask, it is padded + with 0s on the right; for example, specifying a mask value of 0x41 is + the same as specifying:: + + 0x4100000000000000000000000000000000000000000000000000000000000000 + + Keep in mind that the mask reads from left to right (i.e., most + significant to least significant bit in big endian order), so the mask + above identifies device numbers 1 and 7 (01000001). + + If the string is longer than the mask, the operation is terminated with + an error (EINVAL). + + * Individual bits in the mask can be switched on and off by specifying + each bit number to be switched in a comma separated list. Each bit + number string must be prepended with a ('+') or minus ('-') to indicate + the corresponding bit is to be switched on ('+') or off ('-'). Some + valid values are: + + - "+0" switches bit 0 on + - "-13" switches bit 13 off + - "+0x41" switches bit 65 on + - "-0xff" switches bit 255 off + + The following example: + + +0,-6,+0x47,-0xf0 + + Switches bits 0 and 71 (0x47) on + + Switches bits 6 and 240 (0xf0) off + + Note that the bits not specified in the list remain as they were before + the operation. + + 2. The masks can also be changed at boot time via parameters on the kernel + command line like this: + + ap.apmask=0xffff ap.aqmask=0x40 + + This would create the following masks:: + + apmask: + 0xffff000000000000000000000000000000000000000000000000000000000000 + + aqmask: + 0x4000000000000000000000000000000000000000000000000000000000000000 + + Resulting in these two pools:: + + default drivers pool: adapter 0-15, domain 1 + alternate drivers pool: adapter 16-255, domains 0, 2-255 + +Securing the APQNs for our example +---------------------------------- + To secure the AP queues 05.0004, 05.0047, 05.00ab, 05.00ff, 06.0004, 06.0047, + 06.00ab, and 06.00ff for use by the vfio_ap device driver, the corresponding + APQNs can either be removed from the default masks:: + + echo -5,-6 > /sys/bus/ap/apmask + + echo -4,-0x47,-0xab,-0xff > /sys/bus/ap/aqmask + + Or the masks can be set as follows:: + + echo 0xf9ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff \ + > apmask + + echo 0xf7fffffffffffffffeffffffffffffffffffffffffeffffffffffffffffffffe \ + > aqmask + + This will result in AP queues 05.0004, 05.0047, 05.00ab, 05.00ff, 06.0004, + 06.0047, 06.00ab, and 06.00ff getting bound to the vfio_ap device driver. The + sysfs directory for the vfio_ap device driver will now contain symbolic links + to the AP queue devices bound to it:: + + /sys/bus/ap + ... [drivers] + ...... [vfio_ap] + ......... [05.0004] + ......... [05.0047] + ......... [05.00ab] + ......... [05.00ff] + ......... [06.0004] + ......... [06.0047] + ......... [06.00ab] + ......... [06.00ff] + + Keep in mind that only type 10 and newer adapters (i.e., CEX4 and later) + can be bound to the vfio_ap device driver. The reason for this is to + simplify the implementation by not needlessly complicating the design by + supporting older devices that will go out of service in the relatively near + future and for which there are few older systems on which to test. + + The administrator, therefore, must take care to secure only AP queues that + can be bound to the vfio_ap device driver. The device type for a given AP + queue device can be read from the parent card's sysfs directory. For example, + to see the hardware type of the queue 05.0004: + + cat /sys/bus/ap/devices/card05/hwtype + + The hwtype must be 10 or higher (CEX4 or newer) in order to be bound to the + vfio_ap device driver. + +3. Create the mediated devices needed to configure the AP matrixes for the + three guests and to provide an interface to the vfio_ap driver for + use by the guests:: + + /sys/devices/vfio_ap/matrix/ + --- [mdev_supported_types] + ------ [vfio_ap-passthrough] (passthrough mediated matrix device type) + --------- create + --------- [devices] + + To create the mediated devices for the three guests:: + + uuidgen > create + uuidgen > create + uuidgen > create + + or + + echo $uuid1 > create + echo $uuid2 > create + echo $uuid3 > create + + This will create three mediated devices in the [devices] subdirectory named + after the UUID written to the create attribute file. We call them $uuid1, + $uuid2 and $uuid3 and this is the sysfs directory structure after creation:: + + /sys/devices/vfio_ap/matrix/ + --- [mdev_supported_types] + ------ [vfio_ap-passthrough] + --------- [devices] + ------------ [$uuid1] + --------------- assign_adapter + --------------- assign_control_domain + --------------- assign_domain + --------------- matrix + --------------- unassign_adapter + --------------- unassign_control_domain + --------------- unassign_domain + + ------------ [$uuid2] + --------------- assign_adapter + --------------- assign_control_domain + --------------- assign_domain + --------------- matrix + --------------- unassign_adapter + ----------------unassign_control_domain + ----------------unassign_domain + + ------------ [$uuid3] + --------------- assign_adapter + --------------- assign_control_domain + --------------- assign_domain + --------------- matrix + --------------- unassign_adapter + ----------------unassign_control_domain + ----------------unassign_domain + +4. The administrator now needs to configure the matrixes for the mediated + devices $uuid1 (for Guest1), $uuid2 (for Guest2) and $uuid3 (for Guest3). + + This is how the matrix is configured for Guest1:: + + echo 5 > assign_adapter + echo 6 > assign_adapter + echo 4 > assign_domain + echo 0xab > assign_domain + + Control domains can similarly be assigned using the assign_control_domain + sysfs file. + + If a mistake is made configuring an adapter, domain or control domain, + you can use the unassign_xxx files to unassign the adapter, domain or + control domain. + + To display the matrix configuration for Guest1:: + + cat matrix + + This is how the matrix is configured for Guest2:: + + echo 5 > assign_adapter + echo 0x47 > assign_domain + echo 0xff > assign_domain + + This is how the matrix is configured for Guest3:: + + echo 6 > assign_adapter + echo 0x47 > assign_domain + echo 0xff > assign_domain + + In order to successfully assign an adapter: + + * The adapter number specified must represent a value from 0 up to the + maximum adapter number configured for the system. If an adapter number + higher than the maximum is specified, the operation will terminate with + an error (ENODEV). + + * All APQNs that can be derived from the adapter ID and the IDs of + the previously assigned domains must be bound to the vfio_ap device + driver. If no domains have yet been assigned, then there must be at least + one APQN with the specified APID bound to the vfio_ap driver. If no such + APQNs are bound to the driver, the operation will terminate with an + error (EADDRNOTAVAIL). + + No APQN that can be derived from the adapter ID and the IDs of the + previously assigned domains can be assigned to another mediated matrix + device. If an APQN is assigned to another mediated matrix device, the + operation will terminate with an error (EADDRINUSE). + + In order to successfully assign a domain: + + * The domain number specified must represent a value from 0 up to the + maximum domain number configured for the system. If a domain number + higher than the maximum is specified, the operation will terminate with + an error (ENODEV). + + * All APQNs that can be derived from the domain ID and the IDs of + the previously assigned adapters must be bound to the vfio_ap device + driver. If no domains have yet been assigned, then there must be at least + one APQN with the specified APQI bound to the vfio_ap driver. If no such + APQNs are bound to the driver, the operation will terminate with an + error (EADDRNOTAVAIL). + + No APQN that can be derived from the domain ID and the IDs of the + previously assigned adapters can be assigned to another mediated matrix + device. If an APQN is assigned to another mediated matrix device, the + operation will terminate with an error (EADDRINUSE). + + In order to successfully assign a control domain, the domain number + specified must represent a value from 0 up to the maximum domain number + configured for the system. If a control domain number higher than the maximum + is specified, the operation will terminate with an error (ENODEV). + +5. Start Guest1:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on \ + -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid1 ... + +7. Start Guest2:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on \ + -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid2 ... + +7. Start Guest3:: + + /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on \ + -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid3 ... + +When the guest is shut down, the mediated matrix devices may be removed. + +Using our example again, to remove the mediated matrix device $uuid1:: + + /sys/devices/vfio_ap/matrix/ + --- [mdev_supported_types] + ------ [vfio_ap-passthrough] + --------- [devices] + ------------ [$uuid1] + --------------- remove + +:: + + echo 1 > remove + +This will remove all of the mdev matrix device's sysfs structures including +the mdev device itself. To recreate and reconfigure the mdev matrix device, +all of the steps starting with step 3 will have to be performed again. Note +that the remove will fail if a guest using the mdev is still running. + +It is not necessary to remove an mdev matrix device, but one may want to +remove it if no guest will use it during the remaining lifetime of the linux +host. If the mdev matrix device is removed, one may want to also reconfigure +the pool of adapters and queues reserved for use by the default drivers. + +Limitations +=========== +* The KVM/kernel interfaces do not provide a way to prevent restoring an APQN + to the default drivers pool of a queue that is still assigned to a mediated + device in use by a guest. It is incumbent upon the administrator to + ensure there is no mediated device in use by a guest to which the APQN is + assigned lest the host be given access to the private data of the AP queue + device such as a private key configured specifically for the guest. + +* Dynamically modifying the AP matrix for a running guest (which would amount to + hot(un)plug of AP devices for the guest) is currently not supported + +* Live guest migration is not supported for guests using AP devices. diff --git a/Documentation/s390/vfio-ap.txt b/Documentation/s390/vfio-ap.txt deleted file mode 100644 index 65167cfe4485..000000000000 --- a/Documentation/s390/vfio-ap.txt +++ /dev/null @@ -1,837 +0,0 @@ -Introduction: -============ -The Adjunct Processor (AP) facility is an IBM Z cryptographic facility comprised -of three AP instructions and from 1 up to 256 PCIe cryptographic adapter cards. -The AP devices provide cryptographic functions to all CPUs assigned to a -linux system running in an IBM Z system LPAR. - -The AP adapter cards are exposed via the AP bus. The motivation for vfio-ap -is to make AP cards available to KVM guests using the VFIO mediated device -framework. This implementation relies considerably on the s390 virtualization -facilities which do most of the hard work of providing direct access to AP -devices. - -AP Architectural Overview: -========================= -To facilitate the comprehension of the design, let's start with some -definitions: - -* AP adapter - - An AP adapter is an IBM Z adapter card that can perform cryptographic - functions. There can be from 0 to 256 adapters assigned to an LPAR. Adapters - assigned to the LPAR in which a linux host is running will be available to - the linux host. Each adapter is identified by a number from 0 to 255; however, - the maximum adapter number is determined by machine model and/or adapter type. - When installed, an AP adapter is accessed by AP instructions executed by any - CPU. - - The AP adapter cards are assigned to a given LPAR via the system's Activation - Profile which can be edited via the HMC. When the linux host system is IPL'd - in the LPAR, the AP bus detects the AP adapter cards assigned to the LPAR and - creates a sysfs device for each assigned adapter. For example, if AP adapters - 4 and 10 (0x0a) are assigned to the LPAR, the AP bus will create the following - sysfs device entries: - - /sys/devices/ap/card04 - /sys/devices/ap/card0a - - Symbolic links to these devices will also be created in the AP bus devices - sub-directory: - - /sys/bus/ap/devices/[card04] - /sys/bus/ap/devices/[card04] - -* AP domain - - An adapter is partitioned into domains. An adapter can hold up to 256 domains - depending upon the adapter type and hardware configuration. A domain is - identified by a number from 0 to 255; however, the maximum domain number is - determined by machine model and/or adapter type.. A domain can be thought of - as a set of hardware registers and memory used for processing AP commands. A - domain can be configured with a secure private key used for clear key - encryption. A domain is classified in one of two ways depending upon how it - may be accessed: - - * Usage domains are domains that are targeted by an AP instruction to - process an AP command. - - * Control domains are domains that are changed by an AP command sent to a - usage domain; for example, to set the secure private key for the control - domain. - - The AP usage and control domains are assigned to a given LPAR via the system's - Activation Profile which can be edited via the HMC. When a linux host system - is IPL'd in the LPAR, the AP bus module detects the AP usage and control - domains assigned to the LPAR. The domain number of each usage domain and - adapter number of each AP adapter are combined to create AP queue devices - (see AP Queue section below). The domain number of each control domain will be - represented in a bitmask and stored in a sysfs file - /sys/bus/ap/ap_control_domain_mask. The bits in the mask, from most to least - significant bit, correspond to domains 0-255. - -* AP Queue - - An AP queue is the means by which an AP command is sent to a usage domain - inside a specific adapter. An AP queue is identified by a tuple - comprised of an AP adapter ID (APID) and an AP queue index (APQI). The - APQI corresponds to a given usage domain number within the adapter. This tuple - forms an AP Queue Number (APQN) uniquely identifying an AP queue. AP - instructions include a field containing the APQN to identify the AP queue to - which the AP command is to be sent for processing. - - The AP bus will create a sysfs device for each APQN that can be derived from - the cross product of the AP adapter and usage domain numbers detected when the - AP bus module is loaded. For example, if adapters 4 and 10 (0x0a) and usage - domains 6 and 71 (0x47) are assigned to the LPAR, the AP bus will create the - following sysfs entries: - - /sys/devices/ap/card04/04.0006 - /sys/devices/ap/card04/04.0047 - /sys/devices/ap/card0a/0a.0006 - /sys/devices/ap/card0a/0a.0047 - - The following symbolic links to these devices will be created in the AP bus - devices subdirectory: - - /sys/bus/ap/devices/[04.0006] - /sys/bus/ap/devices/[04.0047] - /sys/bus/ap/devices/[0a.0006] - /sys/bus/ap/devices/[0a.0047] - -* AP Instructions: - - There are three AP instructions: - - * NQAP: to enqueue an AP command-request message to a queue - * DQAP: to dequeue an AP command-reply message from a queue - * PQAP: to administer the queues - - AP instructions identify the domain that is targeted to process the AP - command; this must be one of the usage domains. An AP command may modify a - domain that is not one of the usage domains, but the modified domain - must be one of the control domains. - -AP and SIE: -========== -Let's now take a look at how AP instructions executed on a guest are interpreted -by the hardware. - -A satellite control block called the Crypto Control Block (CRYCB) is attached to -our main hardware virtualization control block. The CRYCB contains three fields -to identify the adapters, usage domains and control domains assigned to the KVM -guest: - -* The AP Mask (APM) field is a bit mask that identifies the AP adapters assigned - to the KVM guest. Each bit in the mask, from left to right (i.e. from most - significant to least significant bit in big endian order), corresponds to - an APID from 0-255. If a bit is set, the corresponding adapter is valid for - use by the KVM guest. - -* The AP Queue Mask (AQM) field is a bit mask identifying the AP usage domains - assigned to the KVM guest. Each bit in the mask, from left to right (i.e. from - most significant to least significant bit in big endian order), corresponds to - an AP queue index (APQI) from 0-255. If a bit is set, the corresponding queue - is valid for use by the KVM guest. - -* The AP Domain Mask field is a bit mask that identifies the AP control domains - assigned to the KVM guest. The ADM bit mask controls which domains can be - changed by an AP command-request message sent to a usage domain from the - guest. Each bit in the mask, from left to right (i.e. from most significant to - least significant bit in big endian order), corresponds to a domain from - 0-255. If a bit is set, the corresponding domain can be modified by an AP - command-request message sent to a usage domain. - -If you recall from the description of an AP Queue, AP instructions include -an APQN to identify the AP queue to which an AP command-request message is to be -sent (NQAP and PQAP instructions), or from which a command-reply message is to -be received (DQAP instruction). The validity of an APQN is defined by the matrix -calculated from the APM and AQM; it is the cross product of all assigned adapter -numbers (APM) with all assigned queue indexes (AQM). For example, if adapters 1 -and 2 and usage domains 5 and 6 are assigned to a guest, the APQNs (1,5), (1,6), -(2,5) and (2,6) will be valid for the guest. - -The APQNs can provide secure key functionality - i.e., a private key is stored -on the adapter card for each of its domains - so each APQN must be assigned to -at most one guest or to the linux host. - - Example 1: Valid configuration: - ------------------------------ - Guest1: adapters 1,2 domains 5,6 - Guest2: adapter 1,2 domain 7 - - This is valid because both guests have a unique set of APQNs: - Guest1 has APQNs (1,5), (1,6), (2,5), (2,6); - Guest2 has APQNs (1,7), (2,7) - - Example 2: Valid configuration: - ------------------------------ - Guest1: adapters 1,2 domains 5,6 - Guest2: adapters 3,4 domains 5,6 - - This is also valid because both guests have a unique set of APQNs: - Guest1 has APQNs (1,5), (1,6), (2,5), (2,6); - Guest2 has APQNs (3,5), (3,6), (4,5), (4,6) - - Example 3: Invalid configuration: - -------------------------------- - Guest1: adapters 1,2 domains 5,6 - Guest2: adapter 1 domains 6,7 - - This is an invalid configuration because both guests have access to - APQN (1,6). - -The Design: -=========== -The design introduces three new objects: - -1. AP matrix device -2. VFIO AP device driver (vfio_ap.ko) -3. VFIO AP mediated matrix pass-through device - -The VFIO AP device driver -------------------------- -The VFIO AP (vfio_ap) device driver serves the following purposes: - -1. Provides the interfaces to secure APQNs for exclusive use of KVM guests. - -2. Sets up the VFIO mediated device interfaces to manage a mediated matrix - device and creates the sysfs interfaces for assigning adapters, usage - domains, and control domains comprising the matrix for a KVM guest. - -3. Configures the APM, AQM and ADM in the CRYCB referenced by a KVM guest's - SIE state description to grant the guest access to a matrix of AP devices - -Reserve APQNs for exclusive use of KVM guests ---------------------------------------------- -The following block diagram illustrates the mechanism by which APQNs are -reserved: - - +------------------+ - 7 remove | | - +--------------------> cex4queue driver | - | | | - | +------------------+ - | - | - | +------------------+ +-----------------+ - | 5 register driver | | 3 create | | - | +----------------> Device core +----------> matrix device | - | | | | | | - | | +--------^---------+ +-----------------+ - | | | - | | +-------------------+ - | | +-----------------------------------+ | - | | | 4 register AP driver | | 2 register device - | | | | | -+--------+---+-v---+ +--------+-------+-+ -| | | | -| ap_bus +--------------------- > vfio_ap driver | -| | 8 probe | | -+--------^---------+ +--^--^------------+ -6 edit | | | - apmask | +-----------------------------+ | 9 mdev create - aqmask | | 1 modprobe | -+--------+-----+---+ +----------------+-+ +------------------+ -| | | |8 create | mediated | -| admin | | VFIO device core |---------> matrix | -| + | | | device | -+------+-+---------+ +--------^---------+ +--------^---------+ - | | | | - | | 9 create vfio_ap-passthrough | | - | +------------------------------+ | - +-------------------------------------------------------------+ - 10 assign adapter/domain/control domain - -The process for reserving an AP queue for use by a KVM guest is: - -1. The administrator loads the vfio_ap device driver -2. The vfio-ap driver during its initialization will register a single 'matrix' - device with the device core. This will serve as the parent device for - all mediated matrix devices used to configure an AP matrix for a guest. -3. The /sys/devices/vfio_ap/matrix device is created by the device core -4 The vfio_ap device driver will register with the AP bus for AP queue devices - of type 10 and higher (CEX4 and newer). The driver will provide the vfio_ap - driver's probe and remove callback interfaces. Devices older than CEX4 queues - are not supported to simplify the implementation by not needlessly - complicating the design by supporting older devices that will go out of - service in the relatively near future, and for which there are few older - systems around on which to test. -5. The AP bus registers the vfio_ap device driver with the device core -6. The administrator edits the AP adapter and queue masks to reserve AP queues - for use by the vfio_ap device driver. -7. The AP bus removes the AP queues reserved for the vfio_ap driver from the - default zcrypt cex4queue driver. -8. The AP bus probes the vfio_ap device driver to bind the queues reserved for - it. -9. The administrator creates a passthrough type mediated matrix device to be - used by a guest -10 The administrator assigns the adapters, usage domains and control domains - to be exclusively used by a guest. - -Set up the VFIO mediated device interfaces ------------------------------------------- -The VFIO AP device driver utilizes the common interface of the VFIO mediated -device core driver to: -* Register an AP mediated bus driver to add a mediated matrix device to and - remove it from a VFIO group. -* Create and destroy a mediated matrix device -* Add a mediated matrix device to and remove it from the AP mediated bus driver -* Add a mediated matrix device to and remove it from an IOMMU group - -The following high-level block diagram shows the main components and interfaces -of the VFIO AP mediated matrix device driver: - - +-------------+ - | | - | +---------+ | mdev_register_driver() +--------------+ - | | Mdev | +<-----------------------+ | - | | bus | | | vfio_mdev.ko | - | | driver | +----------------------->+ |<-> VFIO user - | +---------+ | probe()/remove() +--------------+ APIs - | | - | MDEV CORE | - | MODULE | - | mdev.ko | - | +---------+ | mdev_register_device() +--------------+ - | |Physical | +<-----------------------+ | - | | device | | | vfio_ap.ko |<-> matrix - | |interface| +----------------------->+ | device - | +---------+ | callback +--------------+ - +-------------+ - -During initialization of the vfio_ap module, the matrix device is registered -with an 'mdev_parent_ops' structure that provides the sysfs attribute -structures, mdev functions and callback interfaces for managing the mediated -matrix device. - -* sysfs attribute structures: - * supported_type_groups - The VFIO mediated device framework supports creation of user-defined - mediated device types. These mediated device types are specified - via the 'supported_type_groups' structure when a device is registered - with the mediated device framework. The registration process creates the - sysfs structures for each mediated device type specified in the - 'mdev_supported_types' sub-directory of the device being registered. Along - with the device type, the sysfs attributes of the mediated device type are - provided. - - The VFIO AP device driver will register one mediated device type for - passthrough devices: - /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough - Only the read-only attributes required by the VFIO mdev framework will - be provided: - ... name - ... device_api - ... available_instances - ... device_api - Where: - * name: specifies the name of the mediated device type - * device_api: the mediated device type's API - * available_instances: the number of mediated matrix passthrough devices - that can be created - * device_api: specifies the VFIO API - * mdev_attr_groups - This attribute group identifies the user-defined sysfs attributes of the - mediated device. When a device is registered with the VFIO mediated device - framework, the sysfs attribute files identified in the 'mdev_attr_groups' - structure will be created in the mediated matrix device's directory. The - sysfs attributes for a mediated matrix device are: - * assign_adapter: - * unassign_adapter: - Write-only attributes for assigning/unassigning an AP adapter to/from the - mediated matrix device. To assign/unassign an adapter, the APID of the - adapter is echoed to the respective attribute file. - * assign_domain: - * unassign_domain: - Write-only attributes for assigning/unassigning an AP usage domain to/from - the mediated matrix device. To assign/unassign a domain, the domain - number of the the usage domain is echoed to the respective attribute - file. - * matrix: - A read-only file for displaying the APQNs derived from the cross product - of the adapter and domain numbers assigned to the mediated matrix device. - * assign_control_domain: - * unassign_control_domain: - Write-only attributes for assigning/unassigning an AP control domain - to/from the mediated matrix device. To assign/unassign a control domain, - the ID of the domain to be assigned/unassigned is echoed to the respective - attribute file. - * control_domains: - A read-only file for displaying the control domain numbers assigned to the - mediated matrix device. - -* functions: - * create: - allocates the ap_matrix_mdev structure used by the vfio_ap driver to: - * Store the reference to the KVM structure for the guest using the mdev - * Store the AP matrix configuration for the adapters, domains, and control - domains assigned via the corresponding sysfs attributes files - * remove: - deallocates the mediated matrix device's ap_matrix_mdev structure. This will - be allowed only if a running guest is not using the mdev. - -* callback interfaces - * open: - The vfio_ap driver uses this callback to register a - VFIO_GROUP_NOTIFY_SET_KVM notifier callback function for the mdev matrix - device. The open is invoked when QEMU connects the VFIO iommu group - for the mdev matrix device to the MDEV bus. Access to the KVM structure used - to configure the KVM guest is provided via this callback. The KVM structure, - is used to configure the guest's access to the AP matrix defined via the - mediated matrix device's sysfs attribute files. - * release: - unregisters the VFIO_GROUP_NOTIFY_SET_KVM notifier callback function for the - mdev matrix device and deconfigures the guest's AP matrix. - -Configure the APM, AQM and ADM in the CRYCB: -------------------------------------------- -Configuring the AP matrix for a KVM guest will be performed when the -VFIO_GROUP_NOTIFY_SET_KVM notifier callback is invoked. The notifier -function is called when QEMU connects to KVM. The guest's AP matrix is -configured via it's CRYCB by: -* Setting the bits in the APM corresponding to the APIDs assigned to the - mediated matrix device via its 'assign_adapter' interface. -* Setting the bits in the AQM corresponding to the domains assigned to the - mediated matrix device via its 'assign_domain' interface. -* Setting the bits in the ADM corresponding to the domain dIDs assigned to the - mediated matrix device via its 'assign_control_domains' interface. - -The CPU model features for AP ------------------------------ -The AP stack relies on the presence of the AP instructions as well as two -facilities: The AP Facilities Test (APFT) facility; and the AP Query -Configuration Information (QCI) facility. These features/facilities are made -available to a KVM guest via the following CPU model features: - -1. ap: Indicates whether the AP instructions are installed on the guest. This - feature will be enabled by KVM only if the AP instructions are installed - on the host. - -2. apft: Indicates the APFT facility is available on the guest. This facility - can be made available to the guest only if it is available on the host (i.e., - facility bit 15 is set). - -3. apqci: Indicates the AP QCI facility is available on the guest. This facility - can be made available to the guest only if it is available on the host (i.e., - facility bit 12 is set). - -Note: If the user chooses to specify a CPU model different than the 'host' -model to QEMU, the CPU model features and facilities need to be turned on -explicitly; for example: - - /usr/bin/qemu-system-s390x ... -cpu z13,ap=on,apqci=on,apft=on - -A guest can be precluded from using AP features/facilities by turning them off -explicitly; for example: - - /usr/bin/qemu-system-s390x ... -cpu host,ap=off,apqci=off,apft=off - -Note: If the APFT facility is turned off (apft=off) for the guest, the guest -will not see any AP devices. The zcrypt device drivers that register for type 10 -and newer AP devices - i.e., the cex4card and cex4queue device drivers - need -the APFT facility to ascertain the facilities installed on a given AP device. If -the APFT facility is not installed on the guest, then the probe of device -drivers will fail since only type 10 and newer devices can be configured for -guest use. - -Example: -======= -Let's now provide an example to illustrate how KVM guests may be given -access to AP facilities. For this example, we will show how to configure -three guests such that executing the lszcrypt command on the guests would -look like this: - -Guest1 ------- -CARD.DOMAIN TYPE MODE ------------------------------- -05 CEX5C CCA-Coproc -05.0004 CEX5C CCA-Coproc -05.00ab CEX5C CCA-Coproc -06 CEX5A Accelerator -06.0004 CEX5A Accelerator -06.00ab CEX5C CCA-Coproc - -Guest2 ------- -CARD.DOMAIN TYPE MODE ------------------------------- -05 CEX5A Accelerator -05.0047 CEX5A Accelerator -05.00ff CEX5A Accelerator - -Guest2 ------- -CARD.DOMAIN TYPE MODE ------------------------------- -06 CEX5A Accelerator -06.0047 CEX5A Accelerator -06.00ff CEX5A Accelerator - -These are the steps: - -1. Install the vfio_ap module on the linux host. The dependency chain for the - vfio_ap module is: - * iommu - * s390 - * zcrypt - * vfio - * vfio_mdev - * vfio_mdev_device - * KVM - - To build the vfio_ap module, the kernel build must be configured with the - following Kconfig elements selected: - * IOMMU_SUPPORT - * S390 - * ZCRYPT - * S390_AP_IOMMU - * VFIO - * VFIO_MDEV - * VFIO_MDEV_DEVICE - * KVM - - If using make menuconfig select the following to build the vfio_ap module: - -> Device Drivers - -> IOMMU Hardware Support - select S390 AP IOMMU Support - -> VFIO Non-Privileged userspace driver framework - -> Mediated device driver frramework - -> VFIO driver for Mediated devices - -> I/O subsystem - -> VFIO support for AP devices - -2. Secure the AP queues to be used by the three guests so that the host can not - access them. To secure them, there are two sysfs files that specify - bitmasks marking a subset of the APQN range as 'usable by the default AP - queue device drivers' or 'not usable by the default device drivers' and thus - available for use by the vfio_ap device driver'. The location of the sysfs - files containing the masks are: - - /sys/bus/ap/apmask - /sys/bus/ap/aqmask - - The 'apmask' is a 256-bit mask that identifies a set of AP adapter IDs - (APID). Each bit in the mask, from left to right (i.e., from most significant - to least significant bit in big endian order), corresponds to an APID from - 0-255. If a bit is set, the APID is marked as usable only by the default AP - queue device drivers; otherwise, the APID is usable by the vfio_ap - device driver. - - The 'aqmask' is a 256-bit mask that identifies a set of AP queue indexes - (APQI). Each bit in the mask, from left to right (i.e., from most significant - to least significant bit in big endian order), corresponds to an APQI from - 0-255. If a bit is set, the APQI is marked as usable only by the default AP - queue device drivers; otherwise, the APQI is usable by the vfio_ap device - driver. - - Take, for example, the following mask: - - 0x7dffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff - - It indicates: - - 1, 2, 3, 4, 5, and 7-255 belong to the default drivers' pool, and 0 and 6 - belong to the vfio_ap device driver's pool. - - The APQN of each AP queue device assigned to the linux host is checked by the - AP bus against the set of APQNs derived from the cross product of APIDs - and APQIs marked as usable only by the default AP queue device drivers. If a - match is detected, only the default AP queue device drivers will be probed; - otherwise, the vfio_ap device driver will be probed. - - By default, the two masks are set to reserve all APQNs for use by the default - AP queue device drivers. There are two ways the default masks can be changed: - - 1. The sysfs mask files can be edited by echoing a string into the - respective sysfs mask file in one of two formats: - - * An absolute hex string starting with 0x - like "0x12345678" - sets - the mask. If the given string is shorter than the mask, it is padded - with 0s on the right; for example, specifying a mask value of 0x41 is - the same as specifying: - - 0x4100000000000000000000000000000000000000000000000000000000000000 - - Keep in mind that the mask reads from left to right (i.e., most - significant to least significant bit in big endian order), so the mask - above identifies device numbers 1 and 7 (01000001). - - If the string is longer than the mask, the operation is terminated with - an error (EINVAL). - - * Individual bits in the mask can be switched on and off by specifying - each bit number to be switched in a comma separated list. Each bit - number string must be prepended with a ('+') or minus ('-') to indicate - the corresponding bit is to be switched on ('+') or off ('-'). Some - valid values are: - - "+0" switches bit 0 on - "-13" switches bit 13 off - "+0x41" switches bit 65 on - "-0xff" switches bit 255 off - - The following example: - +0,-6,+0x47,-0xf0 - - Switches bits 0 and 71 (0x47) on - Switches bits 6 and 240 (0xf0) off - - Note that the bits not specified in the list remain as they were before - the operation. - - 2. The masks can also be changed at boot time via parameters on the kernel - command line like this: - - ap.apmask=0xffff ap.aqmask=0x40 - - This would create the following masks: - - apmask: - 0xffff000000000000000000000000000000000000000000000000000000000000 - - aqmask: - 0x4000000000000000000000000000000000000000000000000000000000000000 - - Resulting in these two pools: - - default drivers pool: adapter 0-15, domain 1 - alternate drivers pool: adapter 16-255, domains 0, 2-255 - - Securing the APQNs for our example: - ---------------------------------- - To secure the AP queues 05.0004, 05.0047, 05.00ab, 05.00ff, 06.0004, 06.0047, - 06.00ab, and 06.00ff for use by the vfio_ap device driver, the corresponding - APQNs can either be removed from the default masks: - - echo -5,-6 > /sys/bus/ap/apmask - - echo -4,-0x47,-0xab,-0xff > /sys/bus/ap/aqmask - - Or the masks can be set as follows: - - echo 0xf9ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff \ - > apmask - - echo 0xf7fffffffffffffffeffffffffffffffffffffffffeffffffffffffffffffffe \ - > aqmask - - This will result in AP queues 05.0004, 05.0047, 05.00ab, 05.00ff, 06.0004, - 06.0047, 06.00ab, and 06.00ff getting bound to the vfio_ap device driver. The - sysfs directory for the vfio_ap device driver will now contain symbolic links - to the AP queue devices bound to it: - - /sys/bus/ap - ... [drivers] - ...... [vfio_ap] - ......... [05.0004] - ......... [05.0047] - ......... [05.00ab] - ......... [05.00ff] - ......... [06.0004] - ......... [06.0047] - ......... [06.00ab] - ......... [06.00ff] - - Keep in mind that only type 10 and newer adapters (i.e., CEX4 and later) - can be bound to the vfio_ap device driver. The reason for this is to - simplify the implementation by not needlessly complicating the design by - supporting older devices that will go out of service in the relatively near - future and for which there are few older systems on which to test. - - The administrator, therefore, must take care to secure only AP queues that - can be bound to the vfio_ap device driver. The device type for a given AP - queue device can be read from the parent card's sysfs directory. For example, - to see the hardware type of the queue 05.0004: - - cat /sys/bus/ap/devices/card05/hwtype - - The hwtype must be 10 or higher (CEX4 or newer) in order to be bound to the - vfio_ap device driver. - -3. Create the mediated devices needed to configure the AP matrixes for the - three guests and to provide an interface to the vfio_ap driver for - use by the guests: - - /sys/devices/vfio_ap/matrix/ - --- [mdev_supported_types] - ------ [vfio_ap-passthrough] (passthrough mediated matrix device type) - --------- create - --------- [devices] - - To create the mediated devices for the three guests: - - uuidgen > create - uuidgen > create - uuidgen > create - - or - - echo $uuid1 > create - echo $uuid2 > create - echo $uuid3 > create - - This will create three mediated devices in the [devices] subdirectory named - after the UUID written to the create attribute file. We call them $uuid1, - $uuid2 and $uuid3 and this is the sysfs directory structure after creation: - - /sys/devices/vfio_ap/matrix/ - --- [mdev_supported_types] - ------ [vfio_ap-passthrough] - --------- [devices] - ------------ [$uuid1] - --------------- assign_adapter - --------------- assign_control_domain - --------------- assign_domain - --------------- matrix - --------------- unassign_adapter - --------------- unassign_control_domain - --------------- unassign_domain - - ------------ [$uuid2] - --------------- assign_adapter - --------------- assign_control_domain - --------------- assign_domain - --------------- matrix - --------------- unassign_adapter - ----------------unassign_control_domain - ----------------unassign_domain - - ------------ [$uuid3] - --------------- assign_adapter - --------------- assign_control_domain - --------------- assign_domain - --------------- matrix - --------------- unassign_adapter - ----------------unassign_control_domain - ----------------unassign_domain - -4. The administrator now needs to configure the matrixes for the mediated - devices $uuid1 (for Guest1), $uuid2 (for Guest2) and $uuid3 (for Guest3). - - This is how the matrix is configured for Guest1: - - echo 5 > assign_adapter - echo 6 > assign_adapter - echo 4 > assign_domain - echo 0xab > assign_domain - - Control domains can similarly be assigned using the assign_control_domain - sysfs file. - - If a mistake is made configuring an adapter, domain or control domain, - you can use the unassign_xxx files to unassign the adapter, domain or - control domain. - - To display the matrix configuration for Guest1: - - cat matrix - - This is how the matrix is configured for Guest2: - - echo 5 > assign_adapter - echo 0x47 > assign_domain - echo 0xff > assign_domain - - This is how the matrix is configured for Guest3: - - echo 6 > assign_adapter - echo 0x47 > assign_domain - echo 0xff > assign_domain - - In order to successfully assign an adapter: - - * The adapter number specified must represent a value from 0 up to the - maximum adapter number configured for the system. If an adapter number - higher than the maximum is specified, the operation will terminate with - an error (ENODEV). - - * All APQNs that can be derived from the adapter ID and the IDs of - the previously assigned domains must be bound to the vfio_ap device - driver. If no domains have yet been assigned, then there must be at least - one APQN with the specified APID bound to the vfio_ap driver. If no such - APQNs are bound to the driver, the operation will terminate with an - error (EADDRNOTAVAIL). - - No APQN that can be derived from the adapter ID and the IDs of the - previously assigned domains can be assigned to another mediated matrix - device. If an APQN is assigned to another mediated matrix device, the - operation will terminate with an error (EADDRINUSE). - - In order to successfully assign a domain: - - * The domain number specified must represent a value from 0 up to the - maximum domain number configured for the system. If a domain number - higher than the maximum is specified, the operation will terminate with - an error (ENODEV). - - * All APQNs that can be derived from the domain ID and the IDs of - the previously assigned adapters must be bound to the vfio_ap device - driver. If no domains have yet been assigned, then there must be at least - one APQN with the specified APQI bound to the vfio_ap driver. If no such - APQNs are bound to the driver, the operation will terminate with an - error (EADDRNOTAVAIL). - - No APQN that can be derived from the domain ID and the IDs of the - previously assigned adapters can be assigned to another mediated matrix - device. If an APQN is assigned to another mediated matrix device, the - operation will terminate with an error (EADDRINUSE). - - In order to successfully assign a control domain, the domain number - specified must represent a value from 0 up to the maximum domain number - configured for the system. If a control domain number higher than the maximum - is specified, the operation will terminate with an error (ENODEV). - -5. Start Guest1: - - /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on \ - -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid1 ... - -7. Start Guest2: - - /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on \ - -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid2 ... - -7. Start Guest3: - - /usr/bin/qemu-system-s390x ... -cpu host,ap=on,apqci=on,apft=on \ - -device vfio-ap,sysfsdev=/sys/devices/vfio_ap/matrix/$uuid3 ... - -When the guest is shut down, the mediated matrix devices may be removed. - -Using our example again, to remove the mediated matrix device $uuid1: - - /sys/devices/vfio_ap/matrix/ - --- [mdev_supported_types] - ------ [vfio_ap-passthrough] - --------- [devices] - ------------ [$uuid1] - --------------- remove - - - echo 1 > remove - - This will remove all of the mdev matrix device's sysfs structures including - the mdev device itself. To recreate and reconfigure the mdev matrix device, - all of the steps starting with step 3 will have to be performed again. Note - that the remove will fail if a guest using the mdev is still running. - - It is not necessary to remove an mdev matrix device, but one may want to - remove it if no guest will use it during the remaining lifetime of the linux - host. If the mdev matrix device is removed, one may want to also reconfigure - the pool of adapters and queues reserved for use by the default drivers. - -Limitations -=========== -* The KVM/kernel interfaces do not provide a way to prevent restoring an APQN - to the default drivers pool of a queue that is still assigned to a mediated - device in use by a guest. It is incumbent upon the administrator to - ensure there is no mediated device in use by a guest to which the APQN is - assigned lest the host be given access to the private data of the AP queue - device such as a private key configured specifically for the guest. - -* Dynamically modifying the AP matrix for a running guest (which would amount to - hot(un)plug of AP devices for the guest) is currently not supported - -* Live guest migration is not supported for guests using AP devices. diff --git a/Documentation/s390/vfio-ccw.rst b/Documentation/s390/vfio-ccw.rst new file mode 100644 index 000000000000..1f6d0b56d53e --- /dev/null +++ b/Documentation/s390/vfio-ccw.rst @@ -0,0 +1,326 @@ +================================== +vfio-ccw: the basic infrastructure +================================== + +Introduction +------------ + +Here we describe the vfio support for I/O subchannel devices for +Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a +virtual machine, while vfio is the means. + +Different than other hardware architectures, s390 has defined a unified +I/O access method, which is so called Channel I/O. It has its own access +patterns: + +- Channel programs run asynchronously on a separate (co)processor. +- The channel subsystem will access any memory designated by the caller + in the channel program directly, i.e. there is no iommu involved. + +Thus when we introduce vfio support for these devices, we realize it +with a mediated device (mdev) implementation. The vfio mdev will be +added to an iommu group, so as to make itself able to be managed by the +vfio framework. And we add read/write callbacks for special vfio I/O +regions to pass the channel programs from the mdev to its parent device +(the real I/O subchannel device) to do further address translation and +to perform I/O instructions. + +This document does not intend to explain the s390 I/O architecture in +every detail. More information/reference could be found here: + +- A good start to know Channel I/O in general: + https://en.wikipedia.org/wiki/Channel_I/O +- s390 architecture: + s390 Principles of Operation manual (IBM Form. No. SA22-7832) +- The existing QEMU code which implements a simple emulated channel + subsystem could also be a good reference. It makes it easier to follow + the flow. + qemu/hw/s390x/css.c + +For vfio mediated device framework: +- Documentation/vfio-mediated-device.txt + +Motivation of vfio-ccw +---------------------- + +Typically, a guest virtualized via QEMU/KVM on s390 only sees +paravirtualized virtio devices via the "Virtio Over Channel I/O +(virtio-ccw)" transport. This makes virtio devices discoverable via +standard operating system algorithms for handling channel devices. + +However this is not enough. On s390 for the majority of devices, which +use the standard Channel I/O based mechanism, we also need to provide +the functionality of passing through them to a QEMU virtual machine. +This includes devices that don't have a virtio counterpart (e.g. tape +drives) or that have specific characteristics which guests want to +exploit. + +For passing a device to a guest, we want to use the same interface as +everybody else, namely vfio. We implement this vfio support for channel +devices via the vfio mediated device framework and the subchannel device +driver "vfio_ccw". + +Access patterns of CCW devices +------------------------------ + +s390 architecture has implemented a so called channel subsystem, that +provides a unified view of the devices physically attached to the +systems. Though the s390 hardware platform knows about a huge variety of +different peripheral attachments like disk devices (aka. DASDs), tapes, +communication controllers, etc. They can all be accessed by a well +defined access method and they are presenting I/O completion a unified +way: I/O interruptions. + +All I/O requires the use of channel command words (CCWs). A CCW is an +instruction to a specialized I/O channel processor. A channel program is +a sequence of CCWs which are executed by the I/O channel subsystem. To +issue a channel program to the channel subsystem, it is required to +build an operation request block (ORB), which can be used to point out +the format of the CCW and other control information to the system. The +operating system signals the I/O channel subsystem to begin executing +the channel program with a SSCH (start sub-channel) instruction. The +central processor is then free to proceed with non-I/O instructions +until interrupted. The I/O completion result is received by the +interrupt handler in the form of interrupt response block (IRB). + +Back to vfio-ccw, in short: + +- ORBs and channel programs are built in guest kernel (with guest + physical addresses). +- ORBs and channel programs are passed to the host kernel. +- Host kernel translates the guest physical addresses to real addresses + and starts the I/O with issuing a privileged Channel I/O instruction + (e.g SSCH). +- channel programs run asynchronously on a separate processor. +- I/O completion will be signaled to the host with I/O interruptions. + And it will be copied as IRB to user space to pass it back to the + guest. + +Physical vfio ccw device and its child mdev +------------------------------------------- + +As mentioned above, we realize vfio-ccw with a mdev implementation. + +Channel I/O does not have IOMMU hardware support, so the physical +vfio-ccw device does not have an IOMMU level translation or isolation. + +Subchannel I/O instructions are all privileged instructions. When +handling the I/O instruction interception, vfio-ccw has the software +policing and translation how the channel program is programmed before +it gets sent to hardware. + +Within this implementation, we have two drivers for two types of +devices: + +- The vfio_ccw driver for the physical subchannel device. + This is an I/O subchannel driver for the real subchannel device. It + realizes a group of callbacks and registers to the mdev framework as a + parent (physical) device. As a consequence, mdev provides vfio_ccw a + generic interface (sysfs) to create mdev devices. A vfio mdev could be + created by vfio_ccw then and added to the mediated bus. It is the vfio + device that added to an IOMMU group and a vfio group. + vfio_ccw also provides an I/O region to accept channel program + request from user space and store I/O interrupt result for user + space to retrieve. To notify user space an I/O completion, it offers + an interface to setup an eventfd fd for asynchronous signaling. + +- The vfio_mdev driver for the mediated vfio ccw device. + This is provided by the mdev framework. It is a vfio device driver for + the mdev that created by vfio_ccw. + It realizes a group of vfio device driver callbacks, adds itself to a + vfio group, and registers itself to the mdev framework as a mdev + driver. + It uses a vfio iommu backend that uses the existing map and unmap + ioctls, but rather than programming them into an IOMMU for a device, + it simply stores the translations for use by later requests. This + means that a device programmed in a VM with guest physical addresses + can have the vfio kernel convert that address to process virtual + address, pin the page and program the hardware with the host physical + address in one step. + For a mdev, the vfio iommu backend will not pin the pages during the + VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database + of the iova<->vaddr mappings in this operation. And they export a + vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu + backend for the physical devices to pin and unpin pages by demand. + +Below is a high Level block diagram:: + + +-------------+ + | | + | +---------+ | mdev_register_driver() +--------------+ + | | Mdev | +<-----------------------+ | + | | bus | | | vfio_mdev.ko | + | | driver | +----------------------->+ |<-> VFIO user + | +---------+ | probe()/remove() +--------------+ APIs + | | + | MDEV CORE | + | MODULE | + | mdev.ko | + | +---------+ | mdev_register_device() +--------------+ + | |Physical | +<-----------------------+ | + | | device | | | vfio_ccw.ko |<-> subchannel + | |interface| +----------------------->+ | device + | +---------+ | callback +--------------+ + +-------------+ + +The process of how these work together. + +1. vfio_ccw.ko drives the physical I/O subchannel, and registers the + physical device (with callbacks) to mdev framework. + When vfio_ccw probing the subchannel device, it registers device + pointer and callbacks to the mdev framework. Mdev related file nodes + under the device node in sysfs would be created for the subchannel + device, namely 'mdev_create', 'mdev_destroy' and + 'mdev_supported_types'. +2. Create a mediated vfio ccw device. + Use the 'mdev_create' sysfs file, we need to manually create one (and + only one for our case) mediated device. +3. vfio_mdev.ko drives the mediated ccw device. + vfio_mdev is also the vfio device drvier. It will probe the mdev and + add it to an iommu_group and a vfio_group. Then we could pass through + the mdev to a guest. + +vfio-ccw I/O region +------------------- + +An I/O region is used to accept channel program request from user +space and store I/O interrupt result for user space to retrieve. The +definition of the region is:: + + struct ccw_io_region { + #define ORB_AREA_SIZE 12 + __u8 orb_area[ORB_AREA_SIZE]; + #define SCSW_AREA_SIZE 12 + __u8 scsw_area[SCSW_AREA_SIZE]; + #define IRB_AREA_SIZE 96 + __u8 irb_area[IRB_AREA_SIZE]; + __u32 ret_code; + } __packed; + +While starting an I/O request, orb_area should be filled with the +guest ORB, and scsw_area should be filled with the SCSW of the Virtual +Subchannel. + +irb_area stores the I/O result. + +ret_code stores a return code for each access of the region. + +vfio-ccw operation details +-------------------------- + +vfio-ccw follows what vfio-pci did on the s390 platform and uses +vfio-iommu-type1 as the vfio iommu backend. + +* CCW translation APIs + A group of APIs (start with `cp_`) to do CCW translation. The CCWs + passed in by a user space program are organized with their guest + physical memory addresses. These APIs will copy the CCWs into kernel + space, and assemble a runnable kernel channel program by updating the + guest physical addresses with their corresponding host physical addresses. + Note that we have to use IDALs even for direct-access CCWs, as the + referenced memory can be located anywhere, including above 2G. + +* vfio_ccw device driver + This driver utilizes the CCW translation APIs and introduces + vfio_ccw, which is the driver for the I/O subchannel devices you want + to pass through. + vfio_ccw implements the following vfio ioctls:: + + VFIO_DEVICE_GET_INFO + VFIO_DEVICE_GET_IRQ_INFO + VFIO_DEVICE_GET_REGION_INFO + VFIO_DEVICE_RESET + VFIO_DEVICE_SET_IRQS + + This provides an I/O region, so that the user space program can pass a + channel program to the kernel, to do further CCW translation before + issuing them to a real device. + This also provides the SET_IRQ ioctl to setup an event notifier to + notify the user space program the I/O completion in an asynchronous + way. + +The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a +good example to get understand how these patches work. Here is a little +bit more detail how an I/O request triggered by the QEMU guest will be +handled (without error handling). + +Explanation: + +- Q1-Q7: QEMU side process. +- K1-K5: Kernel side process. + +Q1. + Get I/O region info during initialization. + +Q2. + Setup event notifier and handler to handle I/O completion. + +... ... + +Q3. + Intercept a ssch instruction. +Q4. + Write the guest channel program and ORB to the I/O region. + + K1. + Copy from guest to kernel. + K2. + Translate the guest channel program to a host kernel space + channel program, which becomes runnable for a real device. + K3. + With the necessary information contained in the orb passed in + by QEMU, issue the ccwchain to the device. + K4. + Return the ssch CC code. +Q5. + Return the CC code to the guest. + +... ... + + K5. + Interrupt handler gets the I/O result and write the result to + the I/O region. + K6. + Signal QEMU to retrieve the result. + +Q6. + Get the signal and event handler reads out the result from the I/O + region. +Q7. + Update the irb for the guest. + +Limitations +----------- + +The current vfio-ccw implementation focuses on supporting basic commands +needed to implement block device functionality (read/write) of DASD/ECKD +device only. Some commands may need special handling in the future, for +example, anything related to path grouping. + +DASD is a kind of storage device. While ECKD is a data recording format. +More information for DASD and ECKD could be found here: +https://en.wikipedia.org/wiki/Direct-access_storage_device +https://en.wikipedia.org/wiki/Count_key_data + +Together with the corresponding work in QEMU, we can bring the passed +through DASD/ECKD device online in a guest now and use it as a block +device. + +While the current code allows the guest to start channel programs via +START SUBCHANNEL, support for HALT SUBCHANNEL or CLEAR SUBCHANNEL is +not yet implemented. + +vfio-ccw supports classic (command mode) channel I/O only. Transport +mode (HPF) is not supported. + +QDIO subchannels are currently not supported. Classic devices other than +DASD/ECKD might work, but have not been tested. + +Reference +--------- +1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832) +2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) +3. https://en.wikipedia.org/wiki/Channel_I/O +4. Documentation/s390/cds.rst +5. Documentation/vfio.txt +6. Documentation/vfio-mediated-device.txt diff --git a/Documentation/s390/vfio-ccw.txt b/Documentation/s390/vfio-ccw.txt deleted file mode 100644 index 2be11ad864ff..000000000000 --- a/Documentation/s390/vfio-ccw.txt +++ /dev/null @@ -1,300 +0,0 @@ -vfio-ccw: the basic infrastructure -================================== - -Introduction ------------- - -Here we describe the vfio support for I/O subchannel devices for -Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a -virtual machine, while vfio is the means. - -Different than other hardware architectures, s390 has defined a unified -I/O access method, which is so called Channel I/O. It has its own access -patterns: -- Channel programs run asynchronously on a separate (co)processor. -- The channel subsystem will access any memory designated by the caller - in the channel program directly, i.e. there is no iommu involved. -Thus when we introduce vfio support for these devices, we realize it -with a mediated device (mdev) implementation. The vfio mdev will be -added to an iommu group, so as to make itself able to be managed by the -vfio framework. And we add read/write callbacks for special vfio I/O -regions to pass the channel programs from the mdev to its parent device -(the real I/O subchannel device) to do further address translation and -to perform I/O instructions. - -This document does not intend to explain the s390 I/O architecture in -every detail. More information/reference could be found here: -- A good start to know Channel I/O in general: - https://en.wikipedia.org/wiki/Channel_I/O -- s390 architecture: - s390 Principles of Operation manual (IBM Form. No. SA22-7832) -- The existing QEMU code which implements a simple emulated channel - subsystem could also be a good reference. It makes it easier to follow - the flow. - qemu/hw/s390x/css.c - -For vfio mediated device framework: -- Documentation/vfio-mediated-device.txt - -Motivation of vfio-ccw ----------------------- - -Typically, a guest virtualized via QEMU/KVM on s390 only sees -paravirtualized virtio devices via the "Virtio Over Channel I/O -(virtio-ccw)" transport. This makes virtio devices discoverable via -standard operating system algorithms for handling channel devices. - -However this is not enough. On s390 for the majority of devices, which -use the standard Channel I/O based mechanism, we also need to provide -the functionality of passing through them to a QEMU virtual machine. -This includes devices that don't have a virtio counterpart (e.g. tape -drives) or that have specific characteristics which guests want to -exploit. - -For passing a device to a guest, we want to use the same interface as -everybody else, namely vfio. We implement this vfio support for channel -devices via the vfio mediated device framework and the subchannel device -driver "vfio_ccw". - -Access patterns of CCW devices ------------------------------- - -s390 architecture has implemented a so called channel subsystem, that -provides a unified view of the devices physically attached to the -systems. Though the s390 hardware platform knows about a huge variety of -different peripheral attachments like disk devices (aka. DASDs), tapes, -communication controllers, etc. They can all be accessed by a well -defined access method and they are presenting I/O completion a unified -way: I/O interruptions. - -All I/O requires the use of channel command words (CCWs). A CCW is an -instruction to a specialized I/O channel processor. A channel program is -a sequence of CCWs which are executed by the I/O channel subsystem. To -issue a channel program to the channel subsystem, it is required to -build an operation request block (ORB), which can be used to point out -the format of the CCW and other control information to the system. The -operating system signals the I/O channel subsystem to begin executing -the channel program with a SSCH (start sub-channel) instruction. The -central processor is then free to proceed with non-I/O instructions -until interrupted. The I/O completion result is received by the -interrupt handler in the form of interrupt response block (IRB). - -Back to vfio-ccw, in short: -- ORBs and channel programs are built in guest kernel (with guest - physical addresses). -- ORBs and channel programs are passed to the host kernel. -- Host kernel translates the guest physical addresses to real addresses - and starts the I/O with issuing a privileged Channel I/O instruction - (e.g SSCH). -- channel programs run asynchronously on a separate processor. -- I/O completion will be signaled to the host with I/O interruptions. - And it will be copied as IRB to user space to pass it back to the - guest. - -Physical vfio ccw device and its child mdev -------------------------------------------- - -As mentioned above, we realize vfio-ccw with a mdev implementation. - -Channel I/O does not have IOMMU hardware support, so the physical -vfio-ccw device does not have an IOMMU level translation or isolation. - -Subchannel I/O instructions are all privileged instructions. When -handling the I/O instruction interception, vfio-ccw has the software -policing and translation how the channel program is programmed before -it gets sent to hardware. - -Within this implementation, we have two drivers for two types of -devices: -- The vfio_ccw driver for the physical subchannel device. - This is an I/O subchannel driver for the real subchannel device. It - realizes a group of callbacks and registers to the mdev framework as a - parent (physical) device. As a consequence, mdev provides vfio_ccw a - generic interface (sysfs) to create mdev devices. A vfio mdev could be - created by vfio_ccw then and added to the mediated bus. It is the vfio - device that added to an IOMMU group and a vfio group. - vfio_ccw also provides an I/O region to accept channel program - request from user space and store I/O interrupt result for user - space to retrieve. To notify user space an I/O completion, it offers - an interface to setup an eventfd fd for asynchronous signaling. - -- The vfio_mdev driver for the mediated vfio ccw device. - This is provided by the mdev framework. It is a vfio device driver for - the mdev that created by vfio_ccw. - It realizes a group of vfio device driver callbacks, adds itself to a - vfio group, and registers itself to the mdev framework as a mdev - driver. - It uses a vfio iommu backend that uses the existing map and unmap - ioctls, but rather than programming them into an IOMMU for a device, - it simply stores the translations for use by later requests. This - means that a device programmed in a VM with guest physical addresses - can have the vfio kernel convert that address to process virtual - address, pin the page and program the hardware with the host physical - address in one step. - For a mdev, the vfio iommu backend will not pin the pages during the - VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database - of the iova<->vaddr mappings in this operation. And they export a - vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu - backend for the physical devices to pin and unpin pages by demand. - -Below is a high Level block diagram. - - +-------------+ - | | - | +---------+ | mdev_register_driver() +--------------+ - | | Mdev | +<-----------------------+ | - | | bus | | | vfio_mdev.ko | - | | driver | +----------------------->+ |<-> VFIO user - | +---------+ | probe()/remove() +--------------+ APIs - | | - | MDEV CORE | - | MODULE | - | mdev.ko | - | +---------+ | mdev_register_device() +--------------+ - | |Physical | +<-----------------------+ | - | | device | | | vfio_ccw.ko |<-> subchannel - | |interface| +----------------------->+ | device - | +---------+ | callback +--------------+ - +-------------+ - -The process of how these work together. -1. vfio_ccw.ko drives the physical I/O subchannel, and registers the - physical device (with callbacks) to mdev framework. - When vfio_ccw probing the subchannel device, it registers device - pointer and callbacks to the mdev framework. Mdev related file nodes - under the device node in sysfs would be created for the subchannel - device, namely 'mdev_create', 'mdev_destroy' and - 'mdev_supported_types'. -2. Create a mediated vfio ccw device. - Use the 'mdev_create' sysfs file, we need to manually create one (and - only one for our case) mediated device. -3. vfio_mdev.ko drives the mediated ccw device. - vfio_mdev is also the vfio device drvier. It will probe the mdev and - add it to an iommu_group and a vfio_group. Then we could pass through - the mdev to a guest. - -vfio-ccw I/O region -------------------- - -An I/O region is used to accept channel program request from user -space and store I/O interrupt result for user space to retrieve. The -definition of the region is: - -struct ccw_io_region { -#define ORB_AREA_SIZE 12 - __u8 orb_area[ORB_AREA_SIZE]; -#define SCSW_AREA_SIZE 12 - __u8 scsw_area[SCSW_AREA_SIZE]; -#define IRB_AREA_SIZE 96 - __u8 irb_area[IRB_AREA_SIZE]; - __u32 ret_code; -} __packed; - -While starting an I/O request, orb_area should be filled with the -guest ORB, and scsw_area should be filled with the SCSW of the Virtual -Subchannel. - -irb_area stores the I/O result. - -ret_code stores a return code for each access of the region. - -vfio-ccw operation details --------------------------- - -vfio-ccw follows what vfio-pci did on the s390 platform and uses -vfio-iommu-type1 as the vfio iommu backend. - -* CCW translation APIs - A group of APIs (start with 'cp_') to do CCW translation. The CCWs - passed in by a user space program are organized with their guest - physical memory addresses. These APIs will copy the CCWs into kernel - space, and assemble a runnable kernel channel program by updating the - guest physical addresses with their corresponding host physical addresses. - Note that we have to use IDALs even for direct-access CCWs, as the - referenced memory can be located anywhere, including above 2G. - -* vfio_ccw device driver - This driver utilizes the CCW translation APIs and introduces - vfio_ccw, which is the driver for the I/O subchannel devices you want - to pass through. - vfio_ccw implements the following vfio ioctls: - VFIO_DEVICE_GET_INFO - VFIO_DEVICE_GET_IRQ_INFO - VFIO_DEVICE_GET_REGION_INFO - VFIO_DEVICE_RESET - VFIO_DEVICE_SET_IRQS - This provides an I/O region, so that the user space program can pass a - channel program to the kernel, to do further CCW translation before - issuing them to a real device. - This also provides the SET_IRQ ioctl to setup an event notifier to - notify the user space program the I/O completion in an asynchronous - way. - -The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a -good example to get understand how these patches work. Here is a little -bit more detail how an I/O request triggered by the QEMU guest will be -handled (without error handling). - -Explanation: -Q1-Q7: QEMU side process. -K1-K5: Kernel side process. - -Q1. Get I/O region info during initialization. -Q2. Setup event notifier and handler to handle I/O completion. - -... ... - -Q3. Intercept a ssch instruction. -Q4. Write the guest channel program and ORB to the I/O region. - K1. Copy from guest to kernel. - K2. Translate the guest channel program to a host kernel space - channel program, which becomes runnable for a real device. - K3. With the necessary information contained in the orb passed in - by QEMU, issue the ccwchain to the device. - K4. Return the ssch CC code. -Q5. Return the CC code to the guest. - -... ... - - K5. Interrupt handler gets the I/O result and write the result to - the I/O region. - K6. Signal QEMU to retrieve the result. -Q6. Get the signal and event handler reads out the result from the I/O - region. -Q7. Update the irb for the guest. - -Limitations ------------ - -The current vfio-ccw implementation focuses on supporting basic commands -needed to implement block device functionality (read/write) of DASD/ECKD -device only. Some commands may need special handling in the future, for -example, anything related to path grouping. - -DASD is a kind of storage device. While ECKD is a data recording format. -More information for DASD and ECKD could be found here: -https://en.wikipedia.org/wiki/Direct-access_storage_device -https://en.wikipedia.org/wiki/Count_key_data - -Together with the corresponding work in QEMU, we can bring the passed -through DASD/ECKD device online in a guest now and use it as a block -device. - -While the current code allows the guest to start channel programs via -START SUBCHANNEL, support for HALT SUBCHANNEL or CLEAR SUBCHANNEL is -not yet implemented. - -vfio-ccw supports classic (command mode) channel I/O only. Transport -mode (HPF) is not supported. - -QDIO subchannels are currently not supported. Classic devices other than -DASD/ECKD might work, but have not been tested. - -Reference ---------- -1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832) -2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) -3. https://en.wikipedia.org/wiki/Channel_I/O -4. Documentation/s390/cds.txt -5. Documentation/vfio.txt -6. Documentation/vfio-mediated-device.txt diff --git a/Documentation/s390/zfcpdump.rst b/Documentation/s390/zfcpdump.rst new file mode 100644 index 000000000000..54e8e7caf7e7 --- /dev/null +++ b/Documentation/s390/zfcpdump.rst @@ -0,0 +1,50 @@ +================================== +The s390 SCSI dump tool (zfcpdump) +================================== + +System z machines (z900 or higher) provide hardware support for creating system +dumps on SCSI disks. The dump process is initiated by booting a dump tool, which +has to create a dump of the current (probably crashed) Linux image. In order to +not overwrite memory of the crashed Linux with data of the dump tool, the +hardware saves some memory plus the register sets of the boot CPU before the +dump tool is loaded. There exists an SCLP hardware interface to obtain the saved +memory afterwards. Currently 32 MB are saved. + +This zfcpdump implementation consists of a Linux dump kernel together with +a user space dump tool, which are loaded together into the saved memory region +below 32 MB. zfcpdump is installed on a SCSI disk using zipl (as contained in +the s390-tools package) to make the device bootable. The operator of a Linux +system can then trigger a SCSI dump by booting the SCSI disk, where zfcpdump +resides on. + +The user space dump tool accesses the memory of the crashed system by means +of the /proc/vmcore interface. This interface exports the crashed system's +memory and registers in ELF core dump format. To access the memory which has +been saved by the hardware SCLP requests will be created at the time the data +is needed by /proc/vmcore. The tail part of the crashed systems memory which +has not been stashed by hardware can just be copied from real memory. + +To build a dump enabled kernel the kernel config option CONFIG_CRASH_DUMP +has to be set. + +To get a valid zfcpdump kernel configuration use "make zfcpdump_defconfig". + +The s390 zipl tool looks for the zfcpdump kernel and optional initrd/initramfs +under the following locations: + +* kernel: /zfcpdump.image +* ramdisk: /zfcpdump.rd + +The zfcpdump directory is defined in the s390-tools package. + +The user space application of zfcpdump can reside in an intitramfs or an +initrd. It can also be included in a built-in kernel initramfs. The application +reads from /proc/vmcore or zcore/mem and writes the system dump to a SCSI disk. + +The s390-tools package version 1.24.0 and above builds an external zfcpdump +initramfs with a user space application that writes the dump to a SCSI +partition. + +For more information on how to use zfcpdump refer to the s390 'Using the Dump +Tools book', which is available from +http://www.ibm.com/developerworks/linux/linux390. diff --git a/Documentation/s390/zfcpdump.txt b/Documentation/s390/zfcpdump.txt deleted file mode 100644 index b064aa59714d..000000000000 --- a/Documentation/s390/zfcpdump.txt +++ /dev/null @@ -1,48 +0,0 @@ -The s390 SCSI dump tool (zfcpdump) - -System z machines (z900 or higher) provide hardware support for creating system -dumps on SCSI disks. The dump process is initiated by booting a dump tool, which -has to create a dump of the current (probably crashed) Linux image. In order to -not overwrite memory of the crashed Linux with data of the dump tool, the -hardware saves some memory plus the register sets of the boot CPU before the -dump tool is loaded. There exists an SCLP hardware interface to obtain the saved -memory afterwards. Currently 32 MB are saved. - -This zfcpdump implementation consists of a Linux dump kernel together with -a user space dump tool, which are loaded together into the saved memory region -below 32 MB. zfcpdump is installed on a SCSI disk using zipl (as contained in -the s390-tools package) to make the device bootable. The operator of a Linux -system can then trigger a SCSI dump by booting the SCSI disk, where zfcpdump -resides on. - -The user space dump tool accesses the memory of the crashed system by means -of the /proc/vmcore interface. This interface exports the crashed system's -memory and registers in ELF core dump format. To access the memory which has -been saved by the hardware SCLP requests will be created at the time the data -is needed by /proc/vmcore. The tail part of the crashed systems memory which -has not been stashed by hardware can just be copied from real memory. - -To build a dump enabled kernel the kernel config option CONFIG_CRASH_DUMP -has to be set. - -To get a valid zfcpdump kernel configuration use "make zfcpdump_defconfig". - -The s390 zipl tool looks for the zfcpdump kernel and optional initrd/initramfs -under the following locations: - -* kernel: /zfcpdump.image -* ramdisk: /zfcpdump.rd - -The zfcpdump directory is defined in the s390-tools package. - -The user space application of zfcpdump can reside in an intitramfs or an -initrd. It can also be included in a built-in kernel initramfs. The application -reads from /proc/vmcore or zcore/mem and writes the system dump to a SCSI disk. - -The s390-tools package version 1.24.0 and above builds an external zfcpdump -initramfs with a user space application that writes the dump to a SCSI -partition. - -For more information on how to use zfcpdump refer to the s390 'Using the Dump -Tools book', which is available from -http://www.ibm.com/developerworks/linux/linux390. diff --git a/MAINTAINERS b/MAINTAINERS index a6954776a37e..0e904873fb0a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13703,7 +13703,7 @@ L: linux-s390@vger.kernel.org L: kvm@vger.kernel.org S: Supported F: drivers/s390/cio/vfio_ccw* -F: Documentation/s390/vfio-ccw.txt +F: Documentation/s390/vfio-ccw.rst F: include/uapi/linux/vfio_ccw.h S390 ZCRYPT DRIVER @@ -13723,7 +13723,7 @@ S: Supported F: drivers/s390/crypto/vfio_ap_drv.c F: drivers/s390/crypto/vfio_ap_private.h F: drivers/s390/crypto/vfio_ap_ops.c -F: Documentation/s390/vfio-ap.txt +F: Documentation/s390/vfio-ap.rst S390 ZFCP DRIVER M: Steffen Maier diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 66be2d813951..65522d6956ca 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -810,9 +810,9 @@ config CRASH_DUMP Crash dump kernels are loaded in the main kernel with kexec-tools into a specially reserved region and then later executed after a crash by kdump/kexec. - Refer to for more details on this. + Refer to for more details on this. This option also enables s390 zfcpdump. - See also + See also endmenu diff --git a/arch/s390/include/asm/debug.h b/arch/s390/include/asm/debug.h index c305d39f5016..b94783f71322 100644 --- a/arch/s390/include/asm/debug.h +++ b/arch/s390/include/asm/debug.h @@ -152,7 +152,7 @@ static inline debug_entry_t *debug_text_event(debug_info_t *id, int level, /* * IMPORTANT: Use "%s" in sprintf format strings with care! Only pointers are - * stored in the s390dbf. See Documentation/s390/s390dbf.txt for more details! + * stored in the s390dbf. See Documentation/s390/s390dbf.rst for more details! */ extern debug_entry_t * __debug_sprintf_event(debug_info_t *id, int level, char *string, ...) @@ -210,7 +210,7 @@ static inline debug_entry_t *debug_text_exception(debug_info_t *id, int level, /* * IMPORTANT: Use "%s" in sprintf format strings with care! Only pointers are - * stored in the s390dbf. See Documentation/s390/s390dbf.txt for more details! + * stored in the s390dbf. See Documentation/s390/s390dbf.rst for more details! */ extern debug_entry_t * __debug_sprintf_exception(debug_info_t *id, int level, char *string, ...) diff --git a/drivers/s390/char/zcore.c b/drivers/s390/char/zcore.c index 405a60538630..08f812475f5e 100644 --- a/drivers/s390/char/zcore.c +++ b/drivers/s390/char/zcore.c @@ -4,7 +4,7 @@ * dumps on SCSI disks (zfcpdump). The "zcore/mem" debugfs file shows the same * dump format as s390 standalone dumps. * - * For more information please refer to Documentation/s390/zfcpdump.txt + * For more information please refer to Documentation/s390/zfcpdump.rst * * Copyright IBM Corp. 2003, 2008 * Author(s): Michael Holzheu -- cgit v1.2.3 From 1e87fec9fa52a6f7c223998d6bfbd3464eb37e31 Mon Sep 17 00:00:00 2001 From: Johannes Berg Date: Thu, 16 May 2019 11:44:52 +0200 Subject: mac80211: call rate_control_send_low() internally There's no rate control algorithm that *doesn't* want to call it internally, and calling it internally will let us modify its behaviour in the future. Signed-off-by: Johannes Berg --- .../driver-api/80211/mac80211-advanced.rst | 3 --- drivers/net/wireless/intel/iwlegacy/3945-rs.c | 3 --- drivers/net/wireless/intel/iwlegacy/4965-rs.c | 4 ---- drivers/net/wireless/intel/iwlwifi/dvm/rs.c | 4 ---- drivers/net/wireless/intel/iwlwifi/mvm/rs.c | 4 ---- drivers/net/wireless/realtek/rtlwifi/rc.c | 3 --- include/net/mac80211.h | 23 ---------------------- net/mac80211/rate.c | 13 ++++++------ net/mac80211/rc80211_minstrel.c | 4 ---- net/mac80211/rc80211_minstrel_ht.c | 3 --- 10 files changed, 7 insertions(+), 57 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/80211/mac80211-advanced.rst b/Documentation/driver-api/80211/mac80211-advanced.rst index 70a89b2163c2..9f1c5bb7ac35 100644 --- a/Documentation/driver-api/80211/mac80211-advanced.rst +++ b/Documentation/driver-api/80211/mac80211-advanced.rst @@ -226,9 +226,6 @@ TBD .. kernel-doc:: include/net/mac80211.h :functions: ieee80211_tx_rate_control -.. kernel-doc:: include/net/mac80211.h - :functions: rate_control_send_low - TBD This part of the book describes mac80211 internals. diff --git a/drivers/net/wireless/intel/iwlegacy/3945-rs.c b/drivers/net/wireless/intel/iwlegacy/3945-rs.c index a697edd46e7f..922f09f7ea3e 100644 --- a/drivers/net/wireless/intel/iwlegacy/3945-rs.c +++ b/drivers/net/wireless/intel/iwlegacy/3945-rs.c @@ -646,9 +646,6 @@ il3945_rs_get_rate(void *il_r, struct ieee80211_sta *sta, void *il_sta, il_sta = NULL; } - if (rate_control_send_low(sta, il_sta, txrc)) - return; - rate_mask = sta->supp_rates[sband->band]; /* get user max rate if set */ diff --git a/drivers/net/wireless/intel/iwlegacy/4965-rs.c b/drivers/net/wireless/intel/iwlegacy/4965-rs.c index 54ff83829afb..946f352fd9a4 100644 --- a/drivers/net/wireless/intel/iwlegacy/4965-rs.c +++ b/drivers/net/wireless/intel/iwlegacy/4965-rs.c @@ -2224,10 +2224,6 @@ il4965_rs_get_rate(void *il_r, struct ieee80211_sta *sta, void *il_sta, il_sta = NULL; } - /* Send management frames and NO_ACK data using lowest rate. */ - if (rate_control_send_low(sta, il_sta, txrc)) - return; - if (!lq_sta) return; diff --git a/drivers/net/wireless/intel/iwlwifi/dvm/rs.c b/drivers/net/wireless/intel/iwlwifi/dvm/rs.c index ef4b9de256f7..838e76a5db68 100644 --- a/drivers/net/wireless/intel/iwlwifi/dvm/rs.c +++ b/drivers/net/wireless/intel/iwlwifi/dvm/rs.c @@ -2731,10 +2731,6 @@ static void rs_get_rate(void *priv_r, struct ieee80211_sta *sta, void *priv_sta, priv_sta = NULL; } - /* Send management frames and NO_ACK data using lowest rate. */ - if (rate_control_send_low(sta, priv_sta, txrc)) - return; - rate_idx = lq_sta->last_txrate_idx; if (lq_sta->last_rate_n_flags & RATE_MCS_HT_MSK) { diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/rs.c b/drivers/net/wireless/intel/iwlwifi/mvm/rs.c index c182821ab22b..9107b1698b0f 100644 --- a/drivers/net/wireless/intel/iwlwifi/mvm/rs.c +++ b/drivers/net/wireless/intel/iwlwifi/mvm/rs.c @@ -2960,10 +2960,6 @@ static void rs_drv_get_rate(void *mvm_r, struct ieee80211_sta *sta, mvm_sta = NULL; } - /* Send management frames and NO_ACK data using lowest rate. */ - if (rate_control_send_low(sta, mvm_sta, txrc)) - return; - if (!mvm_sta) return; diff --git a/drivers/net/wireless/realtek/rtlwifi/rc.c b/drivers/net/wireless/realtek/rtlwifi/rc.c index cf8e42a01015..0c7d74902d33 100644 --- a/drivers/net/wireless/realtek/rtlwifi/rc.c +++ b/drivers/net/wireless/realtek/rtlwifi/rc.c @@ -173,9 +173,6 @@ static void rtl_get_rate(void *ppriv, struct ieee80211_sta *sta, u8 try_per_rate, i, rix; bool not_data = !ieee80211_is_data(fc); - if (rate_control_send_low(sta, priv_sta, txrc)) - return; - rix = _rtl_rc_get_highest_rix(rtlpriv, sta, skb, not_data); try_per_rate = 1; _rtl_rc_rate_set_series(rtlpriv, sta, &rates[0], txrc, diff --git a/include/net/mac80211.h b/include/net/mac80211.h index ed4911306f03..4411120e5a9a 100644 --- a/include/net/mac80211.h +++ b/include/net/mac80211.h @@ -5960,29 +5960,6 @@ static inline int rate_supported(struct ieee80211_sta *sta, return (sta == NULL || sta->supp_rates[band] & BIT(index)); } -/** - * rate_control_send_low - helper for drivers for management/no-ack frames - * - * Rate control algorithms that agree to use the lowest rate to - * send management frames and NO_ACK data with the respective hw - * retries should use this in the beginning of their mac80211 get_rate - * callback. If true is returned the rate control can simply return. - * If false is returned we guarantee that sta and sta and priv_sta is - * not null. - * - * Rate control algorithms wishing to do more intelligent selection of - * rate for multicast/broadcast frames may choose to not use this. - * - * @sta: &struct ieee80211_sta pointer to the target destination. Note - * that this may be null. - * @priv_sta: private rate control structure. This may be null. - * @txrc: rate control information we sholud populate for mac80211. - */ -bool rate_control_send_low(struct ieee80211_sta *sta, - void *priv_sta, - struct ieee80211_tx_rate_control *txrc); - - static inline s8 rate_lowest_index(struct ieee80211_supported_band *sband, struct ieee80211_sta *sta) diff --git a/net/mac80211/rate.c b/net/mac80211/rate.c index 76f303fda3ed..09f89d004a70 100644 --- a/net/mac80211/rate.c +++ b/net/mac80211/rate.c @@ -369,9 +369,8 @@ static void __rate_control_send_low(struct ieee80211_hw *hw, } -bool rate_control_send_low(struct ieee80211_sta *pubsta, - void *priv_sta, - struct ieee80211_tx_rate_control *txrc) +static bool rate_control_send_low(struct ieee80211_sta *pubsta, + struct ieee80211_tx_rate_control *txrc) { struct ieee80211_tx_info *info = IEEE80211_SKB_CB(txrc->skb); struct ieee80211_supported_band *sband = txrc->sband; @@ -379,7 +378,7 @@ bool rate_control_send_low(struct ieee80211_sta *pubsta, int mcast_rate; bool use_basicrate = false; - if (!pubsta || !priv_sta || rc_no_data_or_no_ack_use_min(txrc)) { + if (!pubsta || rc_no_data_or_no_ack_use_min(txrc)) { __rate_control_send_low(txrc->hw, sband, pubsta, info, txrc->rate_idx_mask); @@ -405,7 +404,6 @@ bool rate_control_send_low(struct ieee80211_sta *pubsta, } return false; } -EXPORT_SYMBOL(rate_control_send_low); static bool rate_idx_match_legacy_mask(s8 *rate_idx, int n_bitrates, u32 mask) { @@ -902,12 +900,15 @@ void rate_control_get_rate(struct ieee80211_sub_if_data *sdata, if (ieee80211_hw_check(&sdata->local->hw, HAS_RATE_CONTROL)) return; + if (rate_control_send_low(ista, txrc)) + return; + if (ista) { spin_lock_bh(&sta->rate_ctrl_lock); ref->ops->get_rate(ref->priv, ista, priv_sta, txrc); spin_unlock_bh(&sta->rate_ctrl_lock); } else { - ref->ops->get_rate(ref->priv, NULL, NULL, txrc); + rate_control_send_low(NULL, txrc); } if (ieee80211_hw_check(&sdata->local->hw, SUPPORTS_RC_TABLE)) diff --git a/net/mac80211/rc80211_minstrel.c b/net/mac80211/rc80211_minstrel.c index a34e9c2ca626..ee86c3333999 100644 --- a/net/mac80211/rc80211_minstrel.c +++ b/net/mac80211/rc80211_minstrel.c @@ -340,10 +340,6 @@ minstrel_get_rate(void *priv, struct ieee80211_sta *sta, int delta; int sampling_ratio; - /* management/no-ack frames do not use rate control */ - if (rate_control_send_low(sta, priv_sta, txrc)) - return; - /* check multi-rate-retry capabilities & adjust lookaround_rate */ mrr_capable = mp->has_mrr && !txrc->rts && diff --git a/net/mac80211/rc80211_minstrel_ht.c b/net/mac80211/rc80211_minstrel_ht.c index 8b168724c5e7..da18c6fb6c1d 100644 --- a/net/mac80211/rc80211_minstrel_ht.c +++ b/net/mac80211/rc80211_minstrel_ht.c @@ -1098,9 +1098,6 @@ minstrel_ht_get_rate(void *priv, struct ieee80211_sta *sta, void *priv_sta, struct minstrel_priv *mp = priv; int sample_idx; - if (rate_control_send_low(sta, priv_sta, txrc)) - return; - if (!msp->is_ht) return mac80211_minstrel.get_rate(priv, sta, &msp->legacy, txrc); -- cgit v1.2.3 From 151f4e2bdc7a04020ae5c533896fb91a16e1f501 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 13 Jun 2019 07:10:36 -0300 Subject: docs: power: convert docs to ReST and rename to *.rst Convert the PM documents to ReST, in order to allow them to build with Sphinx. The conversion is actually: - add blank lines and indentation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Bjorn Helgaas Acked-by: Mark Brown Acked-by: Srivatsa S. Bhat (VMware) --- Documentation/ABI/testing/sysfs-class-powercap | 2 +- Documentation/admin-guide/kernel-parameters.txt | 6 +- Documentation/cpu-freq/core.txt | 2 +- Documentation/driver-api/pm/devices.rst | 6 +- Documentation/driver-api/usb/power-management.rst | 2 +- Documentation/power/apm-acpi.rst | 36 + Documentation/power/apm-acpi.txt | 32 - Documentation/power/basic-pm-debugging.rst | 269 +++++ Documentation/power/basic-pm-debugging.txt | 254 ----- Documentation/power/charger-manager.rst | 205 ++++ Documentation/power/charger-manager.txt | 200 ---- Documentation/power/drivers-testing.rst | 51 + Documentation/power/drivers-testing.txt | 46 - Documentation/power/energy-model.rst | 147 +++ Documentation/power/energy-model.txt | 144 --- Documentation/power/freezing-of-tasks.rst | 244 +++++ Documentation/power/freezing-of-tasks.txt | 231 ---- Documentation/power/index.rst | 46 + Documentation/power/interface.rst | 79 ++ Documentation/power/interface.txt | 77 -- Documentation/power/opp.rst | 379 +++++++ Documentation/power/opp.txt | 342 ------ Documentation/power/pci.rst | 1135 ++++++++++++++++++++ Documentation/power/pci.txt | 1094 ------------------- Documentation/power/pm_qos_interface.rst | 225 ++++ Documentation/power/pm_qos_interface.txt | 212 ---- Documentation/power/power_supply_class.rst | 282 +++++ Documentation/power/power_supply_class.txt | 231 ---- Documentation/power/powercap/powercap.rst | 257 +++++ Documentation/power/powercap/powercap.txt | 236 ---- Documentation/power/regulator/consumer.rst | 229 ++++ Documentation/power/regulator/consumer.txt | 218 ---- Documentation/power/regulator/design.rst | 38 + Documentation/power/regulator/design.txt | 33 - Documentation/power/regulator/machine.rst | 97 ++ Documentation/power/regulator/machine.txt | 96 -- Documentation/power/regulator/overview.rst | 178 +++ Documentation/power/regulator/overview.txt | 171 --- Documentation/power/regulator/regulator.rst | 32 + Documentation/power/regulator/regulator.txt | 30 - Documentation/power/runtime_pm.rst | 940 ++++++++++++++++ Documentation/power/runtime_pm.txt | 928 ---------------- Documentation/power/s2ram.rst | 87 ++ Documentation/power/s2ram.txt | 85 -- Documentation/power/suspend-and-cpuhotplug.rst | 286 +++++ Documentation/power/suspend-and-cpuhotplug.txt | 274 ----- Documentation/power/suspend-and-interrupts.rst | 137 +++ Documentation/power/suspend-and-interrupts.txt | 135 --- Documentation/power/swsusp-and-swap-files.rst | 63 ++ Documentation/power/swsusp-and-swap-files.txt | 60 -- Documentation/power/swsusp-dmcrypt.rst | 140 +++ Documentation/power/swsusp-dmcrypt.txt | 138 --- Documentation/power/swsusp.rst | 501 +++++++++ Documentation/power/swsusp.txt | 446 -------- Documentation/power/tricks.rst | 29 + Documentation/power/tricks.txt | 27 - Documentation/power/userland-swsusp.rst | 191 ++++ Documentation/power/userland-swsusp.txt | 170 --- Documentation/power/video.rst | 213 ++++ Documentation/power/video.txt | 185 ---- Documentation/process/submitting-drivers.rst | 2 +- Documentation/scheduler/sched-energy.txt | 6 +- Documentation/trace/coresight-cpu-debug.txt | 2 +- .../zh_CN/process/submitting-drivers.rst | 2 +- MAINTAINERS | 4 +- arch/x86/Kconfig | 2 +- drivers/gpu/drm/i915/i915_drv.h | 2 +- drivers/opp/Kconfig | 2 +- drivers/power/supply/power_supply_core.c | 2 +- include/linux/interrupt.h | 2 +- include/linux/pci.h | 2 +- include/linux/pm.h | 2 +- kernel/power/Kconfig | 6 +- net/wireless/Kconfig | 2 +- 74 files changed, 6544 insertions(+), 6123 deletions(-) create mode 100644 Documentation/power/apm-acpi.rst delete mode 100644 Documentation/power/apm-acpi.txt create mode 100644 Documentation/power/basic-pm-debugging.rst delete mode 100644 Documentation/power/basic-pm-debugging.txt create mode 100644 Documentation/power/charger-manager.rst delete mode 100644 Documentation/power/charger-manager.txt create mode 100644 Documentation/power/drivers-testing.rst delete mode 100644 Documentation/power/drivers-testing.txt create mode 100644 Documentation/power/energy-model.rst delete mode 100644 Documentation/power/energy-model.txt create mode 100644 Documentation/power/freezing-of-tasks.rst delete mode 100644 Documentation/power/freezing-of-tasks.txt create mode 100644 Documentation/power/index.rst create mode 100644 Documentation/power/interface.rst delete mode 100644 Documentation/power/interface.txt create mode 100644 Documentation/power/opp.rst delete mode 100644 Documentation/power/opp.txt create mode 100644 Documentation/power/pci.rst delete mode 100644 Documentation/power/pci.txt create mode 100644 Documentation/power/pm_qos_interface.rst delete mode 100644 Documentation/power/pm_qos_interface.txt create mode 100644 Documentation/power/power_supply_class.rst delete mode 100644 Documentation/power/power_supply_class.txt create mode 100644 Documentation/power/powercap/powercap.rst delete mode 100644 Documentation/power/powercap/powercap.txt create mode 100644 Documentation/power/regulator/consumer.rst delete mode 100644 Documentation/power/regulator/consumer.txt create mode 100644 Documentation/power/regulator/design.rst delete mode 100644 Documentation/power/regulator/design.txt create mode 100644 Documentation/power/regulator/machine.rst delete mode 100644 Documentation/power/regulator/machine.txt create mode 100644 Documentation/power/regulator/overview.rst delete mode 100644 Documentation/power/regulator/overview.txt create mode 100644 Documentation/power/regulator/regulator.rst delete mode 100644 Documentation/power/regulator/regulator.txt create mode 100644 Documentation/power/runtime_pm.rst delete mode 100644 Documentation/power/runtime_pm.txt create mode 100644 Documentation/power/s2ram.rst delete mode 100644 Documentation/power/s2ram.txt create mode 100644 Documentation/power/suspend-and-cpuhotplug.rst delete mode 100644 Documentation/power/suspend-and-cpuhotplug.txt create mode 100644 Documentation/power/suspend-and-interrupts.rst delete mode 100644 Documentation/power/suspend-and-interrupts.txt create mode 100644 Documentation/power/swsusp-and-swap-files.rst delete mode 100644 Documentation/power/swsusp-and-swap-files.txt create mode 100644 Documentation/power/swsusp-dmcrypt.rst delete mode 100644 Documentation/power/swsusp-dmcrypt.txt create mode 100644 Documentation/power/swsusp.rst delete mode 100644 Documentation/power/swsusp.txt create mode 100644 Documentation/power/tricks.rst delete mode 100644 Documentation/power/tricks.txt create mode 100644 Documentation/power/userland-swsusp.rst delete mode 100644 Documentation/power/userland-swsusp.txt create mode 100644 Documentation/power/video.rst delete mode 100644 Documentation/power/video.txt (limited to 'Documentation/driver-api') diff --git a/Documentation/ABI/testing/sysfs-class-powercap b/Documentation/ABI/testing/sysfs-class-powercap index db3b3ff70d84..742dfd966592 100644 --- a/Documentation/ABI/testing/sysfs-class-powercap +++ b/Documentation/ABI/testing/sysfs-class-powercap @@ -5,7 +5,7 @@ Contact: linux-pm@vger.kernel.org Description: The powercap/ class sub directory belongs to the power cap subsystem. Refer to - Documentation/power/powercap/powercap.txt for details. + Documentation/power/powercap/powercap.rst for details. What: /sys/class/powercap/ Date: September 2013 diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 138f6664b2e2..7f5ca6e7c4d3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -13,7 +13,7 @@ For ARM64, ONLY "acpi=off", "acpi=on" or "acpi=force" are available - See also Documentation/power/runtime_pm.txt, pci=noacpi + See also Documentation/power/runtime_pm.rst, pci=noacpi acpi_apic_instance= [ACPI, IOAPIC] Format: @@ -223,7 +223,7 @@ acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, old_ordering, nonvs, sci_force_enable, nobl } - See Documentation/power/video.txt for information on + See Documentation/power/video.rst for information on s3_bios and s3_mode. s3_beep is for debugging; it makes the PC's speaker beep as soon as the kernel's real-mode entry point is called. @@ -4108,7 +4108,7 @@ Specify the offset from the beginning of the partition given by "resume=" at which the swap header is located, in units (needed only for swap files). - See Documentation/power/swsusp-and-swap-files.txt + See Documentation/power/swsusp-and-swap-files.rst resumedelay= [HIBERNATION] Delay (in seconds) to pause before attempting to read the resume files diff --git a/Documentation/cpu-freq/core.txt b/Documentation/cpu-freq/core.txt index 073f128af5a7..55193e680250 100644 --- a/Documentation/cpu-freq/core.txt +++ b/Documentation/cpu-freq/core.txt @@ -95,7 +95,7 @@ flags - flags of the cpufreq driver 3. CPUFreq Table Generation with Operating Performance Point (OPP) ================================================================== -For details about OPP, see Documentation/power/opp.txt +For details about OPP, see Documentation/power/opp.rst dev_pm_opp_init_cpufreq_table - This function provides a ready to use conversion routine to translate diff --git a/Documentation/driver-api/pm/devices.rst b/Documentation/driver-api/pm/devices.rst index 30835683616a..f66c7b9126ea 100644 --- a/Documentation/driver-api/pm/devices.rst +++ b/Documentation/driver-api/pm/devices.rst @@ -225,7 +225,7 @@ system-wide transition to a sleep state even though its :c:member:`runtime_auto` flag is clear. For more information about the runtime power management framework, refer to -:file:`Documentation/power/runtime_pm.txt`. +:file:`Documentation/power/runtime_pm.rst`. Calling Drivers to Enter and Leave System Sleep States @@ -728,7 +728,7 @@ it into account in any way. Devices may be defined as IRQ-safe which indicates to the PM core that their runtime PM callbacks may be invoked with disabled interrupts (see -:file:`Documentation/power/runtime_pm.txt` for more information). If an +:file:`Documentation/power/runtime_pm.rst` for more information). If an IRQ-safe device belongs to a PM domain, the runtime PM of the domain will be disallowed, unless the domain itself is defined as IRQ-safe. However, it makes sense to define a PM domain as IRQ-safe only if all the devices in it @@ -795,7 +795,7 @@ so on) and the final state of the device must reflect the "active" runtime PM status in that case. During system-wide resume from a sleep state it's easiest to put devices into -the full-power state, as explained in :file:`Documentation/power/runtime_pm.txt`. +the full-power state, as explained in :file:`Documentation/power/runtime_pm.rst`. [Refer to that document for more information regarding this particular issue as well as for information on the device runtime power management framework in general.] diff --git a/Documentation/driver-api/usb/power-management.rst b/Documentation/driver-api/usb/power-management.rst index 4a74cf6f2797..2525c3622cae 100644 --- a/Documentation/driver-api/usb/power-management.rst +++ b/Documentation/driver-api/usb/power-management.rst @@ -46,7 +46,7 @@ device is turned off while the system as a whole remains running, we call it a "dynamic suspend" (also known as a "runtime suspend" or "selective suspend"). This document concentrates mostly on how dynamic PM is implemented in the USB subsystem, although system PM is -covered to some extent (see ``Documentation/power/*.txt`` for more +covered to some extent (see ``Documentation/power/*.rst`` for more information about system PM). System PM support is present only if the kernel was built with diff --git a/Documentation/power/apm-acpi.rst b/Documentation/power/apm-acpi.rst new file mode 100644 index 000000000000..5b90d947126d --- /dev/null +++ b/Documentation/power/apm-acpi.rst @@ -0,0 +1,36 @@ +============ +APM or ACPI? +============ + +If you have a relatively recent x86 mobile, desktop, or server system, +odds are it supports either Advanced Power Management (APM) or +Advanced Configuration and Power Interface (ACPI). ACPI is the newer +of the two technologies and puts power management in the hands of the +operating system, allowing for more intelligent power management than +is possible with BIOS controlled APM. + +The best way to determine which, if either, your system supports is to +build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is +enabled by default). If a working ACPI implementation is found, the +ACPI driver will override and disable APM, otherwise the APM driver +will be used. + +No, sorry, you cannot have both ACPI and APM enabled and running at +once. Some people with broken ACPI or broken APM implementations +would like to use both to get a full set of working features, but you +simply cannot mix and match the two. Only one power management +interface can be in control of the machine at once. Think about it.. + +User-space Daemons +------------------ +Both APM and ACPI rely on user-space daemons, apmd and acpid +respectively, to be completely functional. Obtain both of these +daemons from your Linux distribution or from the Internet (see below) +and be sure that they are started sometime in the system boot process. +Go ahead and start both. If ACPI or APM is not available on your +system the associated daemon will exit gracefully. + + ===== ======================================= + apmd http://ftp.debian.org/pool/main/a/apmd/ + acpid http://acpid.sf.net/ + ===== ======================================= diff --git a/Documentation/power/apm-acpi.txt b/Documentation/power/apm-acpi.txt deleted file mode 100644 index 6cc423d3662e..000000000000 --- a/Documentation/power/apm-acpi.txt +++ /dev/null @@ -1,32 +0,0 @@ -APM or ACPI? ------------- -If you have a relatively recent x86 mobile, desktop, or server system, -odds are it supports either Advanced Power Management (APM) or -Advanced Configuration and Power Interface (ACPI). ACPI is the newer -of the two technologies and puts power management in the hands of the -operating system, allowing for more intelligent power management than -is possible with BIOS controlled APM. - -The best way to determine which, if either, your system supports is to -build a kernel with both ACPI and APM enabled (as of 2.3.x ACPI is -enabled by default). If a working ACPI implementation is found, the -ACPI driver will override and disable APM, otherwise the APM driver -will be used. - -No, sorry, you cannot have both ACPI and APM enabled and running at -once. Some people with broken ACPI or broken APM implementations -would like to use both to get a full set of working features, but you -simply cannot mix and match the two. Only one power management -interface can be in control of the machine at once. Think about it.. - -User-space Daemons ------------------- -Both APM and ACPI rely on user-space daemons, apmd and acpid -respectively, to be completely functional. Obtain both of these -daemons from your Linux distribution or from the Internet (see below) -and be sure that they are started sometime in the system boot process. -Go ahead and start both. If ACPI or APM is not available on your -system the associated daemon will exit gracefully. - - apmd: http://ftp.debian.org/pool/main/a/apmd/ - acpid: http://acpid.sf.net/ diff --git a/Documentation/power/basic-pm-debugging.rst b/Documentation/power/basic-pm-debugging.rst new file mode 100644 index 000000000000..69862e759c30 --- /dev/null +++ b/Documentation/power/basic-pm-debugging.rst @@ -0,0 +1,269 @@ +================================= +Debugging hibernation and suspend +================================= + + (C) 2007 Rafael J. Wysocki , GPL + +1. Testing hibernation (aka suspend to disk or STD) +=================================================== + +To check if hibernation works, you can try to hibernate in the "reboot" mode:: + + # echo reboot > /sys/power/disk + # echo disk > /sys/power/state + +and the system should create a hibernation image, reboot, resume and get back to +the command prompt where you have started the transition. If that happens, +hibernation is most likely to work correctly. Still, you need to repeat the +test at least a couple of times in a row for confidence. [This is necessary, +because some problems only show up on a second attempt at suspending and +resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" +modes causes the PM core to skip some platform-related callbacks which on ACPI +systems might be necessary to make hibernation work. Thus, if your machine +fails to hibernate or resume in the "reboot" mode, you should try the +"platform" mode:: + + # echo platform > /sys/power/disk + # echo disk > /sys/power/state + +which is the default and recommended mode of hibernation. + +Unfortunately, the "platform" mode of hibernation does not work on some systems +with broken BIOSes. In such cases the "shutdown" mode of hibernation might +work:: + + # echo shutdown > /sys/power/disk + # echo disk > /sys/power/state + +(it is similar to the "reboot" mode, but it requires you to press the power +button to make the system resume). + +If neither "platform" nor "shutdown" hibernation mode works, you will need to +identify what goes wrong. + +a) Test modes of hibernation +---------------------------- + +To find out why hibernation fails on your system, you can use a special testing +facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, +there is the file /sys/power/pm_test that can be used to make the hibernation +core run in a test mode. There are 5 test modes available: + +freezer + - test the freezing of processes + +devices + - test the freezing of processes and suspending of devices + +platform + - test the freezing of processes, suspending of devices and platform + global control methods [1]_ + +processors + - test the freezing of processes, suspending of devices, platform + global control methods [1]_ and the disabling of nonboot CPUs + +core + - test the freezing of processes, suspending of devices, platform global + control methods\ [1]_, the disabling of nonboot CPUs and suspending + of platform/system devices + +.. [1] + + the platform global control methods are only available on ACPI systems + and are only tested if the hibernation mode is set to "platform" + +To use one of them it is necessary to write the corresponding string to +/sys/power/pm_test (eg. "devices" to test the freezing of processes and +suspending devices) and issue the standard hibernation commands. For example, +to use the "devices" test mode along with the "platform" mode of hibernation, +you should do the following:: + + # echo devices > /sys/power/pm_test + # echo platform > /sys/power/disk + # echo disk > /sys/power/state + +Then, the kernel will try to freeze processes, suspend devices, wait a few +seconds (5 by default, but configurable by the suspend.pm_test_delay module +parameter), resume devices and thaw processes. If "platform" is written to +/sys/power/pm_test , then after suspending devices the kernel will additionally +invoke the global control methods (eg. ACPI global control methods) used to +prepare the platform firmware for hibernation. Next, it will wait a +configurable number of seconds and invoke the platform (eg. ACPI) global +methods used to cancel hibernation etc. + +Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal +hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test +contains a space-separated list of all available tests (including "none" that +represents the normal functionality) in which the current test level is +indicated by square brackets. + +Generally, as you can see, each test level is more "invasive" than the previous +one and the "core" level tests the hardware and drivers as deeply as possible +without creating a hibernation image. Obviously, if the "devices" test fails, +the "platform" test will fail as well and so on. Thus, as a rule of thumb, you +should try the test modes starting from "freezer", through "devices", "platform" +and "processors" up to "core" (repeat the test on each level a couple of times +to make sure that any random factors are avoided). + +If the "freezer" test fails, there is a task that cannot be frozen (in that case +it usually is possible to identify the offending task by analysing the output of +dmesg obtained after the failing test). Failure at this level usually means +that there is a problem with the tasks freezer subsystem that should be +reported. + +If the "devices" test fails, most likely there is a driver that cannot suspend +or resume its device (in the latter case the system may hang or become unstable +after the test, so please take that into consideration). To find this driver, +you can carry out a binary search according to the rules: + +- if the test fails, unload a half of the drivers currently loaded and repeat + (that would probably involve rebooting the system, so always note what drivers + have been loaded before the test), +- if the test succeeds, load a half of the drivers you have unloaded most + recently and repeat. + +Once you have found the failing driver (there can be more than just one of +them), you have to unload it every time before hibernation. In that case please +make sure to report the problem with the driver. + +It is also possible that the "devices" test will still fail after you have +unloaded all modules. In that case, you may want to look in your kernel +configuration for the drivers that can be compiled as modules (and test again +with these drivers compiled as modules). You may also try to use some special +kernel command line options such as "noapic", "noacpi" or even "acpi=off". + +If the "platform" test fails, there is a problem with the handling of the +platform (eg. ACPI) firmware on your system. In that case the "platform" mode +of hibernation is not likely to work. You can try the "shutdown" mode, but that +is rather a poor man's workaround. + +If the "processors" test fails, the disabling/enabling of nonboot CPUs does not +work (of course, this only may be an issue on SMP systems) and the problem +should be reported. In that case you can also try to switch the nonboot CPUs +off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and +see if that works. + +If the "core" test fails, which means that suspending of the system/platform +devices has failed (these devices are suspended on one CPU with interrupts off), +the problem is most probably hardware-related and serious, so it should be +reported. + +A failure of any of the "platform", "processors" or "core" tests may cause your +system to hang or become unstable, so please beware. Such a failure usually +indicates a serious problem that very well may be related to the hardware, but +please report it anyway. + +b) Testing minimal configuration +-------------------------------- + +If all of the hibernation test modes work, you can boot the system with the +"init=/bin/bash" command line parameter and attempt to hibernate in the +"reboot", "shutdown" and "platform" modes. If that does not work, there +probably is a problem with a driver statically compiled into the kernel and you +can try to compile more drivers as modules, so that they can be tested +individually. Otherwise, there is a problem with a modular driver and you can +find it by loading a half of the modules you normally use and binary searching +in accordance with the algorithm: +- if there are n modules loaded and the attempt to suspend and resume fails, +unload n/2 of the modules and try again (that would probably involve rebooting +the system), +- if there are n modules loaded and the attempt to suspend and resume succeeds, +load n/2 modules more and try again. + +Again, if you find the offending module(s), it(they) must be unloaded every time +before hibernation, and please report the problem with it(them). + +c) Using the "test_resume" hibernation option +--------------------------------------------- + +/sys/power/disk generally tells the kernel what to do after creating a +hibernation image. One of the available options is "test_resume" which +causes the just created image to be used for immediate restoration. Namely, +after doing:: + + # echo test_resume > /sys/power/disk + # echo disk > /sys/power/state + +a hibernation image will be created and a resume from it will be triggered +immediately without involving the platform firmware in any way. + +That test can be used to check if failures to resume from hibernation are +related to bad interactions with the platform firmware. That is, if the above +works every time, but resume from actual hibernation does not work or is +unreliable, the platform firmware may be responsible for the failures. + +On architectures and platforms that support using different kernels to restore +hibernation images (that is, the kernel used to read the image from storage and +load it into memory is different from the one included in the image) or support +kernel address space randomization, it also can be used to check if failures +to resume may be related to the differences between the restore and image +kernels. + +d) Advanced debugging +--------------------- + +In case that hibernation does not work on your system even in the minimal +configuration and compiling more drivers as modules is not practical or some +modules cannot be unloaded, you can use one of the more advanced debugging +techniques to find the problem. First, if there is a serial port in your box, +you can boot the kernel with the 'no_console_suspend' parameter and try to log +kernel messages using the serial console. This may provide you with some +information about the reasons of the suspend (resume) failure. Alternatively, +it may be possible to use a FireWire port for debugging with firescope +(http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to +use the PM_TRACE mechanism documented in Documentation/power/s2ram.rst . + +2. Testing suspend to RAM (STR) +=============================== + +To verify that the STR works, it is generally more convenient to use the s2ram +tool available from http://suspend.sf.net and documented at +http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK). + +Namely, after writing "freezer", "devices", "platform", "processors", or "core" +into /sys/power/pm_test (available if the kernel is compiled with +CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding +to given string. The STR test modes are defined in the same way as for +hibernation, so please refer to Section 1 for more information about them. In +particular, the "core" test allows you to test everything except for the actual +invocation of the platform firmware in order to put the system into the sleep +state. + +Among other things, the testing with the help of /sys/power/pm_test may allow +you to identify drivers that fail to suspend or resume their devices. They +should be unloaded every time before an STR transition. + +Next, you can follow the instructions at S2RAM_LINK to test the system, but if +it does not work "out of the box", you may need to boot it with +"init=/bin/bash" and test s2ram in the minimal configuration. In that case, +you may be able to search for failing drivers by following the procedure +analogous to the one described in section 1. If you find some failing drivers, +you will have to unload them every time before an STR transition (ie. before +you run s2ram), and please report the problems with them. + +There is a debugfs entry which shows the suspend to RAM statistics. Here is an +example of its output:: + + # mount -t debugfs none /sys/kernel/debug + # cat /sys/kernel/debug/suspend_stats + success: 20 + fail: 5 + failed_freeze: 0 + failed_prepare: 0 + failed_suspend: 5 + failed_suspend_noirq: 0 + failed_resume: 0 + failed_resume_noirq: 0 + failures: + last_failed_dev: alarm + adc + last_failed_errno: -16 + -16 + last_failed_step: suspend + suspend + +Field success means the success number of suspend to RAM, and field fail means +the failure number. Others are the failure number of different steps of suspend +to RAM. suspend_stats just lists the last 2 failed devices, error number and +failed step of suspend. diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt deleted file mode 100644 index 708f87f78a75..000000000000 --- a/Documentation/power/basic-pm-debugging.txt +++ /dev/null @@ -1,254 +0,0 @@ -Debugging hibernation and suspend - (C) 2007 Rafael J. Wysocki , GPL - -1. Testing hibernation (aka suspend to disk or STD) - -To check if hibernation works, you can try to hibernate in the "reboot" mode: - -# echo reboot > /sys/power/disk -# echo disk > /sys/power/state - -and the system should create a hibernation image, reboot, resume and get back to -the command prompt where you have started the transition. If that happens, -hibernation is most likely to work correctly. Still, you need to repeat the -test at least a couple of times in a row for confidence. [This is necessary, -because some problems only show up on a second attempt at suspending and -resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" -modes causes the PM core to skip some platform-related callbacks which on ACPI -systems might be necessary to make hibernation work. Thus, if your machine fails -to hibernate or resume in the "reboot" mode, you should try the "platform" mode: - -# echo platform > /sys/power/disk -# echo disk > /sys/power/state - -which is the default and recommended mode of hibernation. - -Unfortunately, the "platform" mode of hibernation does not work on some systems -with broken BIOSes. In such cases the "shutdown" mode of hibernation might -work: - -# echo shutdown > /sys/power/disk -# echo disk > /sys/power/state - -(it is similar to the "reboot" mode, but it requires you to press the power -button to make the system resume). - -If neither "platform" nor "shutdown" hibernation mode works, you will need to -identify what goes wrong. - -a) Test modes of hibernation - -To find out why hibernation fails on your system, you can use a special testing -facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, -there is the file /sys/power/pm_test that can be used to make the hibernation -core run in a test mode. There are 5 test modes available: - -freezer -- test the freezing of processes - -devices -- test the freezing of processes and suspending of devices - -platform -- test the freezing of processes, suspending of devices and platform - global control methods(*) - -processors -- test the freezing of processes, suspending of devices, platform - global control methods(*) and the disabling of nonboot CPUs - -core -- test the freezing of processes, suspending of devices, platform global - control methods(*), the disabling of nonboot CPUs and suspending of - platform/system devices - -(*) the platform global control methods are only available on ACPI systems - and are only tested if the hibernation mode is set to "platform" - -To use one of them it is necessary to write the corresponding string to -/sys/power/pm_test (eg. "devices" to test the freezing of processes and -suspending devices) and issue the standard hibernation commands. For example, -to use the "devices" test mode along with the "platform" mode of hibernation, -you should do the following: - -# echo devices > /sys/power/pm_test -# echo platform > /sys/power/disk -# echo disk > /sys/power/state - -Then, the kernel will try to freeze processes, suspend devices, wait a few -seconds (5 by default, but configurable by the suspend.pm_test_delay module -parameter), resume devices and thaw processes. If "platform" is written to -/sys/power/pm_test , then after suspending devices the kernel will additionally -invoke the global control methods (eg. ACPI global control methods) used to -prepare the platform firmware for hibernation. Next, it will wait a -configurable number of seconds and invoke the platform (eg. ACPI) global -methods used to cancel hibernation etc. - -Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal -hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test -contains a space-separated list of all available tests (including "none" that -represents the normal functionality) in which the current test level is -indicated by square brackets. - -Generally, as you can see, each test level is more "invasive" than the previous -one and the "core" level tests the hardware and drivers as deeply as possible -without creating a hibernation image. Obviously, if the "devices" test fails, -the "platform" test will fail as well and so on. Thus, as a rule of thumb, you -should try the test modes starting from "freezer", through "devices", "platform" -and "processors" up to "core" (repeat the test on each level a couple of times -to make sure that any random factors are avoided). - -If the "freezer" test fails, there is a task that cannot be frozen (in that case -it usually is possible to identify the offending task by analysing the output of -dmesg obtained after the failing test). Failure at this level usually means -that there is a problem with the tasks freezer subsystem that should be -reported. - -If the "devices" test fails, most likely there is a driver that cannot suspend -or resume its device (in the latter case the system may hang or become unstable -after the test, so please take that into consideration). To find this driver, -you can carry out a binary search according to the rules: -- if the test fails, unload a half of the drivers currently loaded and repeat -(that would probably involve rebooting the system, so always note what drivers -have been loaded before the test), -- if the test succeeds, load a half of the drivers you have unloaded most -recently and repeat. - -Once you have found the failing driver (there can be more than just one of -them), you have to unload it every time before hibernation. In that case please -make sure to report the problem with the driver. - -It is also possible that the "devices" test will still fail after you have -unloaded all modules. In that case, you may want to look in your kernel -configuration for the drivers that can be compiled as modules (and test again -with these drivers compiled as modules). You may also try to use some special -kernel command line options such as "noapic", "noacpi" or even "acpi=off". - -If the "platform" test fails, there is a problem with the handling of the -platform (eg. ACPI) firmware on your system. In that case the "platform" mode -of hibernation is not likely to work. You can try the "shutdown" mode, but that -is rather a poor man's workaround. - -If the "processors" test fails, the disabling/enabling of nonboot CPUs does not -work (of course, this only may be an issue on SMP systems) and the problem -should be reported. In that case you can also try to switch the nonboot CPUs -off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and -see if that works. - -If the "core" test fails, which means that suspending of the system/platform -devices has failed (these devices are suspended on one CPU with interrupts off), -the problem is most probably hardware-related and serious, so it should be -reported. - -A failure of any of the "platform", "processors" or "core" tests may cause your -system to hang or become unstable, so please beware. Such a failure usually -indicates a serious problem that very well may be related to the hardware, but -please report it anyway. - -b) Testing minimal configuration - -If all of the hibernation test modes work, you can boot the system with the -"init=/bin/bash" command line parameter and attempt to hibernate in the -"reboot", "shutdown" and "platform" modes. If that does not work, there -probably is a problem with a driver statically compiled into the kernel and you -can try to compile more drivers as modules, so that they can be tested -individually. Otherwise, there is a problem with a modular driver and you can -find it by loading a half of the modules you normally use and binary searching -in accordance with the algorithm: -- if there are n modules loaded and the attempt to suspend and resume fails, -unload n/2 of the modules and try again (that would probably involve rebooting -the system), -- if there are n modules loaded and the attempt to suspend and resume succeeds, -load n/2 modules more and try again. - -Again, if you find the offending module(s), it(they) must be unloaded every time -before hibernation, and please report the problem with it(them). - -c) Using the "test_resume" hibernation option - -/sys/power/disk generally tells the kernel what to do after creating a -hibernation image. One of the available options is "test_resume" which -causes the just created image to be used for immediate restoration. Namely, -after doing: - -# echo test_resume > /sys/power/disk -# echo disk > /sys/power/state - -a hibernation image will be created and a resume from it will be triggered -immediately without involving the platform firmware in any way. - -That test can be used to check if failures to resume from hibernation are -related to bad interactions with the platform firmware. That is, if the above -works every time, but resume from actual hibernation does not work or is -unreliable, the platform firmware may be responsible for the failures. - -On architectures and platforms that support using different kernels to restore -hibernation images (that is, the kernel used to read the image from storage and -load it into memory is different from the one included in the image) or support -kernel address space randomization, it also can be used to check if failures -to resume may be related to the differences between the restore and image -kernels. - -d) Advanced debugging - -In case that hibernation does not work on your system even in the minimal -configuration and compiling more drivers as modules is not practical or some -modules cannot be unloaded, you can use one of the more advanced debugging -techniques to find the problem. First, if there is a serial port in your box, -you can boot the kernel with the 'no_console_suspend' parameter and try to log -kernel messages using the serial console. This may provide you with some -information about the reasons of the suspend (resume) failure. Alternatively, -it may be possible to use a FireWire port for debugging with firescope -(http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to -use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt . - -2. Testing suspend to RAM (STR) - -To verify that the STR works, it is generally more convenient to use the s2ram -tool available from http://suspend.sf.net and documented at -http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK). - -Namely, after writing "freezer", "devices", "platform", "processors", or "core" -into /sys/power/pm_test (available if the kernel is compiled with -CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding -to given string. The STR test modes are defined in the same way as for -hibernation, so please refer to Section 1 for more information about them. In -particular, the "core" test allows you to test everything except for the actual -invocation of the platform firmware in order to put the system into the sleep -state. - -Among other things, the testing with the help of /sys/power/pm_test may allow -you to identify drivers that fail to suspend or resume their devices. They -should be unloaded every time before an STR transition. - -Next, you can follow the instructions at S2RAM_LINK to test the system, but if -it does not work "out of the box", you may need to boot it with -"init=/bin/bash" and test s2ram in the minimal configuration. In that case, -you may be able to search for failing drivers by following the procedure -analogous to the one described in section 1. If you find some failing drivers, -you will have to unload them every time before an STR transition (ie. before -you run s2ram), and please report the problems with them. - -There is a debugfs entry which shows the suspend to RAM statistics. Here is an -example of its output. - # mount -t debugfs none /sys/kernel/debug - # cat /sys/kernel/debug/suspend_stats - success: 20 - fail: 5 - failed_freeze: 0 - failed_prepare: 0 - failed_suspend: 5 - failed_suspend_noirq: 0 - failed_resume: 0 - failed_resume_noirq: 0 - failures: - last_failed_dev: alarm - adc - last_failed_errno: -16 - -16 - last_failed_step: suspend - suspend -Field success means the success number of suspend to RAM, and field fail means -the failure number. Others are the failure number of different steps of suspend -to RAM. suspend_stats just lists the last 2 failed devices, error number and -failed step of suspend. diff --git a/Documentation/power/charger-manager.rst b/Documentation/power/charger-manager.rst new file mode 100644 index 000000000000..84fab9376792 --- /dev/null +++ b/Documentation/power/charger-manager.rst @@ -0,0 +1,205 @@ +=============== +Charger Manager +=============== + + (C) 2011 MyungJoo Ham , GPL + +Charger Manager provides in-kernel battery charger management that +requires temperature monitoring during suspend-to-RAM state +and where each battery may have multiple chargers attached and the userland +wants to look at the aggregated information of the multiple chargers. + +Charger Manager is a platform_driver with power-supply-class entries. +An instance of Charger Manager (a platform-device created with Charger-Manager) +represents an independent battery with chargers. If there are multiple +batteries with their own chargers acting independently in a system, +the system may need multiple instances of Charger Manager. + +1. Introduction +=============== + +Charger Manager supports the following: + +* Support for multiple chargers (e.g., a device with USB, AC, and solar panels) + A system may have multiple chargers (or power sources) and some of + they may be activated at the same time. Each charger may have its + own power-supply-class and each power-supply-class can provide + different information about the battery status. This framework + aggregates charger-related information from multiple sources and + shows combined information as a single power-supply-class. + +* Support for in suspend-to-RAM polling (with suspend_again callback) + While the battery is being charged and the system is in suspend-to-RAM, + we may need to monitor the battery health by looking at the ambient or + battery temperature. We can accomplish this by waking up the system + periodically. However, such a method wakes up devices unnecessarily for + monitoring the battery health and tasks, and user processes that are + supposed to be kept suspended. That, in turn, incurs unnecessary power + consumption and slow down charging process. Or even, such peak power + consumption can stop chargers in the middle of charging + (external power input < device power consumption), which not + only affects the charging time, but the lifespan of the battery. + + Charger Manager provides a function "cm_suspend_again" that can be + used as suspend_again callback of platform_suspend_ops. If the platform + requires tasks other than cm_suspend_again, it may implement its own + suspend_again callback that calls cm_suspend_again in the middle. + Normally, the platform will need to resume and suspend some devices + that are used by Charger Manager. + +* Support for premature full-battery event handling + If the battery voltage drops by "fullbatt_vchkdrop_uV" after + "fullbatt_vchkdrop_ms" from the full-battery event, the framework + restarts charging. This check is also performed while suspended by + setting wakeup time accordingly and using suspend_again. + +* Support for uevent-notify + With the charger-related events, the device sends + notification to users with UEVENT. + +2. Global Charger-Manager Data related with suspend_again +========================================================= +In order to setup Charger Manager with suspend-again feature +(in-suspend monitoring), the user should provide charger_global_desc +with setup_charger_manager(`struct charger_global_desc *`). +This charger_global_desc data for in-suspend monitoring is global +as the name suggests. Thus, the user needs to provide only once even +if there are multiple batteries. If there are multiple batteries, the +multiple instances of Charger Manager share the same charger_global_desc +and it will manage in-suspend monitoring for all instances of Charger Manager. + +The user needs to provide all the three entries to `struct charger_global_desc` +properly in order to activate in-suspend monitoring: + +`char *rtc_name;` + The name of rtc (e.g., "rtc0") used to wakeup the system from + suspend for Charger Manager. The alarm interrupt (AIE) of the rtc + should be able to wake up the system from suspend. Charger Manager + saves and restores the alarm value and use the previously-defined + alarm if it is going to go off earlier than Charger Manager so that + Charger Manager does not interfere with previously-defined alarms. + +`bool (*rtc_only_wakeup)(void);` + This callback should let CM know whether + the wakeup-from-suspend is caused only by the alarm of "rtc" in the + same struct. If there is any other wakeup source triggered the + wakeup, it should return false. If the "rtc" is the only wakeup + reason, it should return true. + +`bool assume_timer_stops_in_suspend;` + if true, Charger Manager assumes that + the timer (CM uses jiffies as timer) stops during suspend. Then, CM + assumes that the suspend-duration is same as the alarm length. + + +3. How to setup suspend_again +============================= +Charger Manager provides a function "extern bool cm_suspend_again(void)". +When cm_suspend_again is called, it monitors every battery. The suspend_ops +callback of the system's platform_suspend_ops can call cm_suspend_again +function to know whether Charger Manager wants to suspend again or not. +If there are no other devices or tasks that want to use suspend_again +feature, the platform_suspend_ops may directly refer to cm_suspend_again +for its suspend_again callback. + +The cm_suspend_again() returns true (meaning "I want to suspend again") +if the system was woken up by Charger Manager and the polling +(in-suspend monitoring) results in "normal". + +4. Charger-Manager Data (struct charger_desc) +============================================= +For each battery charged independently from other batteries (if a series of +batteries are charged by a single charger, they are counted as one independent +battery), an instance of Charger Manager is attached to it. The following + +struct charger_desc elements: + +`char *psy_name;` + The power-supply-class name of the battery. Default is + "battery" if psy_name is NULL. Users can access the psy entries + at "/sys/class/power_supply/[psy_name]/". + +`enum polling_modes polling_mode;` + CM_POLL_DISABLE: + do not poll this battery. + CM_POLL_ALWAYS: + always poll this battery. + CM_POLL_EXTERNAL_POWER_ONLY: + poll this battery if and only if an external power + source is attached. + CM_POLL_CHARGING_ONLY: + poll this battery if and only if the battery is being charged. + +`unsigned int fullbatt_vchkdrop_ms; / unsigned int fullbatt_vchkdrop_uV;` + If both have non-zero values, Charger Manager will check the + battery voltage drop fullbatt_vchkdrop_ms after the battery is fully + charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger + Manager will try to recharge the battery by disabling and enabling + chargers. Recharge with voltage drop condition only (without delay + condition) is needed to be implemented with hardware interrupts from + fuel gauges or charger devices/chips. + +`unsigned int fullbatt_uV;` + If specified with a non-zero value, Charger Manager assumes + that the battery is full (capacity = 100) if the battery is not being + charged and the battery voltage is equal to or greater than + fullbatt_uV. + +`unsigned int polling_interval_ms;` + Required polling interval in ms. Charger Manager will poll + this battery every polling_interval_ms or more frequently. + +`enum data_source battery_present;` + CM_BATTERY_PRESENT: + assume that the battery exists. + CM_NO_BATTERY: + assume that the battery does not exists. + CM_FUEL_GAUGE: + get battery presence information from fuel gauge. + CM_CHARGER_STAT: + get battery presence from chargers. + +`char **psy_charger_stat;` + An array ending with NULL that has power-supply-class names of + chargers. Each power-supply-class should provide "PRESENT" (if + battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an + external power source is attached or not), and "STATUS" (shows whether + the battery is {"FULL" or not FULL} or {"FULL", "Charging", + "Discharging", "NotCharging"}). + +`int num_charger_regulators; / struct regulator_bulk_data *charger_regulators;` + Regulators representing the chargers in the form for + regulator framework's bulk functions. + +`char *psy_fuel_gauge;` + Power-supply-class name of the fuel gauge. + +`int (*temperature_out_of_range)(int *mC); / bool measure_battery_temp;` + This callback returns 0 if the temperature is safe for charging, + a positive number if it is too hot to charge, and a negative number + if it is too cold to charge. With the variable mC, the callback returns + the temperature in 1/1000 of centigrade. + The source of temperature can be battery or ambient one according to + the value of measure_battery_temp. + + +5. Notify Charger-Manager of charger events: cm_notify_event() +============================================================== +If there is an charger event is required to notify +Charger Manager, a charger device driver that triggers the event can call +cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager. +In the function, psy is the charger driver's power_supply pointer, which is +associated with Charger-Manager. The parameter "type" +is the same as irq's type (enum cm_event_types). The event message "msg" is +optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS". + +6. Other Considerations +======================= + +At the charger/battery-related events such as battery-pulled-out, +charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped, +and others critical to chargers, the system should be configured to wake up. +At least the following should wake up the system from a suspend: +a) charger-on/off b) external-power-in/out c) battery-in/out (while charging) + +It is usually accomplished by configuring the PMIC as a wakeup source. diff --git a/Documentation/power/charger-manager.txt b/Documentation/power/charger-manager.txt deleted file mode 100644 index 9ff1105e58d6..000000000000 --- a/Documentation/power/charger-manager.txt +++ /dev/null @@ -1,200 +0,0 @@ -Charger Manager - (C) 2011 MyungJoo Ham , GPL - -Charger Manager provides in-kernel battery charger management that -requires temperature monitoring during suspend-to-RAM state -and where each battery may have multiple chargers attached and the userland -wants to look at the aggregated information of the multiple chargers. - -Charger Manager is a platform_driver with power-supply-class entries. -An instance of Charger Manager (a platform-device created with Charger-Manager) -represents an independent battery with chargers. If there are multiple -batteries with their own chargers acting independently in a system, -the system may need multiple instances of Charger Manager. - -1. Introduction -=============== - -Charger Manager supports the following: - -* Support for multiple chargers (e.g., a device with USB, AC, and solar panels) - A system may have multiple chargers (or power sources) and some of - they may be activated at the same time. Each charger may have its - own power-supply-class and each power-supply-class can provide - different information about the battery status. This framework - aggregates charger-related information from multiple sources and - shows combined information as a single power-supply-class. - -* Support for in suspend-to-RAM polling (with suspend_again callback) - While the battery is being charged and the system is in suspend-to-RAM, - we may need to monitor the battery health by looking at the ambient or - battery temperature. We can accomplish this by waking up the system - periodically. However, such a method wakes up devices unnecessarily for - monitoring the battery health and tasks, and user processes that are - supposed to be kept suspended. That, in turn, incurs unnecessary power - consumption and slow down charging process. Or even, such peak power - consumption can stop chargers in the middle of charging - (external power input < device power consumption), which not - only affects the charging time, but the lifespan of the battery. - - Charger Manager provides a function "cm_suspend_again" that can be - used as suspend_again callback of platform_suspend_ops. If the platform - requires tasks other than cm_suspend_again, it may implement its own - suspend_again callback that calls cm_suspend_again in the middle. - Normally, the platform will need to resume and suspend some devices - that are used by Charger Manager. - -* Support for premature full-battery event handling - If the battery voltage drops by "fullbatt_vchkdrop_uV" after - "fullbatt_vchkdrop_ms" from the full-battery event, the framework - restarts charging. This check is also performed while suspended by - setting wakeup time accordingly and using suspend_again. - -* Support for uevent-notify - With the charger-related events, the device sends - notification to users with UEVENT. - -2. Global Charger-Manager Data related with suspend_again -======================================================== -In order to setup Charger Manager with suspend-again feature -(in-suspend monitoring), the user should provide charger_global_desc -with setup_charger_manager(struct charger_global_desc *). -This charger_global_desc data for in-suspend monitoring is global -as the name suggests. Thus, the user needs to provide only once even -if there are multiple batteries. If there are multiple batteries, the -multiple instances of Charger Manager share the same charger_global_desc -and it will manage in-suspend monitoring for all instances of Charger Manager. - -The user needs to provide all the three entries properly in order to activate -in-suspend monitoring: - -struct charger_global_desc { - -char *rtc_name; - : The name of rtc (e.g., "rtc0") used to wakeup the system from - suspend for Charger Manager. The alarm interrupt (AIE) of the rtc - should be able to wake up the system from suspend. Charger Manager - saves and restores the alarm value and use the previously-defined - alarm if it is going to go off earlier than Charger Manager so that - Charger Manager does not interfere with previously-defined alarms. - -bool (*rtc_only_wakeup)(void); - : This callback should let CM know whether - the wakeup-from-suspend is caused only by the alarm of "rtc" in the - same struct. If there is any other wakeup source triggered the - wakeup, it should return false. If the "rtc" is the only wakeup - reason, it should return true. - -bool assume_timer_stops_in_suspend; - : if true, Charger Manager assumes that - the timer (CM uses jiffies as timer) stops during suspend. Then, CM - assumes that the suspend-duration is same as the alarm length. -}; - -3. How to setup suspend_again -============================= -Charger Manager provides a function "extern bool cm_suspend_again(void)". -When cm_suspend_again is called, it monitors every battery. The suspend_ops -callback of the system's platform_suspend_ops can call cm_suspend_again -function to know whether Charger Manager wants to suspend again or not. -If there are no other devices or tasks that want to use suspend_again -feature, the platform_suspend_ops may directly refer to cm_suspend_again -for its suspend_again callback. - -The cm_suspend_again() returns true (meaning "I want to suspend again") -if the system was woken up by Charger Manager and the polling -(in-suspend monitoring) results in "normal". - -4. Charger-Manager Data (struct charger_desc) -============================================= -For each battery charged independently from other batteries (if a series of -batteries are charged by a single charger, they are counted as one independent -battery), an instance of Charger Manager is attached to it. - -struct charger_desc { - -char *psy_name; - : The power-supply-class name of the battery. Default is - "battery" if psy_name is NULL. Users can access the psy entries - at "/sys/class/power_supply/[psy_name]/". - -enum polling_modes polling_mode; - : CM_POLL_DISABLE: do not poll this battery. - CM_POLL_ALWAYS: always poll this battery. - CM_POLL_EXTERNAL_POWER_ONLY: poll this battery if and only if - an external power source is attached. - CM_POLL_CHARGING_ONLY: poll this battery if and only if the - battery is being charged. - -unsigned int fullbatt_vchkdrop_ms; -unsigned int fullbatt_vchkdrop_uV; - : If both have non-zero values, Charger Manager will check the - battery voltage drop fullbatt_vchkdrop_ms after the battery is fully - charged. If the voltage drop is over fullbatt_vchkdrop_uV, Charger - Manager will try to recharge the battery by disabling and enabling - chargers. Recharge with voltage drop condition only (without delay - condition) is needed to be implemented with hardware interrupts from - fuel gauges or charger devices/chips. - -unsigned int fullbatt_uV; - : If specified with a non-zero value, Charger Manager assumes - that the battery is full (capacity = 100) if the battery is not being - charged and the battery voltage is equal to or greater than - fullbatt_uV. - -unsigned int polling_interval_ms; - : Required polling interval in ms. Charger Manager will poll - this battery every polling_interval_ms or more frequently. - -enum data_source battery_present; - : CM_BATTERY_PRESENT: assume that the battery exists. - CM_NO_BATTERY: assume that the battery does not exists. - CM_FUEL_GAUGE: get battery presence information from fuel gauge. - CM_CHARGER_STAT: get battery presence from chargers. - -char **psy_charger_stat; - : An array ending with NULL that has power-supply-class names of - chargers. Each power-supply-class should provide "PRESENT" (if - battery_present is "CM_CHARGER_STAT"), "ONLINE" (shows whether an - external power source is attached or not), and "STATUS" (shows whether - the battery is {"FULL" or not FULL} or {"FULL", "Charging", - "Discharging", "NotCharging"}). - -int num_charger_regulators; -struct regulator_bulk_data *charger_regulators; - : Regulators representing the chargers in the form for - regulator framework's bulk functions. - -char *psy_fuel_gauge; - : Power-supply-class name of the fuel gauge. - -int (*temperature_out_of_range)(int *mC); -bool measure_battery_temp; - : This callback returns 0 if the temperature is safe for charging, - a positive number if it is too hot to charge, and a negative number - if it is too cold to charge. With the variable mC, the callback returns - the temperature in 1/1000 of centigrade. - The source of temperature can be battery or ambient one according to - the value of measure_battery_temp. -}; - -5. Notify Charger-Manager of charger events: cm_notify_event() -========================================================= -If there is an charger event is required to notify -Charger Manager, a charger device driver that triggers the event can call -cm_notify_event(psy, type, msg) to notify the corresponding Charger Manager. -In the function, psy is the charger driver's power_supply pointer, which is -associated with Charger-Manager. The parameter "type" -is the same as irq's type (enum cm_event_types). The event message "msg" is -optional and is effective only if the event type is "UNDESCRIBED" or "OTHERS". - -6. Other Considerations -======================= - -At the charger/battery-related events such as battery-pulled-out, -charger-pulled-out, charger-inserted, DCIN-over/under-voltage, charger-stopped, -and others critical to chargers, the system should be configured to wake up. -At least the following should wake up the system from a suspend: -a) charger-on/off b) external-power-in/out c) battery-in/out (while charging) - -It is usually accomplished by configuring the PMIC as a wakeup source. diff --git a/Documentation/power/drivers-testing.rst b/Documentation/power/drivers-testing.rst new file mode 100644 index 000000000000..e53f1999fc39 --- /dev/null +++ b/Documentation/power/drivers-testing.rst @@ -0,0 +1,51 @@ +==================================================== +Testing suspend and resume support in device drivers +==================================================== + + (C) 2007 Rafael J. Wysocki , GPL + +1. Preparing the test system +============================ + +Unfortunately, to effectively test the support for the system-wide suspend and +resume transitions in a driver, it is necessary to suspend and resume a fully +functional system with this driver loaded. Moreover, that should be done +several times, preferably several times in a row, and separately for hibernation +(aka suspend to disk or STD) and suspend to RAM (STR), because each of these +cases involves slightly different operations and different interactions with +the machine's BIOS. + +Of course, for this purpose the test system has to be known to suspend and +resume without the driver being tested. Thus, if possible, you should first +resolve all suspend/resume-related problems in the test system before you start +testing the new driver. Please see Documentation/power/basic-pm-debugging.rst +for more information about the debugging of suspend/resume functionality. + +2. Testing the driver +===================== + +Once you have resolved the suspend/resume-related problems with your test system +without the new driver, you are ready to test it: + +a) Build the driver as a module, load it and try the test modes of hibernation + (see: Documentation/power/basic-pm-debugging.rst, 1). + +b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and + "platform" modes (see: Documentation/power/basic-pm-debugging.rst, 1). + +c) Compile the driver directly into the kernel and try the test modes of + hibernation. + +d) Attempt to hibernate with the driver compiled directly into the kernel + in the "reboot", "shutdown" and "platform" modes. + +e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.rst, + 2). [As far as the STR tests are concerned, it should not matter whether or + not the driver is built as a module.] + +f) Attempt to suspend to RAM using the s2ram tool with the driver loaded + (see: Documentation/power/basic-pm-debugging.rst, 2). + +Each of the above tests should be repeated several times and the STD tests +should be mixed with the STR tests. If any of them fails, the driver cannot be +regarded as suspend/resume-safe. diff --git a/Documentation/power/drivers-testing.txt b/Documentation/power/drivers-testing.txt deleted file mode 100644 index 638afdf4d6b8..000000000000 --- a/Documentation/power/drivers-testing.txt +++ /dev/null @@ -1,46 +0,0 @@ -Testing suspend and resume support in device drivers - (C) 2007 Rafael J. Wysocki , GPL - -1. Preparing the test system - -Unfortunately, to effectively test the support for the system-wide suspend and -resume transitions in a driver, it is necessary to suspend and resume a fully -functional system with this driver loaded. Moreover, that should be done -several times, preferably several times in a row, and separately for hibernation -(aka suspend to disk or STD) and suspend to RAM (STR), because each of these -cases involves slightly different operations and different interactions with -the machine's BIOS. - -Of course, for this purpose the test system has to be known to suspend and -resume without the driver being tested. Thus, if possible, you should first -resolve all suspend/resume-related problems in the test system before you start -testing the new driver. Please see Documentation/power/basic-pm-debugging.txt -for more information about the debugging of suspend/resume functionality. - -2. Testing the driver - -Once you have resolved the suspend/resume-related problems with your test system -without the new driver, you are ready to test it: - -a) Build the driver as a module, load it and try the test modes of hibernation - (see: Documentation/power/basic-pm-debugging.txt, 1). - -b) Load the driver and attempt to hibernate in the "reboot", "shutdown" and - "platform" modes (see: Documentation/power/basic-pm-debugging.txt, 1). - -c) Compile the driver directly into the kernel and try the test modes of - hibernation. - -d) Attempt to hibernate with the driver compiled directly into the kernel - in the "reboot", "shutdown" and "platform" modes. - -e) Try the test modes of suspend (see: Documentation/power/basic-pm-debugging.txt, - 2). [As far as the STR tests are concerned, it should not matter whether or - not the driver is built as a module.] - -f) Attempt to suspend to RAM using the s2ram tool with the driver loaded - (see: Documentation/power/basic-pm-debugging.txt, 2). - -Each of the above tests should be repeated several times and the STD tests -should be mixed with the STR tests. If any of them fails, the driver cannot be -regarded as suspend/resume-safe. diff --git a/Documentation/power/energy-model.rst b/Documentation/power/energy-model.rst new file mode 100644 index 000000000000..90a345d57ae9 --- /dev/null +++ b/Documentation/power/energy-model.rst @@ -0,0 +1,147 @@ +==================== +Energy Model of CPUs +==================== + +1. Overview +----------- + +The Energy Model (EM) framework serves as an interface between drivers knowing +the power consumed by CPUs at various performance levels, and the kernel +subsystems willing to use that information to make energy-aware decisions. + +The source of the information about the power consumed by CPUs can vary greatly +from one platform to another. These power costs can be estimated using +devicetree data in some cases. In others, the firmware will know better. +Alternatively, userspace might be best positioned. And so on. In order to avoid +each and every client subsystem to re-implement support for each and every +possible source of information on its own, the EM framework intervenes as an +abstraction layer which standardizes the format of power cost tables in the +kernel, hence enabling to avoid redundant work. + +The figure below depicts an example of drivers (Arm-specific here, but the +approach is applicable to any architecture) providing power costs to the EM +framework, and interested clients reading the data from it:: + + +---------------+ +-----------------+ +---------------+ + | Thermal (IPA) | | Scheduler (EAS) | | Other | + +---------------+ +-----------------+ +---------------+ + | | em_pd_energy() | + | | em_cpu_get() | + +---------+ | +---------+ + | | | + v v v + +---------------------+ + | Energy Model | + | Framework | + +---------------------+ + ^ ^ ^ + | | | em_register_perf_domain() + +----------+ | +---------+ + | | | + +---------------+ +---------------+ +--------------+ + | cpufreq-dt | | arm_scmi | | Other | + +---------------+ +---------------+ +--------------+ + ^ ^ ^ + | | | + +--------------+ +---------------+ +--------------+ + | Device Tree | | Firmware | | ? | + +--------------+ +---------------+ +--------------+ + +The EM framework manages power cost tables per 'performance domain' in the +system. A performance domain is a group of CPUs whose performance is scaled +together. Performance domains generally have a 1-to-1 mapping with CPUFreq +policies. All CPUs in a performance domain are required to have the same +micro-architecture. CPUs in different performance domains can have different +micro-architectures. + + +2. Core APIs +------------ + +2.1 Config options +^^^^^^^^^^^^^^^^^^ + +CONFIG_ENERGY_MODEL must be enabled to use the EM framework. + + +2.2 Registration of performance domains +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Drivers are expected to register performance domains into the EM framework by +calling the following API:: + + int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, + struct em_data_callback *cb); + +Drivers must specify the CPUs of the performance domains using the cpumask +argument, and provide a callback function returning tuples +for each capacity state. The callback function provided by the driver is free +to fetch data from any relevant location (DT, firmware, ...), and by any mean +deemed necessary. See Section 3. for an example of driver implementing this +callback, and kernel/power/energy_model.c for further documentation on this +API. + + +2.3 Accessing performance domains +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Subsystems interested in the energy model of a CPU can retrieve it using the +em_cpu_get() API. The energy model tables are allocated once upon creation of +the performance domains, and kept in memory untouched. + +The energy consumed by a performance domain can be estimated using the +em_pd_energy() API. The estimation is performed assuming that the schedutil +CPUfreq governor is in use. + +More details about the above APIs can be found in include/linux/energy_model.h. + + +3. Example driver +----------------- + +This section provides a simple example of a CPUFreq driver registering a +performance domain in the Energy Model framework using the (fake) 'foo' +protocol. The driver implements an est_power() function to be provided to the +EM framework:: + + -> drivers/cpufreq/foo_cpufreq.c + + 01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu) + 02 { + 03 long freq, power; + 04 + 05 /* Use the 'foo' protocol to ceil the frequency */ + 06 freq = foo_get_freq_ceil(cpu, *KHz); + 07 if (freq < 0); + 08 return freq; + 09 + 10 /* Estimate the power cost for the CPU at the relevant freq. */ + 11 power = foo_estimate_power(cpu, freq); + 12 if (power < 0); + 13 return power; + 14 + 15 /* Return the values to the EM framework */ + 16 *mW = power; + 17 *KHz = freq; + 18 + 19 return 0; + 20 } + 21 + 22 static int foo_cpufreq_init(struct cpufreq_policy *policy) + 23 { + 24 struct em_data_callback em_cb = EM_DATA_CB(est_power); + 25 int nr_opp, ret; + 26 + 27 /* Do the actual CPUFreq init work ... */ + 28 ret = do_foo_cpufreq_init(policy); + 29 if (ret) + 30 return ret; + 31 + 32 /* Find the number of OPPs for this policy */ + 33 nr_opp = foo_get_nr_opp(policy); + 34 + 35 /* And register the new performance domain */ + 36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb); + 37 + 38 return 0; + 39 } diff --git a/Documentation/power/energy-model.txt b/Documentation/power/energy-model.txt deleted file mode 100644 index a2b0ae4c76bd..000000000000 --- a/Documentation/power/energy-model.txt +++ /dev/null @@ -1,144 +0,0 @@ - ==================== - Energy Model of CPUs - ==================== - -1. Overview ------------ - -The Energy Model (EM) framework serves as an interface between drivers knowing -the power consumed by CPUs at various performance levels, and the kernel -subsystems willing to use that information to make energy-aware decisions. - -The source of the information about the power consumed by CPUs can vary greatly -from one platform to another. These power costs can be estimated using -devicetree data in some cases. In others, the firmware will know better. -Alternatively, userspace might be best positioned. And so on. In order to avoid -each and every client subsystem to re-implement support for each and every -possible source of information on its own, the EM framework intervenes as an -abstraction layer which standardizes the format of power cost tables in the -kernel, hence enabling to avoid redundant work. - -The figure below depicts an example of drivers (Arm-specific here, but the -approach is applicable to any architecture) providing power costs to the EM -framework, and interested clients reading the data from it. - - +---------------+ +-----------------+ +---------------+ - | Thermal (IPA) | | Scheduler (EAS) | | Other | - +---------------+ +-----------------+ +---------------+ - | | em_pd_energy() | - | | em_cpu_get() | - +---------+ | +---------+ - | | | - v v v - +---------------------+ - | Energy Model | - | Framework | - +---------------------+ - ^ ^ ^ - | | | em_register_perf_domain() - +----------+ | +---------+ - | | | - +---------------+ +---------------+ +--------------+ - | cpufreq-dt | | arm_scmi | | Other | - +---------------+ +---------------+ +--------------+ - ^ ^ ^ - | | | - +--------------+ +---------------+ +--------------+ - | Device Tree | | Firmware | | ? | - +--------------+ +---------------+ +--------------+ - -The EM framework manages power cost tables per 'performance domain' in the -system. A performance domain is a group of CPUs whose performance is scaled -together. Performance domains generally have a 1-to-1 mapping with CPUFreq -policies. All CPUs in a performance domain are required to have the same -micro-architecture. CPUs in different performance domains can have different -micro-architectures. - - -2. Core APIs ------------- - - 2.1 Config options - -CONFIG_ENERGY_MODEL must be enabled to use the EM framework. - - - 2.2 Registration of performance domains - -Drivers are expected to register performance domains into the EM framework by -calling the following API: - - int em_register_perf_domain(cpumask_t *span, unsigned int nr_states, - struct em_data_callback *cb); - -Drivers must specify the CPUs of the performance domains using the cpumask -argument, and provide a callback function returning tuples -for each capacity state. The callback function provided by the driver is free -to fetch data from any relevant location (DT, firmware, ...), and by any mean -deemed necessary. See Section 3. for an example of driver implementing this -callback, and kernel/power/energy_model.c for further documentation on this -API. - - - 2.3 Accessing performance domains - -Subsystems interested in the energy model of a CPU can retrieve it using the -em_cpu_get() API. The energy model tables are allocated once upon creation of -the performance domains, and kept in memory untouched. - -The energy consumed by a performance domain can be estimated using the -em_pd_energy() API. The estimation is performed assuming that the schedutil -CPUfreq governor is in use. - -More details about the above APIs can be found in include/linux/energy_model.h. - - -3. Example driver ------------------ - -This section provides a simple example of a CPUFreq driver registering a -performance domain in the Energy Model framework using the (fake) 'foo' -protocol. The driver implements an est_power() function to be provided to the -EM framework. - - -> drivers/cpufreq/foo_cpufreq.c - -01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu) -02 { -03 long freq, power; -04 -05 /* Use the 'foo' protocol to ceil the frequency */ -06 freq = foo_get_freq_ceil(cpu, *KHz); -07 if (freq < 0); -08 return freq; -09 -10 /* Estimate the power cost for the CPU at the relevant freq. */ -11 power = foo_estimate_power(cpu, freq); -12 if (power < 0); -13 return power; -14 -15 /* Return the values to the EM framework */ -16 *mW = power; -17 *KHz = freq; -18 -19 return 0; -20 } -21 -22 static int foo_cpufreq_init(struct cpufreq_policy *policy) -23 { -24 struct em_data_callback em_cb = EM_DATA_CB(est_power); -25 int nr_opp, ret; -26 -27 /* Do the actual CPUFreq init work ... */ -28 ret = do_foo_cpufreq_init(policy); -29 if (ret) -30 return ret; -31 -32 /* Find the number of OPPs for this policy */ -33 nr_opp = foo_get_nr_opp(policy); -34 -35 /* And register the new performance domain */ -36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb); -37 -38 return 0; -39 } diff --git a/Documentation/power/freezing-of-tasks.rst b/Documentation/power/freezing-of-tasks.rst new file mode 100644 index 000000000000..ef110fe55e82 --- /dev/null +++ b/Documentation/power/freezing-of-tasks.rst @@ -0,0 +1,244 @@ +================= +Freezing of tasks +================= + +(C) 2007 Rafael J. Wysocki , GPL + +I. What is the freezing of tasks? +================================= + +The freezing of tasks is a mechanism by which user space processes and some +kernel threads are controlled during hibernation or system-wide suspend (on some +architectures). + +II. How does it work? +===================== + +There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN +and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have +PF_NOFREEZE unset (all user space processes and some kernel threads) are +regarded as 'freezable' and treated in a special way before the system enters a +suspend state as well as before a hibernation image is created (in what follows +we only consider hibernation, but the description also applies to suspend). + +Namely, as the first step of the hibernation procedure the function +freeze_processes() (defined in kernel/power/process.c) is called. A system-wide +variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate +whether the system is to undergo a freezing operation. And freeze_processes() +sets this variable. After this, it executes try_to_freeze_tasks() that sends a +fake signal to all user space processes, and wakes up all the kernel threads. +All freezable tasks must react to that by calling try_to_freeze(), which +results in a call to __refrigerator() (defined in kernel/freezer.c), which sets +the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes +it loop until PF_FROZEN is cleared for it. Then, we say that the task is +'frozen' and therefore the set of functions handling this mechanism is referred +to as 'the freezer' (these functions are defined in kernel/power/process.c, +kernel/freezer.c & include/linux/freezer.h). User space processes are generally +frozen before kernel threads. + +__refrigerator() must not be called directly. Instead, use the +try_to_freeze() function (defined in include/linux/freezer.h), that checks +if the task is to be frozen and makes the task enter __refrigerator(). + +For user space processes try_to_freeze() is called automatically from the +signal-handling code, but the freezable kernel threads need to call it +explicitly in suitable places or use the wait_event_freezable() or +wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) +that combine interruptible sleep with checking if the task is to be frozen and +calling try_to_freeze(). The main loop of a freezable kernel thread may look +like the following one:: + + set_freezable(); + do { + hub_events(); + wait_event_freezable(khubd_wait, + !list_empty(&hub_event_list) || + kthread_should_stop()); + } while (!kthread_should_stop() || !list_empty(&hub_event_list)); + +(from drivers/usb/core/hub.c::hub_thread()). + +If a freezable kernel thread fails to call try_to_freeze() after the freezer has +initiated a freezing operation, the freezing of tasks will fail and the entire +hibernation operation will be cancelled. For this reason, freezable kernel +threads must call try_to_freeze() somewhere or use one of the +wait_event_freezable() and wait_event_freezable_timeout() macros. + +After the system memory state has been restored from a hibernation image and +devices have been reinitialized, the function thaw_processes() is called in +order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that +have been frozen leave __refrigerator() and continue running. + + +Rationale behind the functions dealing with freezing and thawing of tasks +------------------------------------------------------------------------- + +freeze_processes(): + - freezes only userspace tasks + +freeze_kernel_threads(): + - freezes all tasks (including kernel threads) because we can't freeze + kernel threads without freezing userspace tasks + +thaw_kernel_threads(): + - thaws only kernel threads; this is particularly useful if we need to do + anything special in between thawing of kernel threads and thawing of + userspace tasks, or if we want to postpone the thawing of userspace tasks + +thaw_processes(): + - thaws all tasks (including kernel threads) because we can't thaw userspace + tasks without thawing kernel threads + + +III. Which kernel threads are freezable? +======================================== + +Kernel threads are not freezable by default. However, a kernel thread may clear +PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE +directly is not allowed). From this point it is regarded as freezable +and must call try_to_freeze() in a suitable place. + +IV. Why do we do that? +====================== + +Generally speaking, there is a couple of reasons to use the freezing of tasks: + +1. The principal reason is to prevent filesystems from being damaged after + hibernation. At the moment we have no simple means of checkpointing + filesystems, so if there are any modifications made to filesystem data and/or + metadata on disks, we cannot bring them back to the state from before the + modifications. At the same time each hibernation image contains some + filesystem-related information that must be consistent with the state of the + on-disk data and metadata after the system memory state has been restored + from the image (otherwise the filesystems will be damaged in a nasty way, + usually making them almost impossible to repair). We therefore freeze + tasks that might cause the on-disk filesystems' data and metadata to be + modified after the hibernation image has been created and before the + system is finally powered off. The majority of these are user space + processes, but if any of the kernel threads may cause something like this + to happen, they have to be freezable. + +2. Next, to create the hibernation image we need to free a sufficient amount of + memory (approximately 50% of available RAM) and we need to do that before + devices are deactivated, because we generally need them for swapping out. + Then, after the memory for the image has been freed, we don't want tasks + to allocate additional memory and we prevent them from doing that by + freezing them earlier. [Of course, this also means that device drivers + should not allocate substantial amounts of memory from their .suspend() + callbacks before hibernation, but this is a separate issue.] + +3. The third reason is to prevent user space processes and some kernel threads + from interfering with the suspending and resuming of devices. A user space + process running on a second CPU while we are suspending devices may, for + example, be troublesome and without the freezing of tasks we would need some + safeguards against race conditions that might occur in such a case. + +Although Linus Torvalds doesn't like the freezing of tasks, he said this in one +of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): + +"RJW:> Why we freeze tasks at all or why we freeze kernel threads? + +Linus: In many ways, 'at all'. + +I **do** realize the IO request queue issues, and that we cannot actually do +s2ram with some devices in the middle of a DMA. So we want to be able to +avoid *that*, there's no question about that. And I suspect that stopping +user threads and then waiting for a sync is practically one of the easier +ways to do so. + +So in practice, the 'at all' may become a 'why freeze kernel threads?' and +freezing user threads I don't find really objectionable." + +Still, there are kernel threads that may want to be freezable. For example, if +a kernel thread that belongs to a device driver accesses the device directly, it +in principle needs to know when the device is suspended, so that it doesn't try +to access it at that time. However, if the kernel thread is freezable, it will +be frozen before the driver's .suspend() callback is executed and it will be +thawed after the driver's .resume() callback has run, so it won't be accessing +the device while it's suspended. + +4. Another reason for freezing tasks is to prevent user space processes from + realizing that hibernation (or suspend) operation takes place. Ideally, user + space processes should not notice that such a system-wide operation has + occurred and should continue running without any problems after the restore + (or resume from suspend). Unfortunately, in the most general case this + is quite difficult to achieve without the freezing of tasks. Consider, + for example, a process that depends on all CPUs being online while it's + running. Since we need to disable nonboot CPUs during the hibernation, + if this process is not frozen, it may notice that the number of CPUs has + changed and may start to work incorrectly because of that. + +V. Are there any problems related to the freezing of tasks? +=========================================================== + +Yes, there are. + +First of all, the freezing of kernel threads may be tricky if they depend one +on another. For example, if kernel thread A waits for a completion (in the +TASK_UNINTERRUPTIBLE state) that needs to be done by freezable kernel thread B +and B is frozen in the meantime, then A will be blocked until B is thawed, which +may be undesirable. That's why kernel threads are not freezable by default. + +Second, there are the following two problems related to the freezing of user +space processes: + +1. Putting processes into an uninterruptible sleep distorts the load average. +2. Now that we have FUSE, plus the framework for doing device drivers in + userspace, it gets even more complicated because some userspace processes are + now doing the sorts of things that kernel threads do + (https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). + +The problem 1. seems to be fixable, although it hasn't been fixed so far. The +other one is more serious, but it seems that we can work around it by using +hibernation (and suspend) notifiers (in that case, though, we won't be able to +avoid the realization by the user space processes that the hibernation is taking +place). + +There are also problems that the freezing of tasks tends to expose, although +they are not directly related to it. For example, if request_firmware() is +called from a device driver's .resume() routine, it will timeout and eventually +fail, because the user land process that should respond to the request is frozen +at this point. So, seemingly, the failure is due to the freezing of tasks. +Suppose, however, that the firmware file is located on a filesystem accessible +only through another device that hasn't been resumed yet. In that case, +request_firmware() will fail regardless of whether or not the freezing of tasks +is used. Consequently, the problem is not really related to the freezing of +tasks, since it generally exists anyway. + +A driver must have all firmwares it may need in RAM before suspend() is called. +If keeping them is not practical, for example due to their size, they must be +requested early enough using the suspend notifier API described in +Documentation/driver-api/pm/notifiers.rst. + +VI. Are there any precautions to be taken to prevent freezing failures? +======================================================================= + +Yes, there are. + +First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a piece of code +from system-wide sleep such as suspend/hibernation is not encouraged. +If possible, that piece of code must instead hook onto the suspend/hibernation +notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code +(kernel/cpu.c) for an example. + +However, if that is not feasible, and grabbing 'system_transition_mutex' is deemed necessary, +it is strongly discouraged to directly call mutex_[un]lock(&system_transition_mutex) since +that could lead to freezing failures, because if the suspend/hibernate code +successfully acquired the 'system_transition_mutex' lock, and hence that other entity failed +to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE +state. As a consequence, the freezer would not be able to freeze that task, +leading to freezing failure. + +However, the [un]lock_system_sleep() APIs are safe to use in this scenario, +since they ask the freezer to skip freezing this task, since it is anyway +"frozen enough" as it is blocked on 'system_transition_mutex', which will be released +only after the entire suspend/hibernation sequence is complete. +So, to summarize, use [un]lock_system_sleep() instead of directly using +mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. + +V. Miscellaneous +================ + +/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze +all user space processes or all freezable kernel threads, in unit of millisecond. +The default value is 20000, with range of unsigned integer. diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt deleted file mode 100644 index cd283190855a..000000000000 --- a/Documentation/power/freezing-of-tasks.txt +++ /dev/null @@ -1,231 +0,0 @@ -Freezing of tasks - (C) 2007 Rafael J. Wysocki , GPL - -I. What is the freezing of tasks? - -The freezing of tasks is a mechanism by which user space processes and some -kernel threads are controlled during hibernation or system-wide suspend (on some -architectures). - -II. How does it work? - -There are three per-task flags used for that, PF_NOFREEZE, PF_FROZEN -and PF_FREEZER_SKIP (the last one is auxiliary). The tasks that have -PF_NOFREEZE unset (all user space processes and some kernel threads) are -regarded as 'freezable' and treated in a special way before the system enters a -suspend state as well as before a hibernation image is created (in what follows -we only consider hibernation, but the description also applies to suspend). - -Namely, as the first step of the hibernation procedure the function -freeze_processes() (defined in kernel/power/process.c) is called. A system-wide -variable system_freezing_cnt (as opposed to a per-task flag) is used to indicate -whether the system is to undergo a freezing operation. And freeze_processes() -sets this variable. After this, it executes try_to_freeze_tasks() that sends a -fake signal to all user space processes, and wakes up all the kernel threads. -All freezable tasks must react to that by calling try_to_freeze(), which -results in a call to __refrigerator() (defined in kernel/freezer.c), which sets -the task's PF_FROZEN flag, changes its state to TASK_UNINTERRUPTIBLE and makes -it loop until PF_FROZEN is cleared for it. Then, we say that the task is -'frozen' and therefore the set of functions handling this mechanism is referred -to as 'the freezer' (these functions are defined in kernel/power/process.c, -kernel/freezer.c & include/linux/freezer.h). User space processes are generally -frozen before kernel threads. - -__refrigerator() must not be called directly. Instead, use the -try_to_freeze() function (defined in include/linux/freezer.h), that checks -if the task is to be frozen and makes the task enter __refrigerator(). - -For user space processes try_to_freeze() is called automatically from the -signal-handling code, but the freezable kernel threads need to call it -explicitly in suitable places or use the wait_event_freezable() or -wait_event_freezable_timeout() macros (defined in include/linux/freezer.h) -that combine interruptible sleep with checking if the task is to be frozen and -calling try_to_freeze(). The main loop of a freezable kernel thread may look -like the following one: - - set_freezable(); - do { - hub_events(); - wait_event_freezable(khubd_wait, - !list_empty(&hub_event_list) || - kthread_should_stop()); - } while (!kthread_should_stop() || !list_empty(&hub_event_list)); - -(from drivers/usb/core/hub.c::hub_thread()). - -If a freezable kernel thread fails to call try_to_freeze() after the freezer has -initiated a freezing operation, the freezing of tasks will fail and the entire -hibernation operation will be cancelled. For this reason, freezable kernel -threads must call try_to_freeze() somewhere or use one of the -wait_event_freezable() and wait_event_freezable_timeout() macros. - -After the system memory state has been restored from a hibernation image and -devices have been reinitialized, the function thaw_processes() is called in -order to clear the PF_FROZEN flag for each frozen task. Then, the tasks that -have been frozen leave __refrigerator() and continue running. - - -Rationale behind the functions dealing with freezing and thawing of tasks: -------------------------------------------------------------------------- - -freeze_processes(): - - freezes only userspace tasks - -freeze_kernel_threads(): - - freezes all tasks (including kernel threads) because we can't freeze - kernel threads without freezing userspace tasks - -thaw_kernel_threads(): - - thaws only kernel threads; this is particularly useful if we need to do - anything special in between thawing of kernel threads and thawing of - userspace tasks, or if we want to postpone the thawing of userspace tasks - -thaw_processes(): - - thaws all tasks (including kernel threads) because we can't thaw userspace - tasks without thawing kernel threads - - -III. Which kernel threads are freezable? - -Kernel threads are not freezable by default. However, a kernel thread may clear -PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE -directly is not allowed). From this point it is regarded as freezable -and must call try_to_freeze() in a suitable place. - -IV. Why do we do that? - -Generally speaking, there is a couple of reasons to use the freezing of tasks: - -1. The principal reason is to prevent filesystems from being damaged after -hibernation. At the moment we have no simple means of checkpointing -filesystems, so if there are any modifications made to filesystem data and/or -metadata on disks, we cannot bring them back to the state from before the -modifications. At the same time each hibernation image contains some -filesystem-related information that must be consistent with the state of the -on-disk data and metadata after the system memory state has been restored from -the image (otherwise the filesystems will be damaged in a nasty way, usually -making them almost impossible to repair). We therefore freeze tasks that might -cause the on-disk filesystems' data and metadata to be modified after the -hibernation image has been created and before the system is finally powered off. -The majority of these are user space processes, but if any of the kernel threads -may cause something like this to happen, they have to be freezable. - -2. Next, to create the hibernation image we need to free a sufficient amount of -memory (approximately 50% of available RAM) and we need to do that before -devices are deactivated, because we generally need them for swapping out. Then, -after the memory for the image has been freed, we don't want tasks to allocate -additional memory and we prevent them from doing that by freezing them earlier. -[Of course, this also means that device drivers should not allocate substantial -amounts of memory from their .suspend() callbacks before hibernation, but this -is a separate issue.] - -3. The third reason is to prevent user space processes and some kernel threads -from interfering with the suspending and resuming of devices. A user space -process running on a second CPU while we are suspending devices may, for -example, be troublesome and without the freezing of tasks we would need some -safeguards against race conditions that might occur in such a case. - -Although Linus Torvalds doesn't like the freezing of tasks, he said this in one -of the discussions on LKML (http://lkml.org/lkml/2007/4/27/608): - -"RJW:> Why we freeze tasks at all or why we freeze kernel threads? - -Linus: In many ways, 'at all'. - -I _do_ realize the IO request queue issues, and that we cannot actually do -s2ram with some devices in the middle of a DMA. So we want to be able to -avoid *that*, there's no question about that. And I suspect that stopping -user threads and then waiting for a sync is practically one of the easier -ways to do so. - -So in practice, the 'at all' may become a 'why freeze kernel threads?' and -freezing user threads I don't find really objectionable." - -Still, there are kernel threads that may want to be freezable. For example, if -a kernel thread that belongs to a device driver accesses the device directly, it -in principle needs to know when the device is suspended, so that it doesn't try -to access it at that time. However, if the kernel thread is freezable, it will -be frozen before the driver's .suspend() callback is executed and it will be -thawed after the driver's .resume() callback has run, so it won't be accessing -the device while it's suspended. - -4. Another reason for freezing tasks is to prevent user space processes from -realizing that hibernation (or suspend) operation takes place. Ideally, user -space processes should not notice that such a system-wide operation has occurred -and should continue running without any problems after the restore (or resume -from suspend). Unfortunately, in the most general case this is quite difficult -to achieve without the freezing of tasks. Consider, for example, a process -that depends on all CPUs being online while it's running. Since we need to -disable nonboot CPUs during the hibernation, if this process is not frozen, it -may notice that the number of CPUs has changed and may start to work incorrectly -because of that. - -V. Are there any problems related to the freezing of tasks? - -Yes, there are. - -First of all, the freezing of kernel threads may be tricky if they depend one -on another. For example, if kernel thread A waits for a completion (in the -TASK_UNINTERRUPTIBLE state) that needs to be done by freezable kernel thread B -and B is frozen in the meantime, then A will be blocked until B is thawed, which -may be undesirable. That's why kernel threads are not freezable by default. - -Second, there are the following two problems related to the freezing of user -space processes: -1. Putting processes into an uninterruptible sleep distorts the load average. -2. Now that we have FUSE, plus the framework for doing device drivers in -userspace, it gets even more complicated because some userspace processes are -now doing the sorts of things that kernel threads do -(https://lists.linux-foundation.org/pipermail/linux-pm/2007-May/012309.html). - -The problem 1. seems to be fixable, although it hasn't been fixed so far. The -other one is more serious, but it seems that we can work around it by using -hibernation (and suspend) notifiers (in that case, though, we won't be able to -avoid the realization by the user space processes that the hibernation is taking -place). - -There are also problems that the freezing of tasks tends to expose, although -they are not directly related to it. For example, if request_firmware() is -called from a device driver's .resume() routine, it will timeout and eventually -fail, because the user land process that should respond to the request is frozen -at this point. So, seemingly, the failure is due to the freezing of tasks. -Suppose, however, that the firmware file is located on a filesystem accessible -only through another device that hasn't been resumed yet. In that case, -request_firmware() will fail regardless of whether or not the freezing of tasks -is used. Consequently, the problem is not really related to the freezing of -tasks, since it generally exists anyway. - -A driver must have all firmwares it may need in RAM before suspend() is called. -If keeping them is not practical, for example due to their size, they must be -requested early enough using the suspend notifier API described in -Documentation/driver-api/pm/notifiers.rst. - -VI. Are there any precautions to be taken to prevent freezing failures? - -Yes, there are. - -First of all, grabbing the 'system_transition_mutex' lock to mutually exclude a piece of code -from system-wide sleep such as suspend/hibernation is not encouraged. -If possible, that piece of code must instead hook onto the suspend/hibernation -notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code -(kernel/cpu.c) for an example. - -However, if that is not feasible, and grabbing 'system_transition_mutex' is deemed necessary, -it is strongly discouraged to directly call mutex_[un]lock(&system_transition_mutex) since -that could lead to freezing failures, because if the suspend/hibernate code -successfully acquired the 'system_transition_mutex' lock, and hence that other entity failed -to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE -state. As a consequence, the freezer would not be able to freeze that task, -leading to freezing failure. - -However, the [un]lock_system_sleep() APIs are safe to use in this scenario, -since they ask the freezer to skip freezing this task, since it is anyway -"frozen enough" as it is blocked on 'system_transition_mutex', which will be released -only after the entire suspend/hibernation sequence is complete. -So, to summarize, use [un]lock_system_sleep() instead of directly using -mutex_[un]lock(&system_transition_mutex). That would prevent freezing failures. - -V. Miscellaneous -/sys/power/pm_freeze_timeout controls how long it will cost at most to freeze -all user space processes or all freezable kernel threads, in unit of millisecond. -The default value is 20000, with range of unsigned integer. diff --git a/Documentation/power/index.rst b/Documentation/power/index.rst new file mode 100644 index 000000000000..20415f21e48a --- /dev/null +++ b/Documentation/power/index.rst @@ -0,0 +1,46 @@ +:orphan: + +================ +Power Management +================ + +.. toctree:: + :maxdepth: 1 + + apm-acpi + basic-pm-debugging + charger-manager + drivers-testing + energy-model + freezing-of-tasks + interface + opp + pci + pm_qos_interface + power_supply_class + runtime_pm + s2ram + suspend-and-cpuhotplug + suspend-and-interrupts + swsusp-and-swap-files + swsusp-dmcrypt + swsusp + video + tricks + + userland-swsusp + + powercap/powercap + + regulator/consumer + regulator/design + regulator/machine + regulator/overview + regulator/regulator + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/power/interface.rst b/Documentation/power/interface.rst new file mode 100644 index 000000000000..8d270ed27228 --- /dev/null +++ b/Documentation/power/interface.rst @@ -0,0 +1,79 @@ +=========================================== +Power Management Interface for System Sleep +=========================================== + +Copyright (c) 2016 Intel Corp., Rafael J. Wysocki + +The power management subsystem provides userspace with a unified sysfs interface +for system sleep regardless of the underlying system architecture or platform. +The interface is located in the /sys/power/ directory (assuming that sysfs is +mounted at /sys). + +/sys/power/state is the system sleep state control file. + +Reading from it returns a list of supported sleep states, encoded as: + +- 'freeze' (Suspend-to-Idle) +- 'standby' (Power-On Suspend) +- 'mem' (Suspend-to-RAM) +- 'disk' (Suspend-to-Disk) + +Suspend-to-Idle is always supported. Suspend-to-Disk is always supported +too as long the kernel has been configured to support hibernation at all +(ie. CONFIG_HIBERNATION is set in the kernel configuration file). Support +for Suspend-to-RAM and Power-On Suspend depends on the capabilities of the +platform. + +If one of the strings listed in /sys/power/state is written to it, the system +will attempt to transition into the corresponding sleep state. Refer to +Documentation/admin-guide/pm/sleep-states.rst for a description of each of +those states. + +/sys/power/disk controls the operating mode of hibernation (Suspend-to-Disk). +Specifically, it tells the kernel what to do after creating a hibernation image. + +Reading from it returns a list of supported options encoded as: + +- 'platform' (put the system into sleep using a platform-provided method) +- 'shutdown' (shut the system down) +- 'reboot' (reboot the system) +- 'suspend' (trigger a Suspend-to-RAM transition) +- 'test_resume' (resume-after-hibernation test mode) + +The currently selected option is printed in square brackets. + +The 'platform' option is only available if the platform provides a special +mechanism to put the system to sleep after creating a hibernation image (ACPI +does that, for example). The 'suspend' option is available if Suspend-to-RAM +is supported. Refer to Documentation/power/basic-pm-debugging.rst for the +description of the 'test_resume' option. + +To select an option, write the string representing it to /sys/power/disk. + +/sys/power/image_size controls the size of hibernation images. + +It can be written a string representing a non-negative integer that will be +used as a best-effort upper limit of the image size, in bytes. The hibernation +core will do its best to ensure that the image size will not exceed that number. +However, if that turns out to be impossible to achieve, a hibernation image will +still be created and its size will be as small as possible. In particular, +writing '0' to this file will enforce hibernation images to be as small as +possible. + +Reading from this file returns the current image size limit, which is set to +around 2/5 of available RAM by default. + +/sys/power/pm_trace controls the PM trace mechanism saving the last suspend +or resume event point in the RTC across reboots. + +It helps to debug hard lockups or reboots due to device driver failures that +occur during system suspend or resume (which is more common) more effectively. + +If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume +event point in turn will be stored in the RTC memory (overwriting the actual +RTC information), so it will survive a system crash if one occurs right after +storing it and it can be used later to identify the driver that caused the crash +to happen (see Documentation/power/s2ram.rst for more information). + +Initially it contains '0' which may be changed to '1' by writing a string +representing a nonzero integer into it. diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt deleted file mode 100644 index 27df7f98668a..000000000000 --- a/Documentation/power/interface.txt +++ /dev/null @@ -1,77 +0,0 @@ -Power Management Interface for System Sleep - -Copyright (c) 2016 Intel Corp., Rafael J. Wysocki - -The power management subsystem provides userspace with a unified sysfs interface -for system sleep regardless of the underlying system architecture or platform. -The interface is located in the /sys/power/ directory (assuming that sysfs is -mounted at /sys). - -/sys/power/state is the system sleep state control file. - -Reading from it returns a list of supported sleep states, encoded as: - -'freeze' (Suspend-to-Idle) -'standby' (Power-On Suspend) -'mem' (Suspend-to-RAM) -'disk' (Suspend-to-Disk) - -Suspend-to-Idle is always supported. Suspend-to-Disk is always supported -too as long the kernel has been configured to support hibernation at all -(ie. CONFIG_HIBERNATION is set in the kernel configuration file). Support -for Suspend-to-RAM and Power-On Suspend depends on the capabilities of the -platform. - -If one of the strings listed in /sys/power/state is written to it, the system -will attempt to transition into the corresponding sleep state. Refer to -Documentation/admin-guide/pm/sleep-states.rst for a description of each of -those states. - -/sys/power/disk controls the operating mode of hibernation (Suspend-to-Disk). -Specifically, it tells the kernel what to do after creating a hibernation image. - -Reading from it returns a list of supported options encoded as: - -'platform' (put the system into sleep using a platform-provided method) -'shutdown' (shut the system down) -'reboot' (reboot the system) -'suspend' (trigger a Suspend-to-RAM transition) -'test_resume' (resume-after-hibernation test mode) - -The currently selected option is printed in square brackets. - -The 'platform' option is only available if the platform provides a special -mechanism to put the system to sleep after creating a hibernation image (ACPI -does that, for example). The 'suspend' option is available if Suspend-to-RAM -is supported. Refer to Documentation/power/basic-pm-debugging.txt for the -description of the 'test_resume' option. - -To select an option, write the string representing it to /sys/power/disk. - -/sys/power/image_size controls the size of hibernation images. - -It can be written a string representing a non-negative integer that will be -used as a best-effort upper limit of the image size, in bytes. The hibernation -core will do its best to ensure that the image size will not exceed that number. -However, if that turns out to be impossible to achieve, a hibernation image will -still be created and its size will be as small as possible. In particular, -writing '0' to this file will enforce hibernation images to be as small as -possible. - -Reading from this file returns the current image size limit, which is set to -around 2/5 of available RAM by default. - -/sys/power/pm_trace controls the PM trace mechanism saving the last suspend -or resume event point in the RTC across reboots. - -It helps to debug hard lockups or reboots due to device driver failures that -occur during system suspend or resume (which is more common) more effectively. - -If /sys/power/pm_trace contains '1', the fingerprint of each suspend/resume -event point in turn will be stored in the RTC memory (overwriting the actual -RTC information), so it will survive a system crash if one occurs right after -storing it and it can be used later to identify the driver that caused the crash -to happen (see Documentation/power/s2ram.txt for more information). - -Initially it contains '0' which may be changed to '1' by writing a string -representing a nonzero integer into it. diff --git a/Documentation/power/opp.rst b/Documentation/power/opp.rst new file mode 100644 index 000000000000..b3cf1def9dee --- /dev/null +++ b/Documentation/power/opp.rst @@ -0,0 +1,379 @@ +========================================== +Operating Performance Points (OPP) Library +========================================== + +(C) 2009-2010 Nishanth Menon , Texas Instruments Incorporated + +.. Contents + + 1. Introduction + 2. Initial OPP List Registration + 3. OPP Search Functions + 4. OPP Availability Control Functions + 5. OPP Data Retrieval Functions + 6. Data Structures + +1. Introduction +=============== + +1.1 What is an Operating Performance Point (OPP)? +------------------------------------------------- + +Complex SoCs of today consists of a multiple sub-modules working in conjunction. +In an operational system executing varied use cases, not all modules in the SoC +need to function at their highest performing frequency all the time. To +facilitate this, sub-modules in a SoC are grouped into domains, allowing some +domains to run at lower voltage and frequency while other domains run at +voltage/frequency pairs that are higher. + +The set of discrete tuples consisting of frequency and voltage pairs that +the device will support per domain are called Operating Performance Points or +OPPs. + +As an example: + +Let us consider an MPU device which supports the following: +{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V}, +{1GHz at minimum voltage of 1.3V} + +We can represent these as three OPPs as the following {Hz, uV} tuples: + +- {300000000, 1000000} +- {800000000, 1200000} +- {1000000000, 1300000} + +1.2 Operating Performance Points Library +---------------------------------------- + +OPP library provides a set of helper functions to organize and query the OPP +information. The library is located in drivers/base/power/opp.c and the header +is located in include/linux/pm_opp.h. OPP library can be enabled by enabling +CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on +CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to +optionally boot at a certain OPP without needing cpufreq. + +Typical usage of the OPP library is as follows:: + + (users) -> registers a set of default OPPs -> (library) + SoC framework -> modifies on required cases certain OPPs -> OPP layer + -> queries to search/retrieve information -> + +OPP layer expects each domain to be represented by a unique device pointer. SoC +framework registers a set of initial OPPs per device with the OPP layer. This +list is expected to be an optimally small number typically around 5 per device. +This initial list contains a set of OPPs that the framework expects to be safely +enabled by default in the system. + +Note on OPP Availability +^^^^^^^^^^^^^^^^^^^^^^^^ + +As the system proceeds to operate, SoC framework may choose to make certain +OPPs available or not available on each device based on various external +factors. Example usage: Thermal management or other exceptional situations where +SoC framework might choose to disable a higher frequency OPP to safely continue +operations until that OPP could be re-enabled if possible. + +OPP library facilitates this concept in it's implementation. The following +operational functions operate only on available opps: +opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count + +dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then +be used for dev_pm_opp_enable/disable functions to make an opp available as required. + +WARNING: Users of OPP library should refresh their availability count using +get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the +exact mechanism to trigger these or the notification mechanism to other +dependent subsystems such as cpufreq are left to the discretion of the SoC +specific framework which uses the OPP library. Similar care needs to be taken +care to refresh the cpufreq table in cases of these operations. + +2. Initial OPP List Registration +================================ +The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per +device. It is expected that the SoC framework will register the OPP entries +optimally- typical numbers range to be less than 5. The list generated by +registering the OPPs is maintained by OPP library throughout the device +operation. The SoC framework can subsequently control the availability of the +OPPs dynamically using the dev_pm_opp_enable / disable functions. + +dev_pm_opp_add + Add a new OPP for a specific domain represented by the device pointer. + The OPP is defined using the frequency and voltage. Once added, the OPP + is assumed to be available and control of it's availability can be done + with the dev_pm_opp_enable/disable functions. OPP library internally stores + and manages this information in the opp struct. This function may be + used by SoC framework to define a optimal list as per the demands of + SoC usage environment. + + WARNING: + Do not use this function in interrupt context. + + Example:: + + soc_pm_init() + { + /* Do things */ + r = dev_pm_opp_add(mpu_dev, 1000000, 900000); + if (!r) { + pr_err("%s: unable to register mpu opp(%d)\n", r); + goto no_cpufreq; + } + /* Do cpufreq things */ + no_cpufreq: + /* Do remaining things */ + } + +3. OPP Search Functions +======================= +High level framework such as cpufreq operates on frequencies. To map the +frequency back to the corresponding OPP, OPP library provides handy functions +to search the OPP list that OPP library internally manages. These search +functions return the matching pointer representing the opp if a match is +found, else returns error. These errors are expected to be handled by standard +error checks such as IS_ERR() and appropriate actions taken by the caller. + +Callers of these functions shall call dev_pm_opp_put() after they have used the +OPP. Otherwise the memory for the OPP will never get freed and result in +memleak. + +dev_pm_opp_find_freq_exact + Search for an OPP based on an *exact* frequency and + availability. This function is especially useful to enable an OPP which + is not available by default. + Example: In a case when SoC framework detects a situation where a + higher frequency could be made available, it can use this function to + find the OPP prior to call the dev_pm_opp_enable to actually make + it available:: + + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); + dev_pm_opp_put(opp); + /* dont operate on the pointer.. just do a sanity check.. */ + if (IS_ERR(opp)) { + pr_err("frequency not disabled!\n"); + /* trigger appropriate actions.. */ + } else { + dev_pm_opp_enable(dev,1000000000); + } + + NOTE: + This is the only search function that operates on OPPs which are + not available. + +dev_pm_opp_find_freq_floor + Search for an available OPP which is *at most* the + provided frequency. This function is useful while searching for a lesser + match OR operating on OPP information in the order of decreasing + frequency. + Example: To find the highest opp for a device:: + + freq = ULONG_MAX; + opp = dev_pm_opp_find_freq_floor(dev, &freq); + dev_pm_opp_put(opp); + +dev_pm_opp_find_freq_ceil + Search for an available OPP which is *at least* the + provided frequency. This function is useful while searching for a + higher match OR operating on OPP information in the order of increasing + frequency. + Example 1: To find the lowest opp for a device:: + + freq = 0; + opp = dev_pm_opp_find_freq_ceil(dev, &freq); + dev_pm_opp_put(opp); + + Example 2: A simplified implementation of a SoC cpufreq_driver->target:: + + soc_cpufreq_target(..) + { + /* Do stuff like policy checks etc. */ + /* Find the best frequency match for the req */ + opp = dev_pm_opp_find_freq_ceil(dev, &freq); + dev_pm_opp_put(opp); + if (!IS_ERR(opp)) + soc_switch_to_freq_voltage(freq); + else + /* do something when we can't satisfy the req */ + /* do other stuff */ + } + +4. OPP Availability Control Functions +===================================== +A default OPP list registered with the OPP library may not cater to all possible +situation. The OPP library provides a set of functions to modify the +availability of a OPP within the OPP list. This allows SoC frameworks to have +fine grained dynamic control of which sets of OPPs are operationally available. +These functions are intended to *temporarily* remove an OPP in conditions such +as thermal considerations (e.g. don't use OPPx until the temperature drops). + +WARNING: + Do not use these functions in interrupt context. + +dev_pm_opp_enable + Make a OPP available for operation. + Example: Lets say that 1GHz OPP is to be made available only if the + SoC temperature is lower than a certain threshold. The SoC framework + implementation might choose to do something as follows:: + + if (cur_temp < temp_low_thresh) { + /* Enable 1GHz if it was disabled */ + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); + dev_pm_opp_put(opp); + /* just error check */ + if (!IS_ERR(opp)) + ret = dev_pm_opp_enable(dev, 1000000000); + else + goto try_something_else; + } + +dev_pm_opp_disable + Make an OPP to be not available for operation + Example: Lets say that 1GHz OPP is to be disabled if the temperature + exceeds a threshold value. The SoC framework implementation might + choose to do something as follows:: + + if (cur_temp > temp_high_thresh) { + /* Disable 1GHz if it was enabled */ + opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true); + dev_pm_opp_put(opp); + /* just error check */ + if (!IS_ERR(opp)) + ret = dev_pm_opp_disable(dev, 1000000000); + else + goto try_something_else; + } + +5. OPP Data Retrieval Functions +=============================== +Since OPP library abstracts away the OPP information, a set of functions to pull +information from the OPP structure is necessary. Once an OPP pointer is +retrieved using the search functions, the following functions can be used by SoC +framework to retrieve the information represented inside the OPP layer. + +dev_pm_opp_get_voltage + Retrieve the voltage represented by the opp pointer. + Example: At a cpufreq transition to a different frequency, SoC + framework requires to set the voltage represented by the OPP using + the regulator framework to the Power Management chip providing the + voltage:: + + soc_switch_to_freq_voltage(freq) + { + /* do things */ + opp = dev_pm_opp_find_freq_ceil(dev, &freq); + v = dev_pm_opp_get_voltage(opp); + dev_pm_opp_put(opp); + if (v) + regulator_set_voltage(.., v); + /* do other things */ + } + +dev_pm_opp_get_freq + Retrieve the freq represented by the opp pointer. + Example: Lets say the SoC framework uses a couple of helper functions + we could pass opp pointers instead of doing additional parameters to + handle quiet a bit of data parameters:: + + soc_cpufreq_target(..) + { + /* do things.. */ + max_freq = ULONG_MAX; + max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq); + requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq); + if (!IS_ERR(max_opp) && !IS_ERR(requested_opp)) + r = soc_test_validity(max_opp, requested_opp); + dev_pm_opp_put(max_opp); + dev_pm_opp_put(requested_opp); + /* do other things */ + } + soc_test_validity(..) + { + if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp)) + return -EINVAL; + if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp)) + return -EINVAL; + /* do things.. */ + } + +dev_pm_opp_get_opp_count + Retrieve the number of available opps for a device + Example: Lets say a co-processor in the SoC needs to know the available + frequencies in a table, the main processor can notify as following:: + + soc_notify_coproc_available_frequencies() + { + /* Do things */ + num_available = dev_pm_opp_get_opp_count(dev); + speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL); + /* populate the table in increasing order */ + freq = 0; + while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) { + speeds[i] = freq; + freq++; + i++; + dev_pm_opp_put(opp); + } + + soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available); + /* Do other things */ + } + +6. Data Structures +================== +Typically an SoC contains multiple voltage domains which are variable. Each +domain is represented by a device pointer. The relationship to OPP can be +represented as follows:: + + SoC + |- device 1 + | |- opp 1 (availability, freq, voltage) + | |- opp 2 .. + ... ... + | `- opp n .. + |- device 2 + ... + `- device m + +OPP library maintains a internal list that the SoC framework populates and +accessed by various functions as described above. However, the structures +representing the actual OPPs and domains are internal to the OPP library itself +to allow for suitable abstraction reusable across systems. + +struct dev_pm_opp + The internal data structure of OPP library which is used to + represent an OPP. In addition to the freq, voltage, availability + information, it also contains internal book keeping information required + for the OPP library to operate on. Pointer to this structure is + provided back to the users such as SoC framework to be used as a + identifier for OPP in the interactions with OPP layer. + + WARNING: + The struct dev_pm_opp pointer should not be parsed or modified by the + users. The defaults of for an instance is populated by + dev_pm_opp_add, but the availability of the OPP can be modified + by dev_pm_opp_enable/disable functions. + +struct device + This is used to identify a domain to the OPP layer. The + nature of the device and it's implementation is left to the user of + OPP library such as the SoC framework. + +Overall, in a simplistic view, the data structure operations is represented as +following:: + + Initialization / modification: + +-----+ /- dev_pm_opp_enable + dev_pm_opp_add --> | opp | <------- + | +-----+ \- dev_pm_opp_disable + \-------> domain_info(device) + + Search functions: + /-- dev_pm_opp_find_freq_ceil ---\ +-----+ + domain_info<---- dev_pm_opp_find_freq_exact -----> | opp | + \-- dev_pm_opp_find_freq_floor ---/ +-----+ + + Retrieval functions: + +-----+ /- dev_pm_opp_get_voltage + | opp | <--- + +-----+ \- dev_pm_opp_get_freq + + domain_info <- dev_pm_opp_get_opp_count diff --git a/Documentation/power/opp.txt b/Documentation/power/opp.txt deleted file mode 100644 index 0c007e250cd1..000000000000 --- a/Documentation/power/opp.txt +++ /dev/null @@ -1,342 +0,0 @@ -Operating Performance Points (OPP) Library -========================================== - -(C) 2009-2010 Nishanth Menon , Texas Instruments Incorporated - -Contents --------- -1. Introduction -2. Initial OPP List Registration -3. OPP Search Functions -4. OPP Availability Control Functions -5. OPP Data Retrieval Functions -6. Data Structures - -1. Introduction -=============== -1.1 What is an Operating Performance Point (OPP)? - -Complex SoCs of today consists of a multiple sub-modules working in conjunction. -In an operational system executing varied use cases, not all modules in the SoC -need to function at their highest performing frequency all the time. To -facilitate this, sub-modules in a SoC are grouped into domains, allowing some -domains to run at lower voltage and frequency while other domains run at -voltage/frequency pairs that are higher. - -The set of discrete tuples consisting of frequency and voltage pairs that -the device will support per domain are called Operating Performance Points or -OPPs. - -As an example: -Let us consider an MPU device which supports the following: -{300MHz at minimum voltage of 1V}, {800MHz at minimum voltage of 1.2V}, -{1GHz at minimum voltage of 1.3V} - -We can represent these as three OPPs as the following {Hz, uV} tuples: -{300000000, 1000000} -{800000000, 1200000} -{1000000000, 1300000} - -1.2 Operating Performance Points Library - -OPP library provides a set of helper functions to organize and query the OPP -information. The library is located in drivers/base/power/opp.c and the header -is located in include/linux/pm_opp.h. OPP library can be enabled by enabling -CONFIG_PM_OPP from power management menuconfig menu. OPP library depends on -CONFIG_PM as certain SoCs such as Texas Instrument's OMAP framework allows to -optionally boot at a certain OPP without needing cpufreq. - -Typical usage of the OPP library is as follows: -(users) -> registers a set of default OPPs -> (library) -SoC framework -> modifies on required cases certain OPPs -> OPP layer - -> queries to search/retrieve information -> - -OPP layer expects each domain to be represented by a unique device pointer. SoC -framework registers a set of initial OPPs per device with the OPP layer. This -list is expected to be an optimally small number typically around 5 per device. -This initial list contains a set of OPPs that the framework expects to be safely -enabled by default in the system. - -Note on OPP Availability: ------------------------- -As the system proceeds to operate, SoC framework may choose to make certain -OPPs available or not available on each device based on various external -factors. Example usage: Thermal management or other exceptional situations where -SoC framework might choose to disable a higher frequency OPP to safely continue -operations until that OPP could be re-enabled if possible. - -OPP library facilitates this concept in it's implementation. The following -operational functions operate only on available opps: -opp_find_freq_{ceil, floor}, dev_pm_opp_get_voltage, dev_pm_opp_get_freq, dev_pm_opp_get_opp_count - -dev_pm_opp_find_freq_exact is meant to be used to find the opp pointer which can then -be used for dev_pm_opp_enable/disable functions to make an opp available as required. - -WARNING: Users of OPP library should refresh their availability count using -get_opp_count if dev_pm_opp_enable/disable functions are invoked for a device, the -exact mechanism to trigger these or the notification mechanism to other -dependent subsystems such as cpufreq are left to the discretion of the SoC -specific framework which uses the OPP library. Similar care needs to be taken -care to refresh the cpufreq table in cases of these operations. - -2. Initial OPP List Registration -================================ -The SoC implementation calls dev_pm_opp_add function iteratively to add OPPs per -device. It is expected that the SoC framework will register the OPP entries -optimally- typical numbers range to be less than 5. The list generated by -registering the OPPs is maintained by OPP library throughout the device -operation. The SoC framework can subsequently control the availability of the -OPPs dynamically using the dev_pm_opp_enable / disable functions. - -dev_pm_opp_add - Add a new OPP for a specific domain represented by the device pointer. - The OPP is defined using the frequency and voltage. Once added, the OPP - is assumed to be available and control of it's availability can be done - with the dev_pm_opp_enable/disable functions. OPP library internally stores - and manages this information in the opp struct. This function may be - used by SoC framework to define a optimal list as per the demands of - SoC usage environment. - - WARNING: Do not use this function in interrupt context. - - Example: - soc_pm_init() - { - /* Do things */ - r = dev_pm_opp_add(mpu_dev, 1000000, 900000); - if (!r) { - pr_err("%s: unable to register mpu opp(%d)\n", r); - goto no_cpufreq; - } - /* Do cpufreq things */ - no_cpufreq: - /* Do remaining things */ - } - -3. OPP Search Functions -======================= -High level framework such as cpufreq operates on frequencies. To map the -frequency back to the corresponding OPP, OPP library provides handy functions -to search the OPP list that OPP library internally manages. These search -functions return the matching pointer representing the opp if a match is -found, else returns error. These errors are expected to be handled by standard -error checks such as IS_ERR() and appropriate actions taken by the caller. - -Callers of these functions shall call dev_pm_opp_put() after they have used the -OPP. Otherwise the memory for the OPP will never get freed and result in -memleak. - -dev_pm_opp_find_freq_exact - Search for an OPP based on an *exact* frequency and - availability. This function is especially useful to enable an OPP which - is not available by default. - Example: In a case when SoC framework detects a situation where a - higher frequency could be made available, it can use this function to - find the OPP prior to call the dev_pm_opp_enable to actually make it available. - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); - dev_pm_opp_put(opp); - /* dont operate on the pointer.. just do a sanity check.. */ - if (IS_ERR(opp)) { - pr_err("frequency not disabled!\n"); - /* trigger appropriate actions.. */ - } else { - dev_pm_opp_enable(dev,1000000000); - } - - NOTE: This is the only search function that operates on OPPs which are - not available. - -dev_pm_opp_find_freq_floor - Search for an available OPP which is *at most* the - provided frequency. This function is useful while searching for a lesser - match OR operating on OPP information in the order of decreasing - frequency. - Example: To find the highest opp for a device: - freq = ULONG_MAX; - opp = dev_pm_opp_find_freq_floor(dev, &freq); - dev_pm_opp_put(opp); - -dev_pm_opp_find_freq_ceil - Search for an available OPP which is *at least* the - provided frequency. This function is useful while searching for a - higher match OR operating on OPP information in the order of increasing - frequency. - Example 1: To find the lowest opp for a device: - freq = 0; - opp = dev_pm_opp_find_freq_ceil(dev, &freq); - dev_pm_opp_put(opp); - Example 2: A simplified implementation of a SoC cpufreq_driver->target: - soc_cpufreq_target(..) - { - /* Do stuff like policy checks etc. */ - /* Find the best frequency match for the req */ - opp = dev_pm_opp_find_freq_ceil(dev, &freq); - dev_pm_opp_put(opp); - if (!IS_ERR(opp)) - soc_switch_to_freq_voltage(freq); - else - /* do something when we can't satisfy the req */ - /* do other stuff */ - } - -4. OPP Availability Control Functions -===================================== -A default OPP list registered with the OPP library may not cater to all possible -situation. The OPP library provides a set of functions to modify the -availability of a OPP within the OPP list. This allows SoC frameworks to have -fine grained dynamic control of which sets of OPPs are operationally available. -These functions are intended to *temporarily* remove an OPP in conditions such -as thermal considerations (e.g. don't use OPPx until the temperature drops). - -WARNING: Do not use these functions in interrupt context. - -dev_pm_opp_enable - Make a OPP available for operation. - Example: Lets say that 1GHz OPP is to be made available only if the - SoC temperature is lower than a certain threshold. The SoC framework - implementation might choose to do something as follows: - if (cur_temp < temp_low_thresh) { - /* Enable 1GHz if it was disabled */ - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, false); - dev_pm_opp_put(opp); - /* just error check */ - if (!IS_ERR(opp)) - ret = dev_pm_opp_enable(dev, 1000000000); - else - goto try_something_else; - } - -dev_pm_opp_disable - Make an OPP to be not available for operation - Example: Lets say that 1GHz OPP is to be disabled if the temperature - exceeds a threshold value. The SoC framework implementation might - choose to do something as follows: - if (cur_temp > temp_high_thresh) { - /* Disable 1GHz if it was enabled */ - opp = dev_pm_opp_find_freq_exact(dev, 1000000000, true); - dev_pm_opp_put(opp); - /* just error check */ - if (!IS_ERR(opp)) - ret = dev_pm_opp_disable(dev, 1000000000); - else - goto try_something_else; - } - -5. OPP Data Retrieval Functions -=============================== -Since OPP library abstracts away the OPP information, a set of functions to pull -information from the OPP structure is necessary. Once an OPP pointer is -retrieved using the search functions, the following functions can be used by SoC -framework to retrieve the information represented inside the OPP layer. - -dev_pm_opp_get_voltage - Retrieve the voltage represented by the opp pointer. - Example: At a cpufreq transition to a different frequency, SoC - framework requires to set the voltage represented by the OPP using - the regulator framework to the Power Management chip providing the - voltage. - soc_switch_to_freq_voltage(freq) - { - /* do things */ - opp = dev_pm_opp_find_freq_ceil(dev, &freq); - v = dev_pm_opp_get_voltage(opp); - dev_pm_opp_put(opp); - if (v) - regulator_set_voltage(.., v); - /* do other things */ - } - -dev_pm_opp_get_freq - Retrieve the freq represented by the opp pointer. - Example: Lets say the SoC framework uses a couple of helper functions - we could pass opp pointers instead of doing additional parameters to - handle quiet a bit of data parameters. - soc_cpufreq_target(..) - { - /* do things.. */ - max_freq = ULONG_MAX; - max_opp = dev_pm_opp_find_freq_floor(dev,&max_freq); - requested_opp = dev_pm_opp_find_freq_ceil(dev,&freq); - if (!IS_ERR(max_opp) && !IS_ERR(requested_opp)) - r = soc_test_validity(max_opp, requested_opp); - dev_pm_opp_put(max_opp); - dev_pm_opp_put(requested_opp); - /* do other things */ - } - soc_test_validity(..) - { - if(dev_pm_opp_get_voltage(max_opp) < dev_pm_opp_get_voltage(requested_opp)) - return -EINVAL; - if(dev_pm_opp_get_freq(max_opp) < dev_pm_opp_get_freq(requested_opp)) - return -EINVAL; - /* do things.. */ - } - -dev_pm_opp_get_opp_count - Retrieve the number of available opps for a device - Example: Lets say a co-processor in the SoC needs to know the available - frequencies in a table, the main processor can notify as following: - soc_notify_coproc_available_frequencies() - { - /* Do things */ - num_available = dev_pm_opp_get_opp_count(dev); - speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL); - /* populate the table in increasing order */ - freq = 0; - while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) { - speeds[i] = freq; - freq++; - i++; - dev_pm_opp_put(opp); - } - - soc_notify_coproc(AVAILABLE_FREQs, speeds, num_available); - /* Do other things */ - } - -6. Data Structures -================== -Typically an SoC contains multiple voltage domains which are variable. Each -domain is represented by a device pointer. The relationship to OPP can be -represented as follows: -SoC - |- device 1 - | |- opp 1 (availability, freq, voltage) - | |- opp 2 .. - ... ... - | `- opp n .. - |- device 2 - ... - `- device m - -OPP library maintains a internal list that the SoC framework populates and -accessed by various functions as described above. However, the structures -representing the actual OPPs and domains are internal to the OPP library itself -to allow for suitable abstraction reusable across systems. - -struct dev_pm_opp - The internal data structure of OPP library which is used to - represent an OPP. In addition to the freq, voltage, availability - information, it also contains internal book keeping information required - for the OPP library to operate on. Pointer to this structure is - provided back to the users such as SoC framework to be used as a - identifier for OPP in the interactions with OPP layer. - - WARNING: The struct dev_pm_opp pointer should not be parsed or modified by the - users. The defaults of for an instance is populated by dev_pm_opp_add, but the - availability of the OPP can be modified by dev_pm_opp_enable/disable functions. - -struct device - This is used to identify a domain to the OPP layer. The - nature of the device and it's implementation is left to the user of - OPP library such as the SoC framework. - -Overall, in a simplistic view, the data structure operations is represented as -following: - -Initialization / modification: - +-----+ /- dev_pm_opp_enable -dev_pm_opp_add --> | opp | <------- - | +-----+ \- dev_pm_opp_disable - \-------> domain_info(device) - -Search functions: - /-- dev_pm_opp_find_freq_ceil ---\ +-----+ -domain_info<---- dev_pm_opp_find_freq_exact -----> | opp | - \-- dev_pm_opp_find_freq_floor ---/ +-----+ - -Retrieval functions: -+-----+ /- dev_pm_opp_get_voltage -| opp | <--- -+-----+ \- dev_pm_opp_get_freq - -domain_info <- dev_pm_opp_get_opp_count diff --git a/Documentation/power/pci.rst b/Documentation/power/pci.rst new file mode 100644 index 000000000000..0e2ef7429304 --- /dev/null +++ b/Documentation/power/pci.rst @@ -0,0 +1,1135 @@ +==================== +PCI Power Management +==================== + +Copyright (c) 2010 Rafael J. Wysocki , Novell Inc. + +An overview of concepts and the Linux kernel's interfaces related to PCI power +management. Based on previous work by Patrick Mochel +(and others). + +This document only covers the aspects of power management specific to PCI +devices. For general description of the kernel's interfaces related to device +power management refer to Documentation/driver-api/pm/devices.rst and +Documentation/power/runtime_pm.rst. + +.. contents: + + 1. Hardware and Platform Support for PCI Power Management + 2. PCI Subsystem and Device Power Management + 3. PCI Device Drivers and Power Management + 4. Resources + + +1. Hardware and Platform Support for PCI Power Management +========================================================= + +1.1. Native and Platform-Based Power Management +----------------------------------------------- + +In general, power management is a feature allowing one to save energy by putting +devices into states in which they draw less power (low-power states) at the +price of reduced functionality or performance. + +Usually, a device is put into a low-power state when it is underutilized or +completely inactive. However, when it is necessary to use the device once +again, it has to be put back into the "fully functional" state (full-power +state). This may happen when there are some data for the device to handle or +as a result of an external event requiring the device to be active, which may +be signaled by the device itself. + +PCI devices may be put into low-power states in two ways, by using the device +capabilities introduced by the PCI Bus Power Management Interface Specification, +or with the help of platform firmware, such as an ACPI BIOS. In the first +approach, that is referred to as the native PCI power management (native PCI PM) +in what follows, the device power state is changed as a result of writing a +specific value into one of its standard configuration registers. The second +approach requires the platform firmware to provide special methods that may be +used by the kernel to change the device's power state. + +Devices supporting the native PCI PM usually can generate wakeup signals called +Power Management Events (PMEs) to let the kernel know about external events +requiring the device to be active. After receiving a PME the kernel is supposed +to put the device that sent it into the full-power state. However, the PCI Bus +Power Management Interface Specification doesn't define any standard method of +delivering the PME from the device to the CPU and the operating system kernel. +It is assumed that the platform firmware will perform this task and therefore, +even though a PCI device is set up to generate PMEs, it also may be necessary to +prepare the platform firmware for notifying the CPU of the PMEs coming from the +device (e.g. by generating interrupts). + +In turn, if the methods provided by the platform firmware are used for changing +the power state of a device, usually the platform also provides a method for +preparing the device to generate wakeup signals. In that case, however, it +often also is necessary to prepare the device for generating PMEs using the +native PCI PM mechanism, because the method provided by the platform depends on +that. + +Thus in many situations both the native and the platform-based power management +mechanisms have to be used simultaneously to obtain the desired result. + +1.2. Native PCI Power Management +-------------------------------- + +The PCI Bus Power Management Interface Specification (PCI PM Spec) was +introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a +standard interface for performing various operations related to power +management. + +The implementation of the PCI PM Spec is optional for conventional PCI devices, +but it is mandatory for PCI Express devices. If a device supports the PCI PM +Spec, it has an 8 byte power management capability field in its PCI +configuration space. This field is used to describe and control the standard +features related to the native PCI power management. + +The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses +(B0-B3). The higher the number, the less power is drawn by the device or bus +in that state. However, the higher the number, the longer the latency for +the device or bus to return to the full-power state (D0 or B0, respectively). + +There are two variants of the D3 state defined by the specification. The first +one is D3hot, referred to as the software accessible D3, because devices can be +programmed to go into it. The second one, D3cold, is the state that PCI devices +are in when the supply voltage (Vcc) is removed from them. It is not possible +to program a PCI device to go into D3cold, although there may be a programmable +interface for putting the bus the device is on into a state in which Vcc is +removed from all devices on the bus. + +PCI bus power management, however, is not supported by the Linux kernel at the +time of this writing and therefore it is not covered by this document. + +Note that every PCI device can be in the full-power state (D0) or in D3cold, +regardless of whether or not it implements the PCI PM Spec. In addition to +that, if the PCI PM Spec is implemented by the device, it must support D3hot +as well as D0. The support for the D1 and D2 power states is optional. + +PCI devices supporting the PCI PM Spec can be programmed to go to any of the +supported low-power states (except for D3cold). While in D1-D3hot the +standard configuration registers of the device must be accessible to software +(i.e. the device is required to respond to PCI configuration accesses), although +its I/O and memory spaces are then disabled. This allows the device to be +programmatically put into D0. Thus the kernel can switch the device back and +forth between D0 and the supported low-power states (except for D3cold) and the +possible power state transitions the device can undergo are the following: + ++----------------------------+ +| Current State | New State | ++----------------------------+ +| D0 | D1, D2, D3 | ++----------------------------+ +| D1 | D2, D3 | ++----------------------------+ +| D2 | D3 | ++----------------------------+ +| D1, D2, D3 | D0 | ++----------------------------+ + +The transition from D3cold to D0 occurs when the supply voltage is provided to +the device (i.e. power is restored). In that case the device returns to D0 with +a full power-on reset sequence and the power-on defaults are restored to the +device by hardware just as at initial power up. + +PCI devices supporting the PCI PM Spec can be programmed to generate PMEs +while in a low-power state (D1-D3), but they are not required to be capable +of generating PMEs from all supported low-power states. In particular, the +capability of generating PMEs from D3cold is optional and depends on the +presence of additional voltage (3.3Vaux) allowing the device to remain +sufficiently active to generate a wakeup signal. + +1.3. ACPI Device Power Management +--------------------------------- + +The platform firmware support for the power management of PCI devices is +system-specific. However, if the system in question is compliant with the +Advanced Configuration and Power Interface (ACPI) Specification, like the +majority of x86-based systems, it is supposed to implement device power +management interfaces defined by the ACPI standard. + +For this purpose the ACPI BIOS provides special functions called "control +methods" that may be executed by the kernel to perform specific tasks, such as +putting a device into a low-power state. These control methods are encoded +using special byte-code language called the ACPI Machine Language (AML) and +stored in the machine's BIOS. The kernel loads them from the BIOS and executes +them as needed using an AML interpreter that translates the AML byte code into +computations and memory or I/O space accesses. This way, in theory, a BIOS +writer can provide the kernel with a means to perform actions depending +on the system design in a system-specific fashion. + +ACPI control methods may be divided into global control methods, that are not +associated with any particular devices, and device control methods, that have +to be defined separately for each device supposed to be handled with the help of +the platform. This means, in particular, that ACPI device control methods can +only be used to handle devices that the BIOS writer knew about in advance. The +ACPI methods used for device power management fall into that category. + +The ACPI specification assumes that devices can be in one of four power states +labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM +D0-D3 states (although the difference between D3hot and D3cold is not taken +into account by ACPI). Moreover, for each power state of a device there is a +set of power resources that have to be enabled for the device to be put into +that state. These power resources are controlled (i.e. enabled or disabled) +with the help of their own control methods, _ON and _OFF, that have to be +defined individually for each of them. + +To put a device into the ACPI power state Dx (where x is a number between 0 and +3 inclusive) the kernel is supposed to (1) enable the power resources required +by the device in this state using their _ON control methods and (2) execute the +_PSx control method defined for the device. In addition to that, if the device +is going to be put into a low-power state (D1-D3) and is supposed to generate +wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI +3.0) control method defined for it has to be executed before _PSx. Power +resources that are not required by the device in the target power state and are +not required any more by any other device should be disabled (by executing their +_OFF control methods). If the current power state of the device is D3, it can +only be put into D0 this way. + +However, quite often the power states of devices are changed during a +system-wide transition into a sleep state or back into the working state. ACPI +defines four system sleep states, S1, S2, S3, and S4, and denotes the system +working state as S0. In general, the target system sleep (or working) state +determines the highest power (lowest number) state the device can be put +into and the kernel is supposed to obtain this information by executing the +device's _SxD control method (where x is a number between 0 and 4 inclusive). +If the device is required to wake up the system from the target sleep state, the +lowest power (highest number) state it can be put into is also determined by the +target state of the system. The kernel is then supposed to use the device's +_SxW control method to obtain the number of that state. It also is supposed to +use the device's _PRW control method to learn which power resources need to be +enabled for the device to be able to generate wakeup signals. + +1.4. Wakeup Signaling +--------------------- + +Wakeup signals generated by PCI devices, either as native PCI PMEs, or as +a result of the execution of the _DSW (or _PSW) ACPI control method before +putting the device into a low-power state, have to be caught and handled as +appropriate. If they are sent while the system is in the working state +(ACPI S0), they should be translated into interrupts so that the kernel can +put the devices generating them into the full-power state and take care of the +events that triggered them. In turn, if they are sent while the system is +sleeping, they should cause the system's core logic to trigger wakeup. + +On ACPI-based systems wakeup signals sent by conventional PCI devices are +converted into ACPI General-Purpose Events (GPEs) which are hardware signals +from the system core logic generated in response to various events that need to +be acted upon. Every GPE is associated with one or more sources of potentially +interesting events. In particular, a GPE may be associated with a PCI device +capable of signaling wakeup. The information on the connections between GPEs +and event sources is recorded in the system's ACPI BIOS from where it can be +read by the kernel. + +If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE +associated with it (if there is one) is triggered. The GPEs associated with PCI +bridges may also be triggered in response to a wakeup signal from one of the +devices below the bridge (this also is the case for root bridges) and, for +example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be +handled this way. + +A GPE may be triggered when the system is sleeping (i.e. when it is in one of +the ACPI S1-S4 states), in which case system wakeup is started by its core logic +(the device that was the source of the signal causing the system wakeup to occur +may be identified later). The GPEs used in such situations are referred to as +wakeup GPEs. + +Usually, however, GPEs are also triggered when the system is in the working +state (ACPI S0) and in that case the system's core logic generates a System +Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI +handler identifies the GPE that caused the interrupt to be generated which, +in turn, allows the kernel to identify the source of the event (that may be +a PCI device signaling wakeup). The GPEs used for notifying the kernel of +events occurring while the system is in the working state are referred to as +runtime GPEs. + +Unfortunately, there is no standard way of handling wakeup signals sent by +conventional PCI devices on systems that are not ACPI-based, but there is one +for PCI Express devices. Namely, the PCI Express Base Specification introduced +a native mechanism for converting native PCI PMEs into interrupts generated by +root ports. For conventional PCI devices native PMEs are out-of-band, so they +are routed separately and they need not pass through bridges (in principle they +may be routed directly to the system's core logic), but for PCI Express devices +they are in-band messages that have to pass through the PCI Express hierarchy, +including the root port on the path from the device to the Root Complex. Thus +it was possible to introduce a mechanism by which a root port generates an +interrupt whenever it receives a PME message from one of the devices below it. +The PCI Express Requester ID of the device that sent the PME message is then +recorded in one of the root port's configuration registers from where it may be +read by the interrupt handler allowing the device to be identified. [PME +messages sent by PCI Express endpoints integrated with the Root Complex don't +pass through root ports, but instead they cause a Root Complex Event Collector +(if there is one) to generate interrupts.] + +In principle the native PCI Express PME signaling may also be used on ACPI-based +systems along with the GPEs, but to use it the kernel has to ask the system's +ACPI BIOS to release control of root port configuration registers. The ACPI +BIOS, however, is not required to allow the kernel to control these registers +and if it doesn't do that, the kernel must not modify their contents. Of course +the native PCI Express PME signaling cannot be used by the kernel in that case. + + +2. PCI Subsystem and Device Power Management +============================================ + +2.1. Device Power Management Callbacks +-------------------------------------- + +The PCI Subsystem participates in the power management of PCI devices in a +number of ways. First of all, it provides an intermediate code layer between +the device power management core (PM core) and PCI device drivers. +Specifically, the pm field of the PCI subsystem's struct bus_type object, +pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing +pointers to several device power management callbacks:: + + const struct dev_pm_ops pci_dev_pm_ops = { + .prepare = pci_pm_prepare, + .complete = pci_pm_complete, + .suspend = pci_pm_suspend, + .resume = pci_pm_resume, + .freeze = pci_pm_freeze, + .thaw = pci_pm_thaw, + .poweroff = pci_pm_poweroff, + .restore = pci_pm_restore, + .suspend_noirq = pci_pm_suspend_noirq, + .resume_noirq = pci_pm_resume_noirq, + .freeze_noirq = pci_pm_freeze_noirq, + .thaw_noirq = pci_pm_thaw_noirq, + .poweroff_noirq = pci_pm_poweroff_noirq, + .restore_noirq = pci_pm_restore_noirq, + .runtime_suspend = pci_pm_runtime_suspend, + .runtime_resume = pci_pm_runtime_resume, + .runtime_idle = pci_pm_runtime_idle, + }; + +These callbacks are executed by the PM core in various situations related to +device power management and they, in turn, execute power management callbacks +provided by PCI device drivers. They also perform power management operations +involving some standard configuration registers of PCI devices that device +drivers need not know or care about. + +The structure representing a PCI device, struct pci_dev, contains several fields +that these callbacks operate on:: + + struct pci_dev { + ... + pci_power_t current_state; /* Current operating state. */ + int pm_cap; /* PM capability offset in the + configuration space */ + unsigned int pme_support:5; /* Bitmask of states from which PME# + can be generated */ + unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ + unsigned int d1_support:1; /* Low power state D1 is supported */ + unsigned int d2_support:1; /* Low power state D2 is supported */ + unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ + unsigned int wakeup_prepared:1; /* Device prepared for wake up */ + unsigned int d3_delay; /* D3->D0 transition time in ms */ + ... + }; + +They also indirectly use some fields of the struct device that is embedded in +struct pci_dev. + +2.2. Device Initialization +-------------------------- + +The PCI subsystem's first task related to device power management is to +prepare the device for power management and initialize the fields of struct +pci_dev used for this purpose. This happens in two functions defined in +drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). + +The first of these functions checks if the device supports native PCI PM +and if that's the case the offset of its power management capability structure +in the configuration space is stored in the pm_cap field of the device's struct +pci_dev object. Next, the function checks which PCI low-power states are +supported by the device and from which low-power states the device can generate +native PCI PMEs. The power management fields of the device's struct pci_dev and +the struct device embedded in it are updated accordingly and the generation of +PMEs by the device is disabled. + +The second function checks if the device can be prepared to signal wakeup with +the help of the platform firmware, such as the ACPI BIOS. If that is the case, +the function updates the wakeup fields in struct device embedded in the +device's struct pci_dev and uses the firmware-provided method to prevent the +device from signaling wakeup. + +At this point the device is ready for power management. For driverless devices, +however, this functionality is limited to a few basic operations carried out +during system-wide transitions to a sleep state and back to the working state. + +2.3. Runtime Device Power Management +------------------------------------ + +The PCI subsystem plays a vital role in the runtime power management of PCI +devices. For this purpose it uses the general runtime power management +(runtime PM) framework described in Documentation/power/runtime_pm.rst. +Namely, it provides subsystem-level callbacks:: + + pci_pm_runtime_suspend() + pci_pm_runtime_resume() + pci_pm_runtime_idle() + +that are executed by the core runtime PM routines. It also implements the +entire mechanics necessary for handling runtime wakeup signals from PCI devices +in low-power states, which at the time of this writing works for both the native +PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in +Section 1. + +First, a PCI device is put into a low-power state, or suspended, with the help +of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call +pci_pm_runtime_suspend() to do the actual job. For this to work, the device's +driver has to provide a pm->runtime_suspend() callback (see below), which is +run by pci_pm_runtime_suspend() as the first action. If the driver's callback +returns successfully, the device's standard configuration registers are saved, +the device is prepared to generate wakeup signals and, finally, it is put into +the target low-power state. + +The low-power state to put the device into is the lowest-power (highest number) +state from which it can signal wakeup. The exact method of signaling wakeup is +system-dependent and is determined by the PCI subsystem on the basis of the +reported capabilities of the device and the platform firmware. To prepare the +device for signaling wakeup and put it into the selected low-power state, the +PCI subsystem can use the platform firmware as well as the device's native PCI +PM capabilities, if supported. + +It is expected that the device driver's pm->runtime_suspend() callback will +not attempt to prepare the device for signaling wakeup or to put it into a +low-power state. The driver ought to leave these tasks to the PCI subsystem +that has all of the information necessary to perform them. + +A suspended device is brought back into the "active" state, or resumed, +with the help of pm_request_resume() or pm_runtime_resume() which both call +pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's +driver provides a pm->runtime_resume() callback (see below). However, before +the driver's callback is executed, pci_pm_runtime_resume() brings the device +back into the full-power state, prevents it from signaling wakeup while in that +state and restores its standard configuration registers. Thus the driver's +callback need not worry about the PCI-specific aspects of the device resume. + +Note that generally pci_pm_runtime_resume() may be called in two different +situations. First, it may be called at the request of the device's driver, for +example if there are some data for it to process. Second, it may be called +as a result of a wakeup signal from the device itself (this sometimes is +referred to as "remote wakeup"). Of course, for this purpose the wakeup signal +is handled in one of the ways described in Section 1 and finally converted into +a notification for the PCI subsystem after the source device has been +identified. + +The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() +and pm_request_idle(), executes the device driver's pm->runtime_idle() +callback, if defined, and if that callback doesn't return error code (or is not +present at all), suspends the device with the help of pm_runtime_suspend(). +Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for +example, it is called right after the device has just been resumed), in which +cases it is expected to suspend the device if that makes sense. Usually, +however, the PCI subsystem doesn't really know if the device really can be +suspended, so it lets the device's driver decide by running its +pm->runtime_idle() callback. + +2.4. System-Wide Power Transitions +---------------------------------- +There are a few different types of system-wide power transitions, described in +Documentation/driver-api/pm/devices.rst. Each of them requires devices to be handled +in a specific way and the PM core executes subsystem-level power management +callbacks for this purpose. They are executed in phases such that each phase +involves executing the same subsystem-level callback for every device belonging +to the given subsystem before the next phase begins. These phases always run +after tasks have been frozen. + +2.4.1. System Suspend +^^^^^^^^^^^^^^^^^^^^^ + +When the system is going into a sleep state in which the contents of memory will +be preserved, such as one of the ACPI sleep states S1-S3, the phases are: + + prepare, suspend, suspend_noirq. + +The following PCI bus type's callbacks, respectively, are used in these phases:: + + pci_pm_prepare() + pci_pm_suspend() + pci_pm_suspend_noirq() + +The pci_pm_prepare() routine first puts the device into the "fully functional" +state with the help of pm_runtime_resume(). Then, it executes the device +driver's pm->prepare() callback if defined (i.e. if the driver's struct +dev_pm_ops object is present and the prepare pointer in that object is valid). + +The pci_pm_suspend() routine first checks if the device's driver implements +legacy PCI suspend routines (see Section 3), in which case the driver's legacy +suspend callback is executed, if present, and its result is returned. Next, if +the device's driver doesn't provide a struct dev_pm_ops object (containing +pointers to the driver's callbacks), pci_pm_default_suspend() is called, which +simply turns off the device's bus master capability and runs +pcibios_disable_device() to disable it, unless the device is a bridge (PCI +bridges are ignored by this routine). Next, the device driver's pm->suspend() +callback is executed, if defined, and its result is returned if it fails. +Finally, pci_fixup_device() is called to apply hardware suspend quirks related +to the device if necessary. + +Note that the suspend phase is carried out asynchronously for PCI devices, so +the pci_pm_suspend() callback may be executed in parallel for any pair of PCI +devices that don't depend on each other in a known way (i.e. none of the paths +in the device tree from the root bridge to a leaf device contains both of them). + +The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has +been called, which means that the device driver's interrupt handler won't be +invoked while this routine is running. It first checks if the device's driver +implements legacy PCI suspends routines (Section 3), in which case the legacy +late suspend routine is called and its result is returned (the standard +configuration registers of the device are saved if the driver's callback hasn't +done that). Second, if the device driver's struct dev_pm_ops object is not +present, the device's standard configuration registers are saved and the routine +returns success. Otherwise the device driver's pm->suspend_noirq() callback is +executed, if present, and its result is returned if it fails. Next, if the +device's standard configuration registers haven't been saved yet (one of the +device driver's callbacks executed before might do that), pci_pm_suspend_noirq() +saves them, prepares the device to signal wakeup (if necessary) and puts it into +a low-power state. + +The low-power state to put the device into is the lowest-power (highest number) +state from which it can signal wakeup while the system is in the target sleep +state. Just like in the runtime PM case described above, the mechanism of +signaling wakeup is system-dependent and determined by the PCI subsystem, which +is also responsible for preparing the device to signal wakeup from the system's +target sleep state as appropriate. + +PCI device drivers (that don't implement legacy power management callbacks) are +generally not expected to prepare devices for signaling wakeup or to put them +into low-power states. However, if one of the driver's suspend callbacks +(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration +registers, pci_pm_suspend_noirq() will assume that the device has been prepared +to signal wakeup and put into a low-power state by the driver (the driver is +then assumed to have used the helper functions provided by the PCI subsystem for +this purpose). PCI device drivers are not encouraged to do that, but in some +rare cases doing that in the driver may be the optimum approach. + +2.4.2. System Resume +^^^^^^^^^^^^^^^^^^^^ + +When the system is undergoing a transition from a sleep state in which the +contents of memory have been preserved, such as one of the ACPI sleep states +S1-S3, into the working state (ACPI S0), the phases are: + + resume_noirq, resume, complete. + +The following PCI bus type's callbacks, respectively, are executed in these +phases:: + + pci_pm_resume_noirq() + pci_pm_resume() + pci_pm_complete() + +The pci_pm_resume_noirq() routine first puts the device into the full-power +state, restores its standard configuration registers and applies early resume +hardware quirks related to the device, if necessary. This is done +unconditionally, regardless of whether or not the device's driver implements +legacy PCI power management callbacks (this way all PCI devices are in the +full-power state and their standard configuration registers have been restored +when their interrupt handlers are invoked for the first time during resume, +which allows the kernel to avoid problems with the handling of shared interrupts +by drivers whose devices are still suspended). If legacy PCI power management +callbacks (see Section 3) are implemented by the device's driver, the legacy +early resume callback is executed and its result is returned. Otherwise, the +device driver's pm->resume_noirq() callback is executed, if defined, and its +result is returned. + +The pci_pm_resume() routine first checks if the device's standard configuration +registers have been restored and restores them if that's not the case (this +only is necessary in the error path during a failing suspend). Next, resume +hardware quirks related to the device are applied, if necessary, and if the +device's driver implements legacy PCI power management callbacks (see +Section 3), the driver's legacy resume callback is executed and its result is +returned. Otherwise, the device's wakeup signaling mechanisms are blocked and +its driver's pm->resume() callback is executed, if defined (the callback's +result is then returned). + +The resume phase is carried out asynchronously for PCI devices, like the +suspend phase described above, which means that if two PCI devices don't depend +on each other in a known way, the pci_pm_resume() routine may be executed for +the both of them in parallel. + +The pci_pm_complete() routine only executes the device driver's pm->complete() +callback, if defined. + +2.4.3. System Hibernation +^^^^^^^^^^^^^^^^^^^^^^^^^ + +System hibernation is more complicated than system suspend, because it requires +a system image to be created and written into a persistent storage medium. The +image is created atomically and all devices are quiesced, or frozen, before that +happens. + +The freezing of devices is carried out after enough memory has been freed (at +the time of this writing the image creation requires at least 50% of system RAM +to be free) in the following three phases: + + prepare, freeze, freeze_noirq + +that correspond to the PCI bus type's callbacks:: + + pci_pm_prepare() + pci_pm_freeze() + pci_pm_freeze_noirq() + +This means that the prepare phase is exactly the same as for system suspend. +The other two phases, however, are different. + +The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs +the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), +and it doesn't apply the suspend-related hardware quirks. It is executed +asynchronously for different PCI devices that don't depend on each other in a +known way. + +The pci_pm_freeze_noirq() routine, in turn, is similar to +pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() +routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the +device for signaling wakeup and put it into a low-power state. Still, it saves +the device's standard configuration registers if they haven't been saved by one +of the driver's callbacks. + +Once the image has been created, it has to be saved. However, at this point all +devices are frozen and they cannot handle I/O, while their ability to handle +I/O is obviously necessary for the image saving. Thus they have to be brought +back to the fully functional state and this is done in the following phases: + + thaw_noirq, thaw, complete + +using the following PCI bus type's callbacks:: + + pci_pm_thaw_noirq() + pci_pm_thaw() + pci_pm_complete() + +respectively. + +The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), +but it doesn't put the device into the full power state and doesn't attempt to +restore its standard configuration registers. It also executes the device +driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). + +The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device +driver's pm->thaw() callback instead of pm->resume(). It is executed +asynchronously for different PCI devices that don't depend on each other in a +known way. + +The complete phase it the same as for system resume. + +After saving the image, devices need to be powered down before the system can +enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in +three phases: + + prepare, poweroff, poweroff_noirq + +where the prepare phase is exactly the same as for system suspend. The other +two phases are analogous to the suspend and suspend_noirq phases, respectively. +The PCI subsystem-level callbacks they correspond to:: + + pci_pm_poweroff() + pci_pm_poweroff_noirq() + +work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, +although they don't attempt to save the device's standard configuration +registers. + +2.4.4. System Restore +^^^^^^^^^^^^^^^^^^^^^ + +System restore requires a hibernation image to be loaded into memory and the +pre-hibernation memory contents to be restored before the pre-hibernation system +activity can be resumed. + +As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded +into memory by a fresh instance of the kernel, called the boot kernel, which in +turn is loaded and run by a boot loader in the usual way. After the boot kernel +has loaded the image, it needs to replace its own code and data with the code +and data of the "hibernated" kernel stored within the image, called the image +kernel. For this purpose all devices are frozen just like before creating +the image during hibernation, in the + + prepare, freeze, freeze_noirq + +phases described above. However, the devices affected by these phases are only +those having drivers in the boot kernel; other devices will still be in whatever +state the boot loader left them. + +Should the restoration of the pre-hibernation memory contents fail, the boot +kernel would go through the "thawing" procedure described above, using the +thaw_noirq, thaw, and complete phases (that will only affect the devices having +drivers in the boot kernel), and then continue running normally. + +If the pre-hibernation memory contents are restored successfully, which is the +usual situation, control is passed to the image kernel, which then becomes +responsible for bringing the system back to the working state. To achieve this, +it must restore the devices' pre-hibernation functionality, which is done much +like waking up from the memory sleep state, although it involves different +phases: + + restore_noirq, restore, complete + +The first two of these are analogous to the resume_noirq and resume phases +described above, respectively, and correspond to the following PCI subsystem +callbacks:: + + pci_pm_restore_noirq() + pci_pm_restore() + +These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), +respectively, but they execute the device driver's pm->restore_noirq() and +pm->restore() callbacks, if available. + +The complete phase is carried out in exactly the same way as during system +resume. + + +3. PCI Device Drivers and Power Management +========================================== + +3.1. Power Management Callbacks +------------------------------- + +PCI device drivers participate in power management by providing callbacks to be +executed by the PCI subsystem's power management routines described above and by +controlling the runtime power management of their devices. + +At the time of this writing there are two ways to define power management +callbacks for a PCI device driver, the recommended one, based on using a +dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the +"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and +.resume() callbacks from struct pci_driver are used. The legacy approach, +however, doesn't allow one to define runtime power management callbacks and is +not really suitable for any new drivers. Therefore it is not covered by this +document (refer to the source code to learn more about it). + +It is recommended that all PCI device drivers define a struct dev_pm_ops object +containing pointers to power management (PM) callbacks that will be executed by +the PCI subsystem's PM routines in various circumstances. A pointer to the +driver's struct dev_pm_ops object has to be assigned to the driver.pm field in +its struct pci_driver object. Once that has happened, the "legacy" PM callbacks +in struct pci_driver are ignored (even if they are not NULL). + +The PM callbacks in struct dev_pm_ops are not mandatory and if they are not +defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI +subsystem will handle the device in a simplified default manner. If they are +defined, though, they are expected to behave as described in the following +subsections. + +3.1.1. prepare() +^^^^^^^^^^^^^^^^ + +The prepare() callback is executed during system suspend, during hibernation +(when a hibernation image is about to be created), during power-off after +saving a hibernation image and during system restore, when a hibernation image +has just been loaded into memory. + +This callback is only necessary if the driver's device has children that in +general may be registered at any time. In that case the role of the prepare() +callback is to prevent new children of the device from being registered until +one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. + +In addition to that the prepare() callback may carry out some operations +preparing the device to be suspended, although it should not allocate memory +(if additional memory is required to suspend the device, it has to be +preallocated earlier, for example in a suspend/hibernate notifier as described +in Documentation/driver-api/pm/notifiers.rst). + +3.1.2. suspend() +^^^^^^^^^^^^^^^^ + +The suspend() callback is only executed during system suspend, after prepare() +callbacks have been executed for all devices in the system. + +This callback is expected to quiesce the device and prepare it to be put into a +low-power state by the PCI subsystem. It is not required (in fact it even is +not recommended) that a PCI driver's suspend() callback save the standard +configuration registers of the device, prepare it for waking up the system, or +put it into a low-power state. All of these operations can very well be taken +care of by the PCI subsystem, without the driver's participation. + +However, in some rare case it is convenient to carry out these operations in +a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and +pci_set_power_state() should be used to save the device's standard configuration +registers, to prepare it for system wakeup (if necessary), and to put it into a +low-power state, respectively. Moreover, if the driver calls pci_save_state(), +the PCI subsystem will not execute either pci_prepare_to_sleep(), or +pci_set_power_state() for its device, so the driver is then responsible for +handling the device as appropriate. + +While the suspend() callback is being executed, the driver's interrupt handler +can be invoked to handle an interrupt from the device, so all suspend-related +operations relying on the driver's ability to handle interrupts should be +carried out in this callback. + +3.1.3. suspend_noirq() +^^^^^^^^^^^^^^^^^^^^^^ + +The suspend_noirq() callback is only executed during system suspend, after +suspend() callbacks have been executed for all devices in the system and +after device interrupts have been disabled by the PM core. + +The difference between suspend_noirq() and suspend() is that the driver's +interrupt handler will not be invoked while suspend_noirq() is running. Thus +suspend_noirq() can carry out operations that would cause race conditions to +arise if they were performed in suspend(). + +3.1.4. freeze() +^^^^^^^^^^^^^^^ + +The freeze() callback is hibernation-specific and is executed in two situations, +during hibernation, after prepare() callbacks have been executed for all devices +in preparation for the creation of a system image, and during restore, +after a system image has been loaded into memory from persistent storage and the +prepare() callbacks have been executed for all devices. + +The role of this callback is analogous to the role of the suspend() callback +described above. In fact, they only need to be different in the rare cases when +the driver takes the responsibility for putting the device into a low-power +state. + +In that cases the freeze() callback should not prepare the device system wakeup +or put it into a low-power state. Still, either it or freeze_noirq() should +save the device's standard configuration registers using pci_save_state(). + +3.1.5. freeze_noirq() +^^^^^^^^^^^^^^^^^^^^^ + +The freeze_noirq() callback is hibernation-specific. It is executed during +hibernation, after prepare() and freeze() callbacks have been executed for all +devices in preparation for the creation of a system image, and during restore, +after a system image has been loaded into memory and after prepare() and +freeze() callbacks have been executed for all devices. It is always executed +after device interrupts have been disabled by the PM core. + +The role of this callback is analogous to the role of the suspend_noirq() +callback described above and it very rarely is necessary to define +freeze_noirq(). + +The difference between freeze_noirq() and freeze() is analogous to the +difference between suspend_noirq() and suspend(). + +3.1.6. poweroff() +^^^^^^^^^^^^^^^^^ + +The poweroff() callback is hibernation-specific. It is executed when the system +is about to be powered off after saving a hibernation image to a persistent +storage. prepare() callbacks are executed for all devices before poweroff() is +called. + +The role of this callback is analogous to the role of the suspend() and freeze() +callbacks described above, although it does not need to save the contents of +the device's registers. In particular, if the driver wants to put the device +into a low-power state itself instead of allowing the PCI subsystem to do that, +the poweroff() callback should use pci_prepare_to_sleep() and +pci_set_power_state() to prepare the device for system wakeup and to put it +into a low-power state, respectively, but it need not save the device's standard +configuration registers. + +3.1.7. poweroff_noirq() +^^^^^^^^^^^^^^^^^^^^^^^ + +The poweroff_noirq() callback is hibernation-specific. It is executed after +poweroff() callbacks have been executed for all devices in the system. + +The role of this callback is analogous to the role of the suspend_noirq() and +freeze_noirq() callbacks described above, but it does not need to save the +contents of the device's registers. + +The difference between poweroff_noirq() and poweroff() is analogous to the +difference between suspend_noirq() and suspend(). + +3.1.8. resume_noirq() +^^^^^^^^^^^^^^^^^^^^^ + +The resume_noirq() callback is only executed during system resume, after the +PM core has enabled the non-boot CPUs. The driver's interrupt handler will not +be invoked while resume_noirq() is running, so this callback can carry out +operations that might race with the interrupt handler. + +Since the PCI subsystem unconditionally puts all devices into the full power +state in the resume_noirq phase of system resume and restores their standard +configuration registers, resume_noirq() is usually not necessary. In general +it should only be used for performing operations that would lead to race +conditions if carried out by resume(). + +3.1.9. resume() +^^^^^^^^^^^^^^^ + +The resume() callback is only executed during system resume, after +resume_noirq() callbacks have been executed for all devices in the system and +device interrupts have been enabled by the PM core. + +This callback is responsible for restoring the pre-suspend configuration of the +device and bringing it back to the fully functional state. The device should be +able to process I/O in a usual way after resume() has returned. + +3.1.10. thaw_noirq() +^^^^^^^^^^^^^^^^^^^^ + +The thaw_noirq() callback is hibernation-specific. It is executed after a +system image has been created and the non-boot CPUs have been enabled by the PM +core, in the thaw_noirq phase of hibernation. It also may be executed if the +loading of a hibernation image fails during system restore (it is then executed +after enabling the non-boot CPUs). The driver's interrupt handler will not be +invoked while thaw_noirq() is running. + +The role of this callback is analogous to the role of resume_noirq(). The +difference between these two callbacks is that thaw_noirq() is executed after +freeze() and freeze_noirq(), so in general it does not need to modify the +contents of the device's registers. + +3.1.11. thaw() +^^^^^^^^^^^^^^ + +The thaw() callback is hibernation-specific. It is executed after thaw_noirq() +callbacks have been executed for all devices in the system and after device +interrupts have been enabled by the PM core. + +This callback is responsible for restoring the pre-freeze configuration of +the device, so that it will work in a usual way after thaw() has returned. + +3.1.12. restore_noirq() +^^^^^^^^^^^^^^^^^^^^^^^ + +The restore_noirq() callback is hibernation-specific. It is executed in the +restore_noirq phase of hibernation, when the boot kernel has passed control to +the image kernel and the non-boot CPUs have been enabled by the image kernel's +PM core. + +This callback is analogous to resume_noirq() with the exception that it cannot +make any assumption on the previous state of the device, even if the BIOS (or +generally the platform firmware) is known to preserve that state over a +suspend-resume cycle. + +For the vast majority of PCI device drivers there is no difference between +resume_noirq() and restore_noirq(). + +3.1.13. restore() +^^^^^^^^^^^^^^^^^ + +The restore() callback is hibernation-specific. It is executed after +restore_noirq() callbacks have been executed for all devices in the system and +after the PM core has enabled device drivers' interrupt handlers to be invoked. + +This callback is analogous to resume(), just like restore_noirq() is analogous +to resume_noirq(). Consequently, the difference between restore_noirq() and +restore() is analogous to the difference between resume_noirq() and resume(). + +For the vast majority of PCI device drivers there is no difference between +resume() and restore(). + +3.1.14. complete() +^^^^^^^^^^^^^^^^^^ + +The complete() callback is executed in the following situations: + + - during system resume, after resume() callbacks have been executed for all + devices, + - during hibernation, before saving the system image, after thaw() callbacks + have been executed for all devices, + - during system restore, when the system is going back to its pre-hibernation + state, after restore() callbacks have been executed for all devices. + +It also may be executed if the loading of a hibernation image into memory fails +(in that case it is run after thaw() callbacks have been executed for all +devices that have drivers in the boot kernel). + +This callback is entirely optional, although it may be necessary if the +prepare() callback performs operations that need to be reversed. + +3.1.15. runtime_suspend() +^^^^^^^^^^^^^^^^^^^^^^^^^ + +The runtime_suspend() callback is specific to device runtime power management +(runtime PM). It is executed by the PM core's runtime PM framework when the +device is about to be suspended (i.e. quiesced and put into a low-power state) +at run time. + +This callback is responsible for freezing the device and preparing it to be +put into a low-power state, but it must allow the PCI subsystem to perform all +of the PCI-specific actions necessary for suspending the device. + +3.1.16. runtime_resume() +^^^^^^^^^^^^^^^^^^^^^^^^ + +The runtime_resume() callback is specific to device runtime PM. It is executed +by the PM core's runtime PM framework when the device is about to be resumed +(i.e. put into the full-power state and programmed to process I/O normally) at +run time. + +This callback is responsible for restoring the normal functionality of the +device after it has been put into the full-power state by the PCI subsystem. +The device is expected to be able to process I/O in the usual way after +runtime_resume() has returned. + +3.1.17. runtime_idle() +^^^^^^^^^^^^^^^^^^^^^^ + +The runtime_idle() callback is specific to device runtime PM. It is executed +by the PM core's runtime PM framework whenever it may be desirable to suspend +the device according to the PM core's information. In particular, it is +automatically executed right after runtime_resume() has returned in case the +resume of the device has happened as a result of a spurious event. + +This callback is optional, but if it is not implemented or if it returns 0, the +PCI subsystem will call pm_runtime_suspend() for the device, which in turn will +cause the driver's runtime_suspend() callback to be executed. + +3.1.18. Pointing Multiple Callback Pointers to One Routine +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Although in principle each of the callbacks described in the previous +subsections can be defined as a separate function, it often is convenient to +point two or more members of struct dev_pm_ops to the same routine. There are +a few convenience macros that can be used for this purpose. + +The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one +suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() +members and one resume routine pointed to by the .resume(), .thaw(), and +.restore() members. The other function pointers in this struct dev_pm_ops are +unset. + +The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it +additionally sets the .runtime_resume() pointer to the same value as +.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to +the same value as .suspend() (and .freeze() and .poweroff()). + +The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct +dev_pm_ops to indicate that one suspend routine is to be pointed to by the +.suspend(), .freeze(), and .poweroff() members and one resume routine is to +be pointed to by the .resume(), .thaw(), and .restore() members. + +3.1.19. Driver Flags for Power Management +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The PM core allows device drivers to set flags that influence the handling of +power management for the devices by the core itself and by middle layer code +including the PCI bus type. The flags should be set once at the driver probe +time with the help of the dev_pm_set_driver_flags() function and they should not +be updated directly afterwards. + +The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete +mechanism allowing device suspend/resume callbacks to be skipped if the device +is in runtime suspend when the system suspend starts. That also affects all of +the ancestors of the device, so this flag should only be used if absolutely +necessary. + +The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a +positive value from pci_pm_prepare() if the ->prepare callback provided by the +driver of the device returns a positive value. That allows the driver to opt +out from using the direct-complete mechanism dynamically. + +The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's +perspective the device can be safely left in runtime suspend during system +suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() +to skip resuming the device from runtime suspend unless there are PCI-specific +reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), +pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early +if the device remains in runtime suspend in the beginning of the "late" phase +of the system-wide transition under way. Moreover, if the device is in +runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime +power management status will be changed to "active" (as it is going to be put +into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), +the function will set the power.direct_complete flag for it (to make the PM core +skip the subsequent "thaw" callbacks for it) and return. + +Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the +device to be left in suspend after system-wide transitions to the working state. +This flag is checked by the PM core, but the PCI bus type informs the PM core +which devices may be left in suspend from its perspective (that happens during +the "noirq" phase of system-wide suspend and analogous transitions) and next it +uses the dev_pm_may_skip_resume() helper to decide whether or not to return from +pci_pm_resume_noirq() early, as the PM core will skip the remaining resume +callbacks for the device during the transition under way and will set its +runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for +it. + +3.2. Device Runtime Power Management +------------------------------------ + +In addition to providing device power management callbacks PCI device drivers +are responsible for controlling the runtime power management (runtime PM) of +their devices. + +The PCI device runtime PM is optional, but it is recommended that PCI device +drivers implement it at least in the cases where there is a reliable way of +verifying that the device is not used (like when the network cable is detached +from an Ethernet adapter or there are no devices attached to a USB controller). + +To support the PCI runtime PM the driver first needs to implement the +runtime_suspend() and runtime_resume() callbacks. It also may need to implement +the runtime_idle() callback to prevent the device from being suspended again +every time right after the runtime_resume() callback has returned +(alternatively, the runtime_suspend() callback will have to check if the +device should really be suspended and return -EAGAIN if that is not the case). + +The runtime PM of PCI devices is enabled by default by the PCI core. PCI +device drivers do not need to enable it and should not attempt to do so. +However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() +helper function. In addition to that, the runtime PM usage counter of +each PCI device is incremented by local_pci_probe() before executing the +probe callback provided by the device's driver. + +If a PCI driver implements the runtime PM callbacks and intends to use the +runtime PM framework provided by the PM core and the PCI subsystem, it needs +to decrement the device's runtime PM usage counter in its probe callback +function. If it doesn't do that, the counter will always be different from +zero for the device and it will never be runtime-suspended. The simplest +way to do that is by calling pm_runtime_put_noidle(), but if the driver +wants to schedule an autosuspend right away, for example, it may call +pm_runtime_put_autosuspend() instead for this purpose. Generally, it +just needs to call a function that decrements the devices usage counter +from its probe routine to make runtime PM work for the device. + +It is important to remember that the driver's runtime_suspend() callback +may be executed right after the usage counter has been decremented, because +user space may already have caused the pm_runtime_allow() helper function +unblocking the runtime PM of the device to run via sysfs, so the driver must +be prepared to cope with that. + +The driver itself should not call pm_runtime_allow(), though. Instead, it +should let user space or some platform-specific code do that (user space can +do it via sysfs as stated above), but it must be prepared to handle the +runtime PM of the device correctly as soon as pm_runtime_allow() is called +(which may happen at any time, even before the driver is loaded). + +When the driver's remove callback runs, it has to balance the decrementation +of the device's runtime PM usage counter at the probe time. For this reason, +if it has decremented the counter in its probe callback, it must run +pm_runtime_get_noresume() in its remove callback. [Since the core carries +out a runtime resume of the device and bumps up the device's usage counter +before running the driver's remove callback, the runtime PM of the device +is effectively disabled for the duration of the remove execution and all +runtime PM helper functions incrementing the device's usage counter are +then effectively equivalent to pm_runtime_get_noresume().] + +The runtime PM framework works by processing requests to suspend or resume +devices, or to check if they are idle (in which cases it is reasonable to +subsequently request that they be suspended). These requests are represented +by work items put into the power management workqueue, pm_wq. Although there +are a few situations in which power management requests are automatically +queued by the PM core (for example, after processing a request to resume a +device the PM core automatically queues a request to check if the device is +idle), device drivers are generally responsible for queuing power management +requests for their devices. For this purpose they should use the runtime PM +helper functions provided by the PM core, discussed in +Documentation/power/runtime_pm.rst. + +Devices can also be suspended and resumed synchronously, without placing a +request into pm_wq. In the majority of cases this also is done by their +drivers that use helper functions provided by the PM core for this purpose. + +For more information on the runtime PM of devices refer to +Documentation/power/runtime_pm.rst. + + +4. Resources +============ + +PCI Local Bus Specification, Rev. 3.0 + +PCI Bus Power Management Interface Specification, Rev. 1.2 + +Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b + +PCI Express Base Specification, Rev. 2.0 + +Documentation/driver-api/pm/devices.rst + +Documentation/power/runtime_pm.rst diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.txt deleted file mode 100644 index 8eaf9ee24d43..000000000000 --- a/Documentation/power/pci.txt +++ /dev/null @@ -1,1094 +0,0 @@ -PCI Power Management - -Copyright (c) 2010 Rafael J. Wysocki , Novell Inc. - -An overview of concepts and the Linux kernel's interfaces related to PCI power -management. Based on previous work by Patrick Mochel -(and others). - -This document only covers the aspects of power management specific to PCI -devices. For general description of the kernel's interfaces related to device -power management refer to Documentation/driver-api/pm/devices.rst and -Documentation/power/runtime_pm.txt. - ---------------------------------------------------------------------------- - -1. Hardware and Platform Support for PCI Power Management -2. PCI Subsystem and Device Power Management -3. PCI Device Drivers and Power Management -4. Resources - - -1. Hardware and Platform Support for PCI Power Management -========================================================= - -1.1. Native and Platform-Based Power Management ------------------------------------------------ -In general, power management is a feature allowing one to save energy by putting -devices into states in which they draw less power (low-power states) at the -price of reduced functionality or performance. - -Usually, a device is put into a low-power state when it is underutilized or -completely inactive. However, when it is necessary to use the device once -again, it has to be put back into the "fully functional" state (full-power -state). This may happen when there are some data for the device to handle or -as a result of an external event requiring the device to be active, which may -be signaled by the device itself. - -PCI devices may be put into low-power states in two ways, by using the device -capabilities introduced by the PCI Bus Power Management Interface Specification, -or with the help of platform firmware, such as an ACPI BIOS. In the first -approach, that is referred to as the native PCI power management (native PCI PM) -in what follows, the device power state is changed as a result of writing a -specific value into one of its standard configuration registers. The second -approach requires the platform firmware to provide special methods that may be -used by the kernel to change the device's power state. - -Devices supporting the native PCI PM usually can generate wakeup signals called -Power Management Events (PMEs) to let the kernel know about external events -requiring the device to be active. After receiving a PME the kernel is supposed -to put the device that sent it into the full-power state. However, the PCI Bus -Power Management Interface Specification doesn't define any standard method of -delivering the PME from the device to the CPU and the operating system kernel. -It is assumed that the platform firmware will perform this task and therefore, -even though a PCI device is set up to generate PMEs, it also may be necessary to -prepare the platform firmware for notifying the CPU of the PMEs coming from the -device (e.g. by generating interrupts). - -In turn, if the methods provided by the platform firmware are used for changing -the power state of a device, usually the platform also provides a method for -preparing the device to generate wakeup signals. In that case, however, it -often also is necessary to prepare the device for generating PMEs using the -native PCI PM mechanism, because the method provided by the platform depends on -that. - -Thus in many situations both the native and the platform-based power management -mechanisms have to be used simultaneously to obtain the desired result. - -1.2. Native PCI Power Management --------------------------------- -The PCI Bus Power Management Interface Specification (PCI PM Spec) was -introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a -standard interface for performing various operations related to power -management. - -The implementation of the PCI PM Spec is optional for conventional PCI devices, -but it is mandatory for PCI Express devices. If a device supports the PCI PM -Spec, it has an 8 byte power management capability field in its PCI -configuration space. This field is used to describe and control the standard -features related to the native PCI power management. - -The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses -(B0-B3). The higher the number, the less power is drawn by the device or bus -in that state. However, the higher the number, the longer the latency for -the device or bus to return to the full-power state (D0 or B0, respectively). - -There are two variants of the D3 state defined by the specification. The first -one is D3hot, referred to as the software accessible D3, because devices can be -programmed to go into it. The second one, D3cold, is the state that PCI devices -are in when the supply voltage (Vcc) is removed from them. It is not possible -to program a PCI device to go into D3cold, although there may be a programmable -interface for putting the bus the device is on into a state in which Vcc is -removed from all devices on the bus. - -PCI bus power management, however, is not supported by the Linux kernel at the -time of this writing and therefore it is not covered by this document. - -Note that every PCI device can be in the full-power state (D0) or in D3cold, -regardless of whether or not it implements the PCI PM Spec. In addition to -that, if the PCI PM Spec is implemented by the device, it must support D3hot -as well as D0. The support for the D1 and D2 power states is optional. - -PCI devices supporting the PCI PM Spec can be programmed to go to any of the -supported low-power states (except for D3cold). While in D1-D3hot the -standard configuration registers of the device must be accessible to software -(i.e. the device is required to respond to PCI configuration accesses), although -its I/O and memory spaces are then disabled. This allows the device to be -programmatically put into D0. Thus the kernel can switch the device back and -forth between D0 and the supported low-power states (except for D3cold) and the -possible power state transitions the device can undergo are the following: - -+----------------------------+ -| Current State | New State | -+----------------------------+ -| D0 | D1, D2, D3 | -+----------------------------+ -| D1 | D2, D3 | -+----------------------------+ -| D2 | D3 | -+----------------------------+ -| D1, D2, D3 | D0 | -+----------------------------+ - -The transition from D3cold to D0 occurs when the supply voltage is provided to -the device (i.e. power is restored). In that case the device returns to D0 with -a full power-on reset sequence and the power-on defaults are restored to the -device by hardware just as at initial power up. - -PCI devices supporting the PCI PM Spec can be programmed to generate PMEs -while in a low-power state (D1-D3), but they are not required to be capable -of generating PMEs from all supported low-power states. In particular, the -capability of generating PMEs from D3cold is optional and depends on the -presence of additional voltage (3.3Vaux) allowing the device to remain -sufficiently active to generate a wakeup signal. - -1.3. ACPI Device Power Management ---------------------------------- -The platform firmware support for the power management of PCI devices is -system-specific. However, if the system in question is compliant with the -Advanced Configuration and Power Interface (ACPI) Specification, like the -majority of x86-based systems, it is supposed to implement device power -management interfaces defined by the ACPI standard. - -For this purpose the ACPI BIOS provides special functions called "control -methods" that may be executed by the kernel to perform specific tasks, such as -putting a device into a low-power state. These control methods are encoded -using special byte-code language called the ACPI Machine Language (AML) and -stored in the machine's BIOS. The kernel loads them from the BIOS and executes -them as needed using an AML interpreter that translates the AML byte code into -computations and memory or I/O space accesses. This way, in theory, a BIOS -writer can provide the kernel with a means to perform actions depending -on the system design in a system-specific fashion. - -ACPI control methods may be divided into global control methods, that are not -associated with any particular devices, and device control methods, that have -to be defined separately for each device supposed to be handled with the help of -the platform. This means, in particular, that ACPI device control methods can -only be used to handle devices that the BIOS writer knew about in advance. The -ACPI methods used for device power management fall into that category. - -The ACPI specification assumes that devices can be in one of four power states -labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM -D0-D3 states (although the difference between D3hot and D3cold is not taken -into account by ACPI). Moreover, for each power state of a device there is a -set of power resources that have to be enabled for the device to be put into -that state. These power resources are controlled (i.e. enabled or disabled) -with the help of their own control methods, _ON and _OFF, that have to be -defined individually for each of them. - -To put a device into the ACPI power state Dx (where x is a number between 0 and -3 inclusive) the kernel is supposed to (1) enable the power resources required -by the device in this state using their _ON control methods and (2) execute the -_PSx control method defined for the device. In addition to that, if the device -is going to be put into a low-power state (D1-D3) and is supposed to generate -wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI -3.0) control method defined for it has to be executed before _PSx. Power -resources that are not required by the device in the target power state and are -not required any more by any other device should be disabled (by executing their -_OFF control methods). If the current power state of the device is D3, it can -only be put into D0 this way. - -However, quite often the power states of devices are changed during a -system-wide transition into a sleep state or back into the working state. ACPI -defines four system sleep states, S1, S2, S3, and S4, and denotes the system -working state as S0. In general, the target system sleep (or working) state -determines the highest power (lowest number) state the device can be put -into and the kernel is supposed to obtain this information by executing the -device's _SxD control method (where x is a number between 0 and 4 inclusive). -If the device is required to wake up the system from the target sleep state, the -lowest power (highest number) state it can be put into is also determined by the -target state of the system. The kernel is then supposed to use the device's -_SxW control method to obtain the number of that state. It also is supposed to -use the device's _PRW control method to learn which power resources need to be -enabled for the device to be able to generate wakeup signals. - -1.4. Wakeup Signaling ---------------------- -Wakeup signals generated by PCI devices, either as native PCI PMEs, or as -a result of the execution of the _DSW (or _PSW) ACPI control method before -putting the device into a low-power state, have to be caught and handled as -appropriate. If they are sent while the system is in the working state -(ACPI S0), they should be translated into interrupts so that the kernel can -put the devices generating them into the full-power state and take care of the -events that triggered them. In turn, if they are sent while the system is -sleeping, they should cause the system's core logic to trigger wakeup. - -On ACPI-based systems wakeup signals sent by conventional PCI devices are -converted into ACPI General-Purpose Events (GPEs) which are hardware signals -from the system core logic generated in response to various events that need to -be acted upon. Every GPE is associated with one or more sources of potentially -interesting events. In particular, a GPE may be associated with a PCI device -capable of signaling wakeup. The information on the connections between GPEs -and event sources is recorded in the system's ACPI BIOS from where it can be -read by the kernel. - -If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE -associated with it (if there is one) is triggered. The GPEs associated with PCI -bridges may also be triggered in response to a wakeup signal from one of the -devices below the bridge (this also is the case for root bridges) and, for -example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be -handled this way. - -A GPE may be triggered when the system is sleeping (i.e. when it is in one of -the ACPI S1-S4 states), in which case system wakeup is started by its core logic -(the device that was the source of the signal causing the system wakeup to occur -may be identified later). The GPEs used in such situations are referred to as -wakeup GPEs. - -Usually, however, GPEs are also triggered when the system is in the working -state (ACPI S0) and in that case the system's core logic generates a System -Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI -handler identifies the GPE that caused the interrupt to be generated which, -in turn, allows the kernel to identify the source of the event (that may be -a PCI device signaling wakeup). The GPEs used for notifying the kernel of -events occurring while the system is in the working state are referred to as -runtime GPEs. - -Unfortunately, there is no standard way of handling wakeup signals sent by -conventional PCI devices on systems that are not ACPI-based, but there is one -for PCI Express devices. Namely, the PCI Express Base Specification introduced -a native mechanism for converting native PCI PMEs into interrupts generated by -root ports. For conventional PCI devices native PMEs are out-of-band, so they -are routed separately and they need not pass through bridges (in principle they -may be routed directly to the system's core logic), but for PCI Express devices -they are in-band messages that have to pass through the PCI Express hierarchy, -including the root port on the path from the device to the Root Complex. Thus -it was possible to introduce a mechanism by which a root port generates an -interrupt whenever it receives a PME message from one of the devices below it. -The PCI Express Requester ID of the device that sent the PME message is then -recorded in one of the root port's configuration registers from where it may be -read by the interrupt handler allowing the device to be identified. [PME -messages sent by PCI Express endpoints integrated with the Root Complex don't -pass through root ports, but instead they cause a Root Complex Event Collector -(if there is one) to generate interrupts.] - -In principle the native PCI Express PME signaling may also be used on ACPI-based -systems along with the GPEs, but to use it the kernel has to ask the system's -ACPI BIOS to release control of root port configuration registers. The ACPI -BIOS, however, is not required to allow the kernel to control these registers -and if it doesn't do that, the kernel must not modify their contents. Of course -the native PCI Express PME signaling cannot be used by the kernel in that case. - - -2. PCI Subsystem and Device Power Management -============================================ - -2.1. Device Power Management Callbacks --------------------------------------- -The PCI Subsystem participates in the power management of PCI devices in a -number of ways. First of all, it provides an intermediate code layer between -the device power management core (PM core) and PCI device drivers. -Specifically, the pm field of the PCI subsystem's struct bus_type object, -pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing -pointers to several device power management callbacks: - -const struct dev_pm_ops pci_dev_pm_ops = { - .prepare = pci_pm_prepare, - .complete = pci_pm_complete, - .suspend = pci_pm_suspend, - .resume = pci_pm_resume, - .freeze = pci_pm_freeze, - .thaw = pci_pm_thaw, - .poweroff = pci_pm_poweroff, - .restore = pci_pm_restore, - .suspend_noirq = pci_pm_suspend_noirq, - .resume_noirq = pci_pm_resume_noirq, - .freeze_noirq = pci_pm_freeze_noirq, - .thaw_noirq = pci_pm_thaw_noirq, - .poweroff_noirq = pci_pm_poweroff_noirq, - .restore_noirq = pci_pm_restore_noirq, - .runtime_suspend = pci_pm_runtime_suspend, - .runtime_resume = pci_pm_runtime_resume, - .runtime_idle = pci_pm_runtime_idle, -}; - -These callbacks are executed by the PM core in various situations related to -device power management and they, in turn, execute power management callbacks -provided by PCI device drivers. They also perform power management operations -involving some standard configuration registers of PCI devices that device -drivers need not know or care about. - -The structure representing a PCI device, struct pci_dev, contains several fields -that these callbacks operate on: - -struct pci_dev { - ... - pci_power_t current_state; /* Current operating state. */ - int pm_cap; /* PM capability offset in the - configuration space */ - unsigned int pme_support:5; /* Bitmask of states from which PME# - can be generated */ - unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ - unsigned int d1_support:1; /* Low power state D1 is supported */ - unsigned int d2_support:1; /* Low power state D2 is supported */ - unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ - unsigned int wakeup_prepared:1; /* Device prepared for wake up */ - unsigned int d3_delay; /* D3->D0 transition time in ms */ - ... -}; - -They also indirectly use some fields of the struct device that is embedded in -struct pci_dev. - -2.2. Device Initialization --------------------------- -The PCI subsystem's first task related to device power management is to -prepare the device for power management and initialize the fields of struct -pci_dev used for this purpose. This happens in two functions defined in -drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). - -The first of these functions checks if the device supports native PCI PM -and if that's the case the offset of its power management capability structure -in the configuration space is stored in the pm_cap field of the device's struct -pci_dev object. Next, the function checks which PCI low-power states are -supported by the device and from which low-power states the device can generate -native PCI PMEs. The power management fields of the device's struct pci_dev and -the struct device embedded in it are updated accordingly and the generation of -PMEs by the device is disabled. - -The second function checks if the device can be prepared to signal wakeup with -the help of the platform firmware, such as the ACPI BIOS. If that is the case, -the function updates the wakeup fields in struct device embedded in the -device's struct pci_dev and uses the firmware-provided method to prevent the -device from signaling wakeup. - -At this point the device is ready for power management. For driverless devices, -however, this functionality is limited to a few basic operations carried out -during system-wide transitions to a sleep state and back to the working state. - -2.3. Runtime Device Power Management ------------------------------------- -The PCI subsystem plays a vital role in the runtime power management of PCI -devices. For this purpose it uses the general runtime power management -(runtime PM) framework described in Documentation/power/runtime_pm.txt. -Namely, it provides subsystem-level callbacks: - - pci_pm_runtime_suspend() - pci_pm_runtime_resume() - pci_pm_runtime_idle() - -that are executed by the core runtime PM routines. It also implements the -entire mechanics necessary for handling runtime wakeup signals from PCI devices -in low-power states, which at the time of this writing works for both the native -PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in -Section 1. - -First, a PCI device is put into a low-power state, or suspended, with the help -of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call -pci_pm_runtime_suspend() to do the actual job. For this to work, the device's -driver has to provide a pm->runtime_suspend() callback (see below), which is -run by pci_pm_runtime_suspend() as the first action. If the driver's callback -returns successfully, the device's standard configuration registers are saved, -the device is prepared to generate wakeup signals and, finally, it is put into -the target low-power state. - -The low-power state to put the device into is the lowest-power (highest number) -state from which it can signal wakeup. The exact method of signaling wakeup is -system-dependent and is determined by the PCI subsystem on the basis of the -reported capabilities of the device and the platform firmware. To prepare the -device for signaling wakeup and put it into the selected low-power state, the -PCI subsystem can use the platform firmware as well as the device's native PCI -PM capabilities, if supported. - -It is expected that the device driver's pm->runtime_suspend() callback will -not attempt to prepare the device for signaling wakeup or to put it into a -low-power state. The driver ought to leave these tasks to the PCI subsystem -that has all of the information necessary to perform them. - -A suspended device is brought back into the "active" state, or resumed, -with the help of pm_request_resume() or pm_runtime_resume() which both call -pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's -driver provides a pm->runtime_resume() callback (see below). However, before -the driver's callback is executed, pci_pm_runtime_resume() brings the device -back into the full-power state, prevents it from signaling wakeup while in that -state and restores its standard configuration registers. Thus the driver's -callback need not worry about the PCI-specific aspects of the device resume. - -Note that generally pci_pm_runtime_resume() may be called in two different -situations. First, it may be called at the request of the device's driver, for -example if there are some data for it to process. Second, it may be called -as a result of a wakeup signal from the device itself (this sometimes is -referred to as "remote wakeup"). Of course, for this purpose the wakeup signal -is handled in one of the ways described in Section 1 and finally converted into -a notification for the PCI subsystem after the source device has been -identified. - -The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() -and pm_request_idle(), executes the device driver's pm->runtime_idle() -callback, if defined, and if that callback doesn't return error code (or is not -present at all), suspends the device with the help of pm_runtime_suspend(). -Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for -example, it is called right after the device has just been resumed), in which -cases it is expected to suspend the device if that makes sense. Usually, -however, the PCI subsystem doesn't really know if the device really can be -suspended, so it lets the device's driver decide by running its -pm->runtime_idle() callback. - -2.4. System-Wide Power Transitions ----------------------------------- -There are a few different types of system-wide power transitions, described in -Documentation/driver-api/pm/devices.rst. Each of them requires devices to be handled -in a specific way and the PM core executes subsystem-level power management -callbacks for this purpose. They are executed in phases such that each phase -involves executing the same subsystem-level callback for every device belonging -to the given subsystem before the next phase begins. These phases always run -after tasks have been frozen. - -2.4.1. System Suspend - -When the system is going into a sleep state in which the contents of memory will -be preserved, such as one of the ACPI sleep states S1-S3, the phases are: - - prepare, suspend, suspend_noirq. - -The following PCI bus type's callbacks, respectively, are used in these phases: - - pci_pm_prepare() - pci_pm_suspend() - pci_pm_suspend_noirq() - -The pci_pm_prepare() routine first puts the device into the "fully functional" -state with the help of pm_runtime_resume(). Then, it executes the device -driver's pm->prepare() callback if defined (i.e. if the driver's struct -dev_pm_ops object is present and the prepare pointer in that object is valid). - -The pci_pm_suspend() routine first checks if the device's driver implements -legacy PCI suspend routines (see Section 3), in which case the driver's legacy -suspend callback is executed, if present, and its result is returned. Next, if -the device's driver doesn't provide a struct dev_pm_ops object (containing -pointers to the driver's callbacks), pci_pm_default_suspend() is called, which -simply turns off the device's bus master capability and runs -pcibios_disable_device() to disable it, unless the device is a bridge (PCI -bridges are ignored by this routine). Next, the device driver's pm->suspend() -callback is executed, if defined, and its result is returned if it fails. -Finally, pci_fixup_device() is called to apply hardware suspend quirks related -to the device if necessary. - -Note that the suspend phase is carried out asynchronously for PCI devices, so -the pci_pm_suspend() callback may be executed in parallel for any pair of PCI -devices that don't depend on each other in a known way (i.e. none of the paths -in the device tree from the root bridge to a leaf device contains both of them). - -The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has -been called, which means that the device driver's interrupt handler won't be -invoked while this routine is running. It first checks if the device's driver -implements legacy PCI suspends routines (Section 3), in which case the legacy -late suspend routine is called and its result is returned (the standard -configuration registers of the device are saved if the driver's callback hasn't -done that). Second, if the device driver's struct dev_pm_ops object is not -present, the device's standard configuration registers are saved and the routine -returns success. Otherwise the device driver's pm->suspend_noirq() callback is -executed, if present, and its result is returned if it fails. Next, if the -device's standard configuration registers haven't been saved yet (one of the -device driver's callbacks executed before might do that), pci_pm_suspend_noirq() -saves them, prepares the device to signal wakeup (if necessary) and puts it into -a low-power state. - -The low-power state to put the device into is the lowest-power (highest number) -state from which it can signal wakeup while the system is in the target sleep -state. Just like in the runtime PM case described above, the mechanism of -signaling wakeup is system-dependent and determined by the PCI subsystem, which -is also responsible for preparing the device to signal wakeup from the system's -target sleep state as appropriate. - -PCI device drivers (that don't implement legacy power management callbacks) are -generally not expected to prepare devices for signaling wakeup or to put them -into low-power states. However, if one of the driver's suspend callbacks -(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration -registers, pci_pm_suspend_noirq() will assume that the device has been prepared -to signal wakeup and put into a low-power state by the driver (the driver is -then assumed to have used the helper functions provided by the PCI subsystem for -this purpose). PCI device drivers are not encouraged to do that, but in some -rare cases doing that in the driver may be the optimum approach. - -2.4.2. System Resume - -When the system is undergoing a transition from a sleep state in which the -contents of memory have been preserved, such as one of the ACPI sleep states -S1-S3, into the working state (ACPI S0), the phases are: - - resume_noirq, resume, complete. - -The following PCI bus type's callbacks, respectively, are executed in these -phases: - - pci_pm_resume_noirq() - pci_pm_resume() - pci_pm_complete() - -The pci_pm_resume_noirq() routine first puts the device into the full-power -state, restores its standard configuration registers and applies early resume -hardware quirks related to the device, if necessary. This is done -unconditionally, regardless of whether or not the device's driver implements -legacy PCI power management callbacks (this way all PCI devices are in the -full-power state and their standard configuration registers have been restored -when their interrupt handlers are invoked for the first time during resume, -which allows the kernel to avoid problems with the handling of shared interrupts -by drivers whose devices are still suspended). If legacy PCI power management -callbacks (see Section 3) are implemented by the device's driver, the legacy -early resume callback is executed and its result is returned. Otherwise, the -device driver's pm->resume_noirq() callback is executed, if defined, and its -result is returned. - -The pci_pm_resume() routine first checks if the device's standard configuration -registers have been restored and restores them if that's not the case (this -only is necessary in the error path during a failing suspend). Next, resume -hardware quirks related to the device are applied, if necessary, and if the -device's driver implements legacy PCI power management callbacks (see -Section 3), the driver's legacy resume callback is executed and its result is -returned. Otherwise, the device's wakeup signaling mechanisms are blocked and -its driver's pm->resume() callback is executed, if defined (the callback's -result is then returned). - -The resume phase is carried out asynchronously for PCI devices, like the -suspend phase described above, which means that if two PCI devices don't depend -on each other in a known way, the pci_pm_resume() routine may be executed for -the both of them in parallel. - -The pci_pm_complete() routine only executes the device driver's pm->complete() -callback, if defined. - -2.4.3. System Hibernation - -System hibernation is more complicated than system suspend, because it requires -a system image to be created and written into a persistent storage medium. The -image is created atomically and all devices are quiesced, or frozen, before that -happens. - -The freezing of devices is carried out after enough memory has been freed (at -the time of this writing the image creation requires at least 50% of system RAM -to be free) in the following three phases: - - prepare, freeze, freeze_noirq - -that correspond to the PCI bus type's callbacks: - - pci_pm_prepare() - pci_pm_freeze() - pci_pm_freeze_noirq() - -This means that the prepare phase is exactly the same as for system suspend. -The other two phases, however, are different. - -The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs -the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), -and it doesn't apply the suspend-related hardware quirks. It is executed -asynchronously for different PCI devices that don't depend on each other in a -known way. - -The pci_pm_freeze_noirq() routine, in turn, is similar to -pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() -routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the -device for signaling wakeup and put it into a low-power state. Still, it saves -the device's standard configuration registers if they haven't been saved by one -of the driver's callbacks. - -Once the image has been created, it has to be saved. However, at this point all -devices are frozen and they cannot handle I/O, while their ability to handle -I/O is obviously necessary for the image saving. Thus they have to be brought -back to the fully functional state and this is done in the following phases: - - thaw_noirq, thaw, complete - -using the following PCI bus type's callbacks: - - pci_pm_thaw_noirq() - pci_pm_thaw() - pci_pm_complete() - -respectively. - -The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), -but it doesn't put the device into the full power state and doesn't attempt to -restore its standard configuration registers. It also executes the device -driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). - -The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device -driver's pm->thaw() callback instead of pm->resume(). It is executed -asynchronously for different PCI devices that don't depend on each other in a -known way. - -The complete phase it the same as for system resume. - -After saving the image, devices need to be powered down before the system can -enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in -three phases: - - prepare, poweroff, poweroff_noirq - -where the prepare phase is exactly the same as for system suspend. The other -two phases are analogous to the suspend and suspend_noirq phases, respectively. -The PCI subsystem-level callbacks they correspond to - - pci_pm_poweroff() - pci_pm_poweroff_noirq() - -work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, -although they don't attempt to save the device's standard configuration -registers. - -2.4.4. System Restore - -System restore requires a hibernation image to be loaded into memory and the -pre-hibernation memory contents to be restored before the pre-hibernation system -activity can be resumed. - -As described in Documentation/driver-api/pm/devices.rst, the hibernation image is loaded -into memory by a fresh instance of the kernel, called the boot kernel, which in -turn is loaded and run by a boot loader in the usual way. After the boot kernel -has loaded the image, it needs to replace its own code and data with the code -and data of the "hibernated" kernel stored within the image, called the image -kernel. For this purpose all devices are frozen just like before creating -the image during hibernation, in the - - prepare, freeze, freeze_noirq - -phases described above. However, the devices affected by these phases are only -those having drivers in the boot kernel; other devices will still be in whatever -state the boot loader left them. - -Should the restoration of the pre-hibernation memory contents fail, the boot -kernel would go through the "thawing" procedure described above, using the -thaw_noirq, thaw, and complete phases (that will only affect the devices having -drivers in the boot kernel), and then continue running normally. - -If the pre-hibernation memory contents are restored successfully, which is the -usual situation, control is passed to the image kernel, which then becomes -responsible for bringing the system back to the working state. To achieve this, -it must restore the devices' pre-hibernation functionality, which is done much -like waking up from the memory sleep state, although it involves different -phases: - - restore_noirq, restore, complete - -The first two of these are analogous to the resume_noirq and resume phases -described above, respectively, and correspond to the following PCI subsystem -callbacks: - - pci_pm_restore_noirq() - pci_pm_restore() - -These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), -respectively, but they execute the device driver's pm->restore_noirq() and -pm->restore() callbacks, if available. - -The complete phase is carried out in exactly the same way as during system -resume. - - -3. PCI Device Drivers and Power Management -========================================== - -3.1. Power Management Callbacks -------------------------------- -PCI device drivers participate in power management by providing callbacks to be -executed by the PCI subsystem's power management routines described above and by -controlling the runtime power management of their devices. - -At the time of this writing there are two ways to define power management -callbacks for a PCI device driver, the recommended one, based on using a -dev_pm_ops structure described in Documentation/driver-api/pm/devices.rst, and the -"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and -.resume() callbacks from struct pci_driver are used. The legacy approach, -however, doesn't allow one to define runtime power management callbacks and is -not really suitable for any new drivers. Therefore it is not covered by this -document (refer to the source code to learn more about it). - -It is recommended that all PCI device drivers define a struct dev_pm_ops object -containing pointers to power management (PM) callbacks that will be executed by -the PCI subsystem's PM routines in various circumstances. A pointer to the -driver's struct dev_pm_ops object has to be assigned to the driver.pm field in -its struct pci_driver object. Once that has happened, the "legacy" PM callbacks -in struct pci_driver are ignored (even if they are not NULL). - -The PM callbacks in struct dev_pm_ops are not mandatory and if they are not -defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI -subsystem will handle the device in a simplified default manner. If they are -defined, though, they are expected to behave as described in the following -subsections. - -3.1.1. prepare() - -The prepare() callback is executed during system suspend, during hibernation -(when a hibernation image is about to be created), during power-off after -saving a hibernation image and during system restore, when a hibernation image -has just been loaded into memory. - -This callback is only necessary if the driver's device has children that in -general may be registered at any time. In that case the role of the prepare() -callback is to prevent new children of the device from being registered until -one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. - -In addition to that the prepare() callback may carry out some operations -preparing the device to be suspended, although it should not allocate memory -(if additional memory is required to suspend the device, it has to be -preallocated earlier, for example in a suspend/hibernate notifier as described -in Documentation/driver-api/pm/notifiers.rst). - -3.1.2. suspend() - -The suspend() callback is only executed during system suspend, after prepare() -callbacks have been executed for all devices in the system. - -This callback is expected to quiesce the device and prepare it to be put into a -low-power state by the PCI subsystem. It is not required (in fact it even is -not recommended) that a PCI driver's suspend() callback save the standard -configuration registers of the device, prepare it for waking up the system, or -put it into a low-power state. All of these operations can very well be taken -care of by the PCI subsystem, without the driver's participation. - -However, in some rare case it is convenient to carry out these operations in -a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and -pci_set_power_state() should be used to save the device's standard configuration -registers, to prepare it for system wakeup (if necessary), and to put it into a -low-power state, respectively. Moreover, if the driver calls pci_save_state(), -the PCI subsystem will not execute either pci_prepare_to_sleep(), or -pci_set_power_state() for its device, so the driver is then responsible for -handling the device as appropriate. - -While the suspend() callback is being executed, the driver's interrupt handler -can be invoked to handle an interrupt from the device, so all suspend-related -operations relying on the driver's ability to handle interrupts should be -carried out in this callback. - -3.1.3. suspend_noirq() - -The suspend_noirq() callback is only executed during system suspend, after -suspend() callbacks have been executed for all devices in the system and -after device interrupts have been disabled by the PM core. - -The difference between suspend_noirq() and suspend() is that the driver's -interrupt handler will not be invoked while suspend_noirq() is running. Thus -suspend_noirq() can carry out operations that would cause race conditions to -arise if they were performed in suspend(). - -3.1.4. freeze() - -The freeze() callback is hibernation-specific and is executed in two situations, -during hibernation, after prepare() callbacks have been executed for all devices -in preparation for the creation of a system image, and during restore, -after a system image has been loaded into memory from persistent storage and the -prepare() callbacks have been executed for all devices. - -The role of this callback is analogous to the role of the suspend() callback -described above. In fact, they only need to be different in the rare cases when -the driver takes the responsibility for putting the device into a low-power -state. - -In that cases the freeze() callback should not prepare the device system wakeup -or put it into a low-power state. Still, either it or freeze_noirq() should -save the device's standard configuration registers using pci_save_state(). - -3.1.5. freeze_noirq() - -The freeze_noirq() callback is hibernation-specific. It is executed during -hibernation, after prepare() and freeze() callbacks have been executed for all -devices in preparation for the creation of a system image, and during restore, -after a system image has been loaded into memory and after prepare() and -freeze() callbacks have been executed for all devices. It is always executed -after device interrupts have been disabled by the PM core. - -The role of this callback is analogous to the role of the suspend_noirq() -callback described above and it very rarely is necessary to define -freeze_noirq(). - -The difference between freeze_noirq() and freeze() is analogous to the -difference between suspend_noirq() and suspend(). - -3.1.6. poweroff() - -The poweroff() callback is hibernation-specific. It is executed when the system -is about to be powered off after saving a hibernation image to a persistent -storage. prepare() callbacks are executed for all devices before poweroff() is -called. - -The role of this callback is analogous to the role of the suspend() and freeze() -callbacks described above, although it does not need to save the contents of -the device's registers. In particular, if the driver wants to put the device -into a low-power state itself instead of allowing the PCI subsystem to do that, -the poweroff() callback should use pci_prepare_to_sleep() and -pci_set_power_state() to prepare the device for system wakeup and to put it -into a low-power state, respectively, but it need not save the device's standard -configuration registers. - -3.1.7. poweroff_noirq() - -The poweroff_noirq() callback is hibernation-specific. It is executed after -poweroff() callbacks have been executed for all devices in the system. - -The role of this callback is analogous to the role of the suspend_noirq() and -freeze_noirq() callbacks described above, but it does not need to save the -contents of the device's registers. - -The difference between poweroff_noirq() and poweroff() is analogous to the -difference between suspend_noirq() and suspend(). - -3.1.8. resume_noirq() - -The resume_noirq() callback is only executed during system resume, after the -PM core has enabled the non-boot CPUs. The driver's interrupt handler will not -be invoked while resume_noirq() is running, so this callback can carry out -operations that might race with the interrupt handler. - -Since the PCI subsystem unconditionally puts all devices into the full power -state in the resume_noirq phase of system resume and restores their standard -configuration registers, resume_noirq() is usually not necessary. In general -it should only be used for performing operations that would lead to race -conditions if carried out by resume(). - -3.1.9. resume() - -The resume() callback is only executed during system resume, after -resume_noirq() callbacks have been executed for all devices in the system and -device interrupts have been enabled by the PM core. - -This callback is responsible for restoring the pre-suspend configuration of the -device and bringing it back to the fully functional state. The device should be -able to process I/O in a usual way after resume() has returned. - -3.1.10. thaw_noirq() - -The thaw_noirq() callback is hibernation-specific. It is executed after a -system image has been created and the non-boot CPUs have been enabled by the PM -core, in the thaw_noirq phase of hibernation. It also may be executed if the -loading of a hibernation image fails during system restore (it is then executed -after enabling the non-boot CPUs). The driver's interrupt handler will not be -invoked while thaw_noirq() is running. - -The role of this callback is analogous to the role of resume_noirq(). The -difference between these two callbacks is that thaw_noirq() is executed after -freeze() and freeze_noirq(), so in general it does not need to modify the -contents of the device's registers. - -3.1.11. thaw() - -The thaw() callback is hibernation-specific. It is executed after thaw_noirq() -callbacks have been executed for all devices in the system and after device -interrupts have been enabled by the PM core. - -This callback is responsible for restoring the pre-freeze configuration of -the device, so that it will work in a usual way after thaw() has returned. - -3.1.12. restore_noirq() - -The restore_noirq() callback is hibernation-specific. It is executed in the -restore_noirq phase of hibernation, when the boot kernel has passed control to -the image kernel and the non-boot CPUs have been enabled by the image kernel's -PM core. - -This callback is analogous to resume_noirq() with the exception that it cannot -make any assumption on the previous state of the device, even if the BIOS (or -generally the platform firmware) is known to preserve that state over a -suspend-resume cycle. - -For the vast majority of PCI device drivers there is no difference between -resume_noirq() and restore_noirq(). - -3.1.13. restore() - -The restore() callback is hibernation-specific. It is executed after -restore_noirq() callbacks have been executed for all devices in the system and -after the PM core has enabled device drivers' interrupt handlers to be invoked. - -This callback is analogous to resume(), just like restore_noirq() is analogous -to resume_noirq(). Consequently, the difference between restore_noirq() and -restore() is analogous to the difference between resume_noirq() and resume(). - -For the vast majority of PCI device drivers there is no difference between -resume() and restore(). - -3.1.14. complete() - -The complete() callback is executed in the following situations: - - during system resume, after resume() callbacks have been executed for all - devices, - - during hibernation, before saving the system image, after thaw() callbacks - have been executed for all devices, - - during system restore, when the system is going back to its pre-hibernation - state, after restore() callbacks have been executed for all devices. -It also may be executed if the loading of a hibernation image into memory fails -(in that case it is run after thaw() callbacks have been executed for all -devices that have drivers in the boot kernel). - -This callback is entirely optional, although it may be necessary if the -prepare() callback performs operations that need to be reversed. - -3.1.15. runtime_suspend() - -The runtime_suspend() callback is specific to device runtime power management -(runtime PM). It is executed by the PM core's runtime PM framework when the -device is about to be suspended (i.e. quiesced and put into a low-power state) -at run time. - -This callback is responsible for freezing the device and preparing it to be -put into a low-power state, but it must allow the PCI subsystem to perform all -of the PCI-specific actions necessary for suspending the device. - -3.1.16. runtime_resume() - -The runtime_resume() callback is specific to device runtime PM. It is executed -by the PM core's runtime PM framework when the device is about to be resumed -(i.e. put into the full-power state and programmed to process I/O normally) at -run time. - -This callback is responsible for restoring the normal functionality of the -device after it has been put into the full-power state by the PCI subsystem. -The device is expected to be able to process I/O in the usual way after -runtime_resume() has returned. - -3.1.17. runtime_idle() - -The runtime_idle() callback is specific to device runtime PM. It is executed -by the PM core's runtime PM framework whenever it may be desirable to suspend -the device according to the PM core's information. In particular, it is -automatically executed right after runtime_resume() has returned in case the -resume of the device has happened as a result of a spurious event. - -This callback is optional, but if it is not implemented or if it returns 0, the -PCI subsystem will call pm_runtime_suspend() for the device, which in turn will -cause the driver's runtime_suspend() callback to be executed. - -3.1.18. Pointing Multiple Callback Pointers to One Routine - -Although in principle each of the callbacks described in the previous -subsections can be defined as a separate function, it often is convenient to -point two or more members of struct dev_pm_ops to the same routine. There are -a few convenience macros that can be used for this purpose. - -The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one -suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() -members and one resume routine pointed to by the .resume(), .thaw(), and -.restore() members. The other function pointers in this struct dev_pm_ops are -unset. - -The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it -additionally sets the .runtime_resume() pointer to the same value as -.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to -the same value as .suspend() (and .freeze() and .poweroff()). - -The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct -dev_pm_ops to indicate that one suspend routine is to be pointed to by the -.suspend(), .freeze(), and .poweroff() members and one resume routine is to -be pointed to by the .resume(), .thaw(), and .restore() members. - -3.1.19. Driver Flags for Power Management - -The PM core allows device drivers to set flags that influence the handling of -power management for the devices by the core itself and by middle layer code -including the PCI bus type. The flags should be set once at the driver probe -time with the help of the dev_pm_set_driver_flags() function and they should not -be updated directly afterwards. - -The DPM_FLAG_NEVER_SKIP flag prevents the PM core from using the direct-complete -mechanism allowing device suspend/resume callbacks to be skipped if the device -is in runtime suspend when the system suspend starts. That also affects all of -the ancestors of the device, so this flag should only be used if absolutely -necessary. - -The DPM_FLAG_SMART_PREPARE flag instructs the PCI bus type to only return a -positive value from pci_pm_prepare() if the ->prepare callback provided by the -driver of the device returns a positive value. That allows the driver to opt -out from using the direct-complete mechanism dynamically. - -The DPM_FLAG_SMART_SUSPEND flag tells the PCI bus type that from the driver's -perspective the device can be safely left in runtime suspend during system -suspend. That causes pci_pm_suspend(), pci_pm_freeze() and pci_pm_poweroff() -to skip resuming the device from runtime suspend unless there are PCI-specific -reasons for doing that. Also, it causes pci_pm_suspend_late/noirq(), -pci_pm_freeze_late/noirq() and pci_pm_poweroff_late/noirq() to return early -if the device remains in runtime suspend in the beginning of the "late" phase -of the system-wide transition under way. Moreover, if the device is in -runtime suspend in pci_pm_resume_noirq() or pci_pm_restore_noirq(), its runtime -power management status will be changed to "active" (as it is going to be put -into D0 going forward), but if it is in runtime suspend in pci_pm_thaw_noirq(), -the function will set the power.direct_complete flag for it (to make the PM core -skip the subsequent "thaw" callbacks for it) and return. - -Setting the DPM_FLAG_LEAVE_SUSPENDED flag means that the driver prefers the -device to be left in suspend after system-wide transitions to the working state. -This flag is checked by the PM core, but the PCI bus type informs the PM core -which devices may be left in suspend from its perspective (that happens during -the "noirq" phase of system-wide suspend and analogous transitions) and next it -uses the dev_pm_may_skip_resume() helper to decide whether or not to return from -pci_pm_resume_noirq() early, as the PM core will skip the remaining resume -callbacks for the device during the transition under way and will set its -runtime PM status to "suspended" if dev_pm_may_skip_resume() returns "true" for -it. - -3.2. Device Runtime Power Management ------------------------------------- -In addition to providing device power management callbacks PCI device drivers -are responsible for controlling the runtime power management (runtime PM) of -their devices. - -The PCI device runtime PM is optional, but it is recommended that PCI device -drivers implement it at least in the cases where there is a reliable way of -verifying that the device is not used (like when the network cable is detached -from an Ethernet adapter or there are no devices attached to a USB controller). - -To support the PCI runtime PM the driver first needs to implement the -runtime_suspend() and runtime_resume() callbacks. It also may need to implement -the runtime_idle() callback to prevent the device from being suspended again -every time right after the runtime_resume() callback has returned -(alternatively, the runtime_suspend() callback will have to check if the -device should really be suspended and return -EAGAIN if that is not the case). - -The runtime PM of PCI devices is enabled by default by the PCI core. PCI -device drivers do not need to enable it and should not attempt to do so. -However, it is blocked by pci_pm_init() that runs the pm_runtime_forbid() -helper function. In addition to that, the runtime PM usage counter of -each PCI device is incremented by local_pci_probe() before executing the -probe callback provided by the device's driver. - -If a PCI driver implements the runtime PM callbacks and intends to use the -runtime PM framework provided by the PM core and the PCI subsystem, it needs -to decrement the device's runtime PM usage counter in its probe callback -function. If it doesn't do that, the counter will always be different from -zero for the device and it will never be runtime-suspended. The simplest -way to do that is by calling pm_runtime_put_noidle(), but if the driver -wants to schedule an autosuspend right away, for example, it may call -pm_runtime_put_autosuspend() instead for this purpose. Generally, it -just needs to call a function that decrements the devices usage counter -from its probe routine to make runtime PM work for the device. - -It is important to remember that the driver's runtime_suspend() callback -may be executed right after the usage counter has been decremented, because -user space may already have caused the pm_runtime_allow() helper function -unblocking the runtime PM of the device to run via sysfs, so the driver must -be prepared to cope with that. - -The driver itself should not call pm_runtime_allow(), though. Instead, it -should let user space or some platform-specific code do that (user space can -do it via sysfs as stated above), but it must be prepared to handle the -runtime PM of the device correctly as soon as pm_runtime_allow() is called -(which may happen at any time, even before the driver is loaded). - -When the driver's remove callback runs, it has to balance the decrementation -of the device's runtime PM usage counter at the probe time. For this reason, -if it has decremented the counter in its probe callback, it must run -pm_runtime_get_noresume() in its remove callback. [Since the core carries -out a runtime resume of the device and bumps up the device's usage counter -before running the driver's remove callback, the runtime PM of the device -is effectively disabled for the duration of the remove execution and all -runtime PM helper functions incrementing the device's usage counter are -then effectively equivalent to pm_runtime_get_noresume().] - -The runtime PM framework works by processing requests to suspend or resume -devices, or to check if they are idle (in which cases it is reasonable to -subsequently request that they be suspended). These requests are represented -by work items put into the power management workqueue, pm_wq. Although there -are a few situations in which power management requests are automatically -queued by the PM core (for example, after processing a request to resume a -device the PM core automatically queues a request to check if the device is -idle), device drivers are generally responsible for queuing power management -requests for their devices. For this purpose they should use the runtime PM -helper functions provided by the PM core, discussed in -Documentation/power/runtime_pm.txt. - -Devices can also be suspended and resumed synchronously, without placing a -request into pm_wq. In the majority of cases this also is done by their -drivers that use helper functions provided by the PM core for this purpose. - -For more information on the runtime PM of devices refer to -Documentation/power/runtime_pm.txt. - - -4. Resources -============ - -PCI Local Bus Specification, Rev. 3.0 -PCI Bus Power Management Interface Specification, Rev. 1.2 -Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b -PCI Express Base Specification, Rev. 2.0 -Documentation/driver-api/pm/devices.rst -Documentation/power/runtime_pm.txt diff --git a/Documentation/power/pm_qos_interface.rst b/Documentation/power/pm_qos_interface.rst new file mode 100644 index 000000000000..945fc6d760c9 --- /dev/null +++ b/Documentation/power/pm_qos_interface.rst @@ -0,0 +1,225 @@ +=============================== +PM Quality Of Service Interface +=============================== + +This interface provides a kernel and user mode interface for registering +performance expectations by drivers, subsystems and user space applications on +one of the parameters. + +Two different PM QoS frameworks are available: +1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput, +memory_bandwidth. +2. the per-device PM QoS framework provides the API to manage the per-device latency +constraints and PM QoS flags. + +Each parameters have defined units: + + * latency: usec + * timeout: usec + * throughput: kbs (kilo bit / sec) + * memory bandwidth: mbs (mega bit / sec) + + +1. PM QoS framework +=================== + +The infrastructure exposes multiple misc device nodes one per implemented +parameter. The set of parameters implement is defined by pm_qos_power_init() +and pm_qos_params.h. This is done because having the available parameters +being runtime configurable or changeable from a driver was seen as too easy to +abuse. + +For each parameter a list of performance requests is maintained along with +an aggregated target value. The aggregated target value is updated with +changes to the request list or elements of the list. Typically the +aggregated target value is simply the max or min of the request values held +in the parameter list elements. +Note: the aggregated target value is implemented as an atomic variable so that +reading the aggregated value does not require any locking mechanism. + + +From kernel mode the use of this interface is simple: + +void pm_qos_add_request(handle, param_class, target_value): + Will insert an element into the list for that identified PM QoS class with the + target value. Upon change to this list the new target is recomputed and any + registered notifiers are called only if the target value is now different. + Clients of pm_qos need to save the returned handle for future use in other + pm_qos API functions. + +void pm_qos_update_request(handle, new_target_value): + Will update the list element pointed to by the handle with the new target value + and recompute the new aggregated target, calling the notification tree if the + target is changed. + +void pm_qos_remove_request(handle): + Will remove the element. After removal it will update the aggregate target and + call the notification tree if the target was changed as a result of removing + the request. + +int pm_qos_request(param_class): + Returns the aggregated value for a given PM QoS class. + +int pm_qos_request_active(handle): + Returns if the request is still active, i.e. it has not been removed from a + PM QoS class constraints list. + +int pm_qos_add_notifier(param_class, notifier): + Adds a notification callback function to the PM QoS class. The callback is + called when the aggregated value for the PM QoS class is changed. + +int pm_qos_remove_notifier(int param_class, notifier): + Removes the notification callback function for the PM QoS class. + + +From user mode: + +Only processes can register a pm_qos request. To provide for automatic +cleanup of a process, the interface requires the process to register its +parameter requests in the following way: + +To register the default pm_qos target for the specific parameter, the process +must open one of /dev/[cpu_dma_latency, network_latency, network_throughput] + +As long as the device node is held open that process has a registered +request on the parameter. + +To change the requested target value the process needs to write an s32 value to +the open device node. Alternatively the user mode program could write a hex +string for the value using 10 char long format e.g. "0x12345678". This +translates to a pm_qos_update_request call. + +To remove the user mode request for a target value simply close the device +node. + + +2. PM QoS per-device latency and flags framework +================================================ + +For each device, there are three lists of PM QoS requests. Two of them are +maintained along with the aggregated targets of resume latency and active +state latency tolerance (in microseconds) and the third one is for PM QoS flags. +Values are updated in response to changes of the request list. + +The target values of resume latency and active state latency tolerance are +simply the minimum of the request values held in the parameter list elements. +The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements' +values. One device PM QoS flag is defined currently: PM_QOS_FLAG_NO_POWER_OFF. + +Note: The aggregated target values are implemented in such a way that reading +the aggregated value does not require any locking mechanism. + + +From kernel mode the use of this interface is the following: + +int dev_pm_qos_add_request(device, handle, type, value): + Will insert an element into the list for that identified device with the + target value. Upon change to this list the new target is recomputed and any + registered notifiers are called only if the target value is now different. + Clients of dev_pm_qos need to save the handle for future use in other + dev_pm_qos API functions. + +int dev_pm_qos_update_request(handle, new_value): + Will update the list element pointed to by the handle with the new target + value and recompute the new aggregated target, calling the notification + trees if the target is changed. + +int dev_pm_qos_remove_request(handle): + Will remove the element. After removal it will update the aggregate target + and call the notification trees if the target was changed as a result of + removing the request. + +s32 dev_pm_qos_read_value(device): + Returns the aggregated value for a given device's constraints list. + +enum pm_qos_flags_status dev_pm_qos_flags(device, mask) + Check PM QoS flags of the given device against the given mask of flags. + The meaning of the return values is as follows: + + PM_QOS_FLAGS_ALL: + All flags from the mask are set + PM_QOS_FLAGS_SOME: + Some flags from the mask are set + PM_QOS_FLAGS_NONE: + No flags from the mask are set + PM_QOS_FLAGS_UNDEFINED: + The device's PM QoS structure has not been initialized + or the list of requests is empty. + +int dev_pm_qos_add_ancestor_request(dev, handle, type, value) + Add a PM QoS request for the first direct ancestor of the given device whose + power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests) + or whose power.set_latency_tolerance callback pointer is not NULL (for + DEV_PM_QOS_LATENCY_TOLERANCE requests). + +int dev_pm_qos_expose_latency_limit(device, value) + Add a request to the device's PM QoS list of resume latency constraints and + create a sysfs attribute pm_qos_resume_latency_us under the device's power + directory allowing user space to manipulate that request. + +void dev_pm_qos_hide_latency_limit(device) + Drop the request added by dev_pm_qos_expose_latency_limit() from the device's + PM QoS list of resume latency constraints and remove sysfs attribute + pm_qos_resume_latency_us from the device's power directory. + +int dev_pm_qos_expose_flags(device, value) + Add a request to the device's PM QoS list of flags and create sysfs attribute + pm_qos_no_power_off under the device's power directory allowing user space to + change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. + +void dev_pm_qos_hide_flags(device) + Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list + of flags and remove sysfs attribute pm_qos_no_power_off from the device's power + directory. + +Notification mechanisms: + +The per-device PM QoS framework has a per-device notification tree. + +int dev_pm_qos_add_notifier(device, notifier): + Adds a notification callback function for the device. + The callback is called when the aggregated value of the device constraints list + is changed (for resume latency device PM QoS only). + +int dev_pm_qos_remove_notifier(device, notifier): + Removes the notification callback function for the device. + + +Active state latency tolerance +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This device PM QoS type is used to support systems in which hardware may switch +to energy-saving operation modes on the fly. In those systems, if the operation +mode chosen by the hardware attempts to save energy in an overly aggressive way, +it may cause excess latencies to be visible to software, causing it to miss +certain protocol requirements or target frame or sample rates etc. + +If there is a latency tolerance control mechanism for a given device available +to software, the .set_latency_tolerance callback in that device's dev_pm_info +structure should be populated. The routine pointed to by it is should implement +whatever is necessary to transfer the effective requirement value to the +hardware. + +Whenever the effective latency tolerance changes for the device, its +.set_latency_tolerance() callback will be executed and the effective value will +be passed to it. If that value is negative, which means that the list of +latency tolerance requirements for the device is empty, the callback is expected +to switch the underlying hardware latency tolerance control mechanism to an +autonomous mode if available. If that value is PM_QOS_LATENCY_ANY, in turn, and +the hardware supports a special "no requirement" setting, the callback is +expected to use it. That allows software to prevent the hardware from +automatically updating the device's latency tolerance in response to its power +state changes (e.g. during transitions from D3cold to D0), which generally may +be done in the autonomous latency tolerance control mode. + +If .set_latency_tolerance() is present for the device, sysfs attribute +pm_qos_latency_tolerance_us will be present in the devivce's power directory. +Then, user space can use that attribute to specify its latency tolerance +requirement for the device, if any. Writing "any" to it means "no requirement, +but do not let the hardware control latency tolerance" and writing "auto" to it +allows the hardware to be switched to the autonomous mode if there are no other +requirements from the kernel side in the device's list. + +Kernel code can use the functions described above along with the +DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update +latency tolerance requirements for devices. diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.txt deleted file mode 100644 index 19c5f7b1a7ba..000000000000 --- a/Documentation/power/pm_qos_interface.txt +++ /dev/null @@ -1,212 +0,0 @@ -PM Quality Of Service Interface. - -This interface provides a kernel and user mode interface for registering -performance expectations by drivers, subsystems and user space applications on -one of the parameters. - -Two different PM QoS frameworks are available: -1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput, -memory_bandwidth. -2. the per-device PM QoS framework provides the API to manage the per-device latency -constraints and PM QoS flags. - -Each parameters have defined units: - * latency: usec - * timeout: usec - * throughput: kbs (kilo bit / sec) - * memory bandwidth: mbs (mega bit / sec) - - -1. PM QoS framework - -The infrastructure exposes multiple misc device nodes one per implemented -parameter. The set of parameters implement is defined by pm_qos_power_init() -and pm_qos_params.h. This is done because having the available parameters -being runtime configurable or changeable from a driver was seen as too easy to -abuse. - -For each parameter a list of performance requests is maintained along with -an aggregated target value. The aggregated target value is updated with -changes to the request list or elements of the list. Typically the -aggregated target value is simply the max or min of the request values held -in the parameter list elements. -Note: the aggregated target value is implemented as an atomic variable so that -reading the aggregated value does not require any locking mechanism. - - -From kernel mode the use of this interface is simple: - -void pm_qos_add_request(handle, param_class, target_value): -Will insert an element into the list for that identified PM QoS class with the -target value. Upon change to this list the new target is recomputed and any -registered notifiers are called only if the target value is now different. -Clients of pm_qos need to save the returned handle for future use in other -pm_qos API functions. - -void pm_qos_update_request(handle, new_target_value): -Will update the list element pointed to by the handle with the new target value -and recompute the new aggregated target, calling the notification tree if the -target is changed. - -void pm_qos_remove_request(handle): -Will remove the element. After removal it will update the aggregate target and -call the notification tree if the target was changed as a result of removing -the request. - -int pm_qos_request(param_class): -Returns the aggregated value for a given PM QoS class. - -int pm_qos_request_active(handle): -Returns if the request is still active, i.e. it has not been removed from a -PM QoS class constraints list. - -int pm_qos_add_notifier(param_class, notifier): -Adds a notification callback function to the PM QoS class. The callback is -called when the aggregated value for the PM QoS class is changed. - -int pm_qos_remove_notifier(int param_class, notifier): -Removes the notification callback function for the PM QoS class. - - -From user mode: -Only processes can register a pm_qos request. To provide for automatic -cleanup of a process, the interface requires the process to register its -parameter requests in the following way: - -To register the default pm_qos target for the specific parameter, the process -must open one of /dev/[cpu_dma_latency, network_latency, network_throughput] - -As long as the device node is held open that process has a registered -request on the parameter. - -To change the requested target value the process needs to write an s32 value to -the open device node. Alternatively the user mode program could write a hex -string for the value using 10 char long format e.g. "0x12345678". This -translates to a pm_qos_update_request call. - -To remove the user mode request for a target value simply close the device -node. - - -2. PM QoS per-device latency and flags framework - -For each device, there are three lists of PM QoS requests. Two of them are -maintained along with the aggregated targets of resume latency and active -state latency tolerance (in microseconds) and the third one is for PM QoS flags. -Values are updated in response to changes of the request list. - -The target values of resume latency and active state latency tolerance are -simply the minimum of the request values held in the parameter list elements. -The PM QoS flags aggregate value is a gather (bitwise OR) of all list elements' -values. One device PM QoS flag is defined currently: PM_QOS_FLAG_NO_POWER_OFF. - -Note: The aggregated target values are implemented in such a way that reading -the aggregated value does not require any locking mechanism. - - -From kernel mode the use of this interface is the following: - -int dev_pm_qos_add_request(device, handle, type, value): -Will insert an element into the list for that identified device with the -target value. Upon change to this list the new target is recomputed and any -registered notifiers are called only if the target value is now different. -Clients of dev_pm_qos need to save the handle for future use in other -dev_pm_qos API functions. - -int dev_pm_qos_update_request(handle, new_value): -Will update the list element pointed to by the handle with the new target value -and recompute the new aggregated target, calling the notification trees if the -target is changed. - -int dev_pm_qos_remove_request(handle): -Will remove the element. After removal it will update the aggregate target and -call the notification trees if the target was changed as a result of removing -the request. - -s32 dev_pm_qos_read_value(device): -Returns the aggregated value for a given device's constraints list. - -enum pm_qos_flags_status dev_pm_qos_flags(device, mask) -Check PM QoS flags of the given device against the given mask of flags. -The meaning of the return values is as follows: - PM_QOS_FLAGS_ALL: All flags from the mask are set - PM_QOS_FLAGS_SOME: Some flags from the mask are set - PM_QOS_FLAGS_NONE: No flags from the mask are set - PM_QOS_FLAGS_UNDEFINED: The device's PM QoS structure has not been - initialized or the list of requests is empty. - -int dev_pm_qos_add_ancestor_request(dev, handle, type, value) -Add a PM QoS request for the first direct ancestor of the given device whose -power.ignore_children flag is unset (for DEV_PM_QOS_RESUME_LATENCY requests) -or whose power.set_latency_tolerance callback pointer is not NULL (for -DEV_PM_QOS_LATENCY_TOLERANCE requests). - -int dev_pm_qos_expose_latency_limit(device, value) -Add a request to the device's PM QoS list of resume latency constraints and -create a sysfs attribute pm_qos_resume_latency_us under the device's power -directory allowing user space to manipulate that request. - -void dev_pm_qos_hide_latency_limit(device) -Drop the request added by dev_pm_qos_expose_latency_limit() from the device's -PM QoS list of resume latency constraints and remove sysfs attribute -pm_qos_resume_latency_us from the device's power directory. - -int dev_pm_qos_expose_flags(device, value) -Add a request to the device's PM QoS list of flags and create sysfs attribute -pm_qos_no_power_off under the device's power directory allowing user space to -change the value of the PM_QOS_FLAG_NO_POWER_OFF flag. - -void dev_pm_qos_hide_flags(device) -Drop the request added by dev_pm_qos_expose_flags() from the device's PM QoS list -of flags and remove sysfs attribute pm_qos_no_power_off from the device's power -directory. - -Notification mechanisms: -The per-device PM QoS framework has a per-device notification tree. - -int dev_pm_qos_add_notifier(device, notifier): -Adds a notification callback function for the device. -The callback is called when the aggregated value of the device constraints list -is changed (for resume latency device PM QoS only). - -int dev_pm_qos_remove_notifier(device, notifier): -Removes the notification callback function for the device. - - -Active state latency tolerance - -This device PM QoS type is used to support systems in which hardware may switch -to energy-saving operation modes on the fly. In those systems, if the operation -mode chosen by the hardware attempts to save energy in an overly aggressive way, -it may cause excess latencies to be visible to software, causing it to miss -certain protocol requirements or target frame or sample rates etc. - -If there is a latency tolerance control mechanism for a given device available -to software, the .set_latency_tolerance callback in that device's dev_pm_info -structure should be populated. The routine pointed to by it is should implement -whatever is necessary to transfer the effective requirement value to the -hardware. - -Whenever the effective latency tolerance changes for the device, its -.set_latency_tolerance() callback will be executed and the effective value will -be passed to it. If that value is negative, which means that the list of -latency tolerance requirements for the device is empty, the callback is expected -to switch the underlying hardware latency tolerance control mechanism to an -autonomous mode if available. If that value is PM_QOS_LATENCY_ANY, in turn, and -the hardware supports a special "no requirement" setting, the callback is -expected to use it. That allows software to prevent the hardware from -automatically updating the device's latency tolerance in response to its power -state changes (e.g. during transitions from D3cold to D0), which generally may -be done in the autonomous latency tolerance control mode. - -If .set_latency_tolerance() is present for the device, sysfs attribute -pm_qos_latency_tolerance_us will be present in the devivce's power directory. -Then, user space can use that attribute to specify its latency tolerance -requirement for the device, if any. Writing "any" to it means "no requirement, -but do not let the hardware control latency tolerance" and writing "auto" to it -allows the hardware to be switched to the autonomous mode if there are no other -requirements from the kernel side in the device's list. - -Kernel code can use the functions described above along with the -DEV_PM_QOS_LATENCY_TOLERANCE device PM QoS type to add, remove and update -latency tolerance requirements for devices. diff --git a/Documentation/power/power_supply_class.rst b/Documentation/power/power_supply_class.rst new file mode 100644 index 000000000000..3f2c3fe38a61 --- /dev/null +++ b/Documentation/power/power_supply_class.rst @@ -0,0 +1,282 @@ +======================== +Linux power supply class +======================== + +Synopsis +~~~~~~~~ +Power supply class used to represent battery, UPS, AC or DC power supply +properties to user-space. + +It defines core set of attributes, which should be applicable to (almost) +every power supply out there. Attributes are available via sysfs and uevent +interfaces. + +Each attribute has well defined meaning, up to unit of measure used. While +the attributes provided are believed to be universally applicable to any +power supply, specific monitoring hardware may not be able to provide them +all, so any of them may be skipped. + +Power supply class is extensible, and allows to define drivers own attributes. +The core attribute set is subject to the standard Linux evolution (i.e. +if it will be found that some attribute is applicable to many power supply +types or their drivers, it can be added to the core set). + +It also integrates with LED framework, for the purpose of providing +typically expected feedback of battery charging/fully charged status and +AC/USB power supply online status. (Note that specific details of the +indication (including whether to use it at all) are fully controllable by +user and/or specific machine defaults, per design principles of LED +framework). + + +Attributes/properties +~~~~~~~~~~~~~~~~~~~~~ +Power supply class has predefined set of attributes, this eliminates code +duplication across drivers. Power supply class insist on reusing its +predefined attributes *and* their units. + +So, userspace gets predictable set of attributes and their units for any +kind of power supply, and can process/present them to a user in consistent +manner. Results for different power supplies and machines are also directly +comparable. + +See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c +for the example how to declare and handle attributes. + + +Units +~~~~~ +Quoting include/linux/power_supply.h: + + All voltages, currents, charges, energies, time and temperatures in µV, + µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise + stated. It's driver's job to convert its raw values to units in which + this class operates. + + +Attributes/properties detailed +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++--------------------------------------------------------------------------+ +| **Charge/Energy/Capacity - how to not confuse** | ++--------------------------------------------------------------------------+ +| **Because both "charge" (µAh) and "energy" (µWh) represents "capacity" | +| of battery, this class distinguish these terms. Don't mix them!** | +| | +| - `CHARGE_*` | +| attributes represents capacity in µAh only. | +| - `ENERGY_*` | +| attributes represents capacity in µWh only. | +| - `CAPACITY` | +| attribute represents capacity in *percents*, from 0 to 100. | ++--------------------------------------------------------------------------+ + +Postfixes: + +_AVG + *hardware* averaged value, use it if your hardware is really able to + report averaged values. +_NOW + momentary/instantaneous values. + +STATUS + this attribute represents operating status (charging, full, + discharging (i.e. powering a load), etc.). This corresponds to + `BATTERY_STATUS_*` values, as defined in battery.h. + +CHARGE_TYPE + batteries can typically charge at different rates. + This defines trickle and fast charges. For batteries that + are already charged or discharging, 'n/a' can be displayed (or + 'unknown', if the status is not known). + +AUTHENTIC + indicates the power supply (battery or charger) connected + to the platform is authentic(1) or non authentic(0). + +HEALTH + represents health of the battery, values corresponds to + POWER_SUPPLY_HEALTH_*, defined in battery.h. + +VOLTAGE_OCV + open circuit voltage of the battery. + +VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN + design values for maximal and minimal power supply voltages. + Maximal/minimal means values of voltages when battery considered + "full"/"empty" at normal conditions. Yes, there is no direct relation + between voltage and battery capacity, but some dumb + batteries use voltage for very approximated calculation of capacity. + Battery driver also can use this attribute just to inform userspace + about maximal and minimal voltage thresholds of a given battery. + +VOLTAGE_MAX, VOLTAGE_MIN + same as _DESIGN voltage values except that these ones should be used + if hardware could only guess (measure and retain) the thresholds of a + given power supply. + +VOLTAGE_BOOT + Reports the voltage measured during boot + +CURRENT_BOOT + Reports the current measured during boot + +CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN + design charge values, when battery considered full/empty. + +ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN + same as above but for energy. + +CHARGE_FULL, CHARGE_EMPTY + These attributes means "last remembered value of charge when battery + became full/empty". It also could mean "value of charge when battery + considered full/empty at given conditions (temperature, age)". + I.e. these attributes represents real thresholds, not design values. + +ENERGY_FULL, ENERGY_EMPTY + same as above but for energy. + +CHARGE_COUNTER + the current charge counter (in µAh). This could easily + be negative; there is no empty or full value. It is only useful for + relative, time-based measurements. + +PRECHARGE_CURRENT + the maximum charge current during precharge phase of charge cycle + (typically 20% of battery capacity). + +CHARGE_TERM_CURRENT + Charge termination current. The charge cycle terminates when battery + voltage is above recharge threshold, and charge current is below + this setting (typically 10% of battery capacity). + +CONSTANT_CHARGE_CURRENT + constant charge current programmed by charger. + + +CONSTANT_CHARGE_CURRENT_MAX + maximum charge current supported by the power supply object. + +CONSTANT_CHARGE_VOLTAGE + constant charge voltage programmed by charger. +CONSTANT_CHARGE_VOLTAGE_MAX + maximum charge voltage supported by the power supply object. + +INPUT_CURRENT_LIMIT + input current limit programmed by charger. Indicates + the current drawn from a charging source. + +CHARGE_CONTROL_LIMIT + current charge control limit setting +CHARGE_CONTROL_LIMIT_MAX + maximum charge control limit setting + +CALIBRATE + battery or coulomb counter calibration status + +CAPACITY + capacity in percents. +CAPACITY_ALERT_MIN + minimum capacity alert value in percents. +CAPACITY_ALERT_MAX + maximum capacity alert value in percents. +CAPACITY_LEVEL + capacity level. This corresponds to POWER_SUPPLY_CAPACITY_LEVEL_*. + +TEMP + temperature of the power supply. +TEMP_ALERT_MIN + minimum battery temperature alert. +TEMP_ALERT_MAX + maximum battery temperature alert. +TEMP_AMBIENT + ambient temperature. +TEMP_AMBIENT_ALERT_MIN + minimum ambient temperature alert. +TEMP_AMBIENT_ALERT_MAX + maximum ambient temperature alert. +TEMP_MIN + minimum operatable temperature +TEMP_MAX + maximum operatable temperature + +TIME_TO_EMPTY + seconds left for battery to be considered empty + (i.e. while battery powers a load) +TIME_TO_FULL + seconds left for battery to be considered full + (i.e. while battery is charging) + + +Battery <-> external power supply interaction +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Often power supplies are acting as supplies and supplicants at the same +time. Batteries are good example. So, batteries usually care if they're +externally powered or not. + +For that case, power supply class implements notification mechanism for +batteries. + +External power supply (AC) lists supplicants (batteries) names in +"supplied_to" struct member, and each power_supply_changed() call +issued by external power supply will notify supplicants via +external_power_changed callback. + + +Devicetree battery characteristics +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Drivers should call power_supply_get_battery_info() to obtain battery +characteristics from a devicetree battery node, defined in +Documentation/devicetree/bindings/power/supply/battery.txt. This is +implemented in drivers/power/supply/bq27xxx_battery.c. + +Properties in struct power_supply_battery_info and their counterparts in the +battery node have names corresponding to elements in enum power_supply_property, +for naming consistency between sysfs attributes and battery node properties. + + +QA +~~ + +Q: + Where is POWER_SUPPLY_PROP_XYZ attribute? +A: + If you cannot find attribute suitable for your driver needs, feel free + to add it and send patch along with your driver. + + The attributes available currently are the ones currently provided by the + drivers written. + + Good candidates to add in future: model/part#, cycle_time, manufacturer, + etc. + + +Q: + I have some very specific attribute (e.g. battery color), should I add + this attribute to standard ones? +A: + Most likely, no. Such attribute can be placed in the driver itself, if + it is useful. Of course, if the attribute in question applicable to + large set of batteries, provided by many drivers, and/or comes from + some general battery specification/standard, it may be a candidate to + be added to the core attribute set. + + +Q: + Suppose, my battery monitoring chip/firmware does not provides capacity + in percents, but provides charge_{now,full,empty}. Should I calculate + percentage capacity manually, inside the driver, and register CAPACITY + attribute? The same question about time_to_empty/time_to_full. +A: + Most likely, no. This class is designed to export properties which are + directly measurable by the specific hardware available. + + Inferring not available properties using some heuristics or mathematical + model is not subject of work for a battery driver. Such functionality + should be factored out, and in fact, apm_power, the driver to serve + legacy APM API on top of power supply class, uses a simple heuristic of + approximating remaining battery capacity based on its charge, current, + voltage and so on. But full-fledged battery model is likely not subject + for kernel at all, as it would require floating point calculation to deal + with things like differential equations and Kalman filters. This is + better be handled by batteryd/libbattery, yet to be written. diff --git a/Documentation/power/power_supply_class.txt b/Documentation/power/power_supply_class.txt deleted file mode 100644 index 300d37896e51..000000000000 --- a/Documentation/power/power_supply_class.txt +++ /dev/null @@ -1,231 +0,0 @@ -Linux power supply class -======================== - -Synopsis -~~~~~~~~ -Power supply class used to represent battery, UPS, AC or DC power supply -properties to user-space. - -It defines core set of attributes, which should be applicable to (almost) -every power supply out there. Attributes are available via sysfs and uevent -interfaces. - -Each attribute has well defined meaning, up to unit of measure used. While -the attributes provided are believed to be universally applicable to any -power supply, specific monitoring hardware may not be able to provide them -all, so any of them may be skipped. - -Power supply class is extensible, and allows to define drivers own attributes. -The core attribute set is subject to the standard Linux evolution (i.e. -if it will be found that some attribute is applicable to many power supply -types or their drivers, it can be added to the core set). - -It also integrates with LED framework, for the purpose of providing -typically expected feedback of battery charging/fully charged status and -AC/USB power supply online status. (Note that specific details of the -indication (including whether to use it at all) are fully controllable by -user and/or specific machine defaults, per design principles of LED -framework). - - -Attributes/properties -~~~~~~~~~~~~~~~~~~~~~ -Power supply class has predefined set of attributes, this eliminates code -duplication across drivers. Power supply class insist on reusing its -predefined attributes *and* their units. - -So, userspace gets predictable set of attributes and their units for any -kind of power supply, and can process/present them to a user in consistent -manner. Results for different power supplies and machines are also directly -comparable. - -See drivers/power/supply/ds2760_battery.c and drivers/power/supply/pda_power.c -for the example how to declare and handle attributes. - - -Units -~~~~~ -Quoting include/linux/power_supply.h: - - All voltages, currents, charges, energies, time and temperatures in µV, - µA, µAh, µWh, seconds and tenths of degree Celsius unless otherwise - stated. It's driver's job to convert its raw values to units in which - this class operates. - - -Attributes/properties detailed -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -~ ~ ~ ~ ~ ~ ~ Charge/Energy/Capacity - how to not confuse ~ ~ ~ ~ ~ ~ ~ -~ ~ -~ Because both "charge" (µAh) and "energy" (µWh) represents "capacity" ~ -~ of battery, this class distinguish these terms. Don't mix them! ~ -~ ~ -~ CHARGE_* attributes represents capacity in µAh only. ~ -~ ENERGY_* attributes represents capacity in µWh only. ~ -~ CAPACITY attribute represents capacity in *percents*, from 0 to 100. ~ -~ ~ -~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ - -Postfixes: -_AVG - *hardware* averaged value, use it if your hardware is really able to -report averaged values. -_NOW - momentary/instantaneous values. - -STATUS - this attribute represents operating status (charging, full, -discharging (i.e. powering a load), etc.). This corresponds to -BATTERY_STATUS_* values, as defined in battery.h. - -CHARGE_TYPE - batteries can typically charge at different rates. -This defines trickle and fast charges. For batteries that -are already charged or discharging, 'n/a' can be displayed (or -'unknown', if the status is not known). - -AUTHENTIC - indicates the power supply (battery or charger) connected -to the platform is authentic(1) or non authentic(0). - -HEALTH - represents health of the battery, values corresponds to -POWER_SUPPLY_HEALTH_*, defined in battery.h. - -VOLTAGE_OCV - open circuit voltage of the battery. - -VOLTAGE_MAX_DESIGN, VOLTAGE_MIN_DESIGN - design values for maximal and -minimal power supply voltages. Maximal/minimal means values of voltages -when battery considered "full"/"empty" at normal conditions. Yes, there is -no direct relation between voltage and battery capacity, but some dumb -batteries use voltage for very approximated calculation of capacity. -Battery driver also can use this attribute just to inform userspace -about maximal and minimal voltage thresholds of a given battery. - -VOLTAGE_MAX, VOLTAGE_MIN - same as _DESIGN voltage values except that -these ones should be used if hardware could only guess (measure and -retain) the thresholds of a given power supply. - -VOLTAGE_BOOT - Reports the voltage measured during boot - -CURRENT_BOOT - Reports the current measured during boot - -CHARGE_FULL_DESIGN, CHARGE_EMPTY_DESIGN - design charge values, when -battery considered full/empty. - -ENERGY_FULL_DESIGN, ENERGY_EMPTY_DESIGN - same as above but for energy. - -CHARGE_FULL, CHARGE_EMPTY - These attributes means "last remembered value -of charge when battery became full/empty". It also could mean "value of -charge when battery considered full/empty at given conditions (temperature, -age)". I.e. these attributes represents real thresholds, not design values. - -ENERGY_FULL, ENERGY_EMPTY - same as above but for energy. - -CHARGE_COUNTER - the current charge counter (in µAh). This could easily -be negative; there is no empty or full value. It is only useful for -relative, time-based measurements. - -PRECHARGE_CURRENT - the maximum charge current during precharge phase -of charge cycle (typically 20% of battery capacity). -CHARGE_TERM_CURRENT - Charge termination current. The charge cycle -terminates when battery voltage is above recharge threshold, and charge -current is below this setting (typically 10% of battery capacity). - -CONSTANT_CHARGE_CURRENT - constant charge current programmed by charger. -CONSTANT_CHARGE_CURRENT_MAX - maximum charge current supported by the -power supply object. - -CONSTANT_CHARGE_VOLTAGE - constant charge voltage programmed by charger. -CONSTANT_CHARGE_VOLTAGE_MAX - maximum charge voltage supported by the -power supply object. - -INPUT_CURRENT_LIMIT - input current limit programmed by charger. Indicates -the current drawn from a charging source. - -CHARGE_CONTROL_LIMIT - current charge control limit setting -CHARGE_CONTROL_LIMIT_MAX - maximum charge control limit setting - -CALIBRATE - battery or coulomb counter calibration status - -CAPACITY - capacity in percents. -CAPACITY_ALERT_MIN - minimum capacity alert value in percents. -CAPACITY_ALERT_MAX - maximum capacity alert value in percents. -CAPACITY_LEVEL - capacity level. This corresponds to -POWER_SUPPLY_CAPACITY_LEVEL_*. - -TEMP - temperature of the power supply. -TEMP_ALERT_MIN - minimum battery temperature alert. -TEMP_ALERT_MAX - maximum battery temperature alert. -TEMP_AMBIENT - ambient temperature. -TEMP_AMBIENT_ALERT_MIN - minimum ambient temperature alert. -TEMP_AMBIENT_ALERT_MAX - maximum ambient temperature alert. -TEMP_MIN - minimum operatable temperature -TEMP_MAX - maximum operatable temperature - -TIME_TO_EMPTY - seconds left for battery to be considered empty (i.e. -while battery powers a load) -TIME_TO_FULL - seconds left for battery to be considered full (i.e. -while battery is charging) - - -Battery <-> external power supply interaction -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Often power supplies are acting as supplies and supplicants at the same -time. Batteries are good example. So, batteries usually care if they're -externally powered or not. - -For that case, power supply class implements notification mechanism for -batteries. - -External power supply (AC) lists supplicants (batteries) names in -"supplied_to" struct member, and each power_supply_changed() call -issued by external power supply will notify supplicants via -external_power_changed callback. - - -Devicetree battery characteristics -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Drivers should call power_supply_get_battery_info() to obtain battery -characteristics from a devicetree battery node, defined in -Documentation/devicetree/bindings/power/supply/battery.txt. This is -implemented in drivers/power/supply/bq27xxx_battery.c. - -Properties in struct power_supply_battery_info and their counterparts in the -battery node have names corresponding to elements in enum power_supply_property, -for naming consistency between sysfs attributes and battery node properties. - - -QA -~~ -Q: Where is POWER_SUPPLY_PROP_XYZ attribute? -A: If you cannot find attribute suitable for your driver needs, feel free - to add it and send patch along with your driver. - - The attributes available currently are the ones currently provided by the - drivers written. - - Good candidates to add in future: model/part#, cycle_time, manufacturer, - etc. - - -Q: I have some very specific attribute (e.g. battery color), should I add - this attribute to standard ones? -A: Most likely, no. Such attribute can be placed in the driver itself, if - it is useful. Of course, if the attribute in question applicable to - large set of batteries, provided by many drivers, and/or comes from - some general battery specification/standard, it may be a candidate to - be added to the core attribute set. - - -Q: Suppose, my battery monitoring chip/firmware does not provides capacity - in percents, but provides charge_{now,full,empty}. Should I calculate - percentage capacity manually, inside the driver, and register CAPACITY - attribute? The same question about time_to_empty/time_to_full. -A: Most likely, no. This class is designed to export properties which are - directly measurable by the specific hardware available. - - Inferring not available properties using some heuristics or mathematical - model is not subject of work for a battery driver. Such functionality - should be factored out, and in fact, apm_power, the driver to serve - legacy APM API on top of power supply class, uses a simple heuristic of - approximating remaining battery capacity based on its charge, current, - voltage and so on. But full-fledged battery model is likely not subject - for kernel at all, as it would require floating point calculation to deal - with things like differential equations and Kalman filters. This is - better be handled by batteryd/libbattery, yet to be written. diff --git a/Documentation/power/powercap/powercap.rst b/Documentation/power/powercap/powercap.rst new file mode 100644 index 000000000000..7ae3b44c7624 --- /dev/null +++ b/Documentation/power/powercap/powercap.rst @@ -0,0 +1,257 @@ +======================= +Power Capping Framework +======================= + +The power capping framework provides a consistent interface between the kernel +and the user space that allows power capping drivers to expose the settings to +user space in a uniform way. + +Terminology +=========== + +The framework exposes power capping devices to user space via sysfs in the +form of a tree of objects. The objects at the root level of the tree represent +'control types', which correspond to different methods of power capping. For +example, the intel-rapl control type represents the Intel "Running Average +Power Limit" (RAPL) technology, whereas the 'idle-injection' control type +corresponds to the use of idle injection for controlling power. + +Power zones represent different parts of the system, which can be controlled and +monitored using the power capping method determined by the control type the +given zone belongs to. They each contain attributes for monitoring power, as +well as controls represented in the form of power constraints. If the parts of +the system represented by different power zones are hierarchical (that is, one +bigger part consists of multiple smaller parts that each have their own power +controls), those power zones may also be organized in a hierarchy with one +parent power zone containing multiple subzones and so on to reflect the power +control topology of the system. In that case, it is possible to apply power +capping to a set of devices together using the parent power zone and if more +fine grained control is required, it can be applied through the subzones. + + +Example sysfs interface tree:: + + /sys/devices/virtual/powercap + └──intel-rapl + ├──intel-rapl:0 + │   ├──constraint_0_name + │   ├──constraint_0_power_limit_uw + │   ├──constraint_0_time_window_us + │   ├──constraint_1_name + │   ├──constraint_1_power_limit_uw + │   ├──constraint_1_time_window_us + │   ├──device -> ../../intel-rapl + │   ├──energy_uj + │   ├──intel-rapl:0:0 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:0 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──intel-rapl:0:1 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:0 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──max_energy_range_uj + │   ├──max_power_range_uw + │   ├──name + │   ├──enabled + │   ├──power + │   │   ├──async + │   │   [] + │   ├──subsystem -> ../../../../../class/power_cap + │   ├──enabled + │   ├──uevent + ├──intel-rapl:1 + │   ├──constraint_0_name + │   ├──constraint_0_power_limit_uw + │   ├──constraint_0_time_window_us + │   ├──constraint_1_name + │   ├──constraint_1_power_limit_uw + │   ├──constraint_1_time_window_us + │   ├──device -> ../../intel-rapl + │   ├──energy_uj + │   ├──intel-rapl:1:0 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:1 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──intel-rapl:1:1 + │   │   ├──constraint_0_name + │   │   ├──constraint_0_power_limit_uw + │   │   ├──constraint_0_time_window_us + │   │   ├──constraint_1_name + │   │   ├──constraint_1_power_limit_uw + │   │   ├──constraint_1_time_window_us + │   │   ├──device -> ../../intel-rapl:1 + │   │   ├──energy_uj + │   │   ├──max_energy_range_uj + │   │   ├──name + │   │   ├──enabled + │   │   ├──power + │   │   │   ├──async + │   │   │   [] + │   │   ├──subsystem -> ../../../../../../class/power_cap + │   │   └──uevent + │   ├──max_energy_range_uj + │   ├──max_power_range_uw + │   ├──name + │   ├──enabled + │   ├──power + │   │   ├──async + │   │   [] + │   ├──subsystem -> ../../../../../class/power_cap + │   ├──uevent + ├──power + │   ├──async + │   [] + ├──subsystem -> ../../../../class/power_cap + ├──enabled + └──uevent + +The above example illustrates a case in which the Intel RAPL technology, +available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one +control type called intel-rapl which contains two power zones, intel-rapl:0 and +intel-rapl:1, representing CPU packages. Each of these power zones contains +two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the +"core" and the "uncore" parts of the given CPU package, respectively. All of +the zones and subzones contain energy monitoring attributes (energy_uj, +max_energy_range_uj) and constraint attributes (constraint_*) allowing controls +to be applied (the constraints in the 'package' power zones apply to the whole +CPU packages and the subzone constraints only apply to the respective parts of +the given package individually). Since Intel RAPL doesn't provide instantaneous +power value, there is no power_uw attribute. + +In addition to that, each power zone contains a name attribute, allowing the +part of the system represented by that zone to be identified. +For example:: + + cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name + +package-0 +--------- + +The Intel RAPL technology allows two constraints, short term and long term, +with two different time windows to be applied to each power zone. Thus for +each zone there are 2 attributes representing the constraint names, 2 power +limits and 2 attributes representing the sizes of the time windows. Such that, +constraint_j_* attributes correspond to the jth constraint (j = 0,1). + +For example:: + + constraint_0_name + constraint_0_power_limit_uw + constraint_0_time_window_us + constraint_1_name + constraint_1_power_limit_uw + constraint_1_time_window_us + +Power Zone Attributes +===================== + +Monitoring attributes +--------------------- + +energy_uj (rw) + Current energy counter in micro joules. Write "0" to reset. + If the counter can not be reset, then this attribute is read only. + +max_energy_range_uj (ro) + Range of the above energy counter in micro-joules. + +power_uw (ro) + Current power in micro watts. + +max_power_range_uw (ro) + Range of the above power value in micro-watts. + +name (ro) + Name of this power zone. + +It is possible that some domains have both power ranges and energy counter ranges; +however, only one is mandatory. + +Constraints +----------- + +constraint_X_power_limit_uw (rw) + Power limit in micro watts, which should be applicable for the + time window specified by "constraint_X_time_window_us". + +constraint_X_time_window_us (rw) + Time window in micro seconds. + +constraint_X_name (ro) + An optional name of the constraint + +constraint_X_max_power_uw(ro) + Maximum allowed power in micro watts. + +constraint_X_min_power_uw(ro) + Minimum allowed power in micro watts. + +constraint_X_max_time_window_us(ro) + Maximum allowed time window in micro seconds. + +constraint_X_min_time_window_us(ro) + Minimum allowed time window in micro seconds. + +Except power_limit_uw and time_window_us other fields are optional. + +Common zone and control type attributes +--------------------------------------- + +enabled (rw): Enable/Disable controls at zone level or for all zones using +a control type. + +Power Cap Client Driver Interface +================================= + +The API summary: + +Call powercap_register_control_type() to register control type object. +Call powercap_register_zone() to register a power zone (under a given +control type), either as a top-level power zone or as a subzone of another +power zone registered earlier. +The number of constraints in a power zone and the corresponding callbacks have +to be defined prior to calling powercap_register_zone() to register that zone. + +To Free a power zone call powercap_unregister_zone(). +To free a control type object call powercap_unregister_control_type(). +Detailed API can be generated using kernel-doc on include/linux/powercap.h. diff --git a/Documentation/power/powercap/powercap.txt b/Documentation/power/powercap/powercap.txt deleted file mode 100644 index 1e6ef164e07a..000000000000 --- a/Documentation/power/powercap/powercap.txt +++ /dev/null @@ -1,236 +0,0 @@ -Power Capping Framework -================================== - -The power capping framework provides a consistent interface between the kernel -and the user space that allows power capping drivers to expose the settings to -user space in a uniform way. - -Terminology -========================= -The framework exposes power capping devices to user space via sysfs in the -form of a tree of objects. The objects at the root level of the tree represent -'control types', which correspond to different methods of power capping. For -example, the intel-rapl control type represents the Intel "Running Average -Power Limit" (RAPL) technology, whereas the 'idle-injection' control type -corresponds to the use of idle injection for controlling power. - -Power zones represent different parts of the system, which can be controlled and -monitored using the power capping method determined by the control type the -given zone belongs to. They each contain attributes for monitoring power, as -well as controls represented in the form of power constraints. If the parts of -the system represented by different power zones are hierarchical (that is, one -bigger part consists of multiple smaller parts that each have their own power -controls), those power zones may also be organized in a hierarchy with one -parent power zone containing multiple subzones and so on to reflect the power -control topology of the system. In that case, it is possible to apply power -capping to a set of devices together using the parent power zone and if more -fine grained control is required, it can be applied through the subzones. - - -Example sysfs interface tree: - -/sys/devices/virtual/powercap -??? intel-rapl - ??? intel-rapl:0 - ?   ??? constraint_0_name - ?   ??? constraint_0_power_limit_uw - ?   ??? constraint_0_time_window_us - ?   ??? constraint_1_name - ?   ??? constraint_1_power_limit_uw - ?   ??? constraint_1_time_window_us - ?   ??? device -> ../../intel-rapl - ?   ??? energy_uj - ?   ??? intel-rapl:0:0 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:0 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? intel-rapl:0:1 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:0 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? max_energy_range_uj - ?   ??? max_power_range_uw - ?   ??? name - ?   ??? enabled - ?   ??? power - ?   ?   ??? async - ?   ?   [] - ?   ??? subsystem -> ../../../../../class/power_cap - ?   ??? enabled - ?   ??? uevent - ??? intel-rapl:1 - ?   ??? constraint_0_name - ?   ??? constraint_0_power_limit_uw - ?   ??? constraint_0_time_window_us - ?   ??? constraint_1_name - ?   ??? constraint_1_power_limit_uw - ?   ??? constraint_1_time_window_us - ?   ??? device -> ../../intel-rapl - ?   ??? energy_uj - ?   ??? intel-rapl:1:0 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:1 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? intel-rapl:1:1 - ?   ?   ??? constraint_0_name - ?   ?   ??? constraint_0_power_limit_uw - ?   ?   ??? constraint_0_time_window_us - ?   ?   ??? constraint_1_name - ?   ?   ??? constraint_1_power_limit_uw - ?   ?   ??? constraint_1_time_window_us - ?   ?   ??? device -> ../../intel-rapl:1 - ?   ?   ??? energy_uj - ?   ?   ??? max_energy_range_uj - ?   ?   ??? name - ?   ?   ??? enabled - ?   ?   ??? power - ?   ?   ?   ??? async - ?   ?   ?   [] - ?   ?   ??? subsystem -> ../../../../../../class/power_cap - ?   ?   ??? uevent - ?   ??? max_energy_range_uj - ?   ??? max_power_range_uw - ?   ??? name - ?   ??? enabled - ?   ??? power - ?   ?   ??? async - ?   ?   [] - ?   ??? subsystem -> ../../../../../class/power_cap - ?   ??? uevent - ??? power - ?   ??? async - ?   [] - ??? subsystem -> ../../../../class/power_cap - ??? enabled - ??? uevent - -The above example illustrates a case in which the Intel RAPL technology, -available in Intel® IA-64 and IA-32 Processor Architectures, is used. There is one -control type called intel-rapl which contains two power zones, intel-rapl:0 and -intel-rapl:1, representing CPU packages. Each of these power zones contains -two subzones, intel-rapl:j:0 and intel-rapl:j:1 (j = 0, 1), representing the -"core" and the "uncore" parts of the given CPU package, respectively. All of -the zones and subzones contain energy monitoring attributes (energy_uj, -max_energy_range_uj) and constraint attributes (constraint_*) allowing controls -to be applied (the constraints in the 'package' power zones apply to the whole -CPU packages and the subzone constraints only apply to the respective parts of -the given package individually). Since Intel RAPL doesn't provide instantaneous -power value, there is no power_uw attribute. - -In addition to that, each power zone contains a name attribute, allowing the -part of the system represented by that zone to be identified. -For example: - -cat /sys/class/power_cap/intel-rapl/intel-rapl:0/name -package-0 - -The Intel RAPL technology allows two constraints, short term and long term, -with two different time windows to be applied to each power zone. Thus for -each zone there are 2 attributes representing the constraint names, 2 power -limits and 2 attributes representing the sizes of the time windows. Such that, -constraint_j_* attributes correspond to the jth constraint (j = 0,1). - -For example: - constraint_0_name - constraint_0_power_limit_uw - constraint_0_time_window_us - constraint_1_name - constraint_1_power_limit_uw - constraint_1_time_window_us - -Power Zone Attributes -================================= -Monitoring attributes ----------------------- - -energy_uj (rw): Current energy counter in micro joules. Write "0" to reset. -If the counter can not be reset, then this attribute is read only. - -max_energy_range_uj (ro): Range of the above energy counter in micro-joules. - -power_uw (ro): Current power in micro watts. - -max_power_range_uw (ro): Range of the above power value in micro-watts. - -name (ro): Name of this power zone. - -It is possible that some domains have both power ranges and energy counter ranges; -however, only one is mandatory. - -Constraints ----------------- -constraint_X_power_limit_uw (rw): Power limit in micro watts, which should be -applicable for the time window specified by "constraint_X_time_window_us". - -constraint_X_time_window_us (rw): Time window in micro seconds. - -constraint_X_name (ro): An optional name of the constraint - -constraint_X_max_power_uw(ro): Maximum allowed power in micro watts. - -constraint_X_min_power_uw(ro): Minimum allowed power in micro watts. - -constraint_X_max_time_window_us(ro): Maximum allowed time window in micro seconds. - -constraint_X_min_time_window_us(ro): Minimum allowed time window in micro seconds. - -Except power_limit_uw and time_window_us other fields are optional. - -Common zone and control type attributes ----------------------------------------- -enabled (rw): Enable/Disable controls at zone level or for all zones using -a control type. - -Power Cap Client Driver Interface -================================== -The API summary: - -Call powercap_register_control_type() to register control type object. -Call powercap_register_zone() to register a power zone (under a given -control type), either as a top-level power zone or as a subzone of another -power zone registered earlier. -The number of constraints in a power zone and the corresponding callbacks have -to be defined prior to calling powercap_register_zone() to register that zone. - -To Free a power zone call powercap_unregister_zone(). -To free a control type object call powercap_unregister_control_type(). -Detailed API can be generated using kernel-doc on include/linux/powercap.h. diff --git a/Documentation/power/regulator/consumer.rst b/Documentation/power/regulator/consumer.rst new file mode 100644 index 000000000000..0cd8cc1275a7 --- /dev/null +++ b/Documentation/power/regulator/consumer.rst @@ -0,0 +1,229 @@ +=================================== +Regulator Consumer Driver Interface +=================================== + +This text describes the regulator interface for consumer device drivers. +Please see overview.txt for a description of the terms used in this text. + + +1. Consumer Regulator Access (static & dynamic drivers) +======================================================= + +A consumer driver can get access to its supply regulator by calling :: + + regulator = regulator_get(dev, "Vcc"); + +The consumer passes in its struct device pointer and power supply ID. The core +then finds the correct regulator by consulting a machine specific lookup table. +If the lookup is successful then this call will return a pointer to the struct +regulator that supplies this consumer. + +To release the regulator the consumer driver should call :: + + regulator_put(regulator); + +Consumers can be supplied by more than one regulator e.g. codec consumer with +analog and digital supplies :: + + digital = regulator_get(dev, "Vcc"); /* digital core */ + analog = regulator_get(dev, "Avdd"); /* analog */ + +The regulator access functions regulator_get() and regulator_put() will +usually be called in your device drivers probe() and remove() respectively. + + +2. Regulator Output Enable & Disable (static & dynamic drivers) +=============================================================== + + +A consumer can enable its power supply by calling:: + + int regulator_enable(regulator); + +NOTE: + The supply may already be enabled before regulator_enabled() is called. + This may happen if the consumer shares the regulator or the regulator has been + previously enabled by bootloader or kernel board initialization code. + +A consumer can determine if a regulator is enabled by calling:: + + int regulator_is_enabled(regulator); + +This will return > zero when the regulator is enabled. + + +A consumer can disable its supply when no longer needed by calling:: + + int regulator_disable(regulator); + +NOTE: + This may not disable the supply if it's shared with other consumers. The + regulator will only be disabled when the enabled reference count is zero. + +Finally, a regulator can be forcefully disabled in the case of an emergency:: + + int regulator_force_disable(regulator); + +NOTE: + this will immediately and forcefully shutdown the regulator output. All + consumers will be powered off. + + +3. Regulator Voltage Control & Status (dynamic drivers) +======================================================= + +Some consumer drivers need to be able to dynamically change their supply +voltage to match system operating points. e.g. CPUfreq drivers can scale +voltage along with frequency to save power, SD drivers may need to select the +correct card voltage, etc. + +Consumers can control their supply voltage by calling:: + + int regulator_set_voltage(regulator, min_uV, max_uV); + +Where min_uV and max_uV are the minimum and maximum acceptable voltages in +microvolts. + +NOTE: this can be called when the regulator is enabled or disabled. If called +when enabled, then the voltage changes instantly, otherwise the voltage +configuration changes and the voltage is physically set when the regulator is +next enabled. + +The regulators configured voltage output can be found by calling:: + + int regulator_get_voltage(regulator); + +NOTE: + get_voltage() will return the configured output voltage whether the + regulator is enabled or disabled and should NOT be used to determine regulator + output state. However this can be used in conjunction with is_enabled() to + determine the regulator physical output voltage. + + +4. Regulator Current Limit Control & Status (dynamic drivers) +============================================================= + +Some consumer drivers need to be able to dynamically change their supply +current limit to match system operating points. e.g. LCD backlight driver can +change the current limit to vary the backlight brightness, USB drivers may want +to set the limit to 500mA when supplying power. + +Consumers can control their supply current limit by calling:: + + int regulator_set_current_limit(regulator, min_uA, max_uA); + +Where min_uA and max_uA are the minimum and maximum acceptable current limit in +microamps. + +NOTE: + this can be called when the regulator is enabled or disabled. If called + when enabled, then the current limit changes instantly, otherwise the current + limit configuration changes and the current limit is physically set when the + regulator is next enabled. + +A regulators current limit can be found by calling:: + + int regulator_get_current_limit(regulator); + +NOTE: + get_current_limit() will return the current limit whether the regulator + is enabled or disabled and should not be used to determine regulator current + load. + + +5. Regulator Operating Mode Control & Status (dynamic drivers) +============================================================== + +Some consumers can further save system power by changing the operating mode of +their supply regulator to be more efficient when the consumers operating state +changes. e.g. consumer driver is idle and subsequently draws less current + +Regulator operating mode can be changed indirectly or directly. + +Indirect operating mode control. +-------------------------------- +Consumer drivers can request a change in their supply regulator operating mode +by calling:: + + int regulator_set_load(struct regulator *regulator, int load_uA); + +This will cause the core to recalculate the total load on the regulator (based +on all its consumers) and change operating mode (if necessary and permitted) +to best match the current operating load. + +The load_uA value can be determined from the consumer's datasheet. e.g. most +datasheets have tables showing the maximum current consumed in certain +situations. + +Most consumers will use indirect operating mode control since they have no +knowledge of the regulator or whether the regulator is shared with other +consumers. + +Direct operating mode control. +------------------------------ + +Bespoke or tightly coupled drivers may want to directly control regulator +operating mode depending on their operating point. This can be achieved by +calling:: + + int regulator_set_mode(struct regulator *regulator, unsigned int mode); + unsigned int regulator_get_mode(struct regulator *regulator); + +Direct mode will only be used by consumers that *know* about the regulator and +are not sharing the regulator with other consumers. + + +6. Regulator Events +=================== + +Regulators can notify consumers of external events. Events could be received by +consumers under regulator stress or failure conditions. + +Consumers can register interest in regulator events by calling:: + + int regulator_register_notifier(struct regulator *regulator, + struct notifier_block *nb); + +Consumers can unregister interest by calling:: + + int regulator_unregister_notifier(struct regulator *regulator, + struct notifier_block *nb); + +Regulators use the kernel notifier framework to send event to their interested +consumers. + +7. Regulator Direct Register Access +=================================== + +Some kinds of power management hardware or firmware are designed such that +they need to do low-level hardware access to regulators, with no involvement +from the kernel. Examples of such devices are: + +- clocksource with a voltage-controlled oscillator and control logic to change + the supply voltage over I2C to achieve a desired output clock rate +- thermal management firmware that can issue an arbitrary I2C transaction to + perform system poweroff during overtemperature conditions + +To set up such a device/firmware, various parameters like I2C address of the +regulator, addresses of various regulator registers etc. need to be configured +to it. The regulator framework provides the following helpers for querying +these details. + +Bus-specific details, like I2C addresses or transfer rates are handled by the +regmap framework. To get the regulator's regmap (if supported), use:: + + struct regmap *regulator_get_regmap(struct regulator *regulator); + +To obtain the hardware register offset and bitmask for the regulator's voltage +selector register, use:: + + int regulator_get_hardware_vsel_register(struct regulator *regulator, + unsigned *vsel_reg, + unsigned *vsel_mask); + +To convert a regulator framework voltage selector code (used by +regulator_list_voltage) to a hardware-specific voltage selector that can be +directly written to the voltage selector register, use:: + + int regulator_list_hardware_vsel(struct regulator *regulator, + unsigned selector); diff --git a/Documentation/power/regulator/consumer.txt b/Documentation/power/regulator/consumer.txt deleted file mode 100644 index e51564c1a140..000000000000 --- a/Documentation/power/regulator/consumer.txt +++ /dev/null @@ -1,218 +0,0 @@ -Regulator Consumer Driver Interface -=================================== - -This text describes the regulator interface for consumer device drivers. -Please see overview.txt for a description of the terms used in this text. - - -1. Consumer Regulator Access (static & dynamic drivers) -======================================================= - -A consumer driver can get access to its supply regulator by calling :- - -regulator = regulator_get(dev, "Vcc"); - -The consumer passes in its struct device pointer and power supply ID. The core -then finds the correct regulator by consulting a machine specific lookup table. -If the lookup is successful then this call will return a pointer to the struct -regulator that supplies this consumer. - -To release the regulator the consumer driver should call :- - -regulator_put(regulator); - -Consumers can be supplied by more than one regulator e.g. codec consumer with -analog and digital supplies :- - -digital = regulator_get(dev, "Vcc"); /* digital core */ -analog = regulator_get(dev, "Avdd"); /* analog */ - -The regulator access functions regulator_get() and regulator_put() will -usually be called in your device drivers probe() and remove() respectively. - - -2. Regulator Output Enable & Disable (static & dynamic drivers) -==================================================================== - -A consumer can enable its power supply by calling:- - -int regulator_enable(regulator); - -NOTE: The supply may already be enabled before regulator_enabled() is called. -This may happen if the consumer shares the regulator or the regulator has been -previously enabled by bootloader or kernel board initialization code. - -A consumer can determine if a regulator is enabled by calling :- - -int regulator_is_enabled(regulator); - -This will return > zero when the regulator is enabled. - - -A consumer can disable its supply when no longer needed by calling :- - -int regulator_disable(regulator); - -NOTE: This may not disable the supply if it's shared with other consumers. The -regulator will only be disabled when the enabled reference count is zero. - -Finally, a regulator can be forcefully disabled in the case of an emergency :- - -int regulator_force_disable(regulator); - -NOTE: this will immediately and forcefully shutdown the regulator output. All -consumers will be powered off. - - -3. Regulator Voltage Control & Status (dynamic drivers) -====================================================== - -Some consumer drivers need to be able to dynamically change their supply -voltage to match system operating points. e.g. CPUfreq drivers can scale -voltage along with frequency to save power, SD drivers may need to select the -correct card voltage, etc. - -Consumers can control their supply voltage by calling :- - -int regulator_set_voltage(regulator, min_uV, max_uV); - -Where min_uV and max_uV are the minimum and maximum acceptable voltages in -microvolts. - -NOTE: this can be called when the regulator is enabled or disabled. If called -when enabled, then the voltage changes instantly, otherwise the voltage -configuration changes and the voltage is physically set when the regulator is -next enabled. - -The regulators configured voltage output can be found by calling :- - -int regulator_get_voltage(regulator); - -NOTE: get_voltage() will return the configured output voltage whether the -regulator is enabled or disabled and should NOT be used to determine regulator -output state. However this can be used in conjunction with is_enabled() to -determine the regulator physical output voltage. - - -4. Regulator Current Limit Control & Status (dynamic drivers) -=========================================================== - -Some consumer drivers need to be able to dynamically change their supply -current limit to match system operating points. e.g. LCD backlight driver can -change the current limit to vary the backlight brightness, USB drivers may want -to set the limit to 500mA when supplying power. - -Consumers can control their supply current limit by calling :- - -int regulator_set_current_limit(regulator, min_uA, max_uA); - -Where min_uA and max_uA are the minimum and maximum acceptable current limit in -microamps. - -NOTE: this can be called when the regulator is enabled or disabled. If called -when enabled, then the current limit changes instantly, otherwise the current -limit configuration changes and the current limit is physically set when the -regulator is next enabled. - -A regulators current limit can be found by calling :- - -int regulator_get_current_limit(regulator); - -NOTE: get_current_limit() will return the current limit whether the regulator -is enabled or disabled and should not be used to determine regulator current -load. - - -5. Regulator Operating Mode Control & Status (dynamic drivers) -============================================================= - -Some consumers can further save system power by changing the operating mode of -their supply regulator to be more efficient when the consumers operating state -changes. e.g. consumer driver is idle and subsequently draws less current - -Regulator operating mode can be changed indirectly or directly. - -Indirect operating mode control. --------------------------------- -Consumer drivers can request a change in their supply regulator operating mode -by calling :- - -int regulator_set_load(struct regulator *regulator, int load_uA); - -This will cause the core to recalculate the total load on the regulator (based -on all its consumers) and change operating mode (if necessary and permitted) -to best match the current operating load. - -The load_uA value can be determined from the consumer's datasheet. e.g. most -datasheets have tables showing the maximum current consumed in certain -situations. - -Most consumers will use indirect operating mode control since they have no -knowledge of the regulator or whether the regulator is shared with other -consumers. - -Direct operating mode control. ------------------------------- -Bespoke or tightly coupled drivers may want to directly control regulator -operating mode depending on their operating point. This can be achieved by -calling :- - -int regulator_set_mode(struct regulator *regulator, unsigned int mode); -unsigned int regulator_get_mode(struct regulator *regulator); - -Direct mode will only be used by consumers that *know* about the regulator and -are not sharing the regulator with other consumers. - - -6. Regulator Events -=================== -Regulators can notify consumers of external events. Events could be received by -consumers under regulator stress or failure conditions. - -Consumers can register interest in regulator events by calling :- - -int regulator_register_notifier(struct regulator *regulator, - struct notifier_block *nb); - -Consumers can unregister interest by calling :- - -int regulator_unregister_notifier(struct regulator *regulator, - struct notifier_block *nb); - -Regulators use the kernel notifier framework to send event to their interested -consumers. - -7. Regulator Direct Register Access -=================================== -Some kinds of power management hardware or firmware are designed such that -they need to do low-level hardware access to regulators, with no involvement -from the kernel. Examples of such devices are: - -- clocksource with a voltage-controlled oscillator and control logic to change - the supply voltage over I2C to achieve a desired output clock rate -- thermal management firmware that can issue an arbitrary I2C transaction to - perform system poweroff during overtemperature conditions - -To set up such a device/firmware, various parameters like I2C address of the -regulator, addresses of various regulator registers etc. need to be configured -to it. The regulator framework provides the following helpers for querying -these details. - -Bus-specific details, like I2C addresses or transfer rates are handled by the -regmap framework. To get the regulator's regmap (if supported), use :- - -struct regmap *regulator_get_regmap(struct regulator *regulator); - -To obtain the hardware register offset and bitmask for the regulator's voltage -selector register, use :- - -int regulator_get_hardware_vsel_register(struct regulator *regulator, - unsigned *vsel_reg, - unsigned *vsel_mask); - -To convert a regulator framework voltage selector code (used by -regulator_list_voltage) to a hardware-specific voltage selector that can be -directly written to the voltage selector register, use :- - -int regulator_list_hardware_vsel(struct regulator *regulator, - unsigned selector); diff --git a/Documentation/power/regulator/design.rst b/Documentation/power/regulator/design.rst new file mode 100644 index 000000000000..3b09c6841dc4 --- /dev/null +++ b/Documentation/power/regulator/design.rst @@ -0,0 +1,38 @@ +========================== +Regulator API design notes +========================== + +This document provides a brief, partially structured, overview of some +of the design considerations which impact the regulator API design. + +Safety +------ + + - Errors in regulator configuration can have very serious consequences + for the system, potentially including lasting hardware damage. + - It is not possible to automatically determine the power configuration + of the system - software-equivalent variants of the same chip may + have different power requirements, and not all components with power + requirements are visible to software. + +.. note:: + + The API should make no changes to the hardware state unless it has + specific knowledge that these changes are safe to perform on this + particular system. + +Consumer use cases +------------------ + + - The overwhelming majority of devices in a system will have no + requirement to do any runtime configuration of their power beyond + being able to turn it on or off. + + - Many of the power supplies in the system will be shared between many + different consumers. + +.. note:: + + The consumer API should be structured so that these use cases are + very easy to handle and so that consumers will work with shared + supplies without any additional effort. diff --git a/Documentation/power/regulator/design.txt b/Documentation/power/regulator/design.txt deleted file mode 100644 index fdd919b96830..000000000000 --- a/Documentation/power/regulator/design.txt +++ /dev/null @@ -1,33 +0,0 @@ -Regulator API design notes -========================== - -This document provides a brief, partially structured, overview of some -of the design considerations which impact the regulator API design. - -Safety ------- - - - Errors in regulator configuration can have very serious consequences - for the system, potentially including lasting hardware damage. - - It is not possible to automatically determine the power configuration - of the system - software-equivalent variants of the same chip may - have different power requirements, and not all components with power - requirements are visible to software. - - => The API should make no changes to the hardware state unless it has - specific knowledge that these changes are safe to perform on this - particular system. - -Consumer use cases ------------------- - - - The overwhelming majority of devices in a system will have no - requirement to do any runtime configuration of their power beyond - being able to turn it on or off. - - - Many of the power supplies in the system will be shared between many - different consumers. - - => The consumer API should be structured so that these use cases are - very easy to handle and so that consumers will work with shared - supplies without any additional effort. diff --git a/Documentation/power/regulator/machine.rst b/Documentation/power/regulator/machine.rst new file mode 100644 index 000000000000..22fffefaa3ad --- /dev/null +++ b/Documentation/power/regulator/machine.rst @@ -0,0 +1,97 @@ +================================== +Regulator Machine Driver Interface +================================== + +The regulator machine driver interface is intended for board/machine specific +initialisation code to configure the regulator subsystem. + +Consider the following machine:: + + Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V] + | + +-> [Consumer B @ 3.3V] + +The drivers for consumers A & B must be mapped to the correct regulator in +order to control their power supplies. This mapping can be achieved in machine +initialisation code by creating a struct regulator_consumer_supply for +each regulator:: + + struct regulator_consumer_supply { + const char *dev_name; /* consumer dev_name() */ + const char *supply; /* consumer supply - e.g. "vcc" */ + }; + +e.g. for the machine above:: + + static struct regulator_consumer_supply regulator1_consumers[] = { + REGULATOR_SUPPLY("Vcc", "consumer B"), + }; + + static struct regulator_consumer_supply regulator2_consumers[] = { + REGULATOR_SUPPLY("Vcc", "consumer A"), + }; + +This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2 +to the 'Vcc' supply for Consumer A. + +Constraints can now be registered by defining a struct regulator_init_data +for each regulator power domain. This structure also maps the consumers +to their supply regulators:: + + static struct regulator_init_data regulator1_data = { + .constraints = { + .name = "Regulator-1", + .min_uV = 3300000, + .max_uV = 3300000, + .valid_modes_mask = REGULATOR_MODE_NORMAL, + }, + .num_consumer_supplies = ARRAY_SIZE(regulator1_consumers), + .consumer_supplies = regulator1_consumers, + }; + +The name field should be set to something that is usefully descriptive +for the board for configuration of supplies for other regulators and +for use in logging and other diagnostic output. Normally the name +used for the supply rail in the schematic is a good choice. If no +name is provided then the subsystem will choose one. + +Regulator-1 supplies power to Regulator-2. This relationship must be registered +with the core so that Regulator-1 is also enabled when Consumer A enables its +supply (Regulator-2). The supply regulator is set by the supply_regulator +field below and co:: + + static struct regulator_init_data regulator2_data = { + .supply_regulator = "Regulator-1", + .constraints = { + .min_uV = 1800000, + .max_uV = 2000000, + .valid_ops_mask = REGULATOR_CHANGE_VOLTAGE, + .valid_modes_mask = REGULATOR_MODE_NORMAL, + }, + .num_consumer_supplies = ARRAY_SIZE(regulator2_consumers), + .consumer_supplies = regulator2_consumers, + }; + +Finally the regulator devices must be registered in the usual manner:: + + static struct platform_device regulator_devices[] = { + { + .name = "regulator", + .id = DCDC_1, + .dev = { + .platform_data = ®ulator1_data, + }, + }, + { + .name = "regulator", + .id = DCDC_2, + .dev = { + .platform_data = ®ulator2_data, + }, + }, + }; + /* register regulator 1 device */ + platform_device_register(®ulator_devices[0]); + + /* register regulator 2 device */ + platform_device_register(®ulator_devices[1]); diff --git a/Documentation/power/regulator/machine.txt b/Documentation/power/regulator/machine.txt deleted file mode 100644 index eff4dcaaa252..000000000000 --- a/Documentation/power/regulator/machine.txt +++ /dev/null @@ -1,96 +0,0 @@ -Regulator Machine Driver Interface -=================================== - -The regulator machine driver interface is intended for board/machine specific -initialisation code to configure the regulator subsystem. - -Consider the following machine :- - - Regulator-1 -+-> Regulator-2 --> [Consumer A @ 1.8 - 2.0V] - | - +-> [Consumer B @ 3.3V] - -The drivers for consumers A & B must be mapped to the correct regulator in -order to control their power supplies. This mapping can be achieved in machine -initialisation code by creating a struct regulator_consumer_supply for -each regulator. - -struct regulator_consumer_supply { - const char *dev_name; /* consumer dev_name() */ - const char *supply; /* consumer supply - e.g. "vcc" */ -}; - -e.g. for the machine above - -static struct regulator_consumer_supply regulator1_consumers[] = { - REGULATOR_SUPPLY("Vcc", "consumer B"), -}; - -static struct regulator_consumer_supply regulator2_consumers[] = { - REGULATOR_SUPPLY("Vcc", "consumer A"), -}; - -This maps Regulator-1 to the 'Vcc' supply for Consumer B and maps Regulator-2 -to the 'Vcc' supply for Consumer A. - -Constraints can now be registered by defining a struct regulator_init_data -for each regulator power domain. This structure also maps the consumers -to their supply regulators :- - -static struct regulator_init_data regulator1_data = { - .constraints = { - .name = "Regulator-1", - .min_uV = 3300000, - .max_uV = 3300000, - .valid_modes_mask = REGULATOR_MODE_NORMAL, - }, - .num_consumer_supplies = ARRAY_SIZE(regulator1_consumers), - .consumer_supplies = regulator1_consumers, -}; - -The name field should be set to something that is usefully descriptive -for the board for configuration of supplies for other regulators and -for use in logging and other diagnostic output. Normally the name -used for the supply rail in the schematic is a good choice. If no -name is provided then the subsystem will choose one. - -Regulator-1 supplies power to Regulator-2. This relationship must be registered -with the core so that Regulator-1 is also enabled when Consumer A enables its -supply (Regulator-2). The supply regulator is set by the supply_regulator -field below and co:- - -static struct regulator_init_data regulator2_data = { - .supply_regulator = "Regulator-1", - .constraints = { - .min_uV = 1800000, - .max_uV = 2000000, - .valid_ops_mask = REGULATOR_CHANGE_VOLTAGE, - .valid_modes_mask = REGULATOR_MODE_NORMAL, - }, - .num_consumer_supplies = ARRAY_SIZE(regulator2_consumers), - .consumer_supplies = regulator2_consumers, -}; - -Finally the regulator devices must be registered in the usual manner. - -static struct platform_device regulator_devices[] = { - { - .name = "regulator", - .id = DCDC_1, - .dev = { - .platform_data = ®ulator1_data, - }, - }, - { - .name = "regulator", - .id = DCDC_2, - .dev = { - .platform_data = ®ulator2_data, - }, - }, -}; -/* register regulator 1 device */ -platform_device_register(®ulator_devices[0]); - -/* register regulator 2 device */ -platform_device_register(®ulator_devices[1]); diff --git a/Documentation/power/regulator/overview.rst b/Documentation/power/regulator/overview.rst new file mode 100644 index 000000000000..ee494c70a7c4 --- /dev/null +++ b/Documentation/power/regulator/overview.rst @@ -0,0 +1,178 @@ +============================================= +Linux voltage and current regulator framework +============================================= + +About +===== + +This framework is designed to provide a standard kernel interface to control +voltage and current regulators. + +The intention is to allow systems to dynamically control regulator power output +in order to save power and prolong battery life. This applies to both voltage +regulators (where voltage output is controllable) and current sinks (where +current limit is controllable). + +(C) 2008 Wolfson Microelectronics PLC. + +Author: Liam Girdwood + + +Nomenclature +============ + +Some terms used in this document: + + - Regulator + - Electronic device that supplies power to other devices. + Most regulators can enable and disable their output while + some can control their output voltage and or current. + + Input Voltage -> Regulator -> Output Voltage + + + - PMIC + - Power Management IC. An IC that contains numerous + regulators and often contains other subsystems. + + + - Consumer + - Electronic device that is supplied power by a regulator. + Consumers can be classified into two types:- + + Static: consumer does not change its supply voltage or + current limit. It only needs to enable or disable its + power supply. Its supply voltage is set by the hardware, + bootloader, firmware or kernel board initialisation code. + + Dynamic: consumer needs to change its supply voltage or + current limit to meet operation demands. + + + - Power Domain + - Electronic circuit that is supplied its input power by the + output power of a regulator, switch or by another power + domain. + + The supply regulator may be behind a switch(s). i.e.:: + + Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A] + | | + | +-> [Consumer B], [Consumer C] + | + +-> [Consumer D], [Consumer E] + + That is one regulator and three power domains: + + - Domain 1: Switch-1, Consumers D & E. + - Domain 2: Switch-2, Consumers B & C. + - Domain 3: Consumer A. + + and this represents a "supplies" relationship: + + Domain-1 --> Domain-2 --> Domain-3. + + A power domain may have regulators that are supplied power + by other regulators. i.e.:: + + Regulator-1 -+-> Regulator-2 -+-> [Consumer A] + | + +-> [Consumer B] + + This gives us two regulators and two power domains: + + - Domain 1: Regulator-2, Consumer B. + - Domain 2: Consumer A. + + and a "supplies" relationship: + + Domain-1 --> Domain-2 + + + - Constraints + - Constraints are used to define power levels for performance + and hardware protection. Constraints exist at three levels: + + Regulator Level: This is defined by the regulator hardware + operating parameters and is specified in the regulator + datasheet. i.e. + + - voltage output is in the range 800mV -> 3500mV. + - regulator current output limit is 20mA @ 5V but is + 10mA @ 10V. + + Power Domain Level: This is defined in software by kernel + level board initialisation code. It is used to constrain a + power domain to a particular power range. i.e. + + - Domain-1 voltage is 3300mV + - Domain-2 voltage is 1400mV -> 1600mV + - Domain-3 current limit is 0mA -> 20mA. + + Consumer Level: This is defined by consumer drivers + dynamically setting voltage or current limit levels. + + e.g. a consumer backlight driver asks for a current increase + from 5mA to 10mA to increase LCD illumination. This passes + to through the levels as follows :- + + Consumer: need to increase LCD brightness. Lookup and + request next current mA value in brightness table (the + consumer driver could be used on several different + personalities based upon the same reference device). + + Power Domain: is the new current limit within the domain + operating limits for this domain and system state (e.g. + battery power, USB power) + + Regulator Domains: is the new current limit within the + regulator operating parameters for input/output voltage. + + If the regulator request passes all the constraint tests + then the new regulator value is applied. + + +Design +====== + +The framework is designed and targeted at SoC based devices but may also be +relevant to non SoC devices and is split into the following four interfaces:- + + + 1. Consumer driver interface. + + This uses a similar API to the kernel clock interface in that consumer + drivers can get and put a regulator (like they can with clocks atm) and + get/set voltage, current limit, mode, enable and disable. This should + allow consumers complete control over their supply voltage and current + limit. This also compiles out if not in use so drivers can be reused in + systems with no regulator based power control. + + See Documentation/power/regulator/consumer.rst + + 2. Regulator driver interface. + + This allows regulator drivers to register their regulators and provide + operations to the core. It also has a notifier call chain for propagating + regulator events to clients. + + See Documentation/power/regulator/regulator.rst + + 3. Machine interface. + + This interface is for machine specific code and allows the creation of + voltage/current domains (with constraints) for each regulator. It can + provide regulator constraints that will prevent device damage through + overvoltage or overcurrent caused by buggy client drivers. It also + allows the creation of a regulator tree whereby some regulators are + supplied by others (similar to a clock tree). + + See Documentation/power/regulator/machine.rst + + 4. Userspace ABI. + + The framework also exports a lot of useful voltage/current/opmode data to + userspace via sysfs. This could be used to help monitor device power + consumption and status. + + See Documentation/ABI/testing/sysfs-class-regulator diff --git a/Documentation/power/regulator/overview.txt b/Documentation/power/regulator/overview.txt deleted file mode 100644 index 721b4739ec32..000000000000 --- a/Documentation/power/regulator/overview.txt +++ /dev/null @@ -1,171 +0,0 @@ -Linux voltage and current regulator framework -============================================= - -About -===== - -This framework is designed to provide a standard kernel interface to control -voltage and current regulators. - -The intention is to allow systems to dynamically control regulator power output -in order to save power and prolong battery life. This applies to both voltage -regulators (where voltage output is controllable) and current sinks (where -current limit is controllable). - -(C) 2008 Wolfson Microelectronics PLC. -Author: Liam Girdwood - - -Nomenclature -============ - -Some terms used in this document:- - - o Regulator - Electronic device that supplies power to other devices. - Most regulators can enable and disable their output while - some can control their output voltage and or current. - - Input Voltage -> Regulator -> Output Voltage - - - o PMIC - Power Management IC. An IC that contains numerous regulators - and often contains other subsystems. - - - o Consumer - Electronic device that is supplied power by a regulator. - Consumers can be classified into two types:- - - Static: consumer does not change its supply voltage or - current limit. It only needs to enable or disable its - power supply. Its supply voltage is set by the hardware, - bootloader, firmware or kernel board initialisation code. - - Dynamic: consumer needs to change its supply voltage or - current limit to meet operation demands. - - - o Power Domain - Electronic circuit that is supplied its input power by the - output power of a regulator, switch or by another power - domain. - - The supply regulator may be behind a switch(s). i.e. - - Regulator -+-> Switch-1 -+-> Switch-2 --> [Consumer A] - | | - | +-> [Consumer B], [Consumer C] - | - +-> [Consumer D], [Consumer E] - - That is one regulator and three power domains: - - Domain 1: Switch-1, Consumers D & E. - Domain 2: Switch-2, Consumers B & C. - Domain 3: Consumer A. - - and this represents a "supplies" relationship: - - Domain-1 --> Domain-2 --> Domain-3. - - A power domain may have regulators that are supplied power - by other regulators. i.e. - - Regulator-1 -+-> Regulator-2 -+-> [Consumer A] - | - +-> [Consumer B] - - This gives us two regulators and two power domains: - - Domain 1: Regulator-2, Consumer B. - Domain 2: Consumer A. - - and a "supplies" relationship: - - Domain-1 --> Domain-2 - - - o Constraints - Constraints are used to define power levels for performance - and hardware protection. Constraints exist at three levels: - - Regulator Level: This is defined by the regulator hardware - operating parameters and is specified in the regulator - datasheet. i.e. - - - voltage output is in the range 800mV -> 3500mV. - - regulator current output limit is 20mA @ 5V but is - 10mA @ 10V. - - Power Domain Level: This is defined in software by kernel - level board initialisation code. It is used to constrain a - power domain to a particular power range. i.e. - - - Domain-1 voltage is 3300mV - - Domain-2 voltage is 1400mV -> 1600mV - - Domain-3 current limit is 0mA -> 20mA. - - Consumer Level: This is defined by consumer drivers - dynamically setting voltage or current limit levels. - - e.g. a consumer backlight driver asks for a current increase - from 5mA to 10mA to increase LCD illumination. This passes - to through the levels as follows :- - - Consumer: need to increase LCD brightness. Lookup and - request next current mA value in brightness table (the - consumer driver could be used on several different - personalities based upon the same reference device). - - Power Domain: is the new current limit within the domain - operating limits for this domain and system state (e.g. - battery power, USB power) - - Regulator Domains: is the new current limit within the - regulator operating parameters for input/output voltage. - - If the regulator request passes all the constraint tests - then the new regulator value is applied. - - -Design -====== - -The framework is designed and targeted at SoC based devices but may also be -relevant to non SoC devices and is split into the following four interfaces:- - - - 1. Consumer driver interface. - - This uses a similar API to the kernel clock interface in that consumer - drivers can get and put a regulator (like they can with clocks atm) and - get/set voltage, current limit, mode, enable and disable. This should - allow consumers complete control over their supply voltage and current - limit. This also compiles out if not in use so drivers can be reused in - systems with no regulator based power control. - - See Documentation/power/regulator/consumer.txt - - 2. Regulator driver interface. - - This allows regulator drivers to register their regulators and provide - operations to the core. It also has a notifier call chain for propagating - regulator events to clients. - - See Documentation/power/regulator/regulator.txt - - 3. Machine interface. - - This interface is for machine specific code and allows the creation of - voltage/current domains (with constraints) for each regulator. It can - provide regulator constraints that will prevent device damage through - overvoltage or overcurrent caused by buggy client drivers. It also - allows the creation of a regulator tree whereby some regulators are - supplied by others (similar to a clock tree). - - See Documentation/power/regulator/machine.txt - - 4. Userspace ABI. - - The framework also exports a lot of useful voltage/current/opmode data to - userspace via sysfs. This could be used to help monitor device power - consumption and status. - - See Documentation/ABI/testing/sysfs-class-regulator diff --git a/Documentation/power/regulator/regulator.rst b/Documentation/power/regulator/regulator.rst new file mode 100644 index 000000000000..794b3256fbb9 --- /dev/null +++ b/Documentation/power/regulator/regulator.rst @@ -0,0 +1,32 @@ +========================== +Regulator Driver Interface +========================== + +The regulator driver interface is relatively simple and designed to allow +regulator drivers to register their services with the core framework. + + +Registration +============ + +Drivers can register a regulator by calling:: + + struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, + const struct regulator_config *config); + +This will register the regulator's capabilities and operations to the regulator +core. + +Regulators can be unregistered by calling:: + + void regulator_unregister(struct regulator_dev *rdev); + + +Regulator Events +================ + +Regulators can send events (e.g. overtemperature, undervoltage, etc) to +consumer drivers by calling:: + + int regulator_notifier_call_chain(struct regulator_dev *rdev, + unsigned long event, void *data); diff --git a/Documentation/power/regulator/regulator.txt b/Documentation/power/regulator/regulator.txt deleted file mode 100644 index b17e5833ce21..000000000000 --- a/Documentation/power/regulator/regulator.txt +++ /dev/null @@ -1,30 +0,0 @@ -Regulator Driver Interface -========================== - -The regulator driver interface is relatively simple and designed to allow -regulator drivers to register their services with the core framework. - - -Registration -============ - -Drivers can register a regulator by calling :- - -struct regulator_dev *regulator_register(struct regulator_desc *regulator_desc, - const struct regulator_config *config); - -This will register the regulator's capabilities and operations to the regulator -core. - -Regulators can be unregistered by calling :- - -void regulator_unregister(struct regulator_dev *rdev); - - -Regulator Events -================ -Regulators can send events (e.g. overtemperature, undervoltage, etc) to -consumer drivers by calling :- - -int regulator_notifier_call_chain(struct regulator_dev *rdev, - unsigned long event, void *data); diff --git a/Documentation/power/runtime_pm.rst b/Documentation/power/runtime_pm.rst new file mode 100644 index 000000000000..2c2ec99b5088 --- /dev/null +++ b/Documentation/power/runtime_pm.rst @@ -0,0 +1,940 @@ +================================================== +Runtime Power Management Framework for I/O Devices +================================================== + +(C) 2009-2011 Rafael J. Wysocki , Novell Inc. + +(C) 2010 Alan Stern + +(C) 2014 Intel Corp., Rafael J. Wysocki + +1. Introduction +=============== + +Support for runtime power management (runtime PM) of I/O devices is provided +at the power management core (PM core) level by means of: + +* The power management workqueue pm_wq in which bus types and device drivers can + put their PM-related work items. It is strongly recommended that pm_wq be + used for queuing all work items related to runtime PM, because this allows + them to be synchronized with system-wide power transitions (suspend to RAM, + hibernation and resume from system sleep states). pm_wq is declared in + include/linux/pm_runtime.h and defined in kernel/power/main.c. + +* A number of runtime PM fields in the 'power' member of 'struct device' (which + is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can + be used for synchronizing runtime PM operations with one another. + +* Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in + include/linux/pm.h). + +* A set of helper functions defined in drivers/base/power/runtime.c that can be + used for carrying out runtime PM operations in such a way that the + synchronization between them is taken care of by the PM core. Bus types and + device drivers are encouraged to use these functions. + +The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM +fields of 'struct dev_pm_info' and the core helper functions provided for +runtime PM are described below. + +2. Device Runtime PM Callbacks +============================== + +There are three device runtime PM callbacks defined in 'struct dev_pm_ops':: + + struct dev_pm_ops { + ... + int (*runtime_suspend)(struct device *dev); + int (*runtime_resume)(struct device *dev); + int (*runtime_idle)(struct device *dev); + ... + }; + +The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks +are executed by the PM core for the device's subsystem that may be either of +the following: + + 1. PM domain of the device, if the device's PM domain object, dev->pm_domain, + is present. + + 2. Device type of the device, if both dev->type and dev->type->pm are present. + + 3. Device class of the device, if both dev->class and dev->class->pm are + present. + + 4. Bus type of the device, if both dev->bus and dev->bus->pm are present. + +If the subsystem chosen by applying the above rules doesn't provide the relevant +callback, the PM core will invoke the corresponding driver callback stored in +dev->driver->pm directly (if present). + +The PM core always checks which callback to use in the order given above, so the +priority order of callbacks from high to low is: PM domain, device type, class +and bus type. Moreover, the high-priority one will always take precedence over +a low-priority one. The PM domain, bus type, device type and class callbacks +are referred to as subsystem-level callbacks in what follows. + +By default, the callbacks are always invoked in process context with interrupts +enabled. However, the pm_runtime_irq_safe() helper function can be used to tell +the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume() +and ->runtime_idle() callbacks for the given device in atomic context with +interrupts disabled. This implies that the callback routines in question must +not block or sleep, but it also means that the synchronous helper functions +listed at the end of Section 4 may be used for that device within an interrupt +handler or generally in an atomic context. + +The subsystem-level suspend callback, if present, is _entirely_ _responsible_ +for handling the suspend of the device as appropriate, which may, but need not +include executing the device driver's own ->runtime_suspend() callback (from the +PM core's point of view it is not necessary to implement a ->runtime_suspend() +callback in a device driver as long as the subsystem-level suspend callback +knows what to do to handle the device). + + * Once the subsystem-level suspend callback (or the driver suspend callback, + if invoked directly) has completed successfully for the given device, the PM + core regards the device as suspended, which need not mean that it has been + put into a low power state. It is supposed to mean, however, that the + device will not process data and will not communicate with the CPU(s) and + RAM until the appropriate resume callback is executed for it. The runtime + PM status of a device after successful execution of the suspend callback is + 'suspended'. + + * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM + status remains 'active', which means that the device _must_ be fully + operational afterwards. + + * If the suspend callback returns an error code different from -EBUSY and + -EAGAIN, the PM core regards this as a fatal error and will refuse to run + the helper functions described in Section 4 for the device until its status + is directly set to either 'active', or 'suspended' (the PM core provides + special helper functions for this purpose). + +In particular, if the driver requires remote wakeup capability (i.e. hardware +mechanism allowing the device to request a change of its power state, such as +PCI PME) for proper functioning and device_can_wakeup() returns 'false' for the +device, then ->runtime_suspend() should return -EBUSY. On the other hand, if +device_can_wakeup() returns 'true' for the device and the device is put into a +low-power state during the execution of the suspend callback, it is expected +that remote wakeup will be enabled for the device. Generally, remote wakeup +should be enabled for all input devices put into low-power states at run time. + +The subsystem-level resume callback, if present, is **entirely responsible** for +handling the resume of the device as appropriate, which may, but need not +include executing the device driver's own ->runtime_resume() callback (from the +PM core's point of view it is not necessary to implement a ->runtime_resume() +callback in a device driver as long as the subsystem-level resume callback knows +what to do to handle the device). + + * Once the subsystem-level resume callback (or the driver resume callback, if + invoked directly) has completed successfully, the PM core regards the device + as fully operational, which means that the device _must_ be able to complete + I/O operations as needed. The runtime PM status of the device is then + 'active'. + + * If the resume callback returns an error code, the PM core regards this as a + fatal error and will refuse to run the helper functions described in Section + 4 for the device, until its status is directly set to either 'active', or + 'suspended' (by means of special helper functions provided by the PM core + for this purpose). + +The idle callback (a subsystem-level one, if present, or the driver one) is +executed by the PM core whenever the device appears to be idle, which is +indicated to the PM core by two counters, the device's usage counter and the +counter of 'active' children of the device. + + * If any of these counters is decreased using a helper function provided by + the PM core and it turns out to be equal to zero, the other counter is + checked. If that counter also is equal to zero, the PM core executes the + idle callback with the device as its argument. + +The action performed by the idle callback is totally dependent on the subsystem +(or driver) in question, but the expected and recommended action is to check +if the device can be suspended (i.e. if all of the conditions necessary for +suspending the device are satisfied) and to queue up a suspend request for the +device in that case. If there is no idle callback, or if the callback returns +0, then the PM core will attempt to carry out a runtime suspend of the device, +also respecting devices configured for autosuspend. In essence this means a +call to pm_runtime_autosuspend() (do note that drivers needs to update the +device last busy mark, pm_runtime_mark_last_busy(), to control the delay under +this circumstance). To prevent this (for example, if the callback routine has +started a delayed suspend), the routine must return a non-zero value. Negative +error return codes are ignored by the PM core. + +The helper functions provided by the PM core, described in Section 4, guarantee +that the following constraints are met with respect to runtime PM callbacks for +one device: + +(1) The callbacks are mutually exclusive (e.g. it is forbidden to execute + ->runtime_suspend() in parallel with ->runtime_resume() or with another + instance of ->runtime_suspend() for the same device) with the exception that + ->runtime_suspend() or ->runtime_resume() can be executed in parallel with + ->runtime_idle() (although ->runtime_idle() will not be started while any + of the other callbacks is being executed for the same device). + +(2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active' + devices (i.e. the PM core will only execute ->runtime_idle() or + ->runtime_suspend() for the devices the runtime PM status of which is + 'active'). + +(3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device + the usage counter of which is equal to zero _and_ either the counter of + 'active' children of which is equal to zero, or the 'power.ignore_children' + flag of which is set. + +(4) ->runtime_resume() can only be executed for 'suspended' devices (i.e. the + PM core will only execute ->runtime_resume() for the devices the runtime + PM status of which is 'suspended'). + +Additionally, the helper functions provided by the PM core obey the following +rules: + + * If ->runtime_suspend() is about to be executed or there's a pending request + to execute it, ->runtime_idle() will not be executed for the same device. + + * A request to execute or to schedule the execution of ->runtime_suspend() + will cancel any pending requests to execute ->runtime_idle() for the same + device. + + * If ->runtime_resume() is about to be executed or there's a pending request + to execute it, the other callbacks will not be executed for the same device. + + * A request to execute ->runtime_resume() will cancel any pending or + scheduled requests to execute the other callbacks for the same device, + except for scheduled autosuspends. + +3. Runtime PM Device Fields +=========================== + +The following device runtime PM fields are present in 'struct dev_pm_info', as +defined in include/linux/pm.h: + + `struct timer_list suspend_timer;` + - timer used for scheduling (delayed) suspend and autosuspend requests + + `unsigned long timer_expires;` + - timer expiration time, in jiffies (if this is different from zero, the + timer is running and will expire at that time, otherwise the timer is not + running) + + `struct work_struct work;` + - work structure used for queuing up requests (i.e. work items in pm_wq) + + `wait_queue_head_t wait_queue;` + - wait queue used if any of the helper functions needs to wait for another + one to complete + + `spinlock_t lock;` + - lock used for synchronization + + `atomic_t usage_count;` + - the usage counter of the device + + `atomic_t child_count;` + - the count of 'active' children of the device + + `unsigned int ignore_children;` + - if set, the value of child_count is ignored (but still updated) + + `unsigned int disable_depth;` + - used for disabling the helper functions (they work normally if this is + equal to zero); the initial value of it is 1 (i.e. runtime PM is + initially disabled for all devices) + + `int runtime_error;` + - if set, there was a fatal error (one of the callbacks returned error code + as described in Section 2), so the helper functions will not work until + this flag is cleared; this is the error code returned by the failing + callback + + `unsigned int idle_notification;` + - if set, ->runtime_idle() is being executed + + `unsigned int request_pending;` + - if set, there's a pending request (i.e. a work item queued up into pm_wq) + + `enum rpm_request request;` + - type of request that's pending (valid if request_pending is set) + + `unsigned int deferred_resume;` + - set if ->runtime_resume() is about to be run while ->runtime_suspend() is + being executed for that device and it is not practical to wait for the + suspend to complete; means "start a resume as soon as you've suspended" + + `enum rpm_status runtime_status;` + - the runtime PM status of the device; this field's initial value is + RPM_SUSPENDED, which means that each device is initially regarded by the + PM core as 'suspended', regardless of its real hardware status + + `unsigned int runtime_auto;` + - if set, indicates that the user space has allowed the device driver to + power manage the device at run time via the /sys/devices/.../power/control + `interface;` it may only be modified with the help of the pm_runtime_allow() + and pm_runtime_forbid() helper functions + + `unsigned int no_callbacks;` + - indicates that the device does not use the runtime PM callbacks (see + Section 8); it may be modified only by the pm_runtime_no_callbacks() + helper function + + `unsigned int irq_safe;` + - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks + will be invoked with the spinlock held and interrupts disabled + + `unsigned int use_autosuspend;` + - indicates that the device's driver supports delayed autosuspend (see + Section 9); it may be modified only by the + pm_runtime{_dont}_use_autosuspend() helper functions + + `unsigned int timer_autosuspends;` + - indicates that the PM core should attempt to carry out an autosuspend + when the timer expires rather than a normal suspend + + `int autosuspend_delay;` + - the delay time (in milliseconds) to be used for autosuspend + + `unsigned long last_busy;` + - the time (in jiffies) when the pm_runtime_mark_last_busy() helper + function was last called for this device; used in calculating inactivity + periods for autosuspend + +All of the above fields are members of the 'power' member of 'struct device'. + +4. Runtime PM Device Helper Functions +===================================== + +The following runtime PM helper functions are defined in +drivers/base/power/runtime.c and include/linux/pm_runtime.h: + + `void pm_runtime_init(struct device *dev);` + - initialize the device runtime PM fields in 'struct dev_pm_info' + + `void pm_runtime_remove(struct device *dev);` + - make sure that the runtime PM of the device will be disabled after + removing the device from device hierarchy + + `int pm_runtime_idle(struct device *dev);` + - execute the subsystem-level idle callback for the device; returns an + error code on failure, where -EINPROGRESS means that ->runtime_idle() is + already being executed; if there is no callback or the callback returns 0 + then run pm_runtime_autosuspend(dev) and return its result + + `int pm_runtime_suspend(struct device *dev);` + - execute the subsystem-level suspend callback for the device; returns 0 on + success, 1 if the device's runtime PM status was already 'suspended', or + error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt + to suspend the device again in future and -EACCES means that + 'power.disable_depth' is different from 0 + + `int pm_runtime_autosuspend(struct device *dev);` + - same as pm_runtime_suspend() except that the autosuspend delay is taken + `into account;` if pm_runtime_autosuspend_expiration() says the delay has + not yet expired then an autosuspend is scheduled for the appropriate time + and 0 is returned + + `int pm_runtime_resume(struct device *dev);` + - execute the subsystem-level resume callback for the device; returns 0 on + success, 1 if the device's runtime PM status was already 'active' or + error code on failure, where -EAGAIN means it may be safe to attempt to + resume the device again in future, but 'power.runtime_error' should be + checked additionally, and -EACCES means that 'power.disable_depth' is + different from 0 + + `int pm_request_idle(struct device *dev);` + - submit a request to execute the subsystem-level idle callback for the + device (the request is represented by a work item in pm_wq); returns 0 on + success or error code if the request has not been queued up + + `int pm_request_autosuspend(struct device *dev);` + - schedule the execution of the subsystem-level suspend callback for the + device when the autosuspend delay has expired; if the delay has already + expired then the work item is queued up immediately + + `int pm_schedule_suspend(struct device *dev, unsigned int delay);` + - schedule the execution of the subsystem-level suspend callback for the + device in future, where 'delay' is the time to wait before queuing up a + suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work + item is queued up immediately); returns 0 on success, 1 if the device's PM + runtime status was already 'suspended', or error code if the request + hasn't been scheduled (or queued up if 'delay' is 0); if the execution of + ->runtime_suspend() is already scheduled and not yet expired, the new + value of 'delay' will be used as the time to wait + + `int pm_request_resume(struct device *dev);` + - submit a request to execute the subsystem-level resume callback for the + device (the request is represented by a work item in pm_wq); returns 0 on + success, 1 if the device's runtime PM status was already 'active', or + error code if the request hasn't been queued up + + `void pm_runtime_get_noresume(struct device *dev);` + - increment the device's usage counter + + `int pm_runtime_get(struct device *dev);` + - increment the device's usage counter, run pm_request_resume(dev) and + return its result + + `int pm_runtime_get_sync(struct device *dev);` + - increment the device's usage counter, run pm_runtime_resume(dev) and + return its result + + `int pm_runtime_get_if_in_use(struct device *dev);` + - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the + runtime PM status is RPM_ACTIVE and the runtime PM usage counter is + nonzero, increment the counter and return 1; otherwise return 0 without + changing the counter + + `void pm_runtime_put_noidle(struct device *dev);` + - decrement the device's usage counter + + `int pm_runtime_put(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_request_idle(dev) and return its result + + `int pm_runtime_put_autosuspend(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_request_autosuspend(dev) and return its result + + `int pm_runtime_put_sync(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_runtime_idle(dev) and return its result + + `int pm_runtime_put_sync_suspend(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_runtime_suspend(dev) and return its result + + `int pm_runtime_put_sync_autosuspend(struct device *dev);` + - decrement the device's usage counter; if the result is 0 then run + pm_runtime_autosuspend(dev) and return its result + + `void pm_runtime_enable(struct device *dev);` + - decrement the device's 'power.disable_depth' field; if that field is equal + to zero, the runtime PM helper functions can execute subsystem-level + callbacks described in Section 2 for the device + + `int pm_runtime_disable(struct device *dev);` + - increment the device's 'power.disable_depth' field (if the value of that + field was previously zero, this prevents subsystem-level runtime PM + callbacks from being run for the device), make sure that all of the + pending runtime PM operations on the device are either completed or + canceled; returns 1 if there was a resume request pending and it was + necessary to execute the subsystem-level resume callback for the device + to satisfy that request, otherwise 0 is returned + + `int pm_runtime_barrier(struct device *dev);` + - check if there's a resume request pending for the device and resume it + (synchronously) in that case, cancel any other pending runtime PM requests + regarding it and wait for all runtime PM operations on it in progress to + complete; returns 1 if there was a resume request pending and it was + necessary to execute the subsystem-level resume callback for the device to + satisfy that request, otherwise 0 is returned + + `void pm_suspend_ignore_children(struct device *dev, bool enable);` + - set/unset the power.ignore_children flag of the device + + `int pm_runtime_set_active(struct device *dev);` + - clear the device's 'power.runtime_error' flag, set the device's runtime + PM status to 'active' and update its parent's counter of 'active' + children as appropriate (it is only valid to use this function if + 'power.runtime_error' is set or 'power.disable_depth' is greater than + zero); it will fail and return error code if the device has a parent + which is not active and the 'power.ignore_children' flag of which is unset + + `void pm_runtime_set_suspended(struct device *dev);` + - clear the device's 'power.runtime_error' flag, set the device's runtime + PM status to 'suspended' and update its parent's counter of 'active' + children as appropriate (it is only valid to use this function if + 'power.runtime_error' is set or 'power.disable_depth' is greater than + zero) + + `bool pm_runtime_active(struct device *dev);` + - return true if the device's runtime PM status is 'active' or its + 'power.disable_depth' field is not equal to zero, or false otherwise + + `bool pm_runtime_suspended(struct device *dev);` + - return true if the device's runtime PM status is 'suspended' and its + 'power.disable_depth' field is equal to zero, or false otherwise + + `bool pm_runtime_status_suspended(struct device *dev);` + - return true if the device's runtime PM status is 'suspended' + + `void pm_runtime_allow(struct device *dev);` + - set the power.runtime_auto flag for the device and decrease its usage + counter (used by the /sys/devices/.../power/control interface to + effectively allow the device to be power managed at run time) + + `void pm_runtime_forbid(struct device *dev);` + - unset the power.runtime_auto flag for the device and increase its usage + counter (used by the /sys/devices/.../power/control interface to + effectively prevent the device from being power managed at run time) + + `void pm_runtime_no_callbacks(struct device *dev);` + - set the power.no_callbacks flag for the device and remove the runtime + PM attributes from /sys/devices/.../power (or prevent them from being + added when the device is registered) + + `void pm_runtime_irq_safe(struct device *dev);` + - set the power.irq_safe flag for the device, causing the runtime-PM + callbacks to be invoked with interrupts off + + `bool pm_runtime_is_irq_safe(struct device *dev);` + - return true if power.irq_safe flag was set for the device, causing + the runtime-PM callbacks to be invoked with interrupts off + + `void pm_runtime_mark_last_busy(struct device *dev);` + - set the power.last_busy field to the current time + + `void pm_runtime_use_autosuspend(struct device *dev);` + - set the power.use_autosuspend flag, enabling autosuspend delays; call + pm_runtime_get_sync if the flag was previously cleared and + power.autosuspend_delay is negative + + `void pm_runtime_dont_use_autosuspend(struct device *dev);` + - clear the power.use_autosuspend flag, disabling autosuspend delays; + decrement the device's usage counter if the flag was previously set and + power.autosuspend_delay is negative; call pm_runtime_idle + + `void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);` + - set the power.autosuspend_delay value to 'delay' (expressed in + milliseconds); if 'delay' is negative then runtime suspends are + prevented; if power.use_autosuspend is set, pm_runtime_get_sync may be + called or the device's usage counter may be decremented and + pm_runtime_idle called depending on if power.autosuspend_delay is + changed to or from a negative value; if power.use_autosuspend is clear, + pm_runtime_idle is called + + `unsigned long pm_runtime_autosuspend_expiration(struct device *dev);` + - calculate the time when the current autosuspend delay period will expire, + based on power.last_busy and power.autosuspend_delay; if the delay time + is 1000 ms or larger then the expiration time is rounded up to the + nearest second; returns 0 if the delay period has already expired or + power.use_autosuspend isn't set, otherwise returns the expiration time + in jiffies + +It is safe to execute the following helper functions from interrupt context: + +- pm_request_idle() +- pm_request_autosuspend() +- pm_schedule_suspend() +- pm_request_resume() +- pm_runtime_get_noresume() +- pm_runtime_get() +- pm_runtime_put_noidle() +- pm_runtime_put() +- pm_runtime_put_autosuspend() +- pm_runtime_enable() +- pm_suspend_ignore_children() +- pm_runtime_set_active() +- pm_runtime_set_suspended() +- pm_runtime_suspended() +- pm_runtime_mark_last_busy() +- pm_runtime_autosuspend_expiration() + +If pm_runtime_irq_safe() has been called for a device then the following helper +functions may also be used in interrupt context: + +- pm_runtime_idle() +- pm_runtime_suspend() +- pm_runtime_autosuspend() +- pm_runtime_resume() +- pm_runtime_get_sync() +- pm_runtime_put_sync() +- pm_runtime_put_sync_suspend() +- pm_runtime_put_sync_autosuspend() + +5. Runtime PM Initialization, Device Probing and Removal +======================================================== + +Initially, the runtime PM is disabled for all devices, which means that the +majority of the runtime PM helper functions described in Section 4 will return +-EAGAIN until pm_runtime_enable() is called for the device. + +In addition to that, the initial runtime PM status of all devices is +'suspended', but it need not reflect the actual physical state of the device. +Thus, if the device is initially active (i.e. it is able to process I/O), its +runtime PM status must be changed to 'active', with the help of +pm_runtime_set_active(), before pm_runtime_enable() is called for the device. + +However, if the device has a parent and the parent's runtime PM is enabled, +calling pm_runtime_set_active() for the device will affect the parent, unless +the parent's 'power.ignore_children' flag is set. Namely, in that case the +parent won't be able to suspend at run time, using the PM core's helper +functions, as long as the child's status is 'active', even if the child's +runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for +the child yet or pm_runtime_disable() has been called for it). For this reason, +once pm_runtime_set_active() has been called for the device, pm_runtime_enable() +should be called for it too as soon as reasonably possible or its runtime PM +status should be changed back to 'suspended' with the help of +pm_runtime_set_suspended(). + +If the default initial runtime PM status of the device (i.e. 'suspended') +reflects the actual state of the device, its bus type's or its driver's +->probe() callback will likely need to wake it up using one of the PM core's +helper functions described in Section 4. In that case, pm_runtime_resume() +should be used. Of course, for this purpose the device's runtime PM has to be +enabled earlier by calling pm_runtime_enable(). + +Note, if the device may execute pm_runtime calls during the probe (such as +if it is registers with a subsystem that may call back in) then the +pm_runtime_get_sync() call paired with a pm_runtime_put() call will be +appropriate to ensure that the device is not put back to sleep during the +probe. This can happen with systems such as the network device layer. + +It may be desirable to suspend the device once ->probe() has finished. +Therefore the driver core uses the asynchronous pm_request_idle() to submit a +request to execute the subsystem-level idle callback for the device at that +time. A driver that makes use of the runtime autosuspend feature, may want to +update the last busy mark before returning from ->probe(). + +Moreover, the driver core prevents runtime PM callbacks from racing with the bus +notifier callback in __device_release_driver(), which is necessary, because the +notifier is used by some subsystems to carry out operations affecting the +runtime PM functionality. It does so by calling pm_runtime_get_sync() before +driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications. This +resumes the device if it's in the suspended state and prevents it from +being suspended again while those routines are being executed. + +To allow bus types and drivers to put devices into the suspended state by +calling pm_runtime_suspend() from their ->remove() routines, the driver core +executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER +notifications in __device_release_driver(). This requires bus types and +drivers to make their ->remove() callbacks avoid races with runtime PM directly, +but also it allows of more flexibility in the handling of devices during the +removal of their drivers. + +Drivers in ->remove() callback should undo the runtime PM changes done +in ->probe(). Usually this means calling pm_runtime_disable(), +pm_runtime_dont_use_autosuspend() etc. + +The user space can effectively disallow the driver of the device to power manage +it at run time by changing the value of its /sys/devices/.../power/control +attribute to "on", which causes pm_runtime_forbid() to be called. In principle, +this mechanism may also be used by the driver to effectively turn off the +runtime power management of the device until the user space turns it on. +Namely, during the initialization the driver can make sure that the runtime PM +status of the device is 'active' and call pm_runtime_forbid(). It should be +noted, however, that if the user space has already intentionally changed the +value of /sys/devices/.../power/control to "auto" to allow the driver to power +manage the device at run time, the driver may confuse it by using +pm_runtime_forbid() this way. + +6. Runtime PM and System Sleep +============================== + +Runtime PM and system sleep (i.e., system suspend and hibernation, also known +as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of +ways. If a device is active when a system sleep starts, everything is +straightforward. But what should happen if the device is already suspended? + +The device may have different wake-up settings for runtime PM and system sleep. +For example, remote wake-up may be enabled for runtime suspend but disallowed +for system sleep (device_may_wakeup(dev) returns 'false'). When this happens, +the subsystem-level system suspend callback is responsible for changing the +device's wake-up setting (it may leave that to the device driver's system +suspend routine). It may be necessary to resume the device and suspend it again +in order to do so. The same is true if the driver uses different power levels +or other settings for runtime suspend and system sleep. + +During system resume, the simplest approach is to bring all devices back to full +power, even if they had been suspended before the system suspend began. There +are several reasons for this, including: + + * The device might need to switch power levels, wake-up settings, etc. + + * Remote wake-up events might have been lost by the firmware. + + * The device's children may need the device to be at full power in order + to resume themselves. + + * The driver's idea of the device state may not agree with the device's + physical state. This can happen during resume from hibernation. + + * The device might need to be reset. + + * Even though the device was suspended, if its usage counter was > 0 then most + likely it would need a runtime resume in the near future anyway. + +If the device had been suspended before the system suspend began and it's +brought back to full power during resume, then its runtime PM status will have +to be updated to reflect the actual post-system sleep status. The way to do +this is: + + - pm_runtime_disable(dev); + - pm_runtime_set_active(dev); + - pm_runtime_enable(dev); + +The PM core always increments the runtime usage counter before calling the +->suspend() callback and decrements it after calling the ->resume() callback. +Hence disabling runtime PM temporarily like this will not cause any runtime +suspend attempts to be permanently lost. If the usage count goes to zero +following the return of the ->resume() callback, the ->runtime_idle() callback +will be invoked as usual. + +On some systems, however, system sleep is not entered through a global firmware +or hardware operation. Instead, all hardware components are put into low-power +states directly by the kernel in a coordinated way. Then, the system sleep +state effectively follows from the states the hardware components end up in +and the system is woken up from that state by a hardware interrupt or a similar +mechanism entirely under the kernel's control. As a result, the kernel never +gives control away and the states of all devices during resume are precisely +known to it. If that is the case and none of the situations listed above takes +place (in particular, if the system is not waking up from hibernation), it may +be more efficient to leave the devices that had been suspended before the system +suspend began in the suspended state. + +To this end, the PM core provides a mechanism allowing some coordination between +different levels of device hierarchy. Namely, if a system suspend .prepare() +callback returns a positive number for a device, that indicates to the PM core +that the device appears to be runtime-suspended and its state is fine, so it +may be left in runtime suspend provided that all of its descendants are also +left in runtime suspend. If that happens, the PM core will not execute any +system suspend and resume callbacks for all of those devices, except for the +complete callback, which is then entirely responsible for handling the device +as appropriate. This only applies to system suspend transitions that are not +related to hibernation (see Documentation/driver-api/pm/devices.rst for more +information). + +The PM core does its best to reduce the probability of race conditions between +the runtime PM and system suspend/resume (and hibernation) callbacks by carrying +out the following operations: + + * During system suspend pm_runtime_get_noresume() is called for every device + right before executing the subsystem-level .prepare() callback for it and + pm_runtime_barrier() is called for every device right before executing the + subsystem-level .suspend() callback for it. In addition to that the PM core + calls __pm_runtime_disable() with 'false' as the second argument for every + device right before executing the subsystem-level .suspend_late() callback + for it. + + * During system resume pm_runtime_enable() and pm_runtime_put() are called for + every device right after executing the subsystem-level .resume_early() + callback and right after executing the subsystem-level .complete() callback + for it, respectively. + +7. Generic subsystem callbacks + +Subsystems may wish to conserve code space by using the set of generic power +management callbacks provided by the PM core, defined in +driver/base/power/generic_ops.c: + + `int pm_generic_runtime_suspend(struct device *dev);` + - invoke the ->runtime_suspend() callback provided by the driver of this + device and return its result, or return 0 if not defined + + `int pm_generic_runtime_resume(struct device *dev);` + - invoke the ->runtime_resume() callback provided by the driver of this + device and return its result, or return 0 if not defined + + `int pm_generic_suspend(struct device *dev);` + - if the device has not been suspended at run time, invoke the ->suspend() + callback provided by its driver and return its result, or return 0 if not + defined + + `int pm_generic_suspend_noirq(struct device *dev);` + - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq() + callback provided by the device's driver and return its result, or return + 0 if not defined + + `int pm_generic_resume(struct device *dev);` + - invoke the ->resume() callback provided by the driver of this device and, + if successful, change the device's runtime PM status to 'active' + + `int pm_generic_resume_noirq(struct device *dev);` + - invoke the ->resume_noirq() callback provided by the driver of this device + + `int pm_generic_freeze(struct device *dev);` + - if the device has not been suspended at run time, invoke the ->freeze() + callback provided by its driver and return its result, or return 0 if not + defined + + `int pm_generic_freeze_noirq(struct device *dev);` + - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq() + callback provided by the device's driver and return its result, or return + 0 if not defined + + `int pm_generic_thaw(struct device *dev);` + - if the device has not been suspended at run time, invoke the ->thaw() + callback provided by its driver and return its result, or return 0 if not + defined + + `int pm_generic_thaw_noirq(struct device *dev);` + - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq() + callback provided by the device's driver and return its result, or return + 0 if not defined + + `int pm_generic_poweroff(struct device *dev);` + - if the device has not been suspended at run time, invoke the ->poweroff() + callback provided by its driver and return its result, or return 0 if not + defined + + `int pm_generic_poweroff_noirq(struct device *dev);` + - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq() + callback provided by the device's driver and return its result, or return + 0 if not defined + + `int pm_generic_restore(struct device *dev);` + - invoke the ->restore() callback provided by the driver of this device and, + if successful, change the device's runtime PM status to 'active' + + `int pm_generic_restore_noirq(struct device *dev);` + - invoke the ->restore_noirq() callback provided by the device's driver + +These functions are the defaults used by the PM core, if a subsystem doesn't +provide its own callbacks for ->runtime_idle(), ->runtime_suspend(), +->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(), +->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(), +->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the +subsystem-level dev_pm_ops structure. + +Device drivers that wish to use the same function as a system suspend, freeze, +poweroff and runtime suspend callback, and similarly for system resume, thaw, +restore, and runtime resume, can achieve this with the help of the +UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its +last argument to NULL). + +8. "No-Callback" Devices +======================== + +Some "devices" are only logical sub-devices of their parent and cannot be +power-managed on their own. (The prototype example is a USB interface. Entire +USB devices can go into low-power mode or send wake-up requests, but neither is +possible for individual interfaces.) The drivers for these devices have no +need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend() +and ->runtime_resume() would always return 0 without doing anything else and +->runtime_idle() would always call pm_runtime_suspend(). + +Subsystems can tell the PM core about these devices by calling +pm_runtime_no_callbacks(). This should be done after the device structure is +initialized and before it is registered (although after device registration is +also okay). The routine will set the device's power.no_callbacks flag and +prevent the non-debugging runtime PM sysfs attributes from being created. + +When power.no_callbacks is set, the PM core will not invoke the +->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks. +Instead it will assume that suspends and resumes always succeed and that idle +devices should be suspended. + +As a consequence, the PM core will never directly inform the device's subsystem +or driver about runtime power changes. Instead, the driver for the device's +parent must take responsibility for telling the device's driver when the +parent's power state changes. + +9. Autosuspend, or automatically-delayed suspends +================================================= + +Changing a device's power state isn't free; it requires both time and energy. +A device should be put in a low-power state only when there's some reason to +think it will remain in that state for a substantial time. A common heuristic +says that a device which hasn't been used for a while is liable to remain +unused; following this advice, drivers should not allow devices to be suspended +at runtime until they have been inactive for some minimum period. Even when +the heuristic ends up being non-optimal, it will still prevent devices from +"bouncing" too rapidly between low-power and full-power states. + +The term "autosuspend" is an historical remnant. It doesn't mean that the +device is automatically suspended (the subsystem or driver still has to call +the appropriate PM routines); rather it means that runtime suspends will +automatically be delayed until the desired period of inactivity has elapsed. + +Inactivity is determined based on the power.last_busy field. Drivers should +call pm_runtime_mark_last_busy() to update this field after carrying out I/O, +typically just before calling pm_runtime_put_autosuspend(). The desired length +of the inactivity period is a matter of policy. Subsystems can set this length +initially by calling pm_runtime_set_autosuspend_delay(), but after device +registration the length should be controlled by user space, using the +/sys/devices/.../power/autosuspend_delay_ms attribute. + +In order to use autosuspend, subsystems or drivers must call +pm_runtime_use_autosuspend() (preferably before registering the device), and +thereafter they should use the various `*_autosuspend()` helper functions +instead of the non-autosuspend counterparts:: + + Instead of: pm_runtime_suspend use: pm_runtime_autosuspend; + Instead of: pm_schedule_suspend use: pm_request_autosuspend; + Instead of: pm_runtime_put use: pm_runtime_put_autosuspend; + Instead of: pm_runtime_put_sync use: pm_runtime_put_sync_autosuspend. + +Drivers may also continue to use the non-autosuspend helper functions; they +will behave normally, which means sometimes taking the autosuspend delay into +account (see pm_runtime_idle). + +Under some circumstances a driver or subsystem may want to prevent a device +from autosuspending immediately, even though the usage counter is zero and the +autosuspend delay time has expired. If the ->runtime_suspend() callback +returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is +in the future (as it normally would be if the callback invoked +pm_runtime_mark_last_busy()), the PM core will automatically reschedule the +autosuspend. The ->runtime_suspend() callback can't do this rescheduling +itself because no suspend requests of any kind are accepted while the device is +suspending (i.e., while the callback is running). + +The implementation is well suited for asynchronous use in interrupt contexts. +However such use inevitably involves races, because the PM core can't +synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. +This synchronization must be handled by the driver, using its private lock. +Here is a schematic pseudo-code example:: + + foo_read_or_write(struct foo_priv *foo, void *data) + { + lock(&foo->private_lock); + add_request_to_io_queue(foo, data); + if (foo->num_pending_requests++ == 0) + pm_runtime_get(&foo->dev); + if (!foo->is_suspended) + foo_process_next_request(foo); + unlock(&foo->private_lock); + } + + foo_io_completion(struct foo_priv *foo, void *req) + { + lock(&foo->private_lock); + if (--foo->num_pending_requests == 0) { + pm_runtime_mark_last_busy(&foo->dev); + pm_runtime_put_autosuspend(&foo->dev); + } else { + foo_process_next_request(foo); + } + unlock(&foo->private_lock); + /* Send req result back to the user ... */ + } + + int foo_runtime_suspend(struct device *dev) + { + struct foo_priv foo = container_of(dev, ...); + int ret = 0; + + lock(&foo->private_lock); + if (foo->num_pending_requests > 0) { + ret = -EBUSY; + } else { + /* ... suspend the device ... */ + foo->is_suspended = 1; + } + unlock(&foo->private_lock); + return ret; + } + + int foo_runtime_resume(struct device *dev) + { + struct foo_priv foo = container_of(dev, ...); + + lock(&foo->private_lock); + /* ... resume the device ... */ + foo->is_suspended = 0; + pm_runtime_mark_last_busy(&foo->dev); + if (foo->num_pending_requests > 0) + foo_process_next_request(foo); + unlock(&foo->private_lock); + return 0; + } + +The important point is that after foo_io_completion() asks for an autosuspend, +the foo_runtime_suspend() callback may race with foo_read_or_write(). +Therefore foo_runtime_suspend() has to check whether there are any pending I/O +requests (while holding the private lock) before allowing the suspend to +proceed. + +In addition, the power.autosuspend_delay field can be changed by user space at +any time. If a driver cares about this, it can call +pm_runtime_autosuspend_expiration() from within the ->runtime_suspend() +callback while holding its private lock. If the function returns a nonzero +value then the delay has not yet expired and the callback should return +-EAGAIN. diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt deleted file mode 100644 index 937e33c46211..000000000000 --- a/Documentation/power/runtime_pm.txt +++ /dev/null @@ -1,928 +0,0 @@ -Runtime Power Management Framework for I/O Devices - -(C) 2009-2011 Rafael J. Wysocki , Novell Inc. -(C) 2010 Alan Stern -(C) 2014 Intel Corp., Rafael J. Wysocki - -1. Introduction - -Support for runtime power management (runtime PM) of I/O devices is provided -at the power management core (PM core) level by means of: - -* The power management workqueue pm_wq in which bus types and device drivers can - put their PM-related work items. It is strongly recommended that pm_wq be - used for queuing all work items related to runtime PM, because this allows - them to be synchronized with system-wide power transitions (suspend to RAM, - hibernation and resume from system sleep states). pm_wq is declared in - include/linux/pm_runtime.h and defined in kernel/power/main.c. - -* A number of runtime PM fields in the 'power' member of 'struct device' (which - is of the type 'struct dev_pm_info', defined in include/linux/pm.h) that can - be used for synchronizing runtime PM operations with one another. - -* Three device runtime PM callbacks in 'struct dev_pm_ops' (defined in - include/linux/pm.h). - -* A set of helper functions defined in drivers/base/power/runtime.c that can be - used for carrying out runtime PM operations in such a way that the - synchronization between them is taken care of by the PM core. Bus types and - device drivers are encouraged to use these functions. - -The runtime PM callbacks present in 'struct dev_pm_ops', the device runtime PM -fields of 'struct dev_pm_info' and the core helper functions provided for -runtime PM are described below. - -2. Device Runtime PM Callbacks - -There are three device runtime PM callbacks defined in 'struct dev_pm_ops': - -struct dev_pm_ops { - ... - int (*runtime_suspend)(struct device *dev); - int (*runtime_resume)(struct device *dev); - int (*runtime_idle)(struct device *dev); - ... -}; - -The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks -are executed by the PM core for the device's subsystem that may be either of -the following: - - 1. PM domain of the device, if the device's PM domain object, dev->pm_domain, - is present. - - 2. Device type of the device, if both dev->type and dev->type->pm are present. - - 3. Device class of the device, if both dev->class and dev->class->pm are - present. - - 4. Bus type of the device, if both dev->bus and dev->bus->pm are present. - -If the subsystem chosen by applying the above rules doesn't provide the relevant -callback, the PM core will invoke the corresponding driver callback stored in -dev->driver->pm directly (if present). - -The PM core always checks which callback to use in the order given above, so the -priority order of callbacks from high to low is: PM domain, device type, class -and bus type. Moreover, the high-priority one will always take precedence over -a low-priority one. The PM domain, bus type, device type and class callbacks -are referred to as subsystem-level callbacks in what follows. - -By default, the callbacks are always invoked in process context with interrupts -enabled. However, the pm_runtime_irq_safe() helper function can be used to tell -the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume() -and ->runtime_idle() callbacks for the given device in atomic context with -interrupts disabled. This implies that the callback routines in question must -not block or sleep, but it also means that the synchronous helper functions -listed at the end of Section 4 may be used for that device within an interrupt -handler or generally in an atomic context. - -The subsystem-level suspend callback, if present, is _entirely_ _responsible_ -for handling the suspend of the device as appropriate, which may, but need not -include executing the device driver's own ->runtime_suspend() callback (from the -PM core's point of view it is not necessary to implement a ->runtime_suspend() -callback in a device driver as long as the subsystem-level suspend callback -knows what to do to handle the device). - - * Once the subsystem-level suspend callback (or the driver suspend callback, - if invoked directly) has completed successfully for the given device, the PM - core regards the device as suspended, which need not mean that it has been - put into a low power state. It is supposed to mean, however, that the - device will not process data and will not communicate with the CPU(s) and - RAM until the appropriate resume callback is executed for it. The runtime - PM status of a device after successful execution of the suspend callback is - 'suspended'. - - * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM - status remains 'active', which means that the device _must_ be fully - operational afterwards. - - * If the suspend callback returns an error code different from -EBUSY and - -EAGAIN, the PM core regards this as a fatal error and will refuse to run - the helper functions described in Section 4 for the device until its status - is directly set to either 'active', or 'suspended' (the PM core provides - special helper functions for this purpose). - -In particular, if the driver requires remote wakeup capability (i.e. hardware -mechanism allowing the device to request a change of its power state, such as -PCI PME) for proper functioning and device_can_wakeup() returns 'false' for the -device, then ->runtime_suspend() should return -EBUSY. On the other hand, if -device_can_wakeup() returns 'true' for the device and the device is put into a -low-power state during the execution of the suspend callback, it is expected -that remote wakeup will be enabled for the device. Generally, remote wakeup -should be enabled for all input devices put into low-power states at run time. - -The subsystem-level resume callback, if present, is _entirely_ _responsible_ for -handling the resume of the device as appropriate, which may, but need not -include executing the device driver's own ->runtime_resume() callback (from the -PM core's point of view it is not necessary to implement a ->runtime_resume() -callback in a device driver as long as the subsystem-level resume callback knows -what to do to handle the device). - - * Once the subsystem-level resume callback (or the driver resume callback, if - invoked directly) has completed successfully, the PM core regards the device - as fully operational, which means that the device _must_ be able to complete - I/O operations as needed. The runtime PM status of the device is then - 'active'. - - * If the resume callback returns an error code, the PM core regards this as a - fatal error and will refuse to run the helper functions described in Section - 4 for the device, until its status is directly set to either 'active', or - 'suspended' (by means of special helper functions provided by the PM core - for this purpose). - -The idle callback (a subsystem-level one, if present, or the driver one) is -executed by the PM core whenever the device appears to be idle, which is -indicated to the PM core by two counters, the device's usage counter and the -counter of 'active' children of the device. - - * If any of these counters is decreased using a helper function provided by - the PM core and it turns out to be equal to zero, the other counter is - checked. If that counter also is equal to zero, the PM core executes the - idle callback with the device as its argument. - -The action performed by the idle callback is totally dependent on the subsystem -(or driver) in question, but the expected and recommended action is to check -if the device can be suspended (i.e. if all of the conditions necessary for -suspending the device are satisfied) and to queue up a suspend request for the -device in that case. If there is no idle callback, or if the callback returns -0, then the PM core will attempt to carry out a runtime suspend of the device, -also respecting devices configured for autosuspend. In essence this means a -call to pm_runtime_autosuspend() (do note that drivers needs to update the -device last busy mark, pm_runtime_mark_last_busy(), to control the delay under -this circumstance). To prevent this (for example, if the callback routine has -started a delayed suspend), the routine must return a non-zero value. Negative -error return codes are ignored by the PM core. - -The helper functions provided by the PM core, described in Section 4, guarantee -that the following constraints are met with respect to runtime PM callbacks for -one device: - -(1) The callbacks are mutually exclusive (e.g. it is forbidden to execute - ->runtime_suspend() in parallel with ->runtime_resume() or with another - instance of ->runtime_suspend() for the same device) with the exception that - ->runtime_suspend() or ->runtime_resume() can be executed in parallel with - ->runtime_idle() (although ->runtime_idle() will not be started while any - of the other callbacks is being executed for the same device). - -(2) ->runtime_idle() and ->runtime_suspend() can only be executed for 'active' - devices (i.e. the PM core will only execute ->runtime_idle() or - ->runtime_suspend() for the devices the runtime PM status of which is - 'active'). - -(3) ->runtime_idle() and ->runtime_suspend() can only be executed for a device - the usage counter of which is equal to zero _and_ either the counter of - 'active' children of which is equal to zero, or the 'power.ignore_children' - flag of which is set. - -(4) ->runtime_resume() can only be executed for 'suspended' devices (i.e. the - PM core will only execute ->runtime_resume() for the devices the runtime - PM status of which is 'suspended'). - -Additionally, the helper functions provided by the PM core obey the following -rules: - - * If ->runtime_suspend() is about to be executed or there's a pending request - to execute it, ->runtime_idle() will not be executed for the same device. - - * A request to execute or to schedule the execution of ->runtime_suspend() - will cancel any pending requests to execute ->runtime_idle() for the same - device. - - * If ->runtime_resume() is about to be executed or there's a pending request - to execute it, the other callbacks will not be executed for the same device. - - * A request to execute ->runtime_resume() will cancel any pending or - scheduled requests to execute the other callbacks for the same device, - except for scheduled autosuspends. - -3. Runtime PM Device Fields - -The following device runtime PM fields are present in 'struct dev_pm_info', as -defined in include/linux/pm.h: - - struct timer_list suspend_timer; - - timer used for scheduling (delayed) suspend and autosuspend requests - - unsigned long timer_expires; - - timer expiration time, in jiffies (if this is different from zero, the - timer is running and will expire at that time, otherwise the timer is not - running) - - struct work_struct work; - - work structure used for queuing up requests (i.e. work items in pm_wq) - - wait_queue_head_t wait_queue; - - wait queue used if any of the helper functions needs to wait for another - one to complete - - spinlock_t lock; - - lock used for synchronization - - atomic_t usage_count; - - the usage counter of the device - - atomic_t child_count; - - the count of 'active' children of the device - - unsigned int ignore_children; - - if set, the value of child_count is ignored (but still updated) - - unsigned int disable_depth; - - used for disabling the helper functions (they work normally if this is - equal to zero); the initial value of it is 1 (i.e. runtime PM is - initially disabled for all devices) - - int runtime_error; - - if set, there was a fatal error (one of the callbacks returned error code - as described in Section 2), so the helper functions will not work until - this flag is cleared; this is the error code returned by the failing - callback - - unsigned int idle_notification; - - if set, ->runtime_idle() is being executed - - unsigned int request_pending; - - if set, there's a pending request (i.e. a work item queued up into pm_wq) - - enum rpm_request request; - - type of request that's pending (valid if request_pending is set) - - unsigned int deferred_resume; - - set if ->runtime_resume() is about to be run while ->runtime_suspend() is - being executed for that device and it is not practical to wait for the - suspend to complete; means "start a resume as soon as you've suspended" - - enum rpm_status runtime_status; - - the runtime PM status of the device; this field's initial value is - RPM_SUSPENDED, which means that each device is initially regarded by the - PM core as 'suspended', regardless of its real hardware status - - unsigned int runtime_auto; - - if set, indicates that the user space has allowed the device driver to - power manage the device at run time via the /sys/devices/.../power/control - interface; it may only be modified with the help of the pm_runtime_allow() - and pm_runtime_forbid() helper functions - - unsigned int no_callbacks; - - indicates that the device does not use the runtime PM callbacks (see - Section 8); it may be modified only by the pm_runtime_no_callbacks() - helper function - - unsigned int irq_safe; - - indicates that the ->runtime_suspend() and ->runtime_resume() callbacks - will be invoked with the spinlock held and interrupts disabled - - unsigned int use_autosuspend; - - indicates that the device's driver supports delayed autosuspend (see - Section 9); it may be modified only by the - pm_runtime{_dont}_use_autosuspend() helper functions - - unsigned int timer_autosuspends; - - indicates that the PM core should attempt to carry out an autosuspend - when the timer expires rather than a normal suspend - - int autosuspend_delay; - - the delay time (in milliseconds) to be used for autosuspend - - unsigned long last_busy; - - the time (in jiffies) when the pm_runtime_mark_last_busy() helper - function was last called for this device; used in calculating inactivity - periods for autosuspend - -All of the above fields are members of the 'power' member of 'struct device'. - -4. Runtime PM Device Helper Functions - -The following runtime PM helper functions are defined in -drivers/base/power/runtime.c and include/linux/pm_runtime.h: - - void pm_runtime_init(struct device *dev); - - initialize the device runtime PM fields in 'struct dev_pm_info' - - void pm_runtime_remove(struct device *dev); - - make sure that the runtime PM of the device will be disabled after - removing the device from device hierarchy - - int pm_runtime_idle(struct device *dev); - - execute the subsystem-level idle callback for the device; returns an - error code on failure, where -EINPROGRESS means that ->runtime_idle() is - already being executed; if there is no callback or the callback returns 0 - then run pm_runtime_autosuspend(dev) and return its result - - int pm_runtime_suspend(struct device *dev); - - execute the subsystem-level suspend callback for the device; returns 0 on - success, 1 if the device's runtime PM status was already 'suspended', or - error code on failure, where -EAGAIN or -EBUSY means it is safe to attempt - to suspend the device again in future and -EACCES means that - 'power.disable_depth' is different from 0 - - int pm_runtime_autosuspend(struct device *dev); - - same as pm_runtime_suspend() except that the autosuspend delay is taken - into account; if pm_runtime_autosuspend_expiration() says the delay has - not yet expired then an autosuspend is scheduled for the appropriate time - and 0 is returned - - int pm_runtime_resume(struct device *dev); - - execute the subsystem-level resume callback for the device; returns 0 on - success, 1 if the device's runtime PM status was already 'active' or - error code on failure, where -EAGAIN means it may be safe to attempt to - resume the device again in future, but 'power.runtime_error' should be - checked additionally, and -EACCES means that 'power.disable_depth' is - different from 0 - - int pm_request_idle(struct device *dev); - - submit a request to execute the subsystem-level idle callback for the - device (the request is represented by a work item in pm_wq); returns 0 on - success or error code if the request has not been queued up - - int pm_request_autosuspend(struct device *dev); - - schedule the execution of the subsystem-level suspend callback for the - device when the autosuspend delay has expired; if the delay has already - expired then the work item is queued up immediately - - int pm_schedule_suspend(struct device *dev, unsigned int delay); - - schedule the execution of the subsystem-level suspend callback for the - device in future, where 'delay' is the time to wait before queuing up a - suspend work item in pm_wq, in milliseconds (if 'delay' is zero, the work - item is queued up immediately); returns 0 on success, 1 if the device's PM - runtime status was already 'suspended', or error code if the request - hasn't been scheduled (or queued up if 'delay' is 0); if the execution of - ->runtime_suspend() is already scheduled and not yet expired, the new - value of 'delay' will be used as the time to wait - - int pm_request_resume(struct device *dev); - - submit a request to execute the subsystem-level resume callback for the - device (the request is represented by a work item in pm_wq); returns 0 on - success, 1 if the device's runtime PM status was already 'active', or - error code if the request hasn't been queued up - - void pm_runtime_get_noresume(struct device *dev); - - increment the device's usage counter - - int pm_runtime_get(struct device *dev); - - increment the device's usage counter, run pm_request_resume(dev) and - return its result - - int pm_runtime_get_sync(struct device *dev); - - increment the device's usage counter, run pm_runtime_resume(dev) and - return its result - - int pm_runtime_get_if_in_use(struct device *dev); - - return -EINVAL if 'power.disable_depth' is nonzero; otherwise, if the - runtime PM status is RPM_ACTIVE and the runtime PM usage counter is - nonzero, increment the counter and return 1; otherwise return 0 without - changing the counter - - void pm_runtime_put_noidle(struct device *dev); - - decrement the device's usage counter - - int pm_runtime_put(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_request_idle(dev) and return its result - - int pm_runtime_put_autosuspend(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_request_autosuspend(dev) and return its result - - int pm_runtime_put_sync(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_runtime_idle(dev) and return its result - - int pm_runtime_put_sync_suspend(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_runtime_suspend(dev) and return its result - - int pm_runtime_put_sync_autosuspend(struct device *dev); - - decrement the device's usage counter; if the result is 0 then run - pm_runtime_autosuspend(dev) and return its result - - void pm_runtime_enable(struct device *dev); - - decrement the device's 'power.disable_depth' field; if that field is equal - to zero, the runtime PM helper functions can execute subsystem-level - callbacks described in Section 2 for the device - - int pm_runtime_disable(struct device *dev); - - increment the device's 'power.disable_depth' field (if the value of that - field was previously zero, this prevents subsystem-level runtime PM - callbacks from being run for the device), make sure that all of the - pending runtime PM operations on the device are either completed or - canceled; returns 1 if there was a resume request pending and it was - necessary to execute the subsystem-level resume callback for the device - to satisfy that request, otherwise 0 is returned - - int pm_runtime_barrier(struct device *dev); - - check if there's a resume request pending for the device and resume it - (synchronously) in that case, cancel any other pending runtime PM requests - regarding it and wait for all runtime PM operations on it in progress to - complete; returns 1 if there was a resume request pending and it was - necessary to execute the subsystem-level resume callback for the device to - satisfy that request, otherwise 0 is returned - - void pm_suspend_ignore_children(struct device *dev, bool enable); - - set/unset the power.ignore_children flag of the device - - int pm_runtime_set_active(struct device *dev); - - clear the device's 'power.runtime_error' flag, set the device's runtime - PM status to 'active' and update its parent's counter of 'active' - children as appropriate (it is only valid to use this function if - 'power.runtime_error' is set or 'power.disable_depth' is greater than - zero); it will fail and return error code if the device has a parent - which is not active and the 'power.ignore_children' flag of which is unset - - void pm_runtime_set_suspended(struct device *dev); - - clear the device's 'power.runtime_error' flag, set the device's runtime - PM status to 'suspended' and update its parent's counter of 'active' - children as appropriate (it is only valid to use this function if - 'power.runtime_error' is set or 'power.disable_depth' is greater than - zero) - - bool pm_runtime_active(struct device *dev); - - return true if the device's runtime PM status is 'active' or its - 'power.disable_depth' field is not equal to zero, or false otherwise - - bool pm_runtime_suspended(struct device *dev); - - return true if the device's runtime PM status is 'suspended' and its - 'power.disable_depth' field is equal to zero, or false otherwise - - bool pm_runtime_status_suspended(struct device *dev); - - return true if the device's runtime PM status is 'suspended' - - void pm_runtime_allow(struct device *dev); - - set the power.runtime_auto flag for the device and decrease its usage - counter (used by the /sys/devices/.../power/control interface to - effectively allow the device to be power managed at run time) - - void pm_runtime_forbid(struct device *dev); - - unset the power.runtime_auto flag for the device and increase its usage - counter (used by the /sys/devices/.../power/control interface to - effectively prevent the device from being power managed at run time) - - void pm_runtime_no_callbacks(struct device *dev); - - set the power.no_callbacks flag for the device and remove the runtime - PM attributes from /sys/devices/.../power (or prevent them from being - added when the device is registered) - - void pm_runtime_irq_safe(struct device *dev); - - set the power.irq_safe flag for the device, causing the runtime-PM - callbacks to be invoked with interrupts off - - bool pm_runtime_is_irq_safe(struct device *dev); - - return true if power.irq_safe flag was set for the device, causing - the runtime-PM callbacks to be invoked with interrupts off - - void pm_runtime_mark_last_busy(struct device *dev); - - set the power.last_busy field to the current time - - void pm_runtime_use_autosuspend(struct device *dev); - - set the power.use_autosuspend flag, enabling autosuspend delays; call - pm_runtime_get_sync if the flag was previously cleared and - power.autosuspend_delay is negative - - void pm_runtime_dont_use_autosuspend(struct device *dev); - - clear the power.use_autosuspend flag, disabling autosuspend delays; - decrement the device's usage counter if the flag was previously set and - power.autosuspend_delay is negative; call pm_runtime_idle - - void pm_runtime_set_autosuspend_delay(struct device *dev, int delay); - - set the power.autosuspend_delay value to 'delay' (expressed in - milliseconds); if 'delay' is negative then runtime suspends are - prevented; if power.use_autosuspend is set, pm_runtime_get_sync may be - called or the device's usage counter may be decremented and - pm_runtime_idle called depending on if power.autosuspend_delay is - changed to or from a negative value; if power.use_autosuspend is clear, - pm_runtime_idle is called - - unsigned long pm_runtime_autosuspend_expiration(struct device *dev); - - calculate the time when the current autosuspend delay period will expire, - based on power.last_busy and power.autosuspend_delay; if the delay time - is 1000 ms or larger then the expiration time is rounded up to the - nearest second; returns 0 if the delay period has already expired or - power.use_autosuspend isn't set, otherwise returns the expiration time - in jiffies - -It is safe to execute the following helper functions from interrupt context: - -pm_request_idle() -pm_request_autosuspend() -pm_schedule_suspend() -pm_request_resume() -pm_runtime_get_noresume() -pm_runtime_get() -pm_runtime_put_noidle() -pm_runtime_put() -pm_runtime_put_autosuspend() -pm_runtime_enable() -pm_suspend_ignore_children() -pm_runtime_set_active() -pm_runtime_set_suspended() -pm_runtime_suspended() -pm_runtime_mark_last_busy() -pm_runtime_autosuspend_expiration() - -If pm_runtime_irq_safe() has been called for a device then the following helper -functions may also be used in interrupt context: - -pm_runtime_idle() -pm_runtime_suspend() -pm_runtime_autosuspend() -pm_runtime_resume() -pm_runtime_get_sync() -pm_runtime_put_sync() -pm_runtime_put_sync_suspend() -pm_runtime_put_sync_autosuspend() - -5. Runtime PM Initialization, Device Probing and Removal - -Initially, the runtime PM is disabled for all devices, which means that the -majority of the runtime PM helper functions described in Section 4 will return --EAGAIN until pm_runtime_enable() is called for the device. - -In addition to that, the initial runtime PM status of all devices is -'suspended', but it need not reflect the actual physical state of the device. -Thus, if the device is initially active (i.e. it is able to process I/O), its -runtime PM status must be changed to 'active', with the help of -pm_runtime_set_active(), before pm_runtime_enable() is called for the device. - -However, if the device has a parent and the parent's runtime PM is enabled, -calling pm_runtime_set_active() for the device will affect the parent, unless -the parent's 'power.ignore_children' flag is set. Namely, in that case the -parent won't be able to suspend at run time, using the PM core's helper -functions, as long as the child's status is 'active', even if the child's -runtime PM is still disabled (i.e. pm_runtime_enable() hasn't been called for -the child yet or pm_runtime_disable() has been called for it). For this reason, -once pm_runtime_set_active() has been called for the device, pm_runtime_enable() -should be called for it too as soon as reasonably possible or its runtime PM -status should be changed back to 'suspended' with the help of -pm_runtime_set_suspended(). - -If the default initial runtime PM status of the device (i.e. 'suspended') -reflects the actual state of the device, its bus type's or its driver's -->probe() callback will likely need to wake it up using one of the PM core's -helper functions described in Section 4. In that case, pm_runtime_resume() -should be used. Of course, for this purpose the device's runtime PM has to be -enabled earlier by calling pm_runtime_enable(). - -Note, if the device may execute pm_runtime calls during the probe (such as -if it is registers with a subsystem that may call back in) then the -pm_runtime_get_sync() call paired with a pm_runtime_put() call will be -appropriate to ensure that the device is not put back to sleep during the -probe. This can happen with systems such as the network device layer. - -It may be desirable to suspend the device once ->probe() has finished. -Therefore the driver core uses the asynchronous pm_request_idle() to submit a -request to execute the subsystem-level idle callback for the device at that -time. A driver that makes use of the runtime autosuspend feature, may want to -update the last busy mark before returning from ->probe(). - -Moreover, the driver core prevents runtime PM callbacks from racing with the bus -notifier callback in __device_release_driver(), which is necessary, because the -notifier is used by some subsystems to carry out operations affecting the -runtime PM functionality. It does so by calling pm_runtime_get_sync() before -driver_sysfs_remove() and the BUS_NOTIFY_UNBIND_DRIVER notifications. This -resumes the device if it's in the suspended state and prevents it from -being suspended again while those routines are being executed. - -To allow bus types and drivers to put devices into the suspended state by -calling pm_runtime_suspend() from their ->remove() routines, the driver core -executes pm_runtime_put_sync() after running the BUS_NOTIFY_UNBIND_DRIVER -notifications in __device_release_driver(). This requires bus types and -drivers to make their ->remove() callbacks avoid races with runtime PM directly, -but also it allows of more flexibility in the handling of devices during the -removal of their drivers. - -Drivers in ->remove() callback should undo the runtime PM changes done -in ->probe(). Usually this means calling pm_runtime_disable(), -pm_runtime_dont_use_autosuspend() etc. - -The user space can effectively disallow the driver of the device to power manage -it at run time by changing the value of its /sys/devices/.../power/control -attribute to "on", which causes pm_runtime_forbid() to be called. In principle, -this mechanism may also be used by the driver to effectively turn off the -runtime power management of the device until the user space turns it on. -Namely, during the initialization the driver can make sure that the runtime PM -status of the device is 'active' and call pm_runtime_forbid(). It should be -noted, however, that if the user space has already intentionally changed the -value of /sys/devices/.../power/control to "auto" to allow the driver to power -manage the device at run time, the driver may confuse it by using -pm_runtime_forbid() this way. - -6. Runtime PM and System Sleep - -Runtime PM and system sleep (i.e., system suspend and hibernation, also known -as suspend-to-RAM and suspend-to-disk) interact with each other in a couple of -ways. If a device is active when a system sleep starts, everything is -straightforward. But what should happen if the device is already suspended? - -The device may have different wake-up settings for runtime PM and system sleep. -For example, remote wake-up may be enabled for runtime suspend but disallowed -for system sleep (device_may_wakeup(dev) returns 'false'). When this happens, -the subsystem-level system suspend callback is responsible for changing the -device's wake-up setting (it may leave that to the device driver's system -suspend routine). It may be necessary to resume the device and suspend it again -in order to do so. The same is true if the driver uses different power levels -or other settings for runtime suspend and system sleep. - -During system resume, the simplest approach is to bring all devices back to full -power, even if they had been suspended before the system suspend began. There -are several reasons for this, including: - - * The device might need to switch power levels, wake-up settings, etc. - - * Remote wake-up events might have been lost by the firmware. - - * The device's children may need the device to be at full power in order - to resume themselves. - - * The driver's idea of the device state may not agree with the device's - physical state. This can happen during resume from hibernation. - - * The device might need to be reset. - - * Even though the device was suspended, if its usage counter was > 0 then most - likely it would need a runtime resume in the near future anyway. - -If the device had been suspended before the system suspend began and it's -brought back to full power during resume, then its runtime PM status will have -to be updated to reflect the actual post-system sleep status. The way to do -this is: - - pm_runtime_disable(dev); - pm_runtime_set_active(dev); - pm_runtime_enable(dev); - -The PM core always increments the runtime usage counter before calling the -->suspend() callback and decrements it after calling the ->resume() callback. -Hence disabling runtime PM temporarily like this will not cause any runtime -suspend attempts to be permanently lost. If the usage count goes to zero -following the return of the ->resume() callback, the ->runtime_idle() callback -will be invoked as usual. - -On some systems, however, system sleep is not entered through a global firmware -or hardware operation. Instead, all hardware components are put into low-power -states directly by the kernel in a coordinated way. Then, the system sleep -state effectively follows from the states the hardware components end up in -and the system is woken up from that state by a hardware interrupt or a similar -mechanism entirely under the kernel's control. As a result, the kernel never -gives control away and the states of all devices during resume are precisely -known to it. If that is the case and none of the situations listed above takes -place (in particular, if the system is not waking up from hibernation), it may -be more efficient to leave the devices that had been suspended before the system -suspend began in the suspended state. - -To this end, the PM core provides a mechanism allowing some coordination between -different levels of device hierarchy. Namely, if a system suspend .prepare() -callback returns a positive number for a device, that indicates to the PM core -that the device appears to be runtime-suspended and its state is fine, so it -may be left in runtime suspend provided that all of its descendants are also -left in runtime suspend. If that happens, the PM core will not execute any -system suspend and resume callbacks for all of those devices, except for the -complete callback, which is then entirely responsible for handling the device -as appropriate. This only applies to system suspend transitions that are not -related to hibernation (see Documentation/driver-api/pm/devices.rst for more -information). - -The PM core does its best to reduce the probability of race conditions between -the runtime PM and system suspend/resume (and hibernation) callbacks by carrying -out the following operations: - - * During system suspend pm_runtime_get_noresume() is called for every device - right before executing the subsystem-level .prepare() callback for it and - pm_runtime_barrier() is called for every device right before executing the - subsystem-level .suspend() callback for it. In addition to that the PM core - calls __pm_runtime_disable() with 'false' as the second argument for every - device right before executing the subsystem-level .suspend_late() callback - for it. - - * During system resume pm_runtime_enable() and pm_runtime_put() are called for - every device right after executing the subsystem-level .resume_early() - callback and right after executing the subsystem-level .complete() callback - for it, respectively. - -7. Generic subsystem callbacks - -Subsystems may wish to conserve code space by using the set of generic power -management callbacks provided by the PM core, defined in -driver/base/power/generic_ops.c: - - int pm_generic_runtime_suspend(struct device *dev); - - invoke the ->runtime_suspend() callback provided by the driver of this - device and return its result, or return 0 if not defined - - int pm_generic_runtime_resume(struct device *dev); - - invoke the ->runtime_resume() callback provided by the driver of this - device and return its result, or return 0 if not defined - - int pm_generic_suspend(struct device *dev); - - if the device has not been suspended at run time, invoke the ->suspend() - callback provided by its driver and return its result, or return 0 if not - defined - - int pm_generic_suspend_noirq(struct device *dev); - - if pm_runtime_suspended(dev) returns "false", invoke the ->suspend_noirq() - callback provided by the device's driver and return its result, or return - 0 if not defined - - int pm_generic_resume(struct device *dev); - - invoke the ->resume() callback provided by the driver of this device and, - if successful, change the device's runtime PM status to 'active' - - int pm_generic_resume_noirq(struct device *dev); - - invoke the ->resume_noirq() callback provided by the driver of this device - - int pm_generic_freeze(struct device *dev); - - if the device has not been suspended at run time, invoke the ->freeze() - callback provided by its driver and return its result, or return 0 if not - defined - - int pm_generic_freeze_noirq(struct device *dev); - - if pm_runtime_suspended(dev) returns "false", invoke the ->freeze_noirq() - callback provided by the device's driver and return its result, or return - 0 if not defined - - int pm_generic_thaw(struct device *dev); - - if the device has not been suspended at run time, invoke the ->thaw() - callback provided by its driver and return its result, or return 0 if not - defined - - int pm_generic_thaw_noirq(struct device *dev); - - if pm_runtime_suspended(dev) returns "false", invoke the ->thaw_noirq() - callback provided by the device's driver and return its result, or return - 0 if not defined - - int pm_generic_poweroff(struct device *dev); - - if the device has not been suspended at run time, invoke the ->poweroff() - callback provided by its driver and return its result, or return 0 if not - defined - - int pm_generic_poweroff_noirq(struct device *dev); - - if pm_runtime_suspended(dev) returns "false", run the ->poweroff_noirq() - callback provided by the device's driver and return its result, or return - 0 if not defined - - int pm_generic_restore(struct device *dev); - - invoke the ->restore() callback provided by the driver of this device and, - if successful, change the device's runtime PM status to 'active' - - int pm_generic_restore_noirq(struct device *dev); - - invoke the ->restore_noirq() callback provided by the device's driver - -These functions are the defaults used by the PM core, if a subsystem doesn't -provide its own callbacks for ->runtime_idle(), ->runtime_suspend(), -->runtime_resume(), ->suspend(), ->suspend_noirq(), ->resume(), -->resume_noirq(), ->freeze(), ->freeze_noirq(), ->thaw(), ->thaw_noirq(), -->poweroff(), ->poweroff_noirq(), ->restore(), ->restore_noirq() in the -subsystem-level dev_pm_ops structure. - -Device drivers that wish to use the same function as a system suspend, freeze, -poweroff and runtime suspend callback, and similarly for system resume, thaw, -restore, and runtime resume, can achieve this with the help of the -UNIVERSAL_DEV_PM_OPS macro defined in include/linux/pm.h (possibly setting its -last argument to NULL). - -8. "No-Callback" Devices - -Some "devices" are only logical sub-devices of their parent and cannot be -power-managed on their own. (The prototype example is a USB interface. Entire -USB devices can go into low-power mode or send wake-up requests, but neither is -possible for individual interfaces.) The drivers for these devices have no -need of runtime PM callbacks; if the callbacks did exist, ->runtime_suspend() -and ->runtime_resume() would always return 0 without doing anything else and -->runtime_idle() would always call pm_runtime_suspend(). - -Subsystems can tell the PM core about these devices by calling -pm_runtime_no_callbacks(). This should be done after the device structure is -initialized and before it is registered (although after device registration is -also okay). The routine will set the device's power.no_callbacks flag and -prevent the non-debugging runtime PM sysfs attributes from being created. - -When power.no_callbacks is set, the PM core will not invoke the -->runtime_idle(), ->runtime_suspend(), or ->runtime_resume() callbacks. -Instead it will assume that suspends and resumes always succeed and that idle -devices should be suspended. - -As a consequence, the PM core will never directly inform the device's subsystem -or driver about runtime power changes. Instead, the driver for the device's -parent must take responsibility for telling the device's driver when the -parent's power state changes. - -9. Autosuspend, or automatically-delayed suspends - -Changing a device's power state isn't free; it requires both time and energy. -A device should be put in a low-power state only when there's some reason to -think it will remain in that state for a substantial time. A common heuristic -says that a device which hasn't been used for a while is liable to remain -unused; following this advice, drivers should not allow devices to be suspended -at runtime until they have been inactive for some minimum period. Even when -the heuristic ends up being non-optimal, it will still prevent devices from -"bouncing" too rapidly between low-power and full-power states. - -The term "autosuspend" is an historical remnant. It doesn't mean that the -device is automatically suspended (the subsystem or driver still has to call -the appropriate PM routines); rather it means that runtime suspends will -automatically be delayed until the desired period of inactivity has elapsed. - -Inactivity is determined based on the power.last_busy field. Drivers should -call pm_runtime_mark_last_busy() to update this field after carrying out I/O, -typically just before calling pm_runtime_put_autosuspend(). The desired length -of the inactivity period is a matter of policy. Subsystems can set this length -initially by calling pm_runtime_set_autosuspend_delay(), but after device -registration the length should be controlled by user space, using the -/sys/devices/.../power/autosuspend_delay_ms attribute. - -In order to use autosuspend, subsystems or drivers must call -pm_runtime_use_autosuspend() (preferably before registering the device), and -thereafter they should use the various *_autosuspend() helper functions instead -of the non-autosuspend counterparts: - - Instead of: pm_runtime_suspend use: pm_runtime_autosuspend; - Instead of: pm_schedule_suspend use: pm_request_autosuspend; - Instead of: pm_runtime_put use: pm_runtime_put_autosuspend; - Instead of: pm_runtime_put_sync use: pm_runtime_put_sync_autosuspend. - -Drivers may also continue to use the non-autosuspend helper functions; they -will behave normally, which means sometimes taking the autosuspend delay into -account (see pm_runtime_idle). - -Under some circumstances a driver or subsystem may want to prevent a device -from autosuspending immediately, even though the usage counter is zero and the -autosuspend delay time has expired. If the ->runtime_suspend() callback -returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is -in the future (as it normally would be if the callback invoked -pm_runtime_mark_last_busy()), the PM core will automatically reschedule the -autosuspend. The ->runtime_suspend() callback can't do this rescheduling -itself because no suspend requests of any kind are accepted while the device is -suspending (i.e., while the callback is running). - -The implementation is well suited for asynchronous use in interrupt contexts. -However such use inevitably involves races, because the PM core can't -synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. -This synchronization must be handled by the driver, using its private lock. -Here is a schematic pseudo-code example: - - foo_read_or_write(struct foo_priv *foo, void *data) - { - lock(&foo->private_lock); - add_request_to_io_queue(foo, data); - if (foo->num_pending_requests++ == 0) - pm_runtime_get(&foo->dev); - if (!foo->is_suspended) - foo_process_next_request(foo); - unlock(&foo->private_lock); - } - - foo_io_completion(struct foo_priv *foo, void *req) - { - lock(&foo->private_lock); - if (--foo->num_pending_requests == 0) { - pm_runtime_mark_last_busy(&foo->dev); - pm_runtime_put_autosuspend(&foo->dev); - } else { - foo_process_next_request(foo); - } - unlock(&foo->private_lock); - /* Send req result back to the user ... */ - } - - int foo_runtime_suspend(struct device *dev) - { - struct foo_priv foo = container_of(dev, ...); - int ret = 0; - - lock(&foo->private_lock); - if (foo->num_pending_requests > 0) { - ret = -EBUSY; - } else { - /* ... suspend the device ... */ - foo->is_suspended = 1; - } - unlock(&foo->private_lock); - return ret; - } - - int foo_runtime_resume(struct device *dev) - { - struct foo_priv foo = container_of(dev, ...); - - lock(&foo->private_lock); - /* ... resume the device ... */ - foo->is_suspended = 0; - pm_runtime_mark_last_busy(&foo->dev); - if (foo->num_pending_requests > 0) - foo_process_next_request(foo); - unlock(&foo->private_lock); - return 0; - } - -The important point is that after foo_io_completion() asks for an autosuspend, -the foo_runtime_suspend() callback may race with foo_read_or_write(). -Therefore foo_runtime_suspend() has to check whether there are any pending I/O -requests (while holding the private lock) before allowing the suspend to -proceed. - -In addition, the power.autosuspend_delay field can be changed by user space at -any time. If a driver cares about this, it can call -pm_runtime_autosuspend_expiration() from within the ->runtime_suspend() -callback while holding its private lock. If the function returns a nonzero -value then the delay has not yet expired and the callback should return --EAGAIN. diff --git a/Documentation/power/s2ram.rst b/Documentation/power/s2ram.rst new file mode 100644 index 000000000000..d739aa7c742c --- /dev/null +++ b/Documentation/power/s2ram.rst @@ -0,0 +1,87 @@ +======================== +How to get s2ram working +======================== + +2006 Linus Torvalds +2006 Pavel Machek + +1) Check suspend.sf.net, program s2ram there has long whitelist of + "known ok" machines, along with tricks to use on each one. + +2) If that does not help, try reading tricks.txt and + video.txt. Perhaps problem is as simple as broken module, and + simple module unload can fix it. + +3) You can use Linus' TRACE_RESUME infrastructure, described below. + +Using TRACE_RESUME +~~~~~~~~~~~~~~~~~~ + +I've been working at making the machines I have able to STR, and almost +always it's a driver that is buggy. Thank God for the suspend/resume +debugging - the thing that Chuck tried to disable. That's often the _only_ +way to debug these things, and it's actually pretty powerful (but +time-consuming - having to insert TRACE_RESUME() markers into the device +driver that doesn't resume and recompile and reboot). + +Anyway, the way to debug this for people who are interested (have a +machine that doesn't boot) is: + + - enable PM_DEBUG, and PM_TRACE + + - use a script like this:: + + #!/bin/sh + sync + echo 1 > /sys/power/pm_trace + echo mem > /sys/power/state + + to suspend + + - if it doesn't come back up (which is usually the problem), reboot by + holding the power button down, and look at the dmesg output for things + like:: + + Magic number: 4:156:725 + hash matches drivers/base/power/resume.c:28 + hash matches device 0000:01:00.0 + + which means that the last trace event was just before trying to resume + device 0000:01:00.0. Then figure out what driver is controlling that + device (lspci and /sys/devices/pci* is your friend), and see if you can + fix it, disable it, or trace into its resume function. + + If no device matches the hash (or any matches appear to be false positives), + the culprit may be a device from a loadable kernel module that is not loaded + until after the hash is checked. You can check the hash against the current + devices again after more modules are loaded using sysfs:: + + cat /sys/power/pm_trace_dev_match + +For example, the above happens to be the VGA device on my EVO, which I +used to run with "radeonfb" (it's an ATI Radeon mobility). It turns out +that "radeonfb" simply cannot resume that device - it tries to set the +PLL's, and it just _hangs_. Using the regular VGA console and letting X +resume it instead works fine. + +NOTE +==== +pm_trace uses the system's Real Time Clock (RTC) to save the magic number. +Reason for this is that the RTC is the only reliably available piece of +hardware during resume operations where a value can be set that will +survive a reboot. + +pm_trace is not compatible with asynchronous suspend, so it turns +asynchronous suspend off (which may work around timing or +ordering-sensitive bugs). + +Consequence is that after a resume (even if it is successful) your system +clock will have a value corresponding to the magic number instead of the +correct date/time! It is therefore advisable to use a program like ntp-date +or rdate to reset the correct date/time from an external time source when +using this trace option. + +As the clock keeps ticking it is also essential that the reboot is done +quickly after the resume failure. The trace option does not use the seconds +or the low order bits of the minutes of the RTC, but a too long delay will +corrupt the magic value. diff --git a/Documentation/power/s2ram.txt b/Documentation/power/s2ram.txt deleted file mode 100644 index 4685aee197fd..000000000000 --- a/Documentation/power/s2ram.txt +++ /dev/null @@ -1,85 +0,0 @@ - How to get s2ram working - ~~~~~~~~~~~~~~~~~~~~~~~~ - 2006 Linus Torvalds - 2006 Pavel Machek - -1) Check suspend.sf.net, program s2ram there has long whitelist of - "known ok" machines, along with tricks to use on each one. - -2) If that does not help, try reading tricks.txt and - video.txt. Perhaps problem is as simple as broken module, and - simple module unload can fix it. - -3) You can use Linus' TRACE_RESUME infrastructure, described below. - - Using TRACE_RESUME - ~~~~~~~~~~~~~~~~~~ - -I've been working at making the machines I have able to STR, and almost -always it's a driver that is buggy. Thank God for the suspend/resume -debugging - the thing that Chuck tried to disable. That's often the _only_ -way to debug these things, and it's actually pretty powerful (but -time-consuming - having to insert TRACE_RESUME() markers into the device -driver that doesn't resume and recompile and reboot). - -Anyway, the way to debug this for people who are interested (have a -machine that doesn't boot) is: - - - enable PM_DEBUG, and PM_TRACE - - - use a script like this: - - #!/bin/sh - sync - echo 1 > /sys/power/pm_trace - echo mem > /sys/power/state - - to suspend - - - if it doesn't come back up (which is usually the problem), reboot by - holding the power button down, and look at the dmesg output for things - like - - Magic number: 4:156:725 - hash matches drivers/base/power/resume.c:28 - hash matches device 0000:01:00.0 - - which means that the last trace event was just before trying to resume - device 0000:01:00.0. Then figure out what driver is controlling that - device (lspci and /sys/devices/pci* is your friend), and see if you can - fix it, disable it, or trace into its resume function. - - If no device matches the hash (or any matches appear to be false positives), - the culprit may be a device from a loadable kernel module that is not loaded - until after the hash is checked. You can check the hash against the current - devices again after more modules are loaded using sysfs: - - cat /sys/power/pm_trace_dev_match - -For example, the above happens to be the VGA device on my EVO, which I -used to run with "radeonfb" (it's an ATI Radeon mobility). It turns out -that "radeonfb" simply cannot resume that device - it tries to set the -PLL's, and it just _hangs_. Using the regular VGA console and letting X -resume it instead works fine. - -NOTE -==== -pm_trace uses the system's Real Time Clock (RTC) to save the magic number. -Reason for this is that the RTC is the only reliably available piece of -hardware during resume operations where a value can be set that will -survive a reboot. - -pm_trace is not compatible with asynchronous suspend, so it turns -asynchronous suspend off (which may work around timing or -ordering-sensitive bugs). - -Consequence is that after a resume (even if it is successful) your system -clock will have a value corresponding to the magic number instead of the -correct date/time! It is therefore advisable to use a program like ntp-date -or rdate to reset the correct date/time from an external time source when -using this trace option. - -As the clock keeps ticking it is also essential that the reboot is done -quickly after the resume failure. The trace option does not use the seconds -or the low order bits of the minutes of the RTC, but a too long delay will -corrupt the magic value. diff --git a/Documentation/power/suspend-and-cpuhotplug.rst b/Documentation/power/suspend-and-cpuhotplug.rst new file mode 100644 index 000000000000..7ac8e1f549f4 --- /dev/null +++ b/Documentation/power/suspend-and-cpuhotplug.rst @@ -0,0 +1,286 @@ +==================================================================== +Interaction of Suspend code (S3) with the CPU hotplug infrastructure +==================================================================== + +(C) 2011 - 2014 Srivatsa S. Bhat + + +I. Differences between CPU hotplug and Suspend-to-RAM +====================================================== + +How does the regular CPU hotplug code differ from how the Suspend-to-RAM +infrastructure uses it internally? And where do they share common code? + +Well, a picture is worth a thousand words... So ASCII art follows :-) + +[This depicts the current design in the kernel, and focusses only on the +interactions involving the freezer and CPU hotplug and also tries to explain +the locking involved. It outlines the notifications involved as well. +But please note that here, only the call paths are illustrated, with the aim +of describing where they take different paths and where they share code. +What happens when regular CPU hotplug and Suspend-to-RAM race with each other +is not depicted here.] + +On a high level, the suspend-resume cycle goes like this:: + + |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | + |tasks | | cpus | | | | cpus | |tasks| + + +More details follow:: + + Suspend call path + ----------------- + + Write 'mem' to + /sys/power/state + sysfs file + | + v + Acquire system_transition_mutex lock + | + v + Send PM_SUSPEND_PREPARE + notifications + | + v + Freeze tasks + | + | + v + disable_nonboot_cpus() + /* start */ + | + v + Acquire cpu_add_remove_lock + | + v + Iterate over CURRENTLY + online CPUs + | + | + | ---------- + v | L + ======> _cpu_down() | + | [This takes cpuhotplug.lock | + Common | before taking down the CPU | + code | and releases it when done] | O + | While it is at it, notifications | + | are sent when notable events occur, | + ======> by running all registered callbacks. | + | | O + | | + | | + v | + Note down these cpus in | P + frozen_cpus mask ---------- + | + v + Disable regular cpu hotplug + by increasing cpu_hotplug_disabled + | + v + Release cpu_add_remove_lock + | + v + /* disable_nonboot_cpus() complete */ + | + v + Do suspend + + + +Resuming back is likewise, with the counterparts being (in the order of +execution during resume): + +* enable_nonboot_cpus() which involves:: + + | Acquire cpu_add_remove_lock + | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug + | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] + | Release cpu_add_remove_lock + v + +* thaw tasks +* send PM_POST_SUSPEND notifications +* Release system_transition_mutex lock. + + +It is to be noted here that the system_transition_mutex lock is acquired at the very +beginning, when we are just starting out to suspend, and then released only +after the entire cycle is complete (i.e., suspend + resume). + +:: + + + + Regular CPU hotplug call path + ----------------------------- + + Write 0 (or 1) to + /sys/devices/system/cpu/cpu*/online + sysfs file + | + | + v + cpu_down() + | + v + Acquire cpu_add_remove_lock + | + v + If cpu_hotplug_disabled > 0 + return gracefully + | + | + v + ======> _cpu_down() + | [This takes cpuhotplug.lock + Common | before taking down the CPU + code | and releases it when done] + | While it is at it, notifications + | are sent when notable events occur, + ======> by running all registered callbacks. + | + | + v + Release cpu_add_remove_lock + [That's it!, for + regular CPU hotplug] + + + +So, as can be seen from the two diagrams (the parts marked as "Common code"), +regular CPU hotplug and the suspend code path converge at the _cpu_down() and +_cpu_up() functions. They differ in the arguments passed to these functions, +in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' +argument. But during suspend, since the tasks are already frozen by the time +the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called +with the 'tasks_frozen' argument set to 1. +[See below for some known issues regarding this.] + + +Important files and functions/entry points: +------------------------------------------- + +- kernel/power/process.c : freeze_processes(), thaw_processes() +- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() +- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() + + + +II. What are the issues involved in CPU hotplug? +------------------------------------------------ + +There are some interesting situations involving CPU hotplug and microcode +update on the CPUs, as discussed below: + +[Please bear in mind that the kernel requests the microcode images from +userspace, using the request_firmware() function defined in +drivers/base/firmware_loader/main.c] + + +a. When all the CPUs are identical: + + This is the most common situation and it is quite straightforward: we want + to apply the same microcode revision to each of the CPUs. + To give an example of x86, the collect_cpu_info() function defined in + arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU + and thereby in applying the correct microcode revision to it. + But note that the kernel does not maintain a common microcode image for the + all CPUs, in order to handle case 'b' described below. + + +b. When some of the CPUs are different than the rest: + + In this case since we probably need to apply different microcode revisions + to different CPUs, the kernel maintains a copy of the correct microcode + image for each CPU (after appropriate CPU type/model discovery using + functions such as collect_cpu_info()). + + +c. When a CPU is physically hot-unplugged and a new (and possibly different + type of) CPU is hot-plugged into the system: + + In the current design of the kernel, whenever a CPU is taken offline during + a regular CPU hotplug operation, upon receiving the CPU_DEAD notification + (which is sent by the CPU hotplug code), the microcode update driver's + callback for that event reacts by freeing the kernel's copy of the + microcode image for that CPU. + + Hence, when a new CPU is brought online, since the kernel finds that it + doesn't have the microcode image, it does the CPU type/model discovery + afresh and then requests the userspace for the appropriate microcode image + for that CPU, which is subsequently applied. + + For example, in x86, the mc_cpu_callback() function (which is the microcode + update driver's callback registered for CPU hotplug events) calls + microcode_update_cpu() which would call microcode_init_cpu() in this case, + instead of microcode_resume_cpu() when it finds that the kernel doesn't + have a valid microcode image. This ensures that the CPU type/model + discovery is performed and the right microcode is applied to the CPU after + getting it from userspace. + + +d. Handling microcode update during suspend/hibernate: + + Strictly speaking, during a CPU hotplug operation which does not involve + physically removing or inserting CPUs, the CPUs are not actually powered + off during a CPU offline. They are just put to the lowest C-states possible. + Hence, in such a case, it is not really necessary to re-apply microcode + when the CPUs are brought back online, since they wouldn't have lost the + image during the CPU offline operation. + + This is the usual scenario encountered during a resume after a suspend. + However, in the case of hibernation, since all the CPUs are completely + powered off, during restore it becomes necessary to apply the microcode + images to all the CPUs. + + [Note that we don't expect someone to physically pull out nodes and insert + nodes with a different type of CPUs in-between a suspend-resume or a + hibernate/restore cycle.] + + In the current design of the kernel however, during a CPU offline operation + as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), + the existing copy of microcode image in the kernel is not freed up. + And during the CPU online operations (during resume/restore), since the + kernel finds that it already has copies of the microcode images for all the + CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU + type/model and the need for validating whether the microcode revisions are + right for the CPUs or not (due to the above assumption that physical CPU + hotplug will not be done in-between suspend/resume or hibernate/restore + cycles). + + +III. Known problems +=================== + +Are there any known problems when regular CPU hotplug and suspend race +with each other? + +Yes, they are listed below: + +1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to + the _cpu_down() and _cpu_up() functions is *always* 0. + This might not reflect the true current state of the system, since the + tasks could have been frozen by an out-of-band event such as a suspend + operation in progress. Hence, the cpuhp_tasks_frozen variable will not + reflect the frozen state and the CPU hotplug callbacks which evaluate + that variable might execute the wrong code path. + +2. If a regular CPU hotplug stress test happens to race with the freezer due + to a suspend operation in progress at the same time, then we could hit the + situation described below: + + * A regular cpu online operation continues its journey from userspace + into the kernel, since the freezing has not yet begun. + * Then freezer gets to work and freezes userspace. + * If cpu online has not yet completed the microcode update stuff by now, + it will now start waiting on the frozen userspace in the + TASK_UNINTERRUPTIBLE state, in order to get the microcode image. + * Now the freezer continues and tries to freeze the remaining tasks. But + due to this wait mentioned above, the freezer won't be able to freeze + the cpu online hotplug task and hence freezing of tasks fails. + + As a result of this task freezing failure, the suspend operation gets + aborted. diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt deleted file mode 100644 index a8751b8df10e..000000000000 --- a/Documentation/power/suspend-and-cpuhotplug.txt +++ /dev/null @@ -1,274 +0,0 @@ -Interaction of Suspend code (S3) with the CPU hotplug infrastructure - - (C) 2011 - 2014 Srivatsa S. Bhat - - -I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM - infrastructure uses it internally? And where do they share common code? - -Well, a picture is worth a thousand words... So ASCII art follows :-) - -[This depicts the current design in the kernel, and focusses only on the -interactions involving the freezer and CPU hotplug and also tries to explain -the locking involved. It outlines the notifications involved as well. -But please note that here, only the call paths are illustrated, with the aim -of describing where they take different paths and where they share code. -What happens when regular CPU hotplug and Suspend-to-RAM race with each other -is not depicted here.] - -On a high level, the suspend-resume cycle goes like this: - -|Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | -|tasks | | cpus | | | | cpus | |tasks| - - -More details follow: - - Suspend call path - ----------------- - - Write 'mem' to - /sys/power/state - sysfs file - | - v - Acquire system_transition_mutex lock - | - v - Send PM_SUSPEND_PREPARE - notifications - | - v - Freeze tasks - | - | - v - disable_nonboot_cpus() - /* start */ - | - v - Acquire cpu_add_remove_lock - | - v - Iterate over CURRENTLY - online CPUs - | - | - | ---------- - v | L - ======> _cpu_down() | - | [This takes cpuhotplug.lock | - Common | before taking down the CPU | - code | and releases it when done] | O - | While it is at it, notifications | - | are sent when notable events occur, | - ======> by running all registered callbacks. | - | | O - | | - | | - v | - Note down these cpus in | P - frozen_cpus mask ---------- - | - v - Disable regular cpu hotplug - by increasing cpu_hotplug_disabled - | - v - Release cpu_add_remove_lock - | - v - /* disable_nonboot_cpus() complete */ - | - v - Do suspend - - - -Resuming back is likewise, with the counterparts being (in the order of -execution during resume): -* enable_nonboot_cpus() which involves: - | Acquire cpu_add_remove_lock - | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug - | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] - | Release cpu_add_remove_lock - v - -* thaw tasks -* send PM_POST_SUSPEND notifications -* Release system_transition_mutex lock. - - -It is to be noted here that the system_transition_mutex lock is acquired at the very -beginning, when we are just starting out to suspend, and then released only -after the entire cycle is complete (i.e., suspend + resume). - - - - Regular CPU hotplug call path - ----------------------------- - - Write 0 (or 1) to - /sys/devices/system/cpu/cpu*/online - sysfs file - | - | - v - cpu_down() - | - v - Acquire cpu_add_remove_lock - | - v - If cpu_hotplug_disabled > 0 - return gracefully - | - | - v - ======> _cpu_down() - | [This takes cpuhotplug.lock - Common | before taking down the CPU - code | and releases it when done] - | While it is at it, notifications - | are sent when notable events occur, - ======> by running all registered callbacks. - | - | - v - Release cpu_add_remove_lock - [That's it!, for - regular CPU hotplug] - - - -So, as can be seen from the two diagrams (the parts marked as "Common code"), -regular CPU hotplug and the suspend code path converge at the _cpu_down() and -_cpu_up() functions. They differ in the arguments passed to these functions, -in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' -argument. But during suspend, since the tasks are already frozen by the time -the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called -with the 'tasks_frozen' argument set to 1. -[See below for some known issues regarding this.] - - -Important files and functions/entry points: ------------------------------------------- - -kernel/power/process.c : freeze_processes(), thaw_processes() -kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() -kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() - - - -II. What are the issues involved in CPU hotplug? - ------------------------------------------- - -There are some interesting situations involving CPU hotplug and microcode -update on the CPUs, as discussed below: - -[Please bear in mind that the kernel requests the microcode images from -userspace, using the request_firmware() function defined in -drivers/base/firmware_loader/main.c] - - -a. When all the CPUs are identical: - - This is the most common situation and it is quite straightforward: we want - to apply the same microcode revision to each of the CPUs. - To give an example of x86, the collect_cpu_info() function defined in - arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU - and thereby in applying the correct microcode revision to it. - But note that the kernel does not maintain a common microcode image for the - all CPUs, in order to handle case 'b' described below. - - -b. When some of the CPUs are different than the rest: - - In this case since we probably need to apply different microcode revisions - to different CPUs, the kernel maintains a copy of the correct microcode - image for each CPU (after appropriate CPU type/model discovery using - functions such as collect_cpu_info()). - - -c. When a CPU is physically hot-unplugged and a new (and possibly different - type of) CPU is hot-plugged into the system: - - In the current design of the kernel, whenever a CPU is taken offline during - a regular CPU hotplug operation, upon receiving the CPU_DEAD notification - (which is sent by the CPU hotplug code), the microcode update driver's - callback for that event reacts by freeing the kernel's copy of the - microcode image for that CPU. - - Hence, when a new CPU is brought online, since the kernel finds that it - doesn't have the microcode image, it does the CPU type/model discovery - afresh and then requests the userspace for the appropriate microcode image - for that CPU, which is subsequently applied. - - For example, in x86, the mc_cpu_callback() function (which is the microcode - update driver's callback registered for CPU hotplug events) calls - microcode_update_cpu() which would call microcode_init_cpu() in this case, - instead of microcode_resume_cpu() when it finds that the kernel doesn't - have a valid microcode image. This ensures that the CPU type/model - discovery is performed and the right microcode is applied to the CPU after - getting it from userspace. - - -d. Handling microcode update during suspend/hibernate: - - Strictly speaking, during a CPU hotplug operation which does not involve - physically removing or inserting CPUs, the CPUs are not actually powered - off during a CPU offline. They are just put to the lowest C-states possible. - Hence, in such a case, it is not really necessary to re-apply microcode - when the CPUs are brought back online, since they wouldn't have lost the - image during the CPU offline operation. - - This is the usual scenario encountered during a resume after a suspend. - However, in the case of hibernation, since all the CPUs are completely - powered off, during restore it becomes necessary to apply the microcode - images to all the CPUs. - - [Note that we don't expect someone to physically pull out nodes and insert - nodes with a different type of CPUs in-between a suspend-resume or a - hibernate/restore cycle.] - - In the current design of the kernel however, during a CPU offline operation - as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), - the existing copy of microcode image in the kernel is not freed up. - And during the CPU online operations (during resume/restore), since the - kernel finds that it already has copies of the microcode images for all the - CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU - type/model and the need for validating whether the microcode revisions are - right for the CPUs or not (due to the above assumption that physical CPU - hotplug will not be done in-between suspend/resume or hibernate/restore - cycles). - - -III. Are there any known problems when regular CPU hotplug and suspend race - with each other? - -Yes, they are listed below: - -1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to - the _cpu_down() and _cpu_up() functions is *always* 0. - This might not reflect the true current state of the system, since the - tasks could have been frozen by an out-of-band event such as a suspend - operation in progress. Hence, the cpuhp_tasks_frozen variable will not - reflect the frozen state and the CPU hotplug callbacks which evaluate - that variable might execute the wrong code path. - -2. If a regular CPU hotplug stress test happens to race with the freezer due - to a suspend operation in progress at the same time, then we could hit the - situation described below: - - * A regular cpu online operation continues its journey from userspace - into the kernel, since the freezing has not yet begun. - * Then freezer gets to work and freezes userspace. - * If cpu online has not yet completed the microcode update stuff by now, - it will now start waiting on the frozen userspace in the - TASK_UNINTERRUPTIBLE state, in order to get the microcode image. - * Now the freezer continues and tries to freeze the remaining tasks. But - due to this wait mentioned above, the freezer won't be able to freeze - the cpu online hotplug task and hence freezing of tasks fails. - - As a result of this task freezing failure, the suspend operation gets - aborted. diff --git a/Documentation/power/suspend-and-interrupts.rst b/Documentation/power/suspend-and-interrupts.rst new file mode 100644 index 000000000000..4cda6617709a --- /dev/null +++ b/Documentation/power/suspend-and-interrupts.rst @@ -0,0 +1,137 @@ +==================================== +System Suspend and Device Interrupts +==================================== + +Copyright (C) 2014 Intel Corp. +Author: Rafael J. Wysocki + + +Suspending and Resuming Device IRQs +----------------------------------- + +Device interrupt request lines (IRQs) are generally disabled during system +suspend after the "late" phase of suspending devices (that is, after all of the +->prepare, ->suspend and ->suspend_late callbacks have been executed for all +devices). That is done by suspend_device_irqs(). + +The rationale for doing so is that after the "late" phase of device suspend +there is no legitimate reason why any interrupts from suspended devices should +trigger and if any devices have not been suspended properly yet, it is better to +block interrupts from them anyway. Also, in the past we had problems with +interrupt handlers for shared IRQs that device drivers implementing them were +not prepared for interrupts triggering after their devices had been suspended. +In some cases they would attempt to access, for example, memory address spaces +of suspended devices and cause unpredictable behavior to ensue as a result. +Unfortunately, such problems are very difficult to debug and the introduction +of suspend_device_irqs(), along with the "noirq" phase of device suspend and +resume, was the only practical way to mitigate them. + +Device IRQs are re-enabled during system resume, right before the "early" phase +of resuming devices (that is, before starting to execute ->resume_early +callbacks for devices). The function doing that is resume_device_irqs(). + + +The IRQF_NO_SUSPEND Flag +------------------------ + +There are interrupts that can legitimately trigger during the entire system +suspend-resume cycle, including the "noirq" phases of suspending and resuming +devices as well as during the time when nonboot CPUs are taken offline and +brought back online. That applies to timer interrupts in the first place, +but also to IPIs and to some other special-purpose interrupts. + +The IRQF_NO_SUSPEND flag is used to indicate that to the IRQ subsystem when +requesting a special-purpose interrupt. It causes suspend_device_irqs() to +leave the corresponding IRQ enabled so as to allow the interrupt to work as +expected during the suspend-resume cycle, but does not guarantee that the +interrupt will wake the system from a suspended state -- for such cases it is +necessary to use enable_irq_wake(). + +Note that the IRQF_NO_SUSPEND flag affects the entire IRQ and not just one +user of it. Thus, if the IRQ is shared, all of the interrupt handlers installed +for it will be executed as usual after suspend_device_irqs(), even if the +IRQF_NO_SUSPEND flag was not passed to request_irq() (or equivalent) by some of +the IRQ's users. For this reason, using IRQF_NO_SUSPEND and IRQF_SHARED at the +same time should be avoided. + + +System Wakeup Interrupts, enable_irq_wake() and disable_irq_wake() +------------------------------------------------------------------ + +System wakeup interrupts generally need to be configured to wake up the system +from sleep states, especially if they are used for different purposes (e.g. as +I/O interrupts) in the working state. + +That may involve turning on a special signal handling logic within the platform +(such as an SoC) so that signals from a given line are routed in a different way +during system sleep so as to trigger a system wakeup when needed. For example, +the platform may include a dedicated interrupt controller used specifically for +handling system wakeup events. Then, if a given interrupt line is supposed to +wake up the system from sleep sates, the corresponding input of that interrupt +controller needs to be enabled to receive signals from the line in question. +After wakeup, it generally is better to disable that input to prevent the +dedicated controller from triggering interrupts unnecessarily. + +The IRQ subsystem provides two helper functions to be used by device drivers for +those purposes. Namely, enable_irq_wake() turns on the platform's logic for +handling the given IRQ as a system wakeup interrupt line and disable_irq_wake() +turns that logic off. + +Calling enable_irq_wake() causes suspend_device_irqs() to treat the given IRQ +in a special way. Namely, the IRQ remains enabled, by on the first interrupt +it will be disabled, marked as pending and "suspended" so that it will be +re-enabled by resume_device_irqs() during the subsequent system resume. Also +the PM core is notified about the event which causes the system suspend in +progress to be aborted (that doesn't have to happen immediately, but at one +of the points where the suspend thread looks for pending wakeup events). + +This way every interrupt from a wakeup interrupt source will either cause the +system suspend currently in progress to be aborted or wake up the system if +already suspended. However, after suspend_device_irqs() interrupt handlers are +not executed for system wakeup IRQs. They are only executed for IRQF_NO_SUSPEND +IRQs at that time, but those IRQs should not be configured for system wakeup +using enable_irq_wake(). + + +Interrupts and Suspend-to-Idle +------------------------------ + +Suspend-to-idle (also known as the "freeze" sleep state) is a relatively new +system sleep state that works by idling all of the processors and waiting for +interrupts right after the "noirq" phase of suspending devices. + +Of course, this means that all of the interrupts with the IRQF_NO_SUSPEND flag +set will bring CPUs out of idle while in that state, but they will not cause the +IRQ subsystem to trigger a system wakeup. + +System wakeup interrupts, in turn, will trigger wakeup from suspend-to-idle in +analogy with what they do in the full system suspend case. The only difference +is that the wakeup from suspend-to-idle is signaled using the usual working +state interrupt delivery mechanisms and doesn't require the platform to use +any special interrupt handling logic for it to work. + + +IRQF_NO_SUSPEND and enable_irq_wake() +------------------------------------- + +There are very few valid reasons to use both enable_irq_wake() and the +IRQF_NO_SUSPEND flag on the same IRQ, and it is never valid to use both for the +same device. + +First of all, if the IRQ is not shared, the rules for handling IRQF_NO_SUSPEND +interrupts (interrupt handlers are invoked after suspend_device_irqs()) are +directly at odds with the rules for handling system wakeup interrupts (interrupt +handlers are not invoked after suspend_device_irqs()). + +Second, both enable_irq_wake() and IRQF_NO_SUSPEND apply to entire IRQs and not +to individual interrupt handlers, so sharing an IRQ between a system wakeup +interrupt source and an IRQF_NO_SUSPEND interrupt source does not generally +make sense. + +In rare cases an IRQ can be shared between a wakeup device driver and an +IRQF_NO_SUSPEND user. In order for this to be safe, the wakeup device driver +must be able to discern spurious IRQs from genuine wakeup events (signalling +the latter to the core with pm_system_wakeup()), must use enable_irq_wake() to +ensure that the IRQ will function as a wakeup source, and must request the IRQ +with IRQF_COND_SUSPEND to tell the core that it meets these requirements. If +these requirements are not met, it is not valid to use IRQF_COND_SUSPEND. diff --git a/Documentation/power/suspend-and-interrupts.txt b/Documentation/power/suspend-and-interrupts.txt deleted file mode 100644 index 8afb29a8604a..000000000000 --- a/Documentation/power/suspend-and-interrupts.txt +++ /dev/null @@ -1,135 +0,0 @@ -System Suspend and Device Interrupts - -Copyright (C) 2014 Intel Corp. -Author: Rafael J. Wysocki - - -Suspending and Resuming Device IRQs ------------------------------------ - -Device interrupt request lines (IRQs) are generally disabled during system -suspend after the "late" phase of suspending devices (that is, after all of the -->prepare, ->suspend and ->suspend_late callbacks have been executed for all -devices). That is done by suspend_device_irqs(). - -The rationale for doing so is that after the "late" phase of device suspend -there is no legitimate reason why any interrupts from suspended devices should -trigger and if any devices have not been suspended properly yet, it is better to -block interrupts from them anyway. Also, in the past we had problems with -interrupt handlers for shared IRQs that device drivers implementing them were -not prepared for interrupts triggering after their devices had been suspended. -In some cases they would attempt to access, for example, memory address spaces -of suspended devices and cause unpredictable behavior to ensue as a result. -Unfortunately, such problems are very difficult to debug and the introduction -of suspend_device_irqs(), along with the "noirq" phase of device suspend and -resume, was the only practical way to mitigate them. - -Device IRQs are re-enabled during system resume, right before the "early" phase -of resuming devices (that is, before starting to execute ->resume_early -callbacks for devices). The function doing that is resume_device_irqs(). - - -The IRQF_NO_SUSPEND Flag ------------------------- - -There are interrupts that can legitimately trigger during the entire system -suspend-resume cycle, including the "noirq" phases of suspending and resuming -devices as well as during the time when nonboot CPUs are taken offline and -brought back online. That applies to timer interrupts in the first place, -but also to IPIs and to some other special-purpose interrupts. - -The IRQF_NO_SUSPEND flag is used to indicate that to the IRQ subsystem when -requesting a special-purpose interrupt. It causes suspend_device_irqs() to -leave the corresponding IRQ enabled so as to allow the interrupt to work as -expected during the suspend-resume cycle, but does not guarantee that the -interrupt will wake the system from a suspended state -- for such cases it is -necessary to use enable_irq_wake(). - -Note that the IRQF_NO_SUSPEND flag affects the entire IRQ and not just one -user of it. Thus, if the IRQ is shared, all of the interrupt handlers installed -for it will be executed as usual after suspend_device_irqs(), even if the -IRQF_NO_SUSPEND flag was not passed to request_irq() (or equivalent) by some of -the IRQ's users. For this reason, using IRQF_NO_SUSPEND and IRQF_SHARED at the -same time should be avoided. - - -System Wakeup Interrupts, enable_irq_wake() and disable_irq_wake() ------------------------------------------------------------------- - -System wakeup interrupts generally need to be configured to wake up the system -from sleep states, especially if they are used for different purposes (e.g. as -I/O interrupts) in the working state. - -That may involve turning on a special signal handling logic within the platform -(such as an SoC) so that signals from a given line are routed in a different way -during system sleep so as to trigger a system wakeup when needed. For example, -the platform may include a dedicated interrupt controller used specifically for -handling system wakeup events. Then, if a given interrupt line is supposed to -wake up the system from sleep sates, the corresponding input of that interrupt -controller needs to be enabled to receive signals from the line in question. -After wakeup, it generally is better to disable that input to prevent the -dedicated controller from triggering interrupts unnecessarily. - -The IRQ subsystem provides two helper functions to be used by device drivers for -those purposes. Namely, enable_irq_wake() turns on the platform's logic for -handling the given IRQ as a system wakeup interrupt line and disable_irq_wake() -turns that logic off. - -Calling enable_irq_wake() causes suspend_device_irqs() to treat the given IRQ -in a special way. Namely, the IRQ remains enabled, by on the first interrupt -it will be disabled, marked as pending and "suspended" so that it will be -re-enabled by resume_device_irqs() during the subsequent system resume. Also -the PM core is notified about the event which causes the system suspend in -progress to be aborted (that doesn't have to happen immediately, but at one -of the points where the suspend thread looks for pending wakeup events). - -This way every interrupt from a wakeup interrupt source will either cause the -system suspend currently in progress to be aborted or wake up the system if -already suspended. However, after suspend_device_irqs() interrupt handlers are -not executed for system wakeup IRQs. They are only executed for IRQF_NO_SUSPEND -IRQs at that time, but those IRQs should not be configured for system wakeup -using enable_irq_wake(). - - -Interrupts and Suspend-to-Idle ------------------------------- - -Suspend-to-idle (also known as the "freeze" sleep state) is a relatively new -system sleep state that works by idling all of the processors and waiting for -interrupts right after the "noirq" phase of suspending devices. - -Of course, this means that all of the interrupts with the IRQF_NO_SUSPEND flag -set will bring CPUs out of idle while in that state, but they will not cause the -IRQ subsystem to trigger a system wakeup. - -System wakeup interrupts, in turn, will trigger wakeup from suspend-to-idle in -analogy with what they do in the full system suspend case. The only difference -is that the wakeup from suspend-to-idle is signaled using the usual working -state interrupt delivery mechanisms and doesn't require the platform to use -any special interrupt handling logic for it to work. - - -IRQF_NO_SUSPEND and enable_irq_wake() -------------------------------------- - -There are very few valid reasons to use both enable_irq_wake() and the -IRQF_NO_SUSPEND flag on the same IRQ, and it is never valid to use both for the -same device. - -First of all, if the IRQ is not shared, the rules for handling IRQF_NO_SUSPEND -interrupts (interrupt handlers are invoked after suspend_device_irqs()) are -directly at odds with the rules for handling system wakeup interrupts (interrupt -handlers are not invoked after suspend_device_irqs()). - -Second, both enable_irq_wake() and IRQF_NO_SUSPEND apply to entire IRQs and not -to individual interrupt handlers, so sharing an IRQ between a system wakeup -interrupt source and an IRQF_NO_SUSPEND interrupt source does not generally -make sense. - -In rare cases an IRQ can be shared between a wakeup device driver and an -IRQF_NO_SUSPEND user. In order for this to be safe, the wakeup device driver -must be able to discern spurious IRQs from genuine wakeup events (signalling -the latter to the core with pm_system_wakeup()), must use enable_irq_wake() to -ensure that the IRQ will function as a wakeup source, and must request the IRQ -with IRQF_COND_SUSPEND to tell the core that it meets these requirements. If -these requirements are not met, it is not valid to use IRQF_COND_SUSPEND. diff --git a/Documentation/power/swsusp-and-swap-files.rst b/Documentation/power/swsusp-and-swap-files.rst new file mode 100644 index 000000000000..a33a2919dbe4 --- /dev/null +++ b/Documentation/power/swsusp-and-swap-files.rst @@ -0,0 +1,63 @@ +=============================================== +Using swap files with software suspend (swsusp) +=============================================== + + (C) 2006 Rafael J. Wysocki + +The Linux kernel handles swap files almost in the same way as it handles swap +partitions and there are only two differences between these two types of swap +areas: +(1) swap files need not be contiguous, +(2) the header of a swap file is not in the first block of the partition that +holds it. From the swsusp's point of view (1) is not a problem, because it is +already taken care of by the swap-handling code, but (2) has to be taken into +consideration. + +In principle the location of a swap file's header may be determined with the +help of appropriate filesystem driver. Unfortunately, however, it requires the +filesystem holding the swap file to be mounted, and if this filesystem is +journaled, it cannot be mounted during resume from disk. For this reason to +identify a swap file swsusp uses the name of the partition that holds the file +and the offset from the beginning of the partition at which the swap file's +header is located. For convenience, this offset is expressed in +units. + +In order to use a swap file with swsusp, you need to: + +1) Create the swap file and make it active, eg.:: + + # dd if=/dev/zero of= bs=1024 count= + # mkswap + # swapon + +2) Use an application that will bmap the swap file with the help of the +FIBMAP ioctl and determine the location of the file's swap header, as the +offset, in units, from the beginning of the partition which +holds the swap file. + +3) Add the following parameters to the kernel command line:: + + resume= resume_offset= + +where is the partition on which the swap file is located +and is the offset of the swap header determined by the +application in 2) (of course, this step may be carried out automatically +by the same application that determines the swap file's header offset using the +FIBMAP ioctl) + +OR + +Use a userland suspend application that will set the partition and offset +with the help of the SNAPSHOT_SET_SWAP_AREA ioctl described in +Documentation/power/userland-swsusp.rst (this is the only method to suspend +to a swap file allowing the resume to be initiated from an initrd or initramfs +image). + +Now, swsusp will use the swap file in the same way in which it would use a swap +partition. In particular, the swap file has to be active (ie. be present in +/proc/swaps) so that it can be used for suspending. + +Note that if the swap file used for suspending is deleted and recreated, +the location of its header need not be the same as before. Thus every time +this happens the value of the "resume_offset=" kernel command line parameter +has to be updated. diff --git a/Documentation/power/swsusp-and-swap-files.txt b/Documentation/power/swsusp-and-swap-files.txt deleted file mode 100644 index f281886de490..000000000000 --- a/Documentation/power/swsusp-and-swap-files.txt +++ /dev/null @@ -1,60 +0,0 @@ -Using swap files with software suspend (swsusp) - (C) 2006 Rafael J. Wysocki - -The Linux kernel handles swap files almost in the same way as it handles swap -partitions and there are only two differences between these two types of swap -areas: -(1) swap files need not be contiguous, -(2) the header of a swap file is not in the first block of the partition that -holds it. From the swsusp's point of view (1) is not a problem, because it is -already taken care of by the swap-handling code, but (2) has to be taken into -consideration. - -In principle the location of a swap file's header may be determined with the -help of appropriate filesystem driver. Unfortunately, however, it requires the -filesystem holding the swap file to be mounted, and if this filesystem is -journaled, it cannot be mounted during resume from disk. For this reason to -identify a swap file swsusp uses the name of the partition that holds the file -and the offset from the beginning of the partition at which the swap file's -header is located. For convenience, this offset is expressed in -units. - -In order to use a swap file with swsusp, you need to: - -1) Create the swap file and make it active, eg. - -# dd if=/dev/zero of= bs=1024 count= -# mkswap -# swapon - -2) Use an application that will bmap the swap file with the help of the -FIBMAP ioctl and determine the location of the file's swap header, as the -offset, in units, from the beginning of the partition which -holds the swap file. - -3) Add the following parameters to the kernel command line: - -resume= resume_offset= - -where is the partition on which the swap file is located -and is the offset of the swap header determined by the -application in 2) (of course, this step may be carried out automatically -by the same application that determines the swap file's header offset using the -FIBMAP ioctl) - -OR - -Use a userland suspend application that will set the partition and offset -with the help of the SNAPSHOT_SET_SWAP_AREA ioctl described in -Documentation/power/userland-swsusp.txt (this is the only method to suspend -to a swap file allowing the resume to be initiated from an initrd or initramfs -image). - -Now, swsusp will use the swap file in the same way in which it would use a swap -partition. In particular, the swap file has to be active (ie. be present in -/proc/swaps) so that it can be used for suspending. - -Note that if the swap file used for suspending is deleted and recreated, -the location of its header need not be the same as before. Thus every time -this happens the value of the "resume_offset=" kernel command line parameter -has to be updated. diff --git a/Documentation/power/swsusp-dmcrypt.rst b/Documentation/power/swsusp-dmcrypt.rst new file mode 100644 index 000000000000..426df59172cd --- /dev/null +++ b/Documentation/power/swsusp-dmcrypt.rst @@ -0,0 +1,140 @@ +======================================= +How to use dm-crypt and swsusp together +======================================= + +Author: Andreas Steinmetz + + + +Some prerequisites: +You know how dm-crypt works. If not, visit the following web page: +http://www.saout.de/misc/dm-crypt/ +You have read Documentation/power/swsusp.rst and understand it. +You did read Documentation/admin-guide/initrd.rst and know how an initrd works. +You know how to create or how to modify an initrd. + +Now your system is properly set up, your disk is encrypted except for +the swap device(s) and the boot partition which may contain a mini +system for crypto setup and/or rescue purposes. You may even have +an initrd that does your current crypto setup already. + +At this point you want to encrypt your swap, too. Still you want to +be able to suspend using swsusp. This, however, means that you +have to be able to either enter a passphrase or that you read +the key(s) from an external device like a pcmcia flash disk +or an usb stick prior to resume. So you need an initrd, that sets +up dm-crypt and then asks swsusp to resume from the encrypted +swap device. + +The most important thing is that you set up dm-crypt in such +a way that the swap device you suspend to/resume from has +always the same major/minor within the initrd as well as +within your running system. The easiest way to achieve this is +to always set up this swap device first with dmsetup, so that +it will always look like the following:: + + brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 + +Now set up your kernel to use /dev/mapper/swap0 as the default +resume partition, so your kernel .config contains:: + + CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" + +Prepare your boot loader to use the initrd you will create or +modify. For lilo the simplest setup looks like the following +lines:: + + image=/boot/vmlinuz + initrd=/boot/initrd.gz + label=linux + append="root=/dev/ram0 init=/linuxrc rw" + +Finally you need to create or modify your initrd. Lets assume +you create an initrd that reads the required dm-crypt setup +from a pcmcia flash disk card. The card is formatted with an ext2 +fs which resides on /dev/hde1 when the card is inserted. The +card contains at least the encrypted swap setup in a file +named "swapkey". /etc/fstab of your initrd contains something +like the following:: + + /dev/hda1 /mnt ext3 ro 0 0 + none /proc proc defaults,noatime,nodiratime 0 0 + none /sys sysfs defaults,noatime,nodiratime 0 0 + +/dev/hda1 contains an unencrypted mini system that sets up all +of your crypto devices, again by reading the setup from the +pcmcia flash disk. What follows now is a /linuxrc for your +initrd that allows you to resume from encrypted swap and that +continues boot with your mini system on /dev/hda1 if resume +does not happen:: + + #!/bin/sh + PATH=/sbin:/bin:/usr/sbin:/usr/bin + mount /proc + mount /sys + mapped=0 + noresume=`grep -c noresume /proc/cmdline` + if [ "$*" != "" ] + then + noresume=1 + fi + dmesg -n 1 + /sbin/cardmgr -q + for i in 1 2 3 4 5 6 7 8 9 0 + do + if [ -f /proc/ide/hde/media ] + then + usleep 500000 + mount -t ext2 -o ro /dev/hde1 /mnt + if [ -f /mnt/swapkey ] + then + dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 + fi + umount /mnt + break + fi + usleep 500000 + done + killproc /sbin/cardmgr + dmesg -n 6 + if [ $mapped = 1 ] + then + if [ $noresume != 0 ] + then + mkswap /dev/mapper/swap0 > /dev/null 2>&1 + fi + echo 254:0 > /sys/power/resume + dmsetup remove swap0 + fi + umount /sys + mount /mnt + umount /proc + cd /mnt + pivot_root . mnt + mount /proc + umount -l /mnt + umount /proc + exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 + +Please don't mind the weird loop above, busybox's msh doesn't know +the let statement. Now, what is happening in the script? +First we have to decide if we want to try to resume, or not. +We will not resume if booting with "noresume" or any parameters +for init like "single" or "emergency" as boot parameters. + +Then we need to set up dmcrypt with the setup data from the +pcmcia flash disk. If this succeeds we need to reset the swap +device if we don't want to resume. The line "echo 254:0 > /sys/power/resume" +then attempts to resume from the first device mapper device. +Note that it is important to set the device in /sys/power/resume, +regardless if resuming or not, otherwise later suspend will fail. +If resume starts, script execution terminates here. + +Otherwise we just remove the encrypted swap device and leave it to the +mini system on /dev/hda1 to set the whole crypto up (it is up to +you to modify this to your taste). + +What then follows is the well known process to change the root +file system and continue booting from there. I prefer to unmount +the initrd prior to continue booting but it is up to you to modify +this. diff --git a/Documentation/power/swsusp-dmcrypt.txt b/Documentation/power/swsusp-dmcrypt.txt deleted file mode 100644 index b802fbfd95ef..000000000000 --- a/Documentation/power/swsusp-dmcrypt.txt +++ /dev/null @@ -1,138 +0,0 @@ -Author: Andreas Steinmetz - - -How to use dm-crypt and swsusp together: -======================================== - -Some prerequisites: -You know how dm-crypt works. If not, visit the following web page: -http://www.saout.de/misc/dm-crypt/ -You have read Documentation/power/swsusp.txt and understand it. -You did read Documentation/admin-guide/initrd.rst and know how an initrd works. -You know how to create or how to modify an initrd. - -Now your system is properly set up, your disk is encrypted except for -the swap device(s) and the boot partition which may contain a mini -system for crypto setup and/or rescue purposes. You may even have -an initrd that does your current crypto setup already. - -At this point you want to encrypt your swap, too. Still you want to -be able to suspend using swsusp. This, however, means that you -have to be able to either enter a passphrase or that you read -the key(s) from an external device like a pcmcia flash disk -or an usb stick prior to resume. So you need an initrd, that sets -up dm-crypt and then asks swsusp to resume from the encrypted -swap device. - -The most important thing is that you set up dm-crypt in such -a way that the swap device you suspend to/resume from has -always the same major/minor within the initrd as well as -within your running system. The easiest way to achieve this is -to always set up this swap device first with dmsetup, so that -it will always look like the following: - -brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 - -Now set up your kernel to use /dev/mapper/swap0 as the default -resume partition, so your kernel .config contains: - -CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" - -Prepare your boot loader to use the initrd you will create or -modify. For lilo the simplest setup looks like the following -lines: - -image=/boot/vmlinuz -initrd=/boot/initrd.gz -label=linux -append="root=/dev/ram0 init=/linuxrc rw" - -Finally you need to create or modify your initrd. Lets assume -you create an initrd that reads the required dm-crypt setup -from a pcmcia flash disk card. The card is formatted with an ext2 -fs which resides on /dev/hde1 when the card is inserted. The -card contains at least the encrypted swap setup in a file -named "swapkey". /etc/fstab of your initrd contains something -like the following: - -/dev/hda1 /mnt ext3 ro 0 0 -none /proc proc defaults,noatime,nodiratime 0 0 -none /sys sysfs defaults,noatime,nodiratime 0 0 - -/dev/hda1 contains an unencrypted mini system that sets up all -of your crypto devices, again by reading the setup from the -pcmcia flash disk. What follows now is a /linuxrc for your -initrd that allows you to resume from encrypted swap and that -continues boot with your mini system on /dev/hda1 if resume -does not happen: - -#!/bin/sh -PATH=/sbin:/bin:/usr/sbin:/usr/bin -mount /proc -mount /sys -mapped=0 -noresume=`grep -c noresume /proc/cmdline` -if [ "$*" != "" ] -then - noresume=1 -fi -dmesg -n 1 -/sbin/cardmgr -q -for i in 1 2 3 4 5 6 7 8 9 0 -do - if [ -f /proc/ide/hde/media ] - then - usleep 500000 - mount -t ext2 -o ro /dev/hde1 /mnt - if [ -f /mnt/swapkey ] - then - dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 - fi - umount /mnt - break - fi - usleep 500000 -done -killproc /sbin/cardmgr -dmesg -n 6 -if [ $mapped = 1 ] -then - if [ $noresume != 0 ] - then - mkswap /dev/mapper/swap0 > /dev/null 2>&1 - fi - echo 254:0 > /sys/power/resume - dmsetup remove swap0 -fi -umount /sys -mount /mnt -umount /proc -cd /mnt -pivot_root . mnt -mount /proc -umount -l /mnt -umount /proc -exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 - -Please don't mind the weird loop above, busybox's msh doesn't know -the let statement. Now, what is happening in the script? -First we have to decide if we want to try to resume, or not. -We will not resume if booting with "noresume" or any parameters -for init like "single" or "emergency" as boot parameters. - -Then we need to set up dmcrypt with the setup data from the -pcmcia flash disk. If this succeeds we need to reset the swap -device if we don't want to resume. The line "echo 254:0 > /sys/power/resume" -then attempts to resume from the first device mapper device. -Note that it is important to set the device in /sys/power/resume, -regardless if resuming or not, otherwise later suspend will fail. -If resume starts, script execution terminates here. - -Otherwise we just remove the encrypted swap device and leave it to the -mini system on /dev/hda1 to set the whole crypto up (it is up to -you to modify this to your taste). - -What then follows is the well known process to change the root -file system and continue booting from there. I prefer to unmount -the initrd prior to continue booting but it is up to you to modify -this. diff --git a/Documentation/power/swsusp.rst b/Documentation/power/swsusp.rst new file mode 100644 index 000000000000..d000312f6965 --- /dev/null +++ b/Documentation/power/swsusp.rst @@ -0,0 +1,501 @@ +============ +Swap suspend +============ + +Some warnings, first. + +.. warning:: + + **BIG FAT WARNING** + + If you touch anything on disk between suspend and resume... + ...kiss your data goodbye. + + If you do resume from initrd after your filesystems are mounted... + ...bye bye root partition. + + [this is actually same case as above] + + If you have unsupported ( ) devices using DMA, you may have some + problems. If your disk driver does not support suspend... (IDE does), + it may cause some problems, too. If you change kernel command line + between suspend and resume, it may do something wrong. If you change + your hardware while system is suspended... well, it was not good idea; + but it will probably only crash. + + ( ) suspend/resume support is needed to make it safe. + + If you have any filesystems on USB devices mounted before software suspend, + they won't be accessible after resume and you may lose data, as though + you have unplugged the USB devices with mounted filesystems on them; + see the FAQ below for details. (This is not true for more traditional + power states like "standby", which normally don't turn USB off.) + +Swap partition: + You need to append resume=/dev/your_swap_partition to kernel command + line or specify it using /sys/power/resume. + +Swap file: + If using a swapfile you can also specify a resume offset using + resume_offset= on the kernel command line or specify it + in /sys/power/resume_offset. + +After preparing then you suspend by:: + + echo shutdown > /sys/power/disk; echo disk > /sys/power/state + +- If you feel ACPI works pretty well on your system, you might try:: + + echo platform > /sys/power/disk; echo disk > /sys/power/state + +- If you would like to write hibernation image to swap and then suspend + to RAM (provided your platform supports it), you can try:: + + echo suspend > /sys/power/disk; echo disk > /sys/power/state + +- If you have SATA disks, you'll need recent kernels with SATA suspend + support. For suspend and resume to work, make sure your disk drivers + are built into kernel -- not modules. [There's way to make + suspend/resume with modular disk drivers, see FAQ, but you probably + should not do that.] + +If you want to limit the suspend image size to N bytes, do:: + + echo N > /sys/power/image_size + +before suspend (it is limited to around 2/5 of available RAM by default). + +- The resume process checks for the presence of the resume device, + if found, it then checks the contents for the hibernation image signature. + If both are found, it resumes the hibernation image. + +- The resume process may be triggered in two ways: + + 1) During lateinit: If resume=/dev/your_swap_partition is specified on + the kernel command line, lateinit runs the resume process. If the + resume device has not been probed yet, the resume process fails and + bootup continues. + 2) Manually from an initrd or initramfs: May be run from + the init script by using the /sys/power/resume file. It is vital + that this be done prior to remounting any filesystems (even as + read-only) otherwise data may be corrupted. + +Article about goals and implementation of Software Suspend for Linux +==================================================================== + +Author: Gábor Kuti +Last revised: 2003-10-20 by Pavel Machek + +Idea and goals to achieve +------------------------- + +Nowadays it is common in several laptops that they have a suspend button. It +saves the state of the machine to a filesystem or to a partition and switches +to standby mode. Later resuming the machine the saved state is loaded back to +ram and the machine can continue its work. It has two real benefits. First we +save ourselves the time machine goes down and later boots up, energy costs +are real high when running from batteries. The other gain is that we don't have +to interrupt our programs so processes that are calculating something for a long +time shouldn't need to be written interruptible. + +swsusp saves the state of the machine into active swaps and then reboots or +powerdowns. You must explicitly specify the swap partition to resume from with +`resume=` kernel option. If signature is found it loads and restores saved +state. If the option `noresume` is specified as a boot parameter, it skips +the resuming. If the option `hibernate=nocompress` is specified as a boot +parameter, it saves hibernation image without compression. + +In the meantime while the system is suspended you should not add/remove any +of the hardware, write to the filesystems, etc. + +Sleep states summary +==================== + +There are three different interfaces you can use, /proc/acpi should +work like this: + +In a really perfect world:: + + echo 1 > /proc/acpi/sleep # for standby + echo 2 > /proc/acpi/sleep # for suspend to ram + echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative + echo 4 > /proc/acpi/sleep # for suspend to disk + echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system + +and perhaps:: + + echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios + +Frequently Asked Questions +========================== + +Q: + well, suspending a server is IMHO a really stupid thing, + but... (Diego Zuccato): + +A: + You bought new UPS for your server. How do you install it without + bringing machine down? Suspend to disk, rearrange power cables, + resume. + + You have your server on UPS. Power died, and UPS is indicating 30 + seconds to failure. What do you do? Suspend to disk. + + +Q: + Maybe I'm missing something, but why don't the regular I/O paths work? + +A: + We do use the regular I/O paths. However we cannot restore the data + to its original location as we load it. That would create an + inconsistent kernel state which would certainly result in an oops. + Instead, we load the image into unused memory and then atomically copy + it back to it original location. This implies, of course, a maximum + image size of half the amount of memory. + + There are two solutions to this: + + * require half of memory to be free during suspend. That way you can + read "new" data onto free spots, then cli and copy + + * assume we had special "polling" ide driver that only uses memory + between 0-640KB. That way, I'd have to make sure that 0-640KB is free + during suspending, but otherwise it would work... + + suspend2 shares this fundamental limitation, but does not include user + data and disk caches into "used memory" by saving them in + advance. That means that the limitation goes away in practice. + +Q: + Does linux support ACPI S4? + +A: + Yes. That's what echo platform > /sys/power/disk does. + +Q: + What is 'suspend2'? + +A: + suspend2 is 'Software Suspend 2', a forked implementation of + suspend-to-disk which is available as separate patches for 2.4 and 2.6 + kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB + highmem and preemption. It also has a extensible architecture that + allows for arbitrary transformations on the image (compression, + encryption) and arbitrary backends for writing the image (eg to swap + or an NFS share[Work In Progress]). Questions regarding suspend2 + should be sent to the mailing list available through the suspend2 + website, and not to the Linux Kernel Mailing List. We are working + toward merging suspend2 into the mainline kernel. + +Q: + What is the freezing of tasks and why are we using it? + +A: + The freezing of tasks is a mechanism by which user space processes and some + kernel threads are controlled during hibernation or system-wide suspend (on some + architectures). See freezing-of-tasks.txt for details. + +Q: + What is the difference between "platform" and "shutdown"? + +A: + shutdown: + save state in linux, then tell bios to powerdown + + platform: + save state in linux, then tell bios to powerdown and blink + "suspended led" + + "platform" is actually right thing to do where supported, but + "shutdown" is most reliable (except on ACPI systems). + +Q: + I do not understand why you have such strong objections to idea of + selective suspend. + +A: + Do selective suspend during runtime power management, that's okay. But + it's useless for suspend-to-disk. (And I do not see how you could use + it for suspend-to-ram, I hope you do not want that). + + Lets see, so you suggest to + + * SUSPEND all but swap device and parents + * Snapshot + * Write image to disk + * SUSPEND swap device and parents + * Powerdown + + Oh no, that does not work, if swap device or its parents uses DMA, + you've corrupted data. You'd have to do + + * SUSPEND all but swap device and parents + * FREEZE swap device and parents + * Snapshot + * UNFREEZE swap device and parents + * Write + * SUSPEND swap device and parents + + Which means that you still need that FREEZE state, and you get more + complicated code. (And I have not yet introduce details like system + devices). + +Q: + There don't seem to be any generally useful behavioral + distinctions between SUSPEND and FREEZE. + +A: + Doing SUSPEND when you are asked to do FREEZE is always correct, + but it may be unnecessarily slow. If you want your driver to stay simple, + slowness may not matter to you. It can always be fixed later. + + For devices like disk it does matter, you do not want to spindown for + FREEZE. + +Q: + After resuming, system is paging heavily, leading to very bad interactivity. + +A: + Try running:: + + cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file + do + test -f "$file" && cat "$file" > /dev/null + done + + after resume. swapoff -a; swapon -a may also be useful. + +Q: + What happens to devices during swsusp? They seem to be resumed + during system suspend? + +A: + That's correct. We need to resume them if we want to write image to + disk. Whole sequence goes like + + **Suspend part** + + running system, user asks for suspend-to-disk + + user processes are stopped + + suspend(PMSG_FREEZE): devices are frozen so that they don't interfere + with state snapshot + + state snapshot: copy of whole used memory is taken with interrupts disabled + + resume(): devices are woken up so that we can write image to swap + + write image to swap + + suspend(PMSG_SUSPEND): suspend devices so that we can power off + + turn the power off + + **Resume part** + + (is actually pretty similar) + + running system, user asks for suspend-to-disk + + user processes are stopped (in common case there are none, + but with resume-from-initrd, no one knows) + + read image from disk + + suspend(PMSG_FREEZE): devices are frozen so that they don't interfere + with image restoration + + image restoration: rewrite memory with image + + resume(): devices are woken up so that system can continue + + thaw all user processes + +Q: + What is this 'Encrypt suspend image' for? + +A: + First of all: it is not a replacement for dm-crypt encrypted swap. + It cannot protect your computer while it is suspended. Instead it does + protect from leaking sensitive data after resume from suspend. + + Think of the following: you suspend while an application is running + that keeps sensitive data in memory. The application itself prevents + the data from being swapped out. Suspend, however, must write these + data to swap to be able to resume later on. Without suspend encryption + your sensitive data are then stored in plaintext on disk. This means + that after resume your sensitive data are accessible to all + applications having direct access to the swap device which was used + for suspend. If you don't need swap after resume these data can remain + on disk virtually forever. Thus it can happen that your system gets + broken in weeks later and sensitive data which you thought were + encrypted and protected are retrieved and stolen from the swap device. + To prevent this situation you should use 'Encrypt suspend image'. + + During suspend a temporary key is created and this key is used to + encrypt the data written to disk. When, during resume, the data was + read back into memory the temporary key is destroyed which simply + means that all data written to disk during suspend are then + inaccessible so they can't be stolen later on. The only thing that + you must then take care of is that you call 'mkswap' for the swap + partition used for suspend as early as possible during regular + boot. This asserts that any temporary key from an oopsed suspend or + from a failed or aborted resume is erased from the swap device. + + As a rule of thumb use encrypted swap to protect your data while your + system is shut down or suspended. Additionally use the encrypted + suspend image to prevent sensitive data from being stolen after + resume. + +Q: + Can I suspend to a swap file? + +A: + Generally, yes, you can. However, it requires you to use the "resume=" and + "resume_offset=" kernel command line parameters, so the resume from a swap file + cannot be initiated from an initrd or initramfs image. See + swsusp-and-swap-files.txt for details. + +Q: + Is there a maximum system RAM size that is supported by swsusp? + +A: + It should work okay with highmem. + +Q: + Does swsusp (to disk) use only one swap partition or can it use + multiple swap partitions (aggregate them into one logical space)? + +A: + Only one swap partition, sorry. + +Q: + If my application(s) causes lots of memory & swap space to be used + (over half of the total system RAM), is it correct that it is likely + to be useless to try to suspend to disk while that app is running? + +A: + No, it should work okay, as long as your app does not mlock() + it. Just prepare big enough swap partition. + +Q: + What information is useful for debugging suspend-to-disk problems? + +A: + Well, last messages on the screen are always useful. If something + is broken, it is usually some kernel driver, therefore trying with as + little as possible modules loaded helps a lot. I also prefer people to + suspend from console, preferably without X running. Booting with + init=/bin/bash, then swapon and starting suspend sequence manually + usually does the trick. Then it is good idea to try with latest + vanilla kernel. + +Q: + How can distributions ship a swsusp-supporting kernel with modular + disk drivers (especially SATA)? + +A: + Well, it can be done, load the drivers, then do echo into + /sys/power/resume file from initrd. Be sure not to mount + anything, not even read-only mount, or you are going to lose your + data. + +Q: + How do I make suspend more verbose? + +A: + If you want to see any non-error kernel messages on the virtual + terminal the kernel switches to during suspend, you have to set the + kernel console loglevel to at least 4 (KERN_WARNING), for example by + doing:: + + # save the old loglevel + read LOGLEVEL DUMMY < /proc/sys/kernel/printk + # set the loglevel so we see the progress bar. + # if the level is higher than needed, we leave it alone. + if [ $LOGLEVEL -lt 5 ]; then + echo 5 > /proc/sys/kernel/printk + fi + + IMG_SZ=0 + read IMG_SZ < /sys/power/image_size + echo -n disk > /sys/power/state + RET=$? + # + # the logic here is: + # if image_size > 0 (without kernel support, IMG_SZ will be zero), + # then try again with image_size set to zero. + if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size + echo 0 > /sys/power/image_size + echo -n disk > /sys/power/state + RET=$? + fi + + # restore previous loglevel + echo $LOGLEVEL > /proc/sys/kernel/printk + exit $RET + +Q: + Is this true that if I have a mounted filesystem on a USB device and + I suspend to disk, I can lose data unless the filesystem has been mounted + with "sync"? + +A: + That's right ... if you disconnect that device, you may lose data. + In fact, even with "-o sync" you can lose data if your programs have + information in buffers they haven't written out to a disk you disconnect, + or if you disconnect before the device finished saving data you wrote. + + Software suspend normally powers down USB controllers, which is equivalent + to disconnecting all USB devices attached to your system. + + Your system might well support low-power modes for its USB controllers + while the system is asleep, maintaining the connection, using true sleep + modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the + /sys/power/state file; write "standby" or "mem".) We've not seen any + hardware that can use these modes through software suspend, although in + theory some systems might support "platform" modes that won't break the + USB connections. + + Remember that it's always a bad idea to unplug a disk drive containing a + mounted filesystem. That's true even when your system is asleep! The + safest thing is to unmount all filesystems on removable media (such USB, + Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) + before suspending; then remount them after resuming. + + There is a work-around for this problem. For more information, see + Documentation/driver-api/usb/persist.rst. + +Q: + Can I suspend-to-disk using a swap partition under LVM? + +A: + Yes and No. You can suspend successfully, but the kernel will not be able + to resume on its own. You need an initramfs that can recognize the resume + situation, activate the logical volume containing the swap volume (but not + touch any filesystems!), and eventually call:: + + echo -n "$major:$minor" > /sys/power/resume + + where $major and $minor are the respective major and minor device numbers of + the swap volume. + + uswsusp works with LVM, too. See http://suspend.sourceforge.net/ + +Q: + I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were + compiled with the similar configuration files. Anyway I found that + suspend to disk (and resume) is much slower on 2.6.16 compared to + 2.6.15. Any idea for why that might happen or how can I speed it up? + +A: + This is because the size of the suspend image is now greater than + for 2.6.15 (by saving more data we can get more responsive system + after resume). + + There's the /sys/power/image_size knob that controls the size of the + image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as + root), the 2.6.15 behavior should be restored. If it is still too + slow, take a look at suspend.sf.net -- userland suspend is faster and + supports LZF compression to speed it up further. diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt deleted file mode 100644 index 236d1fb13640..000000000000 --- a/Documentation/power/swsusp.txt +++ /dev/null @@ -1,446 +0,0 @@ -Some warnings, first. - - * BIG FAT WARNING ********************************************************* - * - * If you touch anything on disk between suspend and resume... - * ...kiss your data goodbye. - * - * If you do resume from initrd after your filesystems are mounted... - * ...bye bye root partition. - * [this is actually same case as above] - * - * If you have unsupported (*) devices using DMA, you may have some - * problems. If your disk driver does not support suspend... (IDE does), - * it may cause some problems, too. If you change kernel command line - * between suspend and resume, it may do something wrong. If you change - * your hardware while system is suspended... well, it was not good idea; - * but it will probably only crash. - * - * (*) suspend/resume support is needed to make it safe. - * - * If you have any filesystems on USB devices mounted before software suspend, - * they won't be accessible after resume and you may lose data, as though - * you have unplugged the USB devices with mounted filesystems on them; - * see the FAQ below for details. (This is not true for more traditional - * power states like "standby", which normally don't turn USB off.) - -Swap partition: -You need to append resume=/dev/your_swap_partition to kernel command -line or specify it using /sys/power/resume. - -Swap file: -If using a swapfile you can also specify a resume offset using -resume_offset= on the kernel command line or specify it -in /sys/power/resume_offset. - -After preparing then you suspend by - -echo shutdown > /sys/power/disk; echo disk > /sys/power/state - -. If you feel ACPI works pretty well on your system, you might try - -echo platform > /sys/power/disk; echo disk > /sys/power/state - -. If you would like to write hibernation image to swap and then suspend -to RAM (provided your platform supports it), you can try - -echo suspend > /sys/power/disk; echo disk > /sys/power/state - -. If you have SATA disks, you'll need recent kernels with SATA suspend -support. For suspend and resume to work, make sure your disk drivers -are built into kernel -- not modules. [There's way to make -suspend/resume with modular disk drivers, see FAQ, but you probably -should not do that.] - -If you want to limit the suspend image size to N bytes, do - -echo N > /sys/power/image_size - -before suspend (it is limited to around 2/5 of available RAM by default). - -. The resume process checks for the presence of the resume device, -if found, it then checks the contents for the hibernation image signature. -If both are found, it resumes the hibernation image. - -. The resume process may be triggered in two ways: - 1) During lateinit: If resume=/dev/your_swap_partition is specified on - the kernel command line, lateinit runs the resume process. If the - resume device has not been probed yet, the resume process fails and - bootup continues. - 2) Manually from an initrd or initramfs: May be run from - the init script by using the /sys/power/resume file. It is vital - that this be done prior to remounting any filesystems (even as - read-only) otherwise data may be corrupted. - -Article about goals and implementation of Software Suspend for Linux -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Author: Gábor Kuti -Last revised: 2003-10-20 by Pavel Machek - -Idea and goals to achieve - -Nowadays it is common in several laptops that they have a suspend button. It -saves the state of the machine to a filesystem or to a partition and switches -to standby mode. Later resuming the machine the saved state is loaded back to -ram and the machine can continue its work. It has two real benefits. First we -save ourselves the time machine goes down and later boots up, energy costs -are real high when running from batteries. The other gain is that we don't have to -interrupt our programs so processes that are calculating something for a long -time shouldn't need to be written interruptible. - -swsusp saves the state of the machine into active swaps and then reboots or -powerdowns. You must explicitly specify the swap partition to resume from with -``resume='' kernel option. If signature is found it loads and restores saved -state. If the option ``noresume'' is specified as a boot parameter, it skips -the resuming. If the option ``hibernate=nocompress'' is specified as a boot -parameter, it saves hibernation image without compression. - -In the meantime while the system is suspended you should not add/remove any -of the hardware, write to the filesystems, etc. - -Sleep states summary -==================== - -There are three different interfaces you can use, /proc/acpi should -work like this: - -In a really perfect world: -echo 1 > /proc/acpi/sleep # for standby -echo 2 > /proc/acpi/sleep # for suspend to ram -echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative -echo 4 > /proc/acpi/sleep # for suspend to disk -echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system - -and perhaps -echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios - -Frequently Asked Questions -========================== - -Q: well, suspending a server is IMHO a really stupid thing, -but... (Diego Zuccato): - -A: You bought new UPS for your server. How do you install it without -bringing machine down? Suspend to disk, rearrange power cables, -resume. - -You have your server on UPS. Power died, and UPS is indicating 30 -seconds to failure. What do you do? Suspend to disk. - - -Q: Maybe I'm missing something, but why don't the regular I/O paths work? - -A: We do use the regular I/O paths. However we cannot restore the data -to its original location as we load it. That would create an -inconsistent kernel state which would certainly result in an oops. -Instead, we load the image into unused memory and then atomically copy -it back to it original location. This implies, of course, a maximum -image size of half the amount of memory. - -There are two solutions to this: - -* require half of memory to be free during suspend. That way you can -read "new" data onto free spots, then cli and copy - -* assume we had special "polling" ide driver that only uses memory -between 0-640KB. That way, I'd have to make sure that 0-640KB is free -during suspending, but otherwise it would work... - -suspend2 shares this fundamental limitation, but does not include user -data and disk caches into "used memory" by saving them in -advance. That means that the limitation goes away in practice. - -Q: Does linux support ACPI S4? - -A: Yes. That's what echo platform > /sys/power/disk does. - -Q: What is 'suspend2'? - -A: suspend2 is 'Software Suspend 2', a forked implementation of -suspend-to-disk which is available as separate patches for 2.4 and 2.6 -kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB -highmem and preemption. It also has a extensible architecture that -allows for arbitrary transformations on the image (compression, -encryption) and arbitrary backends for writing the image (eg to swap -or an NFS share[Work In Progress]). Questions regarding suspend2 -should be sent to the mailing list available through the suspend2 -website, and not to the Linux Kernel Mailing List. We are working -toward merging suspend2 into the mainline kernel. - -Q: What is the freezing of tasks and why are we using it? - -A: The freezing of tasks is a mechanism by which user space processes and some -kernel threads are controlled during hibernation or system-wide suspend (on some -architectures). See freezing-of-tasks.txt for details. - -Q: What is the difference between "platform" and "shutdown"? - -A: - -shutdown: save state in linux, then tell bios to powerdown - -platform: save state in linux, then tell bios to powerdown and blink - "suspended led" - -"platform" is actually right thing to do where supported, but -"shutdown" is most reliable (except on ACPI systems). - -Q: I do not understand why you have such strong objections to idea of -selective suspend. - -A: Do selective suspend during runtime power management, that's okay. But -it's useless for suspend-to-disk. (And I do not see how you could use -it for suspend-to-ram, I hope you do not want that). - -Lets see, so you suggest to - -* SUSPEND all but swap device and parents -* Snapshot -* Write image to disk -* SUSPEND swap device and parents -* Powerdown - -Oh no, that does not work, if swap device or its parents uses DMA, -you've corrupted data. You'd have to do - -* SUSPEND all but swap device and parents -* FREEZE swap device and parents -* Snapshot -* UNFREEZE swap device and parents -* Write -* SUSPEND swap device and parents - -Which means that you still need that FREEZE state, and you get more -complicated code. (And I have not yet introduce details like system -devices). - -Q: There don't seem to be any generally useful behavioral -distinctions between SUSPEND and FREEZE. - -A: Doing SUSPEND when you are asked to do FREEZE is always correct, -but it may be unnecessarily slow. If you want your driver to stay simple, -slowness may not matter to you. It can always be fixed later. - -For devices like disk it does matter, you do not want to spindown for -FREEZE. - -Q: After resuming, system is paging heavily, leading to very bad interactivity. - -A: Try running - -cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u | while read file -do - test -f "$file" && cat "$file" > /dev/null -done - -after resume. swapoff -a; swapon -a may also be useful. - -Q: What happens to devices during swsusp? They seem to be resumed -during system suspend? - -A: That's correct. We need to resume them if we want to write image to -disk. Whole sequence goes like - - Suspend part - ~~~~~~~~~~~~ - running system, user asks for suspend-to-disk - - user processes are stopped - - suspend(PMSG_FREEZE): devices are frozen so that they don't interfere - with state snapshot - - state snapshot: copy of whole used memory is taken with interrupts disabled - - resume(): devices are woken up so that we can write image to swap - - write image to swap - - suspend(PMSG_SUSPEND): suspend devices so that we can power off - - turn the power off - - Resume part - ~~~~~~~~~~~ - (is actually pretty similar) - - running system, user asks for suspend-to-disk - - user processes are stopped (in common case there are none, but with resume-from-initrd, no one knows) - - read image from disk - - suspend(PMSG_FREEZE): devices are frozen so that they don't interfere - with image restoration - - image restoration: rewrite memory with image - - resume(): devices are woken up so that system can continue - - thaw all user processes - -Q: What is this 'Encrypt suspend image' for? - -A: First of all: it is not a replacement for dm-crypt encrypted swap. -It cannot protect your computer while it is suspended. Instead it does -protect from leaking sensitive data after resume from suspend. - -Think of the following: you suspend while an application is running -that keeps sensitive data in memory. The application itself prevents -the data from being swapped out. Suspend, however, must write these -data to swap to be able to resume later on. Without suspend encryption -your sensitive data are then stored in plaintext on disk. This means -that after resume your sensitive data are accessible to all -applications having direct access to the swap device which was used -for suspend. If you don't need swap after resume these data can remain -on disk virtually forever. Thus it can happen that your system gets -broken in weeks later and sensitive data which you thought were -encrypted and protected are retrieved and stolen from the swap device. -To prevent this situation you should use 'Encrypt suspend image'. - -During suspend a temporary key is created and this key is used to -encrypt the data written to disk. When, during resume, the data was -read back into memory the temporary key is destroyed which simply -means that all data written to disk during suspend are then -inaccessible so they can't be stolen later on. The only thing that -you must then take care of is that you call 'mkswap' for the swap -partition used for suspend as early as possible during regular -boot. This asserts that any temporary key from an oopsed suspend or -from a failed or aborted resume is erased from the swap device. - -As a rule of thumb use encrypted swap to protect your data while your -system is shut down or suspended. Additionally use the encrypted -suspend image to prevent sensitive data from being stolen after -resume. - -Q: Can I suspend to a swap file? - -A: Generally, yes, you can. However, it requires you to use the "resume=" and -"resume_offset=" kernel command line parameters, so the resume from a swap file -cannot be initiated from an initrd or initramfs image. See -swsusp-and-swap-files.txt for details. - -Q: Is there a maximum system RAM size that is supported by swsusp? - -A: It should work okay with highmem. - -Q: Does swsusp (to disk) use only one swap partition or can it use -multiple swap partitions (aggregate them into one logical space)? - -A: Only one swap partition, sorry. - -Q: If my application(s) causes lots of memory & swap space to be used -(over half of the total system RAM), is it correct that it is likely -to be useless to try to suspend to disk while that app is running? - -A: No, it should work okay, as long as your app does not mlock() -it. Just prepare big enough swap partition. - -Q: What information is useful for debugging suspend-to-disk problems? - -A: Well, last messages on the screen are always useful. If something -is broken, it is usually some kernel driver, therefore trying with as -little as possible modules loaded helps a lot. I also prefer people to -suspend from console, preferably without X running. Booting with -init=/bin/bash, then swapon and starting suspend sequence manually -usually does the trick. Then it is good idea to try with latest -vanilla kernel. - -Q: How can distributions ship a swsusp-supporting kernel with modular -disk drivers (especially SATA)? - -A: Well, it can be done, load the drivers, then do echo into -/sys/power/resume file from initrd. Be sure not to mount -anything, not even read-only mount, or you are going to lose your -data. - -Q: How do I make suspend more verbose? - -A: If you want to see any non-error kernel messages on the virtual -terminal the kernel switches to during suspend, you have to set the -kernel console loglevel to at least 4 (KERN_WARNING), for example by -doing - - # save the old loglevel - read LOGLEVEL DUMMY < /proc/sys/kernel/printk - # set the loglevel so we see the progress bar. - # if the level is higher than needed, we leave it alone. - if [ $LOGLEVEL -lt 5 ]; then - echo 5 > /proc/sys/kernel/printk - fi - - IMG_SZ=0 - read IMG_SZ < /sys/power/image_size - echo -n disk > /sys/power/state - RET=$? - # - # the logic here is: - # if image_size > 0 (without kernel support, IMG_SZ will be zero), - # then try again with image_size set to zero. - if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size - echo 0 > /sys/power/image_size - echo -n disk > /sys/power/state - RET=$? - fi - - # restore previous loglevel - echo $LOGLEVEL > /proc/sys/kernel/printk - exit $RET - -Q: Is this true that if I have a mounted filesystem on a USB device and -I suspend to disk, I can lose data unless the filesystem has been mounted -with "sync"? - -A: That's right ... if you disconnect that device, you may lose data. -In fact, even with "-o sync" you can lose data if your programs have -information in buffers they haven't written out to a disk you disconnect, -or if you disconnect before the device finished saving data you wrote. - -Software suspend normally powers down USB controllers, which is equivalent -to disconnecting all USB devices attached to your system. - -Your system might well support low-power modes for its USB controllers -while the system is asleep, maintaining the connection, using true sleep -modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the -/sys/power/state file; write "standby" or "mem".) We've not seen any -hardware that can use these modes through software suspend, although in -theory some systems might support "platform" modes that won't break the -USB connections. - -Remember that it's always a bad idea to unplug a disk drive containing a -mounted filesystem. That's true even when your system is asleep! The -safest thing is to unmount all filesystems on removable media (such USB, -Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) -before suspending; then remount them after resuming. - -There is a work-around for this problem. For more information, see -Documentation/driver-api/usb/persist.rst. - -Q: Can I suspend-to-disk using a swap partition under LVM? - -A: Yes and No. You can suspend successfully, but the kernel will not be able -to resume on its own. You need an initramfs that can recognize the resume -situation, activate the logical volume containing the swap volume (but not -touch any filesystems!), and eventually call - -echo -n "$major:$minor" > /sys/power/resume - -where $major and $minor are the respective major and minor device numbers of -the swap volume. - -uswsusp works with LVM, too. See http://suspend.sourceforge.net/ - -Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were -compiled with the similar configuration files. Anyway I found that -suspend to disk (and resume) is much slower on 2.6.16 compared to -2.6.15. Any idea for why that might happen or how can I speed it up? - -A: This is because the size of the suspend image is now greater than -for 2.6.15 (by saving more data we can get more responsive system -after resume). - -There's the /sys/power/image_size knob that controls the size of the -image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as -root), the 2.6.15 behavior should be restored. If it is still too -slow, take a look at suspend.sf.net -- userland suspend is faster and -supports LZF compression to speed it up further. diff --git a/Documentation/power/tricks.rst b/Documentation/power/tricks.rst new file mode 100644 index 000000000000..ca787f142c3f --- /dev/null +++ b/Documentation/power/tricks.rst @@ -0,0 +1,29 @@ +================ +swsusp/S3 tricks +================ + +Pavel Machek + +If you want to trick swsusp/S3 into working, you might want to try: + +* go with minimal config, turn off drivers like USB, AGP you don't + really need + +* turn off APIC and preempt + +* use ext2. At least it has working fsck. [If something seems to go + wrong, force fsck when you have a chance] + +* turn off modules + +* use vga text console, shut down X. [If you really want X, you might + want to try vesafb later] + +* try running as few processes as possible, preferably go to single + user mode. + +* due to video issues, swsusp should be easier to get working than + S3. Try that first. + +When you make it work, try to find out what exactly was it that broke +suspend, and preferably fix that. diff --git a/Documentation/power/tricks.txt b/Documentation/power/tricks.txt deleted file mode 100644 index a1b8f7249f4c..000000000000 --- a/Documentation/power/tricks.txt +++ /dev/null @@ -1,27 +0,0 @@ - swsusp/S3 tricks - ~~~~~~~~~~~~~~~~ -Pavel Machek - -If you want to trick swsusp/S3 into working, you might want to try: - -* go with minimal config, turn off drivers like USB, AGP you don't - really need - -* turn off APIC and preempt - -* use ext2. At least it has working fsck. [If something seems to go - wrong, force fsck when you have a chance] - -* turn off modules - -* use vga text console, shut down X. [If you really want X, you might - want to try vesafb later] - -* try running as few processes as possible, preferably go to single - user mode. - -* due to video issues, swsusp should be easier to get working than - S3. Try that first. - -When you make it work, try to find out what exactly was it that broke -suspend, and preferably fix that. diff --git a/Documentation/power/userland-swsusp.rst b/Documentation/power/userland-swsusp.rst new file mode 100644 index 000000000000..a0fa51bb1a4d --- /dev/null +++ b/Documentation/power/userland-swsusp.rst @@ -0,0 +1,191 @@ +===================================================== +Documentation for userland software suspend interface +===================================================== + + (C) 2006 Rafael J. Wysocki + +First, the warnings at the beginning of swsusp.txt still apply. + +Second, you should read the FAQ in swsusp.txt _now_ if you have not +done it already. + +Now, to use the userland interface for software suspend you need special +utilities that will read/write the system memory snapshot from/to the +kernel. Such utilities are available, for example, from +. You may want to have a look at them if you +are going to develop your own suspend/resume utilities. + +The interface consists of a character device providing the open(), +release(), read(), and write() operations as well as several ioctl() +commands defined in include/linux/suspend_ioctls.h . The major and minor +numbers of the device are, respectively, 10 and 231, and they can +be read from /sys/class/misc/snapshot/dev. + +The device can be open either for reading or for writing. If open for +reading, it is considered to be in the suspend mode. Otherwise it is +assumed to be in the resume mode. The device cannot be open for simultaneous +reading and writing. It is also impossible to have the device open more than +once at a time. + +Even opening the device has side effects. Data structures are +allocated, and PM_HIBERNATION_PREPARE / PM_RESTORE_PREPARE chains are +called. + +The ioctl() commands recognized by the device are: + +SNAPSHOT_FREEZE + freeze user space processes (the current process is + not frozen); this is required for SNAPSHOT_CREATE_IMAGE + and SNAPSHOT_ATOMIC_RESTORE to succeed + +SNAPSHOT_UNFREEZE + thaw user space processes frozen by SNAPSHOT_FREEZE + +SNAPSHOT_CREATE_IMAGE + create a snapshot of the system memory; the + last argument of ioctl() should be a pointer to an int variable, + the value of which will indicate whether the call returned after + creating the snapshot (1) or after restoring the system memory state + from it (0) (after resume the system finds itself finishing the + SNAPSHOT_CREATE_IMAGE ioctl() again); after the snapshot + has been created the read() operation can be used to transfer + it out of the kernel + +SNAPSHOT_ATOMIC_RESTORE + restore the system memory state from the + uploaded snapshot image; before calling it you should transfer + the system memory snapshot back to the kernel using the write() + operation; this call will not succeed if the snapshot + image is not available to the kernel + +SNAPSHOT_FREE + free memory allocated for the snapshot image + +SNAPSHOT_PREF_IMAGE_SIZE + set the preferred maximum size of the image + (the kernel will do its best to ensure the image size will not exceed + this number, but if it turns out to be impossible, the kernel will + create the smallest image possible) + +SNAPSHOT_GET_IMAGE_SIZE + return the actual size of the hibernation image + +SNAPSHOT_AVAIL_SWAP_SIZE + return the amount of available swap in bytes (the + last argument should be a pointer to an unsigned int variable that will + contain the result if the call is successful). + +SNAPSHOT_ALLOC_SWAP_PAGE + allocate a swap page from the resume partition + (the last argument should be a pointer to a loff_t variable that + will contain the swap page offset if the call is successful) + +SNAPSHOT_FREE_SWAP_PAGES + free all swap pages allocated by + SNAPSHOT_ALLOC_SWAP_PAGE + +SNAPSHOT_SET_SWAP_AREA + set the resume partition and the offset (in + units) from the beginning of the partition at which the swap header is + located (the last ioctl() argument should point to a struct + resume_swap_area, as defined in kernel/power/suspend_ioctls.h, + containing the resume device specification and the offset); for swap + partitions the offset is always 0, but it is different from zero for + swap files (see Documentation/power/swsusp-and-swap-files.rst for + details). + +SNAPSHOT_PLATFORM_SUPPORT + enable/disable the hibernation platform support, + depending on the argument value (enable, if the argument is nonzero) + +SNAPSHOT_POWER_OFF + make the kernel transition the system to the hibernation + state (eg. ACPI S4) using the platform (eg. ACPI) driver + +SNAPSHOT_S2RAM + suspend to RAM; using this call causes the kernel to + immediately enter the suspend-to-RAM state, so this call must always + be preceded by the SNAPSHOT_FREEZE call and it is also necessary + to use the SNAPSHOT_UNFREEZE call after the system wakes up. This call + is needed to implement the suspend-to-both mechanism in which the + suspend image is first created, as though the system had been suspended + to disk, and then the system is suspended to RAM (this makes it possible + to resume the system from RAM if there's enough battery power or restore + its state on the basis of the saved suspend image otherwise) + +The device's read() operation can be used to transfer the snapshot image from +the kernel. It has the following limitations: + +- you cannot read() more than one virtual memory page at a time +- read()s across page boundaries are impossible (ie. if you read() 1/2 of + a page in the previous call, you will only be able to read() + **at most** 1/2 of the page in the next call) + +The device's write() operation is used for uploading the system memory snapshot +into the kernel. It has the same limitations as the read() operation. + +The release() operation frees all memory allocated for the snapshot image +and all swap pages allocated with SNAPSHOT_ALLOC_SWAP_PAGE (if any). +Thus it is not necessary to use either SNAPSHOT_FREE or +SNAPSHOT_FREE_SWAP_PAGES before closing the device (in fact it will also +unfreeze user space processes frozen by SNAPSHOT_UNFREEZE if they are +still frozen when the device is being closed). + +Currently it is assumed that the userland utilities reading/writing the +snapshot image from/to the kernel will use a swap partition, called the resume +partition, or a swap file as storage space (if a swap file is used, the resume +partition is the partition that holds this file). However, this is not really +required, as they can use, for example, a special (blank) suspend partition or +a file on a partition that is unmounted before SNAPSHOT_CREATE_IMAGE and +mounted afterwards. + +These utilities MUST NOT make any assumptions regarding the ordering of +data within the snapshot image. The contents of the image are entirely owned +by the kernel and its structure may be changed in future kernel releases. + +The snapshot image MUST be written to the kernel unaltered (ie. all of the image +data, metadata and header MUST be written in _exactly_ the same amount, form +and order in which they have been read). Otherwise, the behavior of the +resumed system may be totally unpredictable. + +While executing SNAPSHOT_ATOMIC_RESTORE the kernel checks if the +structure of the snapshot image is consistent with the information stored +in the image header. If any inconsistencies are detected, +SNAPSHOT_ATOMIC_RESTORE will not succeed. Still, this is not a fool-proof +mechanism and the userland utilities using the interface SHOULD use additional +means, such as checksums, to ensure the integrity of the snapshot image. + +The suspending and resuming utilities MUST lock themselves in memory, +preferably using mlockall(), before calling SNAPSHOT_FREEZE. + +The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE +in the memory location pointed to by the last argument of ioctl() and proceed +in accordance with it: + +1. If the value is 1 (ie. the system memory snapshot has just been + created and the system is ready for saving it): + + (a) The suspending utility MUST NOT close the snapshot device + _unless_ the whole suspend procedure is to be cancelled, in + which case, if the snapshot image has already been saved, the + suspending utility SHOULD destroy it, preferably by zapping + its header. If the suspend is not to be cancelled, the + system MUST be powered off or rebooted after the snapshot + image has been saved. + (b) The suspending utility SHOULD NOT attempt to perform any + file system operations (including reads) on the file systems + that were mounted before SNAPSHOT_CREATE_IMAGE has been + called. However, it MAY mount a file system that was not + mounted at that time and perform some operations on it (eg. + use it for saving the image). + +2. If the value is 0 (ie. the system state has just been restored from + the snapshot image), the suspending utility MUST close the snapshot + device. Afterwards it will be treated as a regular userland process, + so it need not exit. + +The resuming utility SHOULD NOT attempt to mount any file systems that could +be mounted before suspend and SHOULD NOT attempt to perform any operations +involving such file systems. + +For details, please refer to the source code. diff --git a/Documentation/power/userland-swsusp.txt b/Documentation/power/userland-swsusp.txt deleted file mode 100644 index bbfcd1bbedc5..000000000000 --- a/Documentation/power/userland-swsusp.txt +++ /dev/null @@ -1,170 +0,0 @@ -Documentation for userland software suspend interface - (C) 2006 Rafael J. Wysocki - -First, the warnings at the beginning of swsusp.txt still apply. - -Second, you should read the FAQ in swsusp.txt _now_ if you have not -done it already. - -Now, to use the userland interface for software suspend you need special -utilities that will read/write the system memory snapshot from/to the -kernel. Such utilities are available, for example, from -. You may want to have a look at them if you -are going to develop your own suspend/resume utilities. - -The interface consists of a character device providing the open(), -release(), read(), and write() operations as well as several ioctl() -commands defined in include/linux/suspend_ioctls.h . The major and minor -numbers of the device are, respectively, 10 and 231, and they can -be read from /sys/class/misc/snapshot/dev. - -The device can be open either for reading or for writing. If open for -reading, it is considered to be in the suspend mode. Otherwise it is -assumed to be in the resume mode. The device cannot be open for simultaneous -reading and writing. It is also impossible to have the device open more than -once at a time. - -Even opening the device has side effects. Data structures are -allocated, and PM_HIBERNATION_PREPARE / PM_RESTORE_PREPARE chains are -called. - -The ioctl() commands recognized by the device are: - -SNAPSHOT_FREEZE - freeze user space processes (the current process is - not frozen); this is required for SNAPSHOT_CREATE_IMAGE - and SNAPSHOT_ATOMIC_RESTORE to succeed - -SNAPSHOT_UNFREEZE - thaw user space processes frozen by SNAPSHOT_FREEZE - -SNAPSHOT_CREATE_IMAGE - create a snapshot of the system memory; the - last argument of ioctl() should be a pointer to an int variable, - the value of which will indicate whether the call returned after - creating the snapshot (1) or after restoring the system memory state - from it (0) (after resume the system finds itself finishing the - SNAPSHOT_CREATE_IMAGE ioctl() again); after the snapshot - has been created the read() operation can be used to transfer - it out of the kernel - -SNAPSHOT_ATOMIC_RESTORE - restore the system memory state from the - uploaded snapshot image; before calling it you should transfer - the system memory snapshot back to the kernel using the write() - operation; this call will not succeed if the snapshot - image is not available to the kernel - -SNAPSHOT_FREE - free memory allocated for the snapshot image - -SNAPSHOT_PREF_IMAGE_SIZE - set the preferred maximum size of the image - (the kernel will do its best to ensure the image size will not exceed - this number, but if it turns out to be impossible, the kernel will - create the smallest image possible) - -SNAPSHOT_GET_IMAGE_SIZE - return the actual size of the hibernation image - -SNAPSHOT_AVAIL_SWAP_SIZE - return the amount of available swap in bytes (the - last argument should be a pointer to an unsigned int variable that will - contain the result if the call is successful). - -SNAPSHOT_ALLOC_SWAP_PAGE - allocate a swap page from the resume partition - (the last argument should be a pointer to a loff_t variable that - will contain the swap page offset if the call is successful) - -SNAPSHOT_FREE_SWAP_PAGES - free all swap pages allocated by - SNAPSHOT_ALLOC_SWAP_PAGE - -SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in - units) from the beginning of the partition at which the swap header is - located (the last ioctl() argument should point to a struct - resume_swap_area, as defined in kernel/power/suspend_ioctls.h, - containing the resume device specification and the offset); for swap - partitions the offset is always 0, but it is different from zero for - swap files (see Documentation/power/swsusp-and-swap-files.txt for - details). - -SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support, - depending on the argument value (enable, if the argument is nonzero) - -SNAPSHOT_POWER_OFF - make the kernel transition the system to the hibernation - state (eg. ACPI S4) using the platform (eg. ACPI) driver - -SNAPSHOT_S2RAM - suspend to RAM; using this call causes the kernel to - immediately enter the suspend-to-RAM state, so this call must always - be preceded by the SNAPSHOT_FREEZE call and it is also necessary - to use the SNAPSHOT_UNFREEZE call after the system wakes up. This call - is needed to implement the suspend-to-both mechanism in which the - suspend image is first created, as though the system had been suspended - to disk, and then the system is suspended to RAM (this makes it possible - to resume the system from RAM if there's enough battery power or restore - its state on the basis of the saved suspend image otherwise) - -The device's read() operation can be used to transfer the snapshot image from -the kernel. It has the following limitations: -- you cannot read() more than one virtual memory page at a time -- read()s across page boundaries are impossible (ie. if you read() 1/2 of - a page in the previous call, you will only be able to read() - _at_ _most_ 1/2 of the page in the next call) - -The device's write() operation is used for uploading the system memory snapshot -into the kernel. It has the same limitations as the read() operation. - -The release() operation frees all memory allocated for the snapshot image -and all swap pages allocated with SNAPSHOT_ALLOC_SWAP_PAGE (if any). -Thus it is not necessary to use either SNAPSHOT_FREE or -SNAPSHOT_FREE_SWAP_PAGES before closing the device (in fact it will also -unfreeze user space processes frozen by SNAPSHOT_UNFREEZE if they are -still frozen when the device is being closed). - -Currently it is assumed that the userland utilities reading/writing the -snapshot image from/to the kernel will use a swap partition, called the resume -partition, or a swap file as storage space (if a swap file is used, the resume -partition is the partition that holds this file). However, this is not really -required, as they can use, for example, a special (blank) suspend partition or -a file on a partition that is unmounted before SNAPSHOT_CREATE_IMAGE and -mounted afterwards. - -These utilities MUST NOT make any assumptions regarding the ordering of -data within the snapshot image. The contents of the image are entirely owned -by the kernel and its structure may be changed in future kernel releases. - -The snapshot image MUST be written to the kernel unaltered (ie. all of the image -data, metadata and header MUST be written in _exactly_ the same amount, form -and order in which they have been read). Otherwise, the behavior of the -resumed system may be totally unpredictable. - -While executing SNAPSHOT_ATOMIC_RESTORE the kernel checks if the -structure of the snapshot image is consistent with the information stored -in the image header. If any inconsistencies are detected, -SNAPSHOT_ATOMIC_RESTORE will not succeed. Still, this is not a fool-proof -mechanism and the userland utilities using the interface SHOULD use additional -means, such as checksums, to ensure the integrity of the snapshot image. - -The suspending and resuming utilities MUST lock themselves in memory, -preferably using mlockall(), before calling SNAPSHOT_FREEZE. - -The suspending utility MUST check the value stored by SNAPSHOT_CREATE_IMAGE -in the memory location pointed to by the last argument of ioctl() and proceed -in accordance with it: -1. If the value is 1 (ie. the system memory snapshot has just been - created and the system is ready for saving it): - (a) The suspending utility MUST NOT close the snapshot device - _unless_ the whole suspend procedure is to be cancelled, in - which case, if the snapshot image has already been saved, the - suspending utility SHOULD destroy it, preferably by zapping - its header. If the suspend is not to be cancelled, the - system MUST be powered off or rebooted after the snapshot - image has been saved. - (b) The suspending utility SHOULD NOT attempt to perform any - file system operations (including reads) on the file systems - that were mounted before SNAPSHOT_CREATE_IMAGE has been - called. However, it MAY mount a file system that was not - mounted at that time and perform some operations on it (eg. - use it for saving the image). -2. If the value is 0 (ie. the system state has just been restored from - the snapshot image), the suspending utility MUST close the snapshot - device. Afterwards it will be treated as a regular userland process, - so it need not exit. - -The resuming utility SHOULD NOT attempt to mount any file systems that could -be mounted before suspend and SHOULD NOT attempt to perform any operations -involving such file systems. - -For details, please refer to the source code. diff --git a/Documentation/power/video.rst b/Documentation/power/video.rst new file mode 100644 index 000000000000..337a2ba9f32f --- /dev/null +++ b/Documentation/power/video.rst @@ -0,0 +1,213 @@ +=========================== +Video issues with S3 resume +=========================== + +2003-2006, Pavel Machek + +During S3 resume, hardware needs to be reinitialized. For most +devices, this is easy, and kernel driver knows how to do +it. Unfortunately there's one exception: video card. Those are usually +initialized by BIOS, and kernel does not have enough information to +boot video card. (Kernel usually does not even contain video card +driver -- vesafb and vgacon are widely used). + +This is not problem for swsusp, because during swsusp resume, BIOS is +run normally so video card is normally initialized. It should not be +problem for S1 standby, because hardware should retain its state over +that. + +We either have to run video BIOS during early resume, or interpret it +using vbetool later, or maybe nothing is necessary on particular +system because video state is preserved. Unfortunately different +methods work on different systems, and no known method suits all of +them. + +Userland application called s2ram has been developed; it contains long +whitelist of systems, and automatically selects working method for a +given system. It can be downloaded from CVS at +www.sf.net/projects/suspend . If you get a system that is not in the +whitelist, please try to find a working solution, and submit whitelist +entry so that work does not need to be repeated. + +Currently, VBE_SAVE method (6 below) works on most +systems. Unfortunately, vbetool only runs after userland is resumed, +so it makes debugging of early resume problems +hard/impossible. Methods that do not rely on userland are preferable. + +Details +~~~~~~~ + +There are a few types of systems where video works after S3 resume: + +(1) systems where video state is preserved over S3. + +(2) systems where it is possible to call the video BIOS during S3 + resume. Unfortunately, it is not correct to call the video BIOS at + that point, but it happens to work on some machines. Use + acpi_sleep=s3_bios. + +(3) systems that initialize video card into vga text mode and where + the BIOS works well enough to be able to set video mode. Use + acpi_sleep=s3_mode on these. + +(4) on some systems s3_bios kicks video into text mode, and + acpi_sleep=s3_bios,s3_mode is needed. + +(5) radeon systems, where X can soft-boot your video card. You'll need + a new enough X, and a plain text console (no vesafb or radeonfb). See + http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information. + Alternatively, you should use vbetool (6) instead. + +(6) other radeon systems, where vbetool is enough to bring system back + to life. It needs text console to be working. Do vbetool vbestate + save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool + vbestate restore < /tmp/delme; setfont , and your video + should work. + +(7) on some systems, it is possible to boot most of kernel, and then + POSTing bios works. Ole Rohne has patch to do just that at + http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. + +(8) on some systems, you can use the video_post utility and or + do echo 3 > /sys/power/state && /usr/sbin/video_post - which will + initialize the display in console mode. If you are in X, you can switch + to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get + the display working in graphical mode again. + +Now, if you pass acpi_sleep=something, and it does not work with your +bios, you'll get a hard crash during resume. Be careful. Also it is +safest to do your experiments with plain old VGA console. The vesafb +and radeonfb (etc) drivers have a tendency to crash the machine during +resume. + +You may have a system where none of above works. At that point you +either invent another ugly hack that works, or write proper driver for +your video card (good luck getting docs :-(). Maybe suspending from X +(proper X, knowing your hardware, not XF68_FBcon) might have better +chance of working. + +Table of known working notebooks: + + +=============================== =============================================== +Model hack (or "how to do it") +=============================== =============================================== +Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI +Acer TM 230 s3_bios (2) +Acer TM 242FX vbetool (6) +Acer TM C110 video_post (8) +Acer TM C300 vga=normal (only suspend on console, not in X), + vbetool (6) or video_post (8) +Acer TM 4052LCi s3_bios (2) +Acer TM 636Lci s3_bios,s3_mode (4) +Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text + console back +Acer TM 660 ??? [#f1]_ +Acer TM 800 vga=normal, X patches, see webpage (5) + or vbetool (6) +Acer TM 803 vga=normal, X patches, see webpage (5) + or vbetool (6) +Acer TM 803LCi vga=normal, vbetool (6) +Arima W730a vbetool needed (6) +Asus L2400D s3_mode (3) [#f2]_ (S1 also works OK) +Asus L3350M (SiS 740) (6) +Asus L3800C (Radeon M7) s3_bios (2) (S1 also works OK) +Asus M6887Ne vga=normal, s3_bios (2), use radeon driver + instead of fglrx in x.org +Athlon64 desktop prototype s3_bios (2) +Compal CL-50 ??? [#f1]_ +Compaq Armada E500 - P3-700 none (1) (S1 also works OK) +Compaq Evo N620c vga=normal, s3_bios (2) +Dell 600m, ATI R250 Lf none (1), but needs xorg-x11-6.8.1.902-1 +Dell D600, ATI RV250 vga=normal and X, or try vbestate (6) +Dell D610 vga=normal and X (possibly vbestate (6) too, + but not tested) +Dell Inspiron 4000 ??? [#f1]_ +Dell Inspiron 500m ??? [#f1]_ +Dell Inspiron 510m ??? +Dell Inspiron 5150 vbetool needed (6) +Dell Inspiron 600m ??? [#f1]_ +Dell Inspiron 8200 ??? [#f1]_ +Dell Inspiron 8500 ??? [#f1]_ +Dell Inspiron 8600 ??? [#f1]_ +eMachines athlon64 machines vbetool needed (6) (someone please get + me model #s) +HP NC6000 s3_bios, may not use radeonfb (2); + or vbetool (6) +HP NX7000 ??? [#f1]_ +HP Pavilion ZD7000 vbetool post needed, need open-source nv + driver for X +HP Omnibook XE3 athlon version none (1) +HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV +HP Omnibook XE3L-GF vbetool (6) +HP Omnibook 5150 none (1), (S1 also works OK) +IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 + Savage/IX-MV, vesafb gets "interesting" + but X work. +IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with + BIOS 1.04 2002-08-23, but not at all with + BIOS 1.11 2004-11-05 :-(] +IBM TP R32 / Type 2658-MMG none (1) +IBM TP R40 2722B3G ??? [#f1]_ +IBM TP R50p / Type 1832-22U s3_bios (2) +IBM TP R51 none (1) +IBM TP T30 236681A ??? [#f1]_ +IBM TP T40 / Type 2373-MU4 none (1) +IBM TP T40p none (1) +IBM TP R40p s3_bios (2) +IBM TP T41p s3_bios (2), switch to X after resume +IBM TP T42 s3_bios (2) +IBM ThinkPad T42p (2373-GTG) s3_bios (2) +IBM TP X20 ??? [#f1]_ +IBM TP X30 s3_bios, s3_mode (4) +IBM TP X31 / Type 2672-XXH none (1), use radeontool + (http://fdd.com/software/radeon/) to + turn off backlight. +IBM TP X32 none (1), but backlight is on and video is + trashed after long suspend. s3_bios, + s3_mode (4) works too. Perhaps that gets + better results? +IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4) +IBM TP 600e none(1), but a switch to console and + back to X is needed +Medion MD4220 ??? [#f1]_ +Samsung P35 vbetool needed (6) +Sharp PC-AR10 (ATI rage) none (1), backlight does not switch off +Sony Vaio PCG-C1VRX/K s3_bios (2) +Sony Vaio PCG-F403 ??? [#f1]_ +Sony Vaio PCG-GRT995MP none (1), works with 'nv' X driver +Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use + radeontool (http://fdd.com/software/radeon/) + to turn off backlight. +Sony Vaio PCG-N505SN ??? [#f1]_ +Sony Vaio vgn-s260 X or boot-radeon can init it (5) +Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will + be blank unless you return to X. +Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4) +Toshiba Libretto L5 none (1) +Toshiba Libretto 100CT/110CT vbetool (6) +Toshiba Portege 3020CT s3_mode (3) +Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK) +Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK) +Toshiba Satellite 4090XCDT ??? [#f1]_ +Toshiba Satellite P10-554 s3_bios,s3_mode (4)[#f3]_ +Toshiba M30 (2) xor X with nvidia driver using internal AGP +Uniwill 244IIO ??? [#f1]_ +=============================== =============================================== + +Known working desktop systems +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +=================== ============================= ======================== +Mainboard Graphics card hack (or "how to do it") +=================== ============================= ======================== +Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4) +=================== ============================= ======================== + + +.. [#f1] from https://wiki.ubuntu.com/HoaryPMResults, not sure + which options to use. If you know, please tell me. + +.. [#f2] To be tested with a newer kernel. + +.. [#f3] Not with SMP kernel, UP only. diff --git a/Documentation/power/video.txt b/Documentation/power/video.txt deleted file mode 100644 index 3e6272bc4472..000000000000 --- a/Documentation/power/video.txt +++ /dev/null @@ -1,185 +0,0 @@ - - Video issues with S3 resume - ~~~~~~~~~~~~~~~~~~~~~~~~~~~ - 2003-2006, Pavel Machek - -During S3 resume, hardware needs to be reinitialized. For most -devices, this is easy, and kernel driver knows how to do -it. Unfortunately there's one exception: video card. Those are usually -initialized by BIOS, and kernel does not have enough information to -boot video card. (Kernel usually does not even contain video card -driver -- vesafb and vgacon are widely used). - -This is not problem for swsusp, because during swsusp resume, BIOS is -run normally so video card is normally initialized. It should not be -problem for S1 standby, because hardware should retain its state over -that. - -We either have to run video BIOS during early resume, or interpret it -using vbetool later, or maybe nothing is necessary on particular -system because video state is preserved. Unfortunately different -methods work on different systems, and no known method suits all of -them. - -Userland application called s2ram has been developed; it contains long -whitelist of systems, and automatically selects working method for a -given system. It can be downloaded from CVS at -www.sf.net/projects/suspend . If you get a system that is not in the -whitelist, please try to find a working solution, and submit whitelist -entry so that work does not need to be repeated. - -Currently, VBE_SAVE method (6 below) works on most -systems. Unfortunately, vbetool only runs after userland is resumed, -so it makes debugging of early resume problems -hard/impossible. Methods that do not rely on userland are preferable. - -Details -~~~~~~~ - -There are a few types of systems where video works after S3 resume: - -(1) systems where video state is preserved over S3. - -(2) systems where it is possible to call the video BIOS during S3 - resume. Unfortunately, it is not correct to call the video BIOS at - that point, but it happens to work on some machines. Use - acpi_sleep=s3_bios. - -(3) systems that initialize video card into vga text mode and where - the BIOS works well enough to be able to set video mode. Use - acpi_sleep=s3_mode on these. - -(4) on some systems s3_bios kicks video into text mode, and - acpi_sleep=s3_bios,s3_mode is needed. - -(5) radeon systems, where X can soft-boot your video card. You'll need - a new enough X, and a plain text console (no vesafb or radeonfb). See - http://www.doesi.gmxhome.de/linux/tm800s3/s3.html for more information. - Alternatively, you should use vbetool (6) instead. - -(6) other radeon systems, where vbetool is enough to bring system back - to life. It needs text console to be working. Do vbetool vbestate - save > /tmp/delme; echo 3 > /proc/acpi/sleep; vbetool post; vbetool - vbestate restore < /tmp/delme; setfont , and your video - should work. - -(7) on some systems, it is possible to boot most of kernel, and then - POSTing bios works. Ole Rohne has patch to do just that at - http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. - -(8) on some systems, you can use the video_post utility and or - do echo 3 > /sys/power/state && /usr/sbin/video_post - which will - initialize the display in console mode. If you are in X, you can switch - to a virtual terminal and back to X using CTRL+ALT+F1 - CTRL+ALT+F7 to get - the display working in graphical mode again. - -Now, if you pass acpi_sleep=something, and it does not work with your -bios, you'll get a hard crash during resume. Be careful. Also it is -safest to do your experiments with plain old VGA console. The vesafb -and radeonfb (etc) drivers have a tendency to crash the machine during -resume. - -You may have a system where none of above works. At that point you -either invent another ugly hack that works, or write proper driver for -your video card (good luck getting docs :-(). Maybe suspending from X -(proper X, knowing your hardware, not XF68_FBcon) might have better -chance of working. - -Table of known working notebooks: - -Model hack (or "how to do it") ------------------------------------------------------------------------------- -Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI -Acer TM 230 s3_bios (2) -Acer TM 242FX vbetool (6) -Acer TM C110 video_post (8) -Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8) -Acer TM 4052LCi s3_bios (2) -Acer TM 636Lci s3_bios,s3_mode (4) -Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text console back -Acer TM 660 ??? (*) -Acer TM 800 vga=normal, X patches, see webpage (5) or vbetool (6) -Acer TM 803 vga=normal, X patches, see webpage (5) or vbetool (6) -Acer TM 803LCi vga=normal, vbetool (6) -Arima W730a vbetool needed (6) -Asus L2400D s3_mode (3)(***) (S1 also works OK) -Asus L3350M (SiS 740) (6) -Asus L3800C (Radeon M7) s3_bios (2) (S1 also works OK) -Asus M6887Ne vga=normal, s3_bios (2), use radeon driver instead of fglrx in x.org -Athlon64 desktop prototype s3_bios (2) -Compal CL-50 ??? (*) -Compaq Armada E500 - P3-700 none (1) (S1 also works OK) -Compaq Evo N620c vga=normal, s3_bios (2) -Dell 600m, ATI R250 Lf none (1), but needs xorg-x11-6.8.1.902-1 -Dell D600, ATI RV250 vga=normal and X, or try vbestate (6) -Dell D610 vga=normal and X (possibly vbestate (6) too, but not tested) -Dell Inspiron 4000 ??? (*) -Dell Inspiron 500m ??? (*) -Dell Inspiron 510m ??? -Dell Inspiron 5150 vbetool needed (6) -Dell Inspiron 600m ??? (*) -Dell Inspiron 8200 ??? (*) -Dell Inspiron 8500 ??? (*) -Dell Inspiron 8600 ??? (*) -eMachines athlon64 machines vbetool needed (6) (someone please get me model #s) -HP NC6000 s3_bios, may not use radeonfb (2); or vbetool (6) -HP NX7000 ??? (*) -HP Pavilion ZD7000 vbetool post needed, need open-source nv driver for X -HP Omnibook XE3 athlon version none (1) -HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV -HP Omnibook XE3L-GF vbetool (6) -HP Omnibook 5150 none (1), (S1 also works OK) -IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 Savage/IX-MV, vesafb gets "interesting" but X work. -IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with BIOS 1.04 2002-08-23, but not at all with BIOS 1.11 2004-11-05 :-(] -IBM TP R32 / Type 2658-MMG none (1) -IBM TP R40 2722B3G ??? (*) -IBM TP R50p / Type 1832-22U s3_bios (2) -IBM TP R51 none (1) -IBM TP T30 236681A ??? (*) -IBM TP T40 / Type 2373-MU4 none (1) -IBM TP T40p none (1) -IBM TP R40p s3_bios (2) -IBM TP T41p s3_bios (2), switch to X after resume -IBM TP T42 s3_bios (2) -IBM ThinkPad T42p (2373-GTG) s3_bios (2) -IBM TP X20 ??? (*) -IBM TP X30 s3_bios, s3_mode (4) -IBM TP X31 / Type 2672-XXH none (1), use radeontool (http://fdd.com/software/radeon/) to turn off backlight. -IBM TP X32 none (1), but backlight is on and video is trashed after long suspend. s3_bios,s3_mode (4) works too. Perhaps that gets better results? -IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4) -IBM TP 600e none(1), but a switch to console and back to X is needed -Medion MD4220 ??? (*) -Samsung P35 vbetool needed (6) -Sharp PC-AR10 (ATI rage) none (1), backlight does not switch off -Sony Vaio PCG-C1VRX/K s3_bios (2) -Sony Vaio PCG-F403 ??? (*) -Sony Vaio PCG-GRT995MP none (1), works with 'nv' X driver -Sony Vaio PCG-GR7/K none (1), but needs radeonfb, use radeontool (http://fdd.com/software/radeon/) to turn off backlight. -Sony Vaio PCG-N505SN ??? (*) -Sony Vaio vgn-s260 X or boot-radeon can init it (5) -Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will be blank unless you return to X. -Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4) -Toshiba Libretto L5 none (1) -Toshiba Libretto 100CT/110CT vbetool (6) -Toshiba Portege 3020CT s3_mode (3) -Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK) -Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK) -Toshiba Satellite 4090XCDT ??? (*) -Toshiba Satellite P10-554 s3_bios,s3_mode (4)(****) -Toshiba M30 (2) xor X with nvidia driver using internal AGP -Uniwill 244IIO ??? (*) - -Known working desktop systems -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Mainboard Graphics card hack (or "how to do it") ------------------------------------------------------------------------------- -Asus A7V8X nVidia RIVA TNT2 model 64 s3_bios,s3_mode (4) - - -(*) from https://wiki.ubuntu.com/HoaryPMResults, not sure - which options to use. If you know, please tell me. - -(***) To be tested with a newer kernel. - -(****) Not with SMP kernel, UP only. diff --git a/Documentation/process/submitting-drivers.rst b/Documentation/process/submitting-drivers.rst index 58bc047e7b95..1acaa14903d6 100644 --- a/Documentation/process/submitting-drivers.rst +++ b/Documentation/process/submitting-drivers.rst @@ -117,7 +117,7 @@ PM support: implemented") error. You should also try to make sure that your driver uses as little power as possible when it's not doing anything. For the driver testing instructions see - Documentation/power/drivers-testing.txt and for a relatively + Documentation/power/drivers-testing.rst and for a relatively complete overview of the power management issues related to drivers see :ref:`Documentation/driver-api/pm/devices.rst `. diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index 197d81f4b836..d97207b9accb 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -22,7 +22,7 @@ the highest. The actual EM used by EAS is _not_ maintained by the scheduler, but by a dedicated framework. For details about this framework and what it provides, -please refer to its documentation (see Documentation/power/energy-model.txt). +please refer to its documentation (see Documentation/power/energy-model.rst). 2. Background and Terminology @@ -81,7 +81,7 @@ through the arch_scale_cpu_capacity() callback. The rest of platform knowledge used by EAS is directly read from the Energy Model (EM) framework. The EM of a platform is composed of a power cost table -per 'performance domain' in the system (see Documentation/power/energy-model.txt +per 'performance domain' in the system (see Documentation/power/energy-model.rst for futher details about performance domains). The scheduler manages references to the EM objects in the topology code when the @@ -352,7 +352,7 @@ could be amended in the future if proven otherwise. EAS uses the EM of a platform to estimate the impact of scheduling decisions on energy. So, your platform must provide power cost tables to the EM framework in order to make EAS start. To do so, please refer to documentation of the -independent EM framework in Documentation/power/energy-model.txt. +independent EM framework in Documentation/power/energy-model.rst. Please also note that the scheduling domains need to be re-built after the EM has been registered in order to start EAS. diff --git a/Documentation/trace/coresight-cpu-debug.txt b/Documentation/trace/coresight-cpu-debug.txt index f07e38094b40..1a660a39e3c0 100644 --- a/Documentation/trace/coresight-cpu-debug.txt +++ b/Documentation/trace/coresight-cpu-debug.txt @@ -151,7 +151,7 @@ At the runtime you can disable idle states with below methods: It is possible to disable CPU idle states by way of the PM QoS subsystem, more specifically by using the "/dev/cpu_dma_latency" -interface (see Documentation/power/pm_qos_interface.txt for more +interface (see Documentation/power/pm_qos_interface.rst for more details). As specified in the PM QoS documentation the requested parameter will stay in effect until the file descriptor is released. For example: diff --git a/Documentation/translations/zh_CN/process/submitting-drivers.rst b/Documentation/translations/zh_CN/process/submitting-drivers.rst index 72c6cd935821..f1c3906c69a8 100644 --- a/Documentation/translations/zh_CN/process/submitting-drivers.rst +++ b/Documentation/translations/zh_CN/process/submitting-drivers.rst @@ -97,7 +97,7 @@ Linux 2.6: 函数定义成返回 -ENOSYS(功能未实现)错误。你还应该尝试确 保你的驱动在什么都不干的情况下将耗电降到最低。要获得驱动 程序测试的指导,请参阅 - Documentation/power/drivers-testing.txt。有关驱动程序电 + Documentation/power/drivers-testing.rst。有关驱动程序电 源管理问题相对全面的概述,请参阅 Documentation/driver-api/pm/devices.rst。 diff --git a/MAINTAINERS b/MAINTAINERS index 9c382053ce6a..5a6137df3f0e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6446,7 +6446,7 @@ M: "Rafael J. Wysocki" M: Pavel Machek L: linux-pm@vger.kernel.org S: Supported -F: Documentation/power/freezing-of-tasks.txt +F: Documentation/power/freezing-of-tasks.rst F: include/linux/freezer.h F: kernel/freezer.c @@ -11764,7 +11764,7 @@ S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git F: drivers/opp/ F: include/linux/pm_opp.h -F: Documentation/power/opp.txt +F: Documentation/power/opp.rst F: Documentation/devicetree/bindings/opp/ OPL4 DRIVER diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2bbbd4d1ba31..77a724771dbb 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2447,7 +2447,7 @@ menuconfig APM machines with more than one CPU. In order to use APM, you will need supporting software. For location - and more information, read + and more information, read and the Battery Powered Linux mini-HOWTO, available from . diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 066fd2a12851..10d040e2e807 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1175,7 +1175,7 @@ struct skl_wm_params { * to be disabled. This shouldn't happen and we'll print some error messages in * case it happens. * - * For more, read the Documentation/power/runtime_pm.txt. + * For more, read the Documentation/power/runtime_pm.rst. */ struct i915_runtime_pm { atomic_t wakeref_count; diff --git a/drivers/opp/Kconfig b/drivers/opp/Kconfig index a7fbb93f302c..1f64a3d46c8a 100644 --- a/drivers/opp/Kconfig +++ b/drivers/opp/Kconfig @@ -10,4 +10,4 @@ config PM_OPP OPP layer organizes the data internally using device pointers representing individual voltage domains and provides SOC implementations a ready to use framework to manage OPPs. - For more information, read + For more information, read diff --git a/drivers/power/supply/power_supply_core.c b/drivers/power/supply/power_supply_core.c index f7033ecf6d0b..11f9c875b028 100644 --- a/drivers/power/supply/power_supply_core.c +++ b/drivers/power/supply/power_supply_core.c @@ -607,7 +607,7 @@ int power_supply_get_battery_info(struct power_supply *psy, /* The property and field names below must correspond to elements * in enum power_supply_property. For reasoning, see - * Documentation/power/power_supply_class.txt. + * Documentation/power/power_supply_class.rst. */ of_property_read_u32(battery_np, "energy-full-design-microwatt-hours", diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index c7eef32e7739..5b8328a99b2a 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -52,7 +52,7 @@ * irq line disabled until the threaded handler has been run. * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend. Does not guarantee * that this interrupt will wake the system from a suspended - * state. See Documentation/power/suspend-and-interrupts.txt + * state. See Documentation/power/suspend-and-interrupts.rst * IRQF_FORCE_RESUME - Force enable it on resume even if IRQF_NO_SUSPEND is set * IRQF_NO_THREAD - Interrupt cannot be threaded * IRQF_EARLY_RESUME - Resume IRQ early during syscore instead of at device diff --git a/include/linux/pci.h b/include/linux/pci.h index b74b2a4e6df2..3d9a167ca5c3 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -807,7 +807,7 @@ struct module; * @suspend_late: Put device into low power state. * @resume_early: Wake device from low power state. * @resume: Wake device from low power state. - * (Please see Documentation/power/pci.txt for descriptions + * (Please see Documentation/power/pci.rst for descriptions * of PCI Power Management and the related functions.) * @shutdown: Hook into reboot_notifier_list (kernel/sys.c). * Intended to stop any idling DMA operations. diff --git a/include/linux/pm.h b/include/linux/pm.h index 66c19a65a514..c14ad8bc1a41 100644 --- a/include/linux/pm.h +++ b/include/linux/pm.h @@ -284,7 +284,7 @@ typedef struct pm_message { * actions to be performed by a device driver's callbacks generally depend on * the platform and subsystem the device belongs to. * - * Refer to Documentation/power/runtime_pm.txt for more information about the + * Refer to Documentation/power/runtime_pm.rst for more information about the * role of the @runtime_suspend(), @runtime_resume() and @runtime_idle() * callbacks in device runtime power management. */ diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index 9bbaaab14b36..7a4dda9e5309 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -65,7 +65,7 @@ config HIBERNATION need to run mkswap against the swap partition used for the suspend. It also works with swap files to a limited extent (for details see - ). + ). Right now you may boot without resuming and resume later but in the meantime you cannot use the swap partition(s)/file(s) involved in @@ -74,7 +74,7 @@ config HIBERNATION MOUNT any journaled filesystems mounted before the suspend or they will get corrupted in a nasty way. - For more information take a look at . + For more information take a look at . config ARCH_SAVE_PAGE_KEYS bool @@ -255,7 +255,7 @@ config APM_EMULATION notification of APM "events" (e.g. battery status change). In order to use APM, you will need supporting software. For location - and more information, read + and more information, read and the Battery Powered Linux mini-HOWTO, available from . diff --git a/net/wireless/Kconfig b/net/wireless/Kconfig index 41722046b937..0cd26289bfbc 100644 --- a/net/wireless/Kconfig +++ b/net/wireless/Kconfig @@ -165,7 +165,7 @@ config CFG80211_DEFAULT_PS If this causes your applications to misbehave you should fix your applications instead -- they need to register their network - latency requirement, see Documentation/power/pm_qos_interface.txt. + latency requirement, see Documentation/power/pm_qos_interface.rst. config CFG80211_DEBUGFS bool "cfg80211 DebugFS entries" -- cgit v1.2.3 From 6ad805b82dc5fc0ffd2de1d1f0de47214a050278 Mon Sep 17 00:00:00 2001 From: Yang Yingliang Date: Sat, 15 Jun 2019 17:41:29 +0800 Subject: doc: fix documentation about UIO_MEM_LOGICAL using After commit d4fc5069a394 ("mm: switch s_mem and slab_cache in struct page") page->mapping will be re-used by slab allocations and page->mapping->host will be used in balance_dirty_pages_ratelimited() as an inode member but it's not an inode in fact and leads an oops. [ 159.906493] Unable to handle kernel paging request at virtual address ffff200012d90be8 [ 159.908029] Mem abort info: [ 159.908552] ESR = 0x96000007 [ 159.909138] Exception class = DABT (current EL), IL = 32 bits [ 159.910155] SET = 0, FnV = 0 [ 159.910690] EA = 0, S1PTW = 0 [ 159.911241] Data abort info: [ 159.911846] ISV = 0, ISS = 0x00000007 [ 159.912567] CM = 0, WnR = 0 [ 159.913105] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000042acd000 [ 159.914269] [ffff200012d90be8] pgd=000000043ffff003, pud=000000043fffe003, pmd=000000043fffa003, pte=0000000000000000 [ 159.916280] Internal error: Oops: 96000007 [#1] SMP [ 159.917195] Dumping ftrace buffer: [ 159.917845] (ftrace buffer empty) [ 159.918521] Modules linked in: uio_dev(OE) [ 159.919276] CPU: 1 PID: 295 Comm: uio_test Tainted: G OE 5.2.0-rc4+ #46 [ 159.920859] Hardware name: linux,dummy-virt (DT) [ 159.921815] pstate: 60000005 (nZCv daif -PAN -UAO) [ 159.922809] pc : balance_dirty_pages_ratelimited+0x68/0xc38 [ 159.923965] lr : fault_dirty_shared_page.isra.8+0xe4/0x100 [ 159.925134] sp : ffff800368a77ae0 [ 159.925824] x29: ffff800368a77ae0 x28: 1ffff0006d14ce1a [ 159.926906] x27: ffff800368a670d0 x26: ffff800368a67120 [ 159.927985] x25: 1ffff0006d10f5fe x24: ffff200012d90be8 [ 159.929089] x23: ffff200013732000 x22: ffff80036ec03200 [ 159.930172] x21: ffff200012d90bc0 x20: 1fffe400025b217d [ 159.931253] x19: ffff80036ec03200 x18: 0000000000000000 [ 159.932348] x17: 0000000000000000 x16: 0ffffe0000010208 [ 159.933439] x15: 0000000000000000 x14: 0000000000000000 [ 159.934518] x13: 0000000000000000 x12: 0000000000000000 [ 159.935596] x11: 1fffefc001b452c0 x10: ffff0fc001b452c0 [ 159.936697] x9 : dfff200000000000 x8 : dfff200000000001 [ 159.937781] x7 : ffff7e000da29607 x6 : ffff0fc001b452c1 [ 159.938859] x5 : ffff0fc001b452c1 x4 : ffff0fc001b452c1 [ 159.939944] x3 : ffff200010523ad4 x2 : 1fffe400026e659b [ 159.941065] x1 : dfff200000000000 x0 : ffff200013732cd8 [ 159.942205] Call trace: [ 159.942732] balance_dirty_pages_ratelimited+0x68/0xc38 [ 159.943797] fault_dirty_shared_page.isra.8+0xe4/0x100 [ 159.944867] do_fault+0x608/0x1250 [ 159.945571] __handle_mm_fault+0x93c/0xfb8 [ 159.946412] handle_mm_fault+0x1c0/0x360 [ 159.947224] do_page_fault+0x358/0x8d0 [ 159.947993] do_translation_fault+0xf8/0x124 [ 159.948884] do_mem_abort+0x70/0x190 [ 159.949624] el0_da+0x24/0x28 According another commit 5e901d0b15c0 ("scsi: qedi: Fix bad pte call trace when iscsiuio is stopped."), using kmalloc also cause other problem. But the documentation about UIO_MEM_LOGICAL allows using kmalloc(), remove and don't allow using kmalloc() in documentation. Signed-off-by: Yang Yingliang Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/uio-howto.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/uio-howto.rst b/Documentation/driver-api/uio-howto.rst index 25f50eace28b..8fecfa11d4ff 100644 --- a/Documentation/driver-api/uio-howto.rst +++ b/Documentation/driver-api/uio-howto.rst @@ -276,8 +276,8 @@ fields of ``struct uio_mem``: - ``int memtype``: Required if the mapping is used. Set this to ``UIO_MEM_PHYS`` if you you have physical memory on your card to be mapped. Use ``UIO_MEM_LOGICAL`` for logical memory (e.g. allocated - with :c:func:`kmalloc()`). There's also ``UIO_MEM_VIRTUAL`` for - virtual memory. + with :c:func:`__get_free_pages()` but not kmalloc()). There's also + ``UIO_MEM_VIRTUAL`` for virtual memory. - ``phys_addr_t addr``: Required if the mapping is used. Fill in the address of your memory block. This address is the one that appears in -- cgit v1.2.3 From 4489f161b739f01ab60a58784f6ef7de9d7a1352 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 17:53:27 -0300 Subject: docs: driver-model: convert docs to ReST and rename to *.rst Convert the various documents at the driver-model, preparing them to be part of the driver-api book. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jeff Kirsher # ice Signed-off-by: Greg Kroah-Hartman --- Documentation/driver-api/gpio/driver.rst | 2 +- Documentation/driver-model/binding.rst | 98 ++++++ Documentation/driver-model/binding.txt | 98 ------ Documentation/driver-model/bus.rst | 146 ++++++++ Documentation/driver-model/bus.txt | 143 -------- Documentation/driver-model/class.rst | 149 ++++++++ Documentation/driver-model/class.txt | 147 -------- Documentation/driver-model/design-patterns.rst | 116 +++++++ Documentation/driver-model/design-patterns.txt | 116 ------- Documentation/driver-model/device.rst | 109 ++++++ Documentation/driver-model/device.txt | 106 ------ Documentation/driver-model/devres.rst | 414 +++++++++++++++++++++++ Documentation/driver-model/devres.txt | 412 ----------------------- Documentation/driver-model/driver.rst | 223 ++++++++++++ Documentation/driver-model/driver.txt | 215 ------------ Documentation/driver-model/index.rst | 26 ++ Documentation/driver-model/overview.rst | 124 +++++++ Documentation/driver-model/overview.txt | 123 ------- Documentation/driver-model/platform.rst | 246 ++++++++++++++ Documentation/driver-model/platform.txt | 244 -------------- Documentation/driver-model/porting.rst | 448 +++++++++++++++++++++++++ Documentation/driver-model/porting.txt | 447 ------------------------ Documentation/eisa.txt | 4 +- Documentation/hwmon/submitting-patches.rst | 2 +- drivers/base/platform.c | 2 +- drivers/gpio/gpio-cs5535.c | 2 +- drivers/net/ethernet/intel/ice/ice_main.c | 2 +- scripts/coccinelle/free/devm_free.cocci | 2 +- 28 files changed, 2107 insertions(+), 2059 deletions(-) create mode 100644 Documentation/driver-model/binding.rst delete mode 100644 Documentation/driver-model/binding.txt create mode 100644 Documentation/driver-model/bus.rst delete mode 100644 Documentation/driver-model/bus.txt create mode 100644 Documentation/driver-model/class.rst delete mode 100644 Documentation/driver-model/class.txt create mode 100644 Documentation/driver-model/design-patterns.rst delete mode 100644 Documentation/driver-model/design-patterns.txt create mode 100644 Documentation/driver-model/device.rst delete mode 100644 Documentation/driver-model/device.txt create mode 100644 Documentation/driver-model/devres.rst delete mode 100644 Documentation/driver-model/devres.txt create mode 100644 Documentation/driver-model/driver.rst delete mode 100644 Documentation/driver-model/driver.txt create mode 100644 Documentation/driver-model/index.rst create mode 100644 Documentation/driver-model/overview.rst delete mode 100644 Documentation/driver-model/overview.txt create mode 100644 Documentation/driver-model/platform.rst delete mode 100644 Documentation/driver-model/platform.txt create mode 100644 Documentation/driver-model/porting.rst delete mode 100644 Documentation/driver-model/porting.txt (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 1ce7fcd0f989..f931597fe7be 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -399,7 +399,7 @@ symbol: will pass the struct gpio_chip* for the chip to all IRQ callbacks, so the callbacks need to embed the gpio_chip in its state container and obtain a pointer to the container using container_of(). - (See Documentation/driver-model/design-patterns.txt) + (See Documentation/driver-model/design-patterns.rst) - gpiochip_irqchip_add_nested(): adds a nested cascaded irqchip to a gpiochip, as discussed above regarding different types of cascaded irqchips. The diff --git a/Documentation/driver-model/binding.rst b/Documentation/driver-model/binding.rst new file mode 100644 index 000000000000..7ea1d7a41e1d --- /dev/null +++ b/Documentation/driver-model/binding.rst @@ -0,0 +1,98 @@ +============== +Driver Binding +============== + +Driver binding is the process of associating a device with a device +driver that can control it. Bus drivers have typically handled this +because there have been bus-specific structures to represent the +devices and the drivers. With generic device and device driver +structures, most of the binding can take place using common code. + + +Bus +~~~ + +The bus type structure contains a list of all devices that are on that bus +type in the system. When device_register is called for a device, it is +inserted into the end of this list. The bus object also contains a +list of all drivers of that bus type. When driver_register is called +for a driver, it is inserted at the end of this list. These are the +two events which trigger driver binding. + + +device_register +~~~~~~~~~~~~~~~ + +When a new device is added, the bus's list of drivers is iterated over +to find one that supports it. In order to determine that, the device +ID of the device must match one of the device IDs that the driver +supports. The format and semantics for comparing IDs is bus-specific. +Instead of trying to derive a complex state machine and matching +algorithm, it is up to the bus driver to provide a callback to compare +a device against the IDs of a driver. The bus returns 1 if a match was +found; 0 otherwise. + +int match(struct device * dev, struct device_driver * drv); + +If a match is found, the device's driver field is set to the driver +and the driver's probe callback is called. This gives the driver a +chance to verify that it really does support the hardware, and that +it's in a working state. + +Device Class +~~~~~~~~~~~~ + +Upon the successful completion of probe, the device is registered with +the class to which it belongs. Device drivers belong to one and only one +class, and that is set in the driver's devclass field. +devclass_add_device is called to enumerate the device within the class +and actually register it with the class, which happens with the +class's register_dev callback. + + +Driver +~~~~~~ + +When a driver is attached to a device, the device is inserted into the +driver's list of devices. + + +sysfs +~~~~~ + +A symlink is created in the bus's 'devices' directory that points to +the device's directory in the physical hierarchy. + +A symlink is created in the driver's 'devices' directory that points +to the device's directory in the physical hierarchy. + +A directory for the device is created in the class's directory. A +symlink is created in that directory that points to the device's +physical location in the sysfs tree. + +A symlink can be created (though this isn't done yet) in the device's +physical directory to either its class directory, or the class's +top-level directory. One can also be created to point to its driver's +directory also. + + +driver_register +~~~~~~~~~~~~~~~ + +The process is almost identical for when a new driver is added. +The bus's list of devices is iterated over to find a match. Devices +that already have a driver are skipped. All the devices are iterated +over, to bind as many devices as possible to the driver. + + +Removal +~~~~~~~ + +When a device is removed, the reference count for it will eventually +go to 0. When it does, the remove callback of the driver is called. It +is removed from the driver's list of devices and the reference count +of the driver is decremented. All symlinks between the two are removed. + +When a driver is removed, the list of devices that it supports is +iterated over, and the driver's remove callback is called for each +one. The device is removed from that list and the symlinks removed. diff --git a/Documentation/driver-model/binding.txt b/Documentation/driver-model/binding.txt deleted file mode 100644 index abfc8e290d53..000000000000 --- a/Documentation/driver-model/binding.txt +++ /dev/null @@ -1,98 +0,0 @@ - -Driver Binding - -Driver binding is the process of associating a device with a device -driver that can control it. Bus drivers have typically handled this -because there have been bus-specific structures to represent the -devices and the drivers. With generic device and device driver -structures, most of the binding can take place using common code. - - -Bus -~~~ - -The bus type structure contains a list of all devices that are on that bus -type in the system. When device_register is called for a device, it is -inserted into the end of this list. The bus object also contains a -list of all drivers of that bus type. When driver_register is called -for a driver, it is inserted at the end of this list. These are the -two events which trigger driver binding. - - -device_register -~~~~~~~~~~~~~~~ - -When a new device is added, the bus's list of drivers is iterated over -to find one that supports it. In order to determine that, the device -ID of the device must match one of the device IDs that the driver -supports. The format and semantics for comparing IDs is bus-specific. -Instead of trying to derive a complex state machine and matching -algorithm, it is up to the bus driver to provide a callback to compare -a device against the IDs of a driver. The bus returns 1 if a match was -found; 0 otherwise. - -int match(struct device * dev, struct device_driver * drv); - -If a match is found, the device's driver field is set to the driver -and the driver's probe callback is called. This gives the driver a -chance to verify that it really does support the hardware, and that -it's in a working state. - -Device Class -~~~~~~~~~~~~ - -Upon the successful completion of probe, the device is registered with -the class to which it belongs. Device drivers belong to one and only one -class, and that is set in the driver's devclass field. -devclass_add_device is called to enumerate the device within the class -and actually register it with the class, which happens with the -class's register_dev callback. - - -Driver -~~~~~~ - -When a driver is attached to a device, the device is inserted into the -driver's list of devices. - - -sysfs -~~~~~ - -A symlink is created in the bus's 'devices' directory that points to -the device's directory in the physical hierarchy. - -A symlink is created in the driver's 'devices' directory that points -to the device's directory in the physical hierarchy. - -A directory for the device is created in the class's directory. A -symlink is created in that directory that points to the device's -physical location in the sysfs tree. - -A symlink can be created (though this isn't done yet) in the device's -physical directory to either its class directory, or the class's -top-level directory. One can also be created to point to its driver's -directory also. - - -driver_register -~~~~~~~~~~~~~~~ - -The process is almost identical for when a new driver is added. -The bus's list of devices is iterated over to find a match. Devices -that already have a driver are skipped. All the devices are iterated -over, to bind as many devices as possible to the driver. - - -Removal -~~~~~~~ - -When a device is removed, the reference count for it will eventually -go to 0. When it does, the remove callback of the driver is called. It -is removed from the driver's list of devices and the reference count -of the driver is decremented. All symlinks between the two are removed. - -When a driver is removed, the list of devices that it supports is -iterated over, and the driver's remove callback is called for each -one. The device is removed from that list and the symlinks removed. - diff --git a/Documentation/driver-model/bus.rst b/Documentation/driver-model/bus.rst new file mode 100644 index 000000000000..016b15a6e8ea --- /dev/null +++ b/Documentation/driver-model/bus.rst @@ -0,0 +1,146 @@ +========= +Bus Types +========= + +Definition +~~~~~~~~~~ +See the kerneldoc for the struct bus_type. + +int bus_register(struct bus_type * bus); + + +Declaration +~~~~~~~~~~~ + +Each bus type in the kernel (PCI, USB, etc) should declare one static +object of this type. They must initialize the name field, and may +optionally initialize the match callback:: + + struct bus_type pci_bus_type = { + .name = "pci", + .match = pci_bus_match, + }; + +The structure should be exported to drivers in a header file: + +extern struct bus_type pci_bus_type; + + +Registration +~~~~~~~~~~~~ + +When a bus driver is initialized, it calls bus_register. This +initializes the rest of the fields in the bus object and inserts it +into a global list of bus types. Once the bus object is registered, +the fields in it are usable by the bus driver. + + +Callbacks +~~~~~~~~~ + +match(): Attaching Drivers to Devices +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The format of device ID structures and the semantics for comparing +them are inherently bus-specific. Drivers typically declare an array +of device IDs of devices they support that reside in a bus-specific +driver structure. + +The purpose of the match callback is to give the bus an opportunity to +determine if a particular driver supports a particular device by +comparing the device IDs the driver supports with the device ID of a +particular device, without sacrificing bus-specific functionality or +type-safety. + +When a driver is registered with the bus, the bus's list of devices is +iterated over, and the match callback is called for each device that +does not have a driver associated with it. + + + +Device and Driver Lists +~~~~~~~~~~~~~~~~~~~~~~~ + +The lists of devices and drivers are intended to replace the local +lists that many buses keep. They are lists of struct devices and +struct device_drivers, respectively. Bus drivers are free to use the +lists as they please, but conversion to the bus-specific type may be +necessary. + +The LDM core provides helper functions for iterating over each list:: + + int bus_for_each_dev(struct bus_type * bus, struct device * start, + void * data, + int (*fn)(struct device *, void *)); + + int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, + void * data, int (*fn)(struct device_driver *, void *)); + +These helpers iterate over the respective list, and call the callback +for each device or driver in the list. All list accesses are +synchronized by taking the bus's lock (read currently). The reference +count on each object in the list is incremented before the callback is +called; it is decremented after the next object has been obtained. The +lock is not held when calling the callback. + + +sysfs +~~~~~~~~ +There is a top-level directory named 'bus'. + +Each bus gets a directory in the bus directory, along with two default +directories:: + + /sys/bus/pci/ + |-- devices + `-- drivers + +Drivers registered with the bus get a directory in the bus's drivers +directory:: + + /sys/bus/pci/ + |-- devices + `-- drivers + |-- Intel ICH + |-- Intel ICH Joystick + |-- agpgart + `-- e100 + +Each device that is discovered on a bus of that type gets a symlink in +the bus's devices directory to the device's directory in the physical +hierarchy:: + + /sys/bus/pci/ + |-- devices + | |-- 00:00.0 -> ../../../root/pci0/00:00.0 + | |-- 00:01.0 -> ../../../root/pci0/00:01.0 + | `-- 00:02.0 -> ../../../root/pci0/00:02.0 + `-- drivers + + +Exporting Attributes +~~~~~~~~~~~~~~~~~~~~ + +:: + + struct bus_attribute { + struct attribute attr; + ssize_t (*show)(struct bus_type *, char * buf); + ssize_t (*store)(struct bus_type *, const char * buf, size_t count); + }; + +Bus drivers can export attributes using the BUS_ATTR_RW macro that works +similarly to the DEVICE_ATTR_RW macro for devices. For example, a +definition like this:: + + static BUS_ATTR_RW(debug); + +is equivalent to declaring:: + + static bus_attribute bus_attr_debug; + +This can then be used to add and remove the attribute from the bus's +sysfs directory using:: + + int bus_create_file(struct bus_type *, struct bus_attribute *); + void bus_remove_file(struct bus_type *, struct bus_attribute *); diff --git a/Documentation/driver-model/bus.txt b/Documentation/driver-model/bus.txt deleted file mode 100644 index c247b488a567..000000000000 --- a/Documentation/driver-model/bus.txt +++ /dev/null @@ -1,143 +0,0 @@ - -Bus Types - -Definition -~~~~~~~~~~ -See the kerneldoc for the struct bus_type. - -int bus_register(struct bus_type * bus); - - -Declaration -~~~~~~~~~~~ - -Each bus type in the kernel (PCI, USB, etc) should declare one static -object of this type. They must initialize the name field, and may -optionally initialize the match callback. - -struct bus_type pci_bus_type = { - .name = "pci", - .match = pci_bus_match, -}; - -The structure should be exported to drivers in a header file: - -extern struct bus_type pci_bus_type; - - -Registration -~~~~~~~~~~~~ - -When a bus driver is initialized, it calls bus_register. This -initializes the rest of the fields in the bus object and inserts it -into a global list of bus types. Once the bus object is registered, -the fields in it are usable by the bus driver. - - -Callbacks -~~~~~~~~~ - -match(): Attaching Drivers to Devices -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The format of device ID structures and the semantics for comparing -them are inherently bus-specific. Drivers typically declare an array -of device IDs of devices they support that reside in a bus-specific -driver structure. - -The purpose of the match callback is to give the bus an opportunity to -determine if a particular driver supports a particular device by -comparing the device IDs the driver supports with the device ID of a -particular device, without sacrificing bus-specific functionality or -type-safety. - -When a driver is registered with the bus, the bus's list of devices is -iterated over, and the match callback is called for each device that -does not have a driver associated with it. - - - -Device and Driver Lists -~~~~~~~~~~~~~~~~~~~~~~~ - -The lists of devices and drivers are intended to replace the local -lists that many buses keep. They are lists of struct devices and -struct device_drivers, respectively. Bus drivers are free to use the -lists as they please, but conversion to the bus-specific type may be -necessary. - -The LDM core provides helper functions for iterating over each list. - -int bus_for_each_dev(struct bus_type * bus, struct device * start, void * data, - int (*fn)(struct device *, void *)); - -int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, - void * data, int (*fn)(struct device_driver *, void *)); - -These helpers iterate over the respective list, and call the callback -for each device or driver in the list. All list accesses are -synchronized by taking the bus's lock (read currently). The reference -count on each object in the list is incremented before the callback is -called; it is decremented after the next object has been obtained. The -lock is not held when calling the callback. - - -sysfs -~~~~~~~~ -There is a top-level directory named 'bus'. - -Each bus gets a directory in the bus directory, along with two default -directories: - - /sys/bus/pci/ - |-- devices - `-- drivers - -Drivers registered with the bus get a directory in the bus's drivers -directory: - - /sys/bus/pci/ - |-- devices - `-- drivers - |-- Intel ICH - |-- Intel ICH Joystick - |-- agpgart - `-- e100 - -Each device that is discovered on a bus of that type gets a symlink in -the bus's devices directory to the device's directory in the physical -hierarchy: - - /sys/bus/pci/ - |-- devices - | |-- 00:00.0 -> ../../../root/pci0/00:00.0 - | |-- 00:01.0 -> ../../../root/pci0/00:01.0 - | `-- 00:02.0 -> ../../../root/pci0/00:02.0 - `-- drivers - - -Exporting Attributes -~~~~~~~~~~~~~~~~~~~~ -struct bus_attribute { - struct attribute attr; - ssize_t (*show)(struct bus_type *, char * buf); - ssize_t (*store)(struct bus_type *, const char * buf, size_t count); -}; - -Bus drivers can export attributes using the BUS_ATTR_RW macro that works -similarly to the DEVICE_ATTR_RW macro for devices. For example, a -definition like this: - -static BUS_ATTR_RW(debug); - -is equivalent to declaring: - -static bus_attribute bus_attr_debug; - -This can then be used to add and remove the attribute from the bus's -sysfs directory using: - -int bus_create_file(struct bus_type *, struct bus_attribute *); -void bus_remove_file(struct bus_type *, struct bus_attribute *); - - diff --git a/Documentation/driver-model/class.rst b/Documentation/driver-model/class.rst new file mode 100644 index 000000000000..fff55b80e86a --- /dev/null +++ b/Documentation/driver-model/class.rst @@ -0,0 +1,149 @@ +============== +Device Classes +============== + +Introduction +~~~~~~~~~~~~ +A device class describes a type of device, like an audio or network +device. The following device classes have been identified: + + + + +Each device class defines a set of semantics and a programming interface +that devices of that class adhere to. Device drivers are the +implementation of that programming interface for a particular device on +a particular bus. + +Device classes are agnostic with respect to what bus a device resides +on. + + +Programming Interface +~~~~~~~~~~~~~~~~~~~~~ +The device class structure looks like:: + + + typedef int (*devclass_add)(struct device *); + typedef void (*devclass_remove)(struct device *); + +See the kerneldoc for the struct class. + +A typical device class definition would look like:: + + struct device_class input_devclass = { + .name = "input", + .add_device = input_add_device, + .remove_device = input_remove_device, + }; + +Each device class structure should be exported in a header file so it +can be used by drivers, extensions and interfaces. + +Device classes are registered and unregistered with the core using:: + + int devclass_register(struct device_class * cls); + void devclass_unregister(struct device_class * cls); + + +Devices +~~~~~~~ +As devices are bound to drivers, they are added to the device class +that the driver belongs to. Before the driver model core, this would +typically happen during the driver's probe() callback, once the device +has been initialized. It now happens after the probe() callback +finishes from the core. + +The device is enumerated in the class. Each time a device is added to +the class, the class's devnum field is incremented and assigned to the +device. The field is never decremented, so if the device is removed +from the class and re-added, it will receive a different enumerated +value. + +The class is allowed to create a class-specific structure for the +device and store it in the device's class_data pointer. + +There is no list of devices in the device class. Each driver has a +list of devices that it supports. The device class has a list of +drivers of that particular class. To access all of the devices in the +class, iterate over the device lists of each driver in the class. + + +Device Drivers +~~~~~~~~~~~~~~ +Device drivers are added to device classes when they are registered +with the core. A driver specifies the class it belongs to by setting +the struct device_driver::devclass field. + + +sysfs directory structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +There is a top-level sysfs directory named 'class'. + +Each class gets a directory in the class directory, along with two +default subdirectories:: + + class/ + `-- input + |-- devices + `-- drivers + + +Drivers registered with the class get a symlink in the drivers/ directory +that points to the driver's directory (under its bus directory):: + + class/ + `-- input + |-- devices + `-- drivers + `-- usb:usb_mouse -> ../../../bus/drivers/usb_mouse/ + + +Each device gets a symlink in the devices/ directory that points to the +device's directory in the physical hierarchy:: + + class/ + `-- input + |-- devices + | `-- 1 -> ../../../root/pci0/00:1f.0/usb_bus/00:1f.2-1:0/ + `-- drivers + + +Exporting Attributes +~~~~~~~~~~~~~~~~~~~~ + +:: + + struct devclass_attribute { + struct attribute attr; + ssize_t (*show)(struct device_class *, char * buf, size_t count, loff_t off); + ssize_t (*store)(struct device_class *, const char * buf, size_t count, loff_t off); + }; + +Class drivers can export attributes using the DEVCLASS_ATTR macro that works +similarly to the DEVICE_ATTR macro for devices. For example, a definition +like this:: + + static DEVCLASS_ATTR(debug,0644,show_debug,store_debug); + +is equivalent to declaring:: + + static devclass_attribute devclass_attr_debug; + +The bus driver can add and remove the attribute from the class's +sysfs directory using:: + + int devclass_create_file(struct device_class *, struct devclass_attribute *); + void devclass_remove_file(struct device_class *, struct devclass_attribute *); + +In the example above, the file will be named 'debug' in placed in the +class's directory in sysfs. + + +Interfaces +~~~~~~~~~~ +There may exist multiple mechanisms for accessing the same device of a +particular class type. Device interfaces describe these mechanisms. + +When a device is added to a device class, the core attempts to add it +to every interface that is registered with the device class. diff --git a/Documentation/driver-model/class.txt b/Documentation/driver-model/class.txt deleted file mode 100644 index 1fefc480a80b..000000000000 --- a/Documentation/driver-model/class.txt +++ /dev/null @@ -1,147 +0,0 @@ - -Device Classes - - -Introduction -~~~~~~~~~~~~ -A device class describes a type of device, like an audio or network -device. The following device classes have been identified: - - - - -Each device class defines a set of semantics and a programming interface -that devices of that class adhere to. Device drivers are the -implementation of that programming interface for a particular device on -a particular bus. - -Device classes are agnostic with respect to what bus a device resides -on. - - -Programming Interface -~~~~~~~~~~~~~~~~~~~~~ -The device class structure looks like: - - -typedef int (*devclass_add)(struct device *); -typedef void (*devclass_remove)(struct device *); - -See the kerneldoc for the struct class. - -A typical device class definition would look like: - -struct device_class input_devclass = { - .name = "input", - .add_device = input_add_device, - .remove_device = input_remove_device, -}; - -Each device class structure should be exported in a header file so it -can be used by drivers, extensions and interfaces. - -Device classes are registered and unregistered with the core using: - -int devclass_register(struct device_class * cls); -void devclass_unregister(struct device_class * cls); - - -Devices -~~~~~~~ -As devices are bound to drivers, they are added to the device class -that the driver belongs to. Before the driver model core, this would -typically happen during the driver's probe() callback, once the device -has been initialized. It now happens after the probe() callback -finishes from the core. - -The device is enumerated in the class. Each time a device is added to -the class, the class's devnum field is incremented and assigned to the -device. The field is never decremented, so if the device is removed -from the class and re-added, it will receive a different enumerated -value. - -The class is allowed to create a class-specific structure for the -device and store it in the device's class_data pointer. - -There is no list of devices in the device class. Each driver has a -list of devices that it supports. The device class has a list of -drivers of that particular class. To access all of the devices in the -class, iterate over the device lists of each driver in the class. - - -Device Drivers -~~~~~~~~~~~~~~ -Device drivers are added to device classes when they are registered -with the core. A driver specifies the class it belongs to by setting -the struct device_driver::devclass field. - - -sysfs directory structure -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -There is a top-level sysfs directory named 'class'. - -Each class gets a directory in the class directory, along with two -default subdirectories: - - class/ - `-- input - |-- devices - `-- drivers - - -Drivers registered with the class get a symlink in the drivers/ directory -that points to the driver's directory (under its bus directory): - - class/ - `-- input - |-- devices - `-- drivers - `-- usb:usb_mouse -> ../../../bus/drivers/usb_mouse/ - - -Each device gets a symlink in the devices/ directory that points to the -device's directory in the physical hierarchy: - - class/ - `-- input - |-- devices - | `-- 1 -> ../../../root/pci0/00:1f.0/usb_bus/00:1f.2-1:0/ - `-- drivers - - -Exporting Attributes -~~~~~~~~~~~~~~~~~~~~ -struct devclass_attribute { - struct attribute attr; - ssize_t (*show)(struct device_class *, char * buf, size_t count, loff_t off); - ssize_t (*store)(struct device_class *, const char * buf, size_t count, loff_t off); -}; - -Class drivers can export attributes using the DEVCLASS_ATTR macro that works -similarly to the DEVICE_ATTR macro for devices. For example, a definition -like this: - -static DEVCLASS_ATTR(debug,0644,show_debug,store_debug); - -is equivalent to declaring: - -static devclass_attribute devclass_attr_debug; - -The bus driver can add and remove the attribute from the class's -sysfs directory using: - -int devclass_create_file(struct device_class *, struct devclass_attribute *); -void devclass_remove_file(struct device_class *, struct devclass_attribute *); - -In the example above, the file will be named 'debug' in placed in the -class's directory in sysfs. - - -Interfaces -~~~~~~~~~~ -There may exist multiple mechanisms for accessing the same device of a -particular class type. Device interfaces describe these mechanisms. - -When a device is added to a device class, the core attempts to add it -to every interface that is registered with the device class. - diff --git a/Documentation/driver-model/design-patterns.rst b/Documentation/driver-model/design-patterns.rst new file mode 100644 index 000000000000..41eb8f41f7dd --- /dev/null +++ b/Documentation/driver-model/design-patterns.rst @@ -0,0 +1,116 @@ +============================= +Device Driver Design Patterns +============================= + +This document describes a few common design patterns found in device drivers. +It is likely that subsystem maintainers will ask driver developers to +conform to these design patterns. + +1. State Container +2. container_of() + + +1. State Container +~~~~~~~~~~~~~~~~~~ + +While the kernel contains a few device drivers that assume that they will +only be probed() once on a certain system (singletons), it is custom to assume +that the device the driver binds to will appear in several instances. This +means that the probe() function and all callbacks need to be reentrant. + +The most common way to achieve this is to use the state container design +pattern. It usually has this form:: + + struct foo { + spinlock_t lock; /* Example member */ + (...) + }; + + static int foo_probe(...) + { + struct foo *foo; + + foo = devm_kzalloc(dev, sizeof(*foo), GFP_KERNEL); + if (!foo) + return -ENOMEM; + spin_lock_init(&foo->lock); + (...) + } + +This will create an instance of struct foo in memory every time probe() is +called. This is our state container for this instance of the device driver. +Of course it is then necessary to always pass this instance of the +state around to all functions that need access to the state and its members. + +For example, if the driver is registering an interrupt handler, you would +pass around a pointer to struct foo like this:: + + static irqreturn_t foo_handler(int irq, void *arg) + { + struct foo *foo = arg; + (...) + } + + static int foo_probe(...) + { + struct foo *foo; + + (...) + ret = request_irq(irq, foo_handler, 0, "foo", foo); + } + +This way you always get a pointer back to the correct instance of foo in +your interrupt handler. + + +2. container_of() +~~~~~~~~~~~~~~~~~ + +Continuing on the above example we add an offloaded work:: + + struct foo { + spinlock_t lock; + struct workqueue_struct *wq; + struct work_struct offload; + (...) + }; + + static void foo_work(struct work_struct *work) + { + struct foo *foo = container_of(work, struct foo, offload); + + (...) + } + + static irqreturn_t foo_handler(int irq, void *arg) + { + struct foo *foo = arg; + + queue_work(foo->wq, &foo->offload); + (...) + } + + static int foo_probe(...) + { + struct foo *foo; + + foo->wq = create_singlethread_workqueue("foo-wq"); + INIT_WORK(&foo->offload, foo_work); + (...) + } + +The design pattern is the same for an hrtimer or something similar that will +return a single argument which is a pointer to a struct member in the +callback. + +container_of() is a macro defined in + +What container_of() does is to obtain a pointer to the containing struct from +a pointer to a member by a simple subtraction using the offsetof() macro from +standard C, which allows something similar to object oriented behaviours. +Notice that the contained member must not be a pointer, but an actual member +for this to work. + +We can see here that we avoid having global pointers to our struct foo * +instance this way, while still keeping the number of parameters passed to the +work function to a single pointer. diff --git a/Documentation/driver-model/design-patterns.txt b/Documentation/driver-model/design-patterns.txt deleted file mode 100644 index ba7b2df64904..000000000000 --- a/Documentation/driver-model/design-patterns.txt +++ /dev/null @@ -1,116 +0,0 @@ - -Device Driver Design Patterns -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This document describes a few common design patterns found in device drivers. -It is likely that subsystem maintainers will ask driver developers to -conform to these design patterns. - -1. State Container -2. container_of() - - -1. State Container -~~~~~~~~~~~~~~~~~~ - -While the kernel contains a few device drivers that assume that they will -only be probed() once on a certain system (singletons), it is custom to assume -that the device the driver binds to will appear in several instances. This -means that the probe() function and all callbacks need to be reentrant. - -The most common way to achieve this is to use the state container design -pattern. It usually has this form: - -struct foo { - spinlock_t lock; /* Example member */ - (...) -}; - -static int foo_probe(...) -{ - struct foo *foo; - - foo = devm_kzalloc(dev, sizeof(*foo), GFP_KERNEL); - if (!foo) - return -ENOMEM; - spin_lock_init(&foo->lock); - (...) -} - -This will create an instance of struct foo in memory every time probe() is -called. This is our state container for this instance of the device driver. -Of course it is then necessary to always pass this instance of the -state around to all functions that need access to the state and its members. - -For example, if the driver is registering an interrupt handler, you would -pass around a pointer to struct foo like this: - -static irqreturn_t foo_handler(int irq, void *arg) -{ - struct foo *foo = arg; - (...) -} - -static int foo_probe(...) -{ - struct foo *foo; - - (...) - ret = request_irq(irq, foo_handler, 0, "foo", foo); -} - -This way you always get a pointer back to the correct instance of foo in -your interrupt handler. - - -2. container_of() -~~~~~~~~~~~~~~~~~ - -Continuing on the above example we add an offloaded work: - -struct foo { - spinlock_t lock; - struct workqueue_struct *wq; - struct work_struct offload; - (...) -}; - -static void foo_work(struct work_struct *work) -{ - struct foo *foo = container_of(work, struct foo, offload); - - (...) -} - -static irqreturn_t foo_handler(int irq, void *arg) -{ - struct foo *foo = arg; - - queue_work(foo->wq, &foo->offload); - (...) -} - -static int foo_probe(...) -{ - struct foo *foo; - - foo->wq = create_singlethread_workqueue("foo-wq"); - INIT_WORK(&foo->offload, foo_work); - (...) -} - -The design pattern is the same for an hrtimer or something similar that will -return a single argument which is a pointer to a struct member in the -callback. - -container_of() is a macro defined in - -What container_of() does is to obtain a pointer to the containing struct from -a pointer to a member by a simple subtraction using the offsetof() macro from -standard C, which allows something similar to object oriented behaviours. -Notice that the contained member must not be a pointer, but an actual member -for this to work. - -We can see here that we avoid having global pointers to our struct foo * -instance this way, while still keeping the number of parameters passed to the -work function to a single pointer. diff --git a/Documentation/driver-model/device.rst b/Documentation/driver-model/device.rst new file mode 100644 index 000000000000..2b868d49d349 --- /dev/null +++ b/Documentation/driver-model/device.rst @@ -0,0 +1,109 @@ +========================== +The Basic Device Structure +========================== + +See the kerneldoc for the struct device. + + +Programming Interface +~~~~~~~~~~~~~~~~~~~~~ +The bus driver that discovers the device uses this to register the +device with the core:: + + int device_register(struct device * dev); + +The bus should initialize the following fields: + + - parent + - name + - bus_id + - bus + +A device is removed from the core when its reference count goes to +0. The reference count can be adjusted using:: + + struct device * get_device(struct device * dev); + void put_device(struct device * dev); + +get_device() will return a pointer to the struct device passed to it +if the reference is not already 0 (if it's in the process of being +removed already). + +A driver can access the lock in the device structure using:: + + void lock_device(struct device * dev); + void unlock_device(struct device * dev); + + +Attributes +~~~~~~~~~~ + +:: + + struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + }; + +Attributes of devices can be exported by a device driver through sysfs. + +Please see Documentation/filesystems/sysfs.txt for more information +on how sysfs works. + +As explained in Documentation/kobject.txt, device attributes must be +created before the KOBJ_ADD uevent is generated. The only way to realize +that is by defining an attribute group. + +Attributes are declared using a macro called DEVICE_ATTR:: + + #define DEVICE_ATTR(name,mode,show,store) + +Example::: + + static DEVICE_ATTR(type, 0444, show_type, NULL); + static DEVICE_ATTR(power, 0644, show_power, store_power); + +This declares two structures of type struct device_attribute with respective +names 'dev_attr_type' and 'dev_attr_power'. These two attributes can be +organized as follows into a group:: + + static struct attribute *dev_attrs[] = { + &dev_attr_type.attr, + &dev_attr_power.attr, + NULL, + }; + + static struct attribute_group dev_attr_group = { + .attrs = dev_attrs, + }; + + static const struct attribute_group *dev_attr_groups[] = { + &dev_attr_group, + NULL, + }; + +This array of groups can then be associated with a device by setting the +group pointer in struct device before device_register() is invoked:: + + dev->groups = dev_attr_groups; + device_register(dev); + +The device_register() function will use the 'groups' pointer to create the +device attributes and the device_unregister() function will use this pointer +to remove the device attributes. + +Word of warning: While the kernel allows device_create_file() and +device_remove_file() to be called on a device at any time, userspace has +strict expectations on when attributes get created. When a new device is +registered in the kernel, a uevent is generated to notify userspace (like +udev) that a new device is available. If attributes are added after the +device is registered, then userspace won't get notified and userspace will +not know about the new attributes. + +This is important for device driver that need to publish additional +attributes for a device at driver probe time. If the device driver simply +calls device_create_file() on the device structure passed to it, then +userspace will never be notified of the new attributes. diff --git a/Documentation/driver-model/device.txt b/Documentation/driver-model/device.txt deleted file mode 100644 index 2403eb856187..000000000000 --- a/Documentation/driver-model/device.txt +++ /dev/null @@ -1,106 +0,0 @@ - -The Basic Device Structure -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -See the kerneldoc for the struct device. - - -Programming Interface -~~~~~~~~~~~~~~~~~~~~~ -The bus driver that discovers the device uses this to register the -device with the core: - -int device_register(struct device * dev); - -The bus should initialize the following fields: - - - parent - - name - - bus_id - - bus - -A device is removed from the core when its reference count goes to -0. The reference count can be adjusted using: - -struct device * get_device(struct device * dev); -void put_device(struct device * dev); - -get_device() will return a pointer to the struct device passed to it -if the reference is not already 0 (if it's in the process of being -removed already). - -A driver can access the lock in the device structure using: - -void lock_device(struct device * dev); -void unlock_device(struct device * dev); - - -Attributes -~~~~~~~~~~ -struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device *dev, struct device_attribute *attr, - char *buf); - ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); -}; - -Attributes of devices can be exported by a device driver through sysfs. - -Please see Documentation/filesystems/sysfs.txt for more information -on how sysfs works. - -As explained in Documentation/kobject.txt, device attributes must be -created before the KOBJ_ADD uevent is generated. The only way to realize -that is by defining an attribute group. - -Attributes are declared using a macro called DEVICE_ATTR: - -#define DEVICE_ATTR(name,mode,show,store) - -Example: - -static DEVICE_ATTR(type, 0444, show_type, NULL); -static DEVICE_ATTR(power, 0644, show_power, store_power); - -This declares two structures of type struct device_attribute with respective -names 'dev_attr_type' and 'dev_attr_power'. These two attributes can be -organized as follows into a group: - -static struct attribute *dev_attrs[] = { - &dev_attr_type.attr, - &dev_attr_power.attr, - NULL, -}; - -static struct attribute_group dev_attr_group = { - .attrs = dev_attrs, -}; - -static const struct attribute_group *dev_attr_groups[] = { - &dev_attr_group, - NULL, -}; - -This array of groups can then be associated with a device by setting the -group pointer in struct device before device_register() is invoked: - - dev->groups = dev_attr_groups; - device_register(dev); - -The device_register() function will use the 'groups' pointer to create the -device attributes and the device_unregister() function will use this pointer -to remove the device attributes. - -Word of warning: While the kernel allows device_create_file() and -device_remove_file() to be called on a device at any time, userspace has -strict expectations on when attributes get created. When a new device is -registered in the kernel, a uevent is generated to notify userspace (like -udev) that a new device is available. If attributes are added after the -device is registered, then userspace won't get notified and userspace will -not know about the new attributes. - -This is important for device driver that need to publish additional -attributes for a device at driver probe time. If the device driver simply -calls device_create_file() on the device structure passed to it, then -userspace will never be notified of the new attributes. diff --git a/Documentation/driver-model/devres.rst b/Documentation/driver-model/devres.rst new file mode 100644 index 000000000000..4ac99122b5f1 --- /dev/null +++ b/Documentation/driver-model/devres.rst @@ -0,0 +1,414 @@ +================================ +Devres - Managed Device Resource +================================ + +Tejun Heo + +First draft 10 January 2007 + +.. contents + + 1. Intro : Huh? Devres? + 2. Devres : Devres in a nutshell + 3. Devres Group : Group devres'es and release them together + 4. Details : Life time rules, calling context, ... + 5. Overhead : How much do we have to pay for this? + 6. List of managed interfaces: Currently implemented managed interfaces + + +1. Intro +-------- + +devres came up while trying to convert libata to use iomap. Each +iomapped address should be kept and unmapped on driver detach. For +example, a plain SFF ATA controller (that is, good old PCI IDE) in +native mode makes use of 5 PCI BARs and all of them should be +maintained. + +As with many other device drivers, libata low level drivers have +sufficient bugs in ->remove and ->probe failure path. Well, yes, +that's probably because libata low level driver developers are lazy +bunch, but aren't all low level driver developers? After spending a +day fiddling with braindamaged hardware with no document or +braindamaged document, if it's finally working, well, it's working. + +For one reason or another, low level drivers don't receive as much +attention or testing as core code, and bugs on driver detach or +initialization failure don't happen often enough to be noticeable. +Init failure path is worse because it's much less travelled while +needs to handle multiple entry points. + +So, many low level drivers end up leaking resources on driver detach +and having half broken failure path implementation in ->probe() which +would leak resources or even cause oops when failure occurs. iomap +adds more to this mix. So do msi and msix. + + +2. Devres +--------- + +devres is basically linked list of arbitrarily sized memory areas +associated with a struct device. Each devres entry is associated with +a release function. A devres can be released in several ways. No +matter what, all devres entries are released on driver detach. On +release, the associated release function is invoked and then the +devres entry is freed. + +Managed interface is created for resources commonly used by device +drivers using devres. For example, coherent DMA memory is acquired +using dma_alloc_coherent(). The managed version is called +dmam_alloc_coherent(). It is identical to dma_alloc_coherent() except +for the DMA memory allocated using it is managed and will be +automatically released on driver detach. Implementation looks like +the following:: + + struct dma_devres { + size_t size; + void *vaddr; + dma_addr_t dma_handle; + }; + + static void dmam_coherent_release(struct device *dev, void *res) + { + struct dma_devres *this = res; + + dma_free_coherent(dev, this->size, this->vaddr, this->dma_handle); + } + + dmam_alloc_coherent(dev, size, dma_handle, gfp) + { + struct dma_devres *dr; + void *vaddr; + + dr = devres_alloc(dmam_coherent_release, sizeof(*dr), gfp); + ... + + /* alloc DMA memory as usual */ + vaddr = dma_alloc_coherent(...); + ... + + /* record size, vaddr, dma_handle in dr */ + dr->vaddr = vaddr; + ... + + devres_add(dev, dr); + + return vaddr; + } + +If a driver uses dmam_alloc_coherent(), the area is guaranteed to be +freed whether initialization fails half-way or the device gets +detached. If most resources are acquired using managed interface, a +driver can have much simpler init and exit code. Init path basically +looks like the following:: + + my_init_one() + { + struct mydev *d; + + d = devm_kzalloc(dev, sizeof(*d), GFP_KERNEL); + if (!d) + return -ENOMEM; + + d->ring = dmam_alloc_coherent(...); + if (!d->ring) + return -ENOMEM; + + if (check something) + return -EINVAL; + ... + + return register_to_upper_layer(d); + } + +And exit path:: + + my_remove_one() + { + unregister_from_upper_layer(d); + shutdown_my_hardware(); + } + +As shown above, low level drivers can be simplified a lot by using +devres. Complexity is shifted from less maintained low level drivers +to better maintained higher layer. Also, as init failure path is +shared with exit path, both can get more testing. + +Note though that when converting current calls or assignments to +managed devm_* versions it is up to you to check if internal operations +like allocating memory, have failed. Managed resources pertains to the +freeing of these resources *only* - all other checks needed are still +on you. In some cases this may mean introducing checks that were not +necessary before moving to the managed devm_* calls. + + +3. Devres group +--------------- + +Devres entries can be grouped using devres group. When a group is +released, all contained normal devres entries and properly nested +groups are released. One usage is to rollback series of acquired +resources on failure. For example:: + + if (!devres_open_group(dev, NULL, GFP_KERNEL)) + return -ENOMEM; + + acquire A; + if (failed) + goto err; + + acquire B; + if (failed) + goto err; + ... + + devres_remove_group(dev, NULL); + return 0; + + err: + devres_release_group(dev, NULL); + return err_code; + +As resource acquisition failure usually means probe failure, constructs +like above are usually useful in midlayer driver (e.g. libata core +layer) where interface function shouldn't have side effect on failure. +For LLDs, just returning error code suffices in most cases. + +Each group is identified by `void *id`. It can either be explicitly +specified by @id argument to devres_open_group() or automatically +created by passing NULL as @id as in the above example. In both +cases, devres_open_group() returns the group's id. The returned id +can be passed to other devres functions to select the target group. +If NULL is given to those functions, the latest open group is +selected. + +For example, you can do something like the following:: + + int my_midlayer_create_something() + { + if (!devres_open_group(dev, my_midlayer_create_something, GFP_KERNEL)) + return -ENOMEM; + + ... + + devres_close_group(dev, my_midlayer_create_something); + return 0; + } + + void my_midlayer_destroy_something() + { + devres_release_group(dev, my_midlayer_create_something); + } + + +4. Details +---------- + +Lifetime of a devres entry begins on devres allocation and finishes +when it is released or destroyed (removed and freed) - no reference +counting. + +devres core guarantees atomicity to all basic devres operations and +has support for single-instance devres types (atomic +lookup-and-add-if-not-found). Other than that, synchronizing +concurrent accesses to allocated devres data is caller's +responsibility. This is usually non-issue because bus ops and +resource allocations already do the job. + +For an example of single-instance devres type, read pcim_iomap_table() +in lib/devres.c. + +All devres interface functions can be called without context if the +right gfp mask is given. + + +5. Overhead +----------- + +Each devres bookkeeping info is allocated together with requested data +area. With debug option turned off, bookkeeping info occupies 16 +bytes on 32bit machines and 24 bytes on 64bit (three pointers rounded +up to ull alignment). If singly linked list is used, it can be +reduced to two pointers (8 bytes on 32bit, 16 bytes on 64bit). + +Each devres group occupies 8 pointers. It can be reduced to 6 if +singly linked list is used. + +Memory space overhead on ahci controller with two ports is between 300 +and 400 bytes on 32bit machine after naive conversion (we can +certainly invest a bit more effort into libata core layer). + + +6. List of managed interfaces +----------------------------- + +CLOCK + devm_clk_get() + devm_clk_get_optional() + devm_clk_put() + devm_clk_hw_register() + devm_of_clk_add_hw_provider() + devm_clk_hw_register_clkdev() + +DMA + dmaenginem_async_device_register() + dmam_alloc_coherent() + dmam_alloc_attrs() + dmam_free_coherent() + dmam_pool_create() + dmam_pool_destroy() + +DRM + devm_drm_dev_init() + +GPIO + devm_gpiod_get() + devm_gpiod_get_index() + devm_gpiod_get_index_optional() + devm_gpiod_get_optional() + devm_gpiod_put() + devm_gpiod_unhinge() + devm_gpiochip_add_data() + devm_gpio_request() + devm_gpio_request_one() + devm_gpio_free() + +I2C + devm_i2c_new_dummy_device() + +IIO + devm_iio_device_alloc() + devm_iio_device_free() + devm_iio_device_register() + devm_iio_device_unregister() + devm_iio_kfifo_allocate() + devm_iio_kfifo_free() + devm_iio_triggered_buffer_setup() + devm_iio_triggered_buffer_cleanup() + devm_iio_trigger_alloc() + devm_iio_trigger_free() + devm_iio_trigger_register() + devm_iio_trigger_unregister() + devm_iio_channel_get() + devm_iio_channel_release() + devm_iio_channel_get_all() + devm_iio_channel_release_all() + +INPUT + devm_input_allocate_device() + +IO region + devm_release_mem_region() + devm_release_region() + devm_release_resource() + devm_request_mem_region() + devm_request_region() + devm_request_resource() + +IOMAP + devm_ioport_map() + devm_ioport_unmap() + devm_ioremap() + devm_ioremap_nocache() + devm_ioremap_wc() + devm_ioremap_resource() : checks resource, requests memory region, ioremaps + devm_iounmap() + pcim_iomap() + pcim_iomap_regions() : do request_region() and iomap() on multiple BARs + pcim_iomap_table() : array of mapped addresses indexed by BAR + pcim_iounmap() + +IRQ + devm_free_irq() + devm_request_any_context_irq() + devm_request_irq() + devm_request_threaded_irq() + devm_irq_alloc_descs() + devm_irq_alloc_desc() + devm_irq_alloc_desc_at() + devm_irq_alloc_desc_from() + devm_irq_alloc_descs_from() + devm_irq_alloc_generic_chip() + devm_irq_setup_generic_chip() + devm_irq_sim_init() + +LED + devm_led_classdev_register() + devm_led_classdev_unregister() + +MDIO + devm_mdiobus_alloc() + devm_mdiobus_alloc_size() + devm_mdiobus_free() + +MEM + devm_free_pages() + devm_get_free_pages() + devm_kasprintf() + devm_kcalloc() + devm_kfree() + devm_kmalloc() + devm_kmalloc_array() + devm_kmemdup() + devm_kstrdup() + devm_kvasprintf() + devm_kzalloc() + +MFD + devm_mfd_add_devices() + +MUX + devm_mux_chip_alloc() + devm_mux_chip_register() + devm_mux_control_get() + +PER-CPU MEM + devm_alloc_percpu() + devm_free_percpu() + +PCI + devm_pci_alloc_host_bridge() : managed PCI host bridge allocation + devm_pci_remap_cfgspace() : ioremap PCI configuration space + devm_pci_remap_cfg_resource() : ioremap PCI configuration space resource + pcim_enable_device() : after success, all PCI ops become managed + pcim_pin_device() : keep PCI device enabled after release + +PHY + devm_usb_get_phy() + devm_usb_put_phy() + +PINCTRL + devm_pinctrl_get() + devm_pinctrl_put() + devm_pinctrl_register() + devm_pinctrl_unregister() + +POWER + devm_reboot_mode_register() + devm_reboot_mode_unregister() + +PWM + devm_pwm_get() + devm_pwm_put() + +REGULATOR + devm_regulator_bulk_get() + devm_regulator_get() + devm_regulator_put() + devm_regulator_register() + +RESET + devm_reset_control_get() + devm_reset_controller_register() + +SERDEV + devm_serdev_device_open() + +SLAVE DMA ENGINE + devm_acpi_dma_controller_register() + +SPI + devm_spi_register_master() + +WATCHDOG + devm_watchdog_register_device() diff --git a/Documentation/driver-model/devres.txt b/Documentation/driver-model/devres.txt deleted file mode 100644 index 69c7fa7f616c..000000000000 --- a/Documentation/driver-model/devres.txt +++ /dev/null @@ -1,412 +0,0 @@ -Devres - Managed Device Resource -================================ - -Tejun Heo - -First draft 10 January 2007 - - -1. Intro : Huh? Devres? -2. Devres : Devres in a nutshell -3. Devres Group : Group devres'es and release them together -4. Details : Life time rules, calling context, ... -5. Overhead : How much do we have to pay for this? -6. List of managed interfaces : Currently implemented managed interfaces - - - 1. Intro - -------- - -devres came up while trying to convert libata to use iomap. Each -iomapped address should be kept and unmapped on driver detach. For -example, a plain SFF ATA controller (that is, good old PCI IDE) in -native mode makes use of 5 PCI BARs and all of them should be -maintained. - -As with many other device drivers, libata low level drivers have -sufficient bugs in ->remove and ->probe failure path. Well, yes, -that's probably because libata low level driver developers are lazy -bunch, but aren't all low level driver developers? After spending a -day fiddling with braindamaged hardware with no document or -braindamaged document, if it's finally working, well, it's working. - -For one reason or another, low level drivers don't receive as much -attention or testing as core code, and bugs on driver detach or -initialization failure don't happen often enough to be noticeable. -Init failure path is worse because it's much less travelled while -needs to handle multiple entry points. - -So, many low level drivers end up leaking resources on driver detach -and having half broken failure path implementation in ->probe() which -would leak resources or even cause oops when failure occurs. iomap -adds more to this mix. So do msi and msix. - - - 2. Devres - --------- - -devres is basically linked list of arbitrarily sized memory areas -associated with a struct device. Each devres entry is associated with -a release function. A devres can be released in several ways. No -matter what, all devres entries are released on driver detach. On -release, the associated release function is invoked and then the -devres entry is freed. - -Managed interface is created for resources commonly used by device -drivers using devres. For example, coherent DMA memory is acquired -using dma_alloc_coherent(). The managed version is called -dmam_alloc_coherent(). It is identical to dma_alloc_coherent() except -for the DMA memory allocated using it is managed and will be -automatically released on driver detach. Implementation looks like -the following. - - struct dma_devres { - size_t size; - void *vaddr; - dma_addr_t dma_handle; - }; - - static void dmam_coherent_release(struct device *dev, void *res) - { - struct dma_devres *this = res; - - dma_free_coherent(dev, this->size, this->vaddr, this->dma_handle); - } - - dmam_alloc_coherent(dev, size, dma_handle, gfp) - { - struct dma_devres *dr; - void *vaddr; - - dr = devres_alloc(dmam_coherent_release, sizeof(*dr), gfp); - ... - - /* alloc DMA memory as usual */ - vaddr = dma_alloc_coherent(...); - ... - - /* record size, vaddr, dma_handle in dr */ - dr->vaddr = vaddr; - ... - - devres_add(dev, dr); - - return vaddr; - } - -If a driver uses dmam_alloc_coherent(), the area is guaranteed to be -freed whether initialization fails half-way or the device gets -detached. If most resources are acquired using managed interface, a -driver can have much simpler init and exit code. Init path basically -looks like the following. - - my_init_one() - { - struct mydev *d; - - d = devm_kzalloc(dev, sizeof(*d), GFP_KERNEL); - if (!d) - return -ENOMEM; - - d->ring = dmam_alloc_coherent(...); - if (!d->ring) - return -ENOMEM; - - if (check something) - return -EINVAL; - ... - - return register_to_upper_layer(d); - } - -And exit path, - - my_remove_one() - { - unregister_from_upper_layer(d); - shutdown_my_hardware(); - } - -As shown above, low level drivers can be simplified a lot by using -devres. Complexity is shifted from less maintained low level drivers -to better maintained higher layer. Also, as init failure path is -shared with exit path, both can get more testing. - -Note though that when converting current calls or assignments to -managed devm_* versions it is up to you to check if internal operations -like allocating memory, have failed. Managed resources pertains to the -freeing of these resources *only* - all other checks needed are still -on you. In some cases this may mean introducing checks that were not -necessary before moving to the managed devm_* calls. - - - 3. Devres group - --------------- - -Devres entries can be grouped using devres group. When a group is -released, all contained normal devres entries and properly nested -groups are released. One usage is to rollback series of acquired -resources on failure. For example, - - if (!devres_open_group(dev, NULL, GFP_KERNEL)) - return -ENOMEM; - - acquire A; - if (failed) - goto err; - - acquire B; - if (failed) - goto err; - ... - - devres_remove_group(dev, NULL); - return 0; - - err: - devres_release_group(dev, NULL); - return err_code; - -As resource acquisition failure usually means probe failure, constructs -like above are usually useful in midlayer driver (e.g. libata core -layer) where interface function shouldn't have side effect on failure. -For LLDs, just returning error code suffices in most cases. - -Each group is identified by void *id. It can either be explicitly -specified by @id argument to devres_open_group() or automatically -created by passing NULL as @id as in the above example. In both -cases, devres_open_group() returns the group's id. The returned id -can be passed to other devres functions to select the target group. -If NULL is given to those functions, the latest open group is -selected. - -For example, you can do something like the following. - - int my_midlayer_create_something() - { - if (!devres_open_group(dev, my_midlayer_create_something, GFP_KERNEL)) - return -ENOMEM; - - ... - - devres_close_group(dev, my_midlayer_create_something); - return 0; - } - - void my_midlayer_destroy_something() - { - devres_release_group(dev, my_midlayer_create_something); - } - - - 4. Details - ---------- - -Lifetime of a devres entry begins on devres allocation and finishes -when it is released or destroyed (removed and freed) - no reference -counting. - -devres core guarantees atomicity to all basic devres operations and -has support for single-instance devres types (atomic -lookup-and-add-if-not-found). Other than that, synchronizing -concurrent accesses to allocated devres data is caller's -responsibility. This is usually non-issue because bus ops and -resource allocations already do the job. - -For an example of single-instance devres type, read pcim_iomap_table() -in lib/devres.c. - -All devres interface functions can be called without context if the -right gfp mask is given. - - - 5. Overhead - ----------- - -Each devres bookkeeping info is allocated together with requested data -area. With debug option turned off, bookkeeping info occupies 16 -bytes on 32bit machines and 24 bytes on 64bit (three pointers rounded -up to ull alignment). If singly linked list is used, it can be -reduced to two pointers (8 bytes on 32bit, 16 bytes on 64bit). - -Each devres group occupies 8 pointers. It can be reduced to 6 if -singly linked list is used. - -Memory space overhead on ahci controller with two ports is between 300 -and 400 bytes on 32bit machine after naive conversion (we can -certainly invest a bit more effort into libata core layer). - - - 6. List of managed interfaces - ----------------------------- - -CLOCK - devm_clk_get() - devm_clk_get_optional() - devm_clk_put() - devm_clk_hw_register() - devm_of_clk_add_hw_provider() - devm_clk_hw_register_clkdev() - -DMA - dmaenginem_async_device_register() - dmam_alloc_coherent() - dmam_alloc_attrs() - dmam_free_coherent() - dmam_pool_create() - dmam_pool_destroy() - -DRM - devm_drm_dev_init() - -GPIO - devm_gpiod_get() - devm_gpiod_get_index() - devm_gpiod_get_index_optional() - devm_gpiod_get_optional() - devm_gpiod_put() - devm_gpiod_unhinge() - devm_gpiochip_add_data() - devm_gpio_request() - devm_gpio_request_one() - devm_gpio_free() - -I2C - devm_i2c_new_dummy_device() - -IIO - devm_iio_device_alloc() - devm_iio_device_free() - devm_iio_device_register() - devm_iio_device_unregister() - devm_iio_kfifo_allocate() - devm_iio_kfifo_free() - devm_iio_triggered_buffer_setup() - devm_iio_triggered_buffer_cleanup() - devm_iio_trigger_alloc() - devm_iio_trigger_free() - devm_iio_trigger_register() - devm_iio_trigger_unregister() - devm_iio_channel_get() - devm_iio_channel_release() - devm_iio_channel_get_all() - devm_iio_channel_release_all() - -INPUT - devm_input_allocate_device() - -IO region - devm_release_mem_region() - devm_release_region() - devm_release_resource() - devm_request_mem_region() - devm_request_region() - devm_request_resource() - -IOMAP - devm_ioport_map() - devm_ioport_unmap() - devm_ioremap() - devm_ioremap_nocache() - devm_ioremap_wc() - devm_ioremap_resource() : checks resource, requests memory region, ioremaps - devm_iounmap() - pcim_iomap() - pcim_iomap_regions() : do request_region() and iomap() on multiple BARs - pcim_iomap_table() : array of mapped addresses indexed by BAR - pcim_iounmap() - -IRQ - devm_free_irq() - devm_request_any_context_irq() - devm_request_irq() - devm_request_threaded_irq() - devm_irq_alloc_descs() - devm_irq_alloc_desc() - devm_irq_alloc_desc_at() - devm_irq_alloc_desc_from() - devm_irq_alloc_descs_from() - devm_irq_alloc_generic_chip() - devm_irq_setup_generic_chip() - devm_irq_sim_init() - -LED - devm_led_classdev_register() - devm_led_classdev_unregister() - -MDIO - devm_mdiobus_alloc() - devm_mdiobus_alloc_size() - devm_mdiobus_free() - -MEM - devm_free_pages() - devm_get_free_pages() - devm_kasprintf() - devm_kcalloc() - devm_kfree() - devm_kmalloc() - devm_kmalloc_array() - devm_kmemdup() - devm_kstrdup() - devm_kvasprintf() - devm_kzalloc() - -MFD - devm_mfd_add_devices() - -MUX - devm_mux_chip_alloc() - devm_mux_chip_register() - devm_mux_control_get() - -PER-CPU MEM - devm_alloc_percpu() - devm_free_percpu() - -PCI - devm_pci_alloc_host_bridge() : managed PCI host bridge allocation - devm_pci_remap_cfgspace() : ioremap PCI configuration space - devm_pci_remap_cfg_resource() : ioremap PCI configuration space resource - pcim_enable_device() : after success, all PCI ops become managed - pcim_pin_device() : keep PCI device enabled after release - -PHY - devm_usb_get_phy() - devm_usb_put_phy() - -PINCTRL - devm_pinctrl_get() - devm_pinctrl_put() - devm_pinctrl_register() - devm_pinctrl_unregister() - -POWER - devm_reboot_mode_register() - devm_reboot_mode_unregister() - -PWM - devm_pwm_get() - devm_pwm_put() - -REGULATOR - devm_regulator_bulk_get() - devm_regulator_get() - devm_regulator_put() - devm_regulator_register() - -RESET - devm_reset_control_get() - devm_reset_controller_register() - -SERDEV - devm_serdev_device_open() - -SLAVE DMA ENGINE - devm_acpi_dma_controller_register() - -SPI - devm_spi_register_master() - -WATCHDOG - devm_watchdog_register_device() diff --git a/Documentation/driver-model/driver.rst b/Documentation/driver-model/driver.rst new file mode 100644 index 000000000000..11d281506a04 --- /dev/null +++ b/Documentation/driver-model/driver.rst @@ -0,0 +1,223 @@ +============== +Device Drivers +============== + +See the kerneldoc for the struct device_driver. + + +Allocation +~~~~~~~~~~ + +Device drivers are statically allocated structures. Though there may +be multiple devices in a system that a driver supports, struct +device_driver represents the driver as a whole (not a particular +device instance). + +Initialization +~~~~~~~~~~~~~~ + +The driver must initialize at least the name and bus fields. It should +also initialize the devclass field (when it arrives), so it may obtain +the proper linkage internally. It should also initialize as many of +the callbacks as possible, though each is optional. + +Declaration +~~~~~~~~~~~ + +As stated above, struct device_driver objects are statically +allocated. Below is an example declaration of the eepro100 +driver. This declaration is hypothetical only; it relies on the driver +being converted completely to the new model:: + + static struct device_driver eepro100_driver = { + .name = "eepro100", + .bus = &pci_bus_type, + + .probe = eepro100_probe, + .remove = eepro100_remove, + .suspend = eepro100_suspend, + .resume = eepro100_resume, + }; + +Most drivers will not be able to be converted completely to the new +model because the bus they belong to has a bus-specific structure with +bus-specific fields that cannot be generalized. + +The most common example of this are device ID structures. A driver +typically defines an array of device IDs that it supports. The format +of these structures and the semantics for comparing device IDs are +completely bus-specific. Defining them as bus-specific entities would +sacrifice type-safety, so we keep bus-specific structures around. + +Bus-specific drivers should include a generic struct device_driver in +the definition of the bus-specific driver. Like this:: + + struct pci_driver { + const struct pci_device_id *id_table; + struct device_driver driver; + }; + +A definition that included bus-specific fields would look like +(using the eepro100 driver again):: + + static struct pci_driver eepro100_driver = { + .id_table = eepro100_pci_tbl, + .driver = { + .name = "eepro100", + .bus = &pci_bus_type, + .probe = eepro100_probe, + .remove = eepro100_remove, + .suspend = eepro100_suspend, + .resume = eepro100_resume, + }, + }; + +Some may find the syntax of embedded struct initialization awkward or +even a bit ugly. So far, it's the best way we've found to do what we want... + +Registration +~~~~~~~~~~~~ + +:: + + int driver_register(struct device_driver *drv); + +The driver registers the structure on startup. For drivers that have +no bus-specific fields (i.e. don't have a bus-specific driver +structure), they would use driver_register and pass a pointer to their +struct device_driver object. + +Most drivers, however, will have a bus-specific structure and will +need to register with the bus using something like pci_driver_register. + +It is important that drivers register their driver structure as early as +possible. Registration with the core initializes several fields in the +struct device_driver object, including the reference count and the +lock. These fields are assumed to be valid at all times and may be +used by the device model core or the bus driver. + + +Transition Bus Drivers +~~~~~~~~~~~~~~~~~~~~~~ + +By defining wrapper functions, the transition to the new model can be +made easier. Drivers can ignore the generic structure altogether and +let the bus wrapper fill in the fields. For the callbacks, the bus can +define generic callbacks that forward the call to the bus-specific +callbacks of the drivers. + +This solution is intended to be only temporary. In order to get class +information in the driver, the drivers must be modified anyway. Since +converting drivers to the new model should reduce some infrastructural +complexity and code size, it is recommended that they are converted as +class information is added. + +Access +~~~~~~ + +Once the object has been registered, it may access the common fields of +the object, like the lock and the list of devices:: + + int driver_for_each_dev(struct device_driver *drv, void *data, + int (*callback)(struct device *dev, void *data)); + +The devices field is a list of all the devices that have been bound to +the driver. The LDM core provides a helper function to operate on all +the devices a driver controls. This helper locks the driver on each +node access, and does proper reference counting on each device as it +accesses it. + + +sysfs +~~~~~ + +When a driver is registered, a sysfs directory is created in its +bus's directory. In this directory, the driver can export an interface +to userspace to control operation of the driver on a global basis; +e.g. toggling debugging output in the driver. + +A future feature of this directory will be a 'devices' directory. This +directory will contain symlinks to the directories of devices it +supports. + + + +Callbacks +~~~~~~~~~ + +:: + + int (*probe) (struct device *dev); + +The probe() entry is called in task context, with the bus's rwsem locked +and the driver partially bound to the device. Drivers commonly use +container_of() to convert "dev" to a bus-specific type, both in probe() +and other routines. That type often provides device resource data, such +as pci_dev.resource[] or platform_device.resources, which is used in +addition to dev->platform_data to initialize the driver. + +This callback holds the driver-specific logic to bind the driver to a +given device. That includes verifying that the device is present, that +it's a version the driver can handle, that driver data structures can +be allocated and initialized, and that any hardware can be initialized. +Drivers often store a pointer to their state with dev_set_drvdata(). +When the driver has successfully bound itself to that device, then probe() +returns zero and the driver model code will finish its part of binding +the driver to that device. + +A driver's probe() may return a negative errno value to indicate that +the driver did not bind to this device, in which case it should have +released all resources it allocated:: + + int (*remove) (struct device *dev); + +remove is called to unbind a driver from a device. This may be +called if a device is physically removed from the system, if the +driver module is being unloaded, during a reboot sequence, or +in other cases. + +It is up to the driver to determine if the device is present or +not. It should free any resources allocated specifically for the +device; i.e. anything in the device's driver_data field. + +If the device is still present, it should quiesce the device and place +it into a supported low-power state:: + + int (*suspend) (struct device *dev, pm_message_t state); + +suspend is called to put the device in a low power state:: + + int (*resume) (struct device *dev); + +Resume is used to bring a device back from a low power state. + + +Attributes +~~~~~~~~~~ + +:: + + struct driver_attribute { + struct attribute attr; + ssize_t (*show)(struct device_driver *driver, char *buf); + ssize_t (*store)(struct device_driver *, const char *buf, size_t count); + }; + +Device drivers can export attributes via their sysfs directories. +Drivers can declare attributes using a DRIVER_ATTR_RW and DRIVER_ATTR_RO +macro that works identically to the DEVICE_ATTR_RW and DEVICE_ATTR_RO +macros. + +Example:: + + DRIVER_ATTR_RW(debug); + +This is equivalent to declaring:: + + struct driver_attribute driver_attr_debug; + +This can then be used to add and remove the attribute from the +driver's directory using:: + + int driver_create_file(struct device_driver *, const struct driver_attribute *); + void driver_remove_file(struct device_driver *, const struct driver_attribute *); diff --git a/Documentation/driver-model/driver.txt b/Documentation/driver-model/driver.txt deleted file mode 100644 index d661e6f7e6a0..000000000000 --- a/Documentation/driver-model/driver.txt +++ /dev/null @@ -1,215 +0,0 @@ - -Device Drivers - -See the kerneldoc for the struct device_driver. - - -Allocation -~~~~~~~~~~ - -Device drivers are statically allocated structures. Though there may -be multiple devices in a system that a driver supports, struct -device_driver represents the driver as a whole (not a particular -device instance). - -Initialization -~~~~~~~~~~~~~~ - -The driver must initialize at least the name and bus fields. It should -also initialize the devclass field (when it arrives), so it may obtain -the proper linkage internally. It should also initialize as many of -the callbacks as possible, though each is optional. - -Declaration -~~~~~~~~~~~ - -As stated above, struct device_driver objects are statically -allocated. Below is an example declaration of the eepro100 -driver. This declaration is hypothetical only; it relies on the driver -being converted completely to the new model. - -static struct device_driver eepro100_driver = { - .name = "eepro100", - .bus = &pci_bus_type, - - .probe = eepro100_probe, - .remove = eepro100_remove, - .suspend = eepro100_suspend, - .resume = eepro100_resume, -}; - -Most drivers will not be able to be converted completely to the new -model because the bus they belong to has a bus-specific structure with -bus-specific fields that cannot be generalized. - -The most common example of this are device ID structures. A driver -typically defines an array of device IDs that it supports. The format -of these structures and the semantics for comparing device IDs are -completely bus-specific. Defining them as bus-specific entities would -sacrifice type-safety, so we keep bus-specific structures around. - -Bus-specific drivers should include a generic struct device_driver in -the definition of the bus-specific driver. Like this: - -struct pci_driver { - const struct pci_device_id *id_table; - struct device_driver driver; -}; - -A definition that included bus-specific fields would look like -(using the eepro100 driver again): - -static struct pci_driver eepro100_driver = { - .id_table = eepro100_pci_tbl, - .driver = { - .name = "eepro100", - .bus = &pci_bus_type, - .probe = eepro100_probe, - .remove = eepro100_remove, - .suspend = eepro100_suspend, - .resume = eepro100_resume, - }, -}; - -Some may find the syntax of embedded struct initialization awkward or -even a bit ugly. So far, it's the best way we've found to do what we want... - -Registration -~~~~~~~~~~~~ - -int driver_register(struct device_driver * drv); - -The driver registers the structure on startup. For drivers that have -no bus-specific fields (i.e. don't have a bus-specific driver -structure), they would use driver_register and pass a pointer to their -struct device_driver object. - -Most drivers, however, will have a bus-specific structure and will -need to register with the bus using something like pci_driver_register. - -It is important that drivers register their driver structure as early as -possible. Registration with the core initializes several fields in the -struct device_driver object, including the reference count and the -lock. These fields are assumed to be valid at all times and may be -used by the device model core or the bus driver. - - -Transition Bus Drivers -~~~~~~~~~~~~~~~~~~~~~~ - -By defining wrapper functions, the transition to the new model can be -made easier. Drivers can ignore the generic structure altogether and -let the bus wrapper fill in the fields. For the callbacks, the bus can -define generic callbacks that forward the call to the bus-specific -callbacks of the drivers. - -This solution is intended to be only temporary. In order to get class -information in the driver, the drivers must be modified anyway. Since -converting drivers to the new model should reduce some infrastructural -complexity and code size, it is recommended that they are converted as -class information is added. - -Access -~~~~~~ - -Once the object has been registered, it may access the common fields of -the object, like the lock and the list of devices. - -int driver_for_each_dev(struct device_driver * drv, void * data, - int (*callback)(struct device * dev, void * data)); - -The devices field is a list of all the devices that have been bound to -the driver. The LDM core provides a helper function to operate on all -the devices a driver controls. This helper locks the driver on each -node access, and does proper reference counting on each device as it -accesses it. - - -sysfs -~~~~~ - -When a driver is registered, a sysfs directory is created in its -bus's directory. In this directory, the driver can export an interface -to userspace to control operation of the driver on a global basis; -e.g. toggling debugging output in the driver. - -A future feature of this directory will be a 'devices' directory. This -directory will contain symlinks to the directories of devices it -supports. - - - -Callbacks -~~~~~~~~~ - - int (*probe) (struct device * dev); - -The probe() entry is called in task context, with the bus's rwsem locked -and the driver partially bound to the device. Drivers commonly use -container_of() to convert "dev" to a bus-specific type, both in probe() -and other routines. That type often provides device resource data, such -as pci_dev.resource[] or platform_device.resources, which is used in -addition to dev->platform_data to initialize the driver. - -This callback holds the driver-specific logic to bind the driver to a -given device. That includes verifying that the device is present, that -it's a version the driver can handle, that driver data structures can -be allocated and initialized, and that any hardware can be initialized. -Drivers often store a pointer to their state with dev_set_drvdata(). -When the driver has successfully bound itself to that device, then probe() -returns zero and the driver model code will finish its part of binding -the driver to that device. - -A driver's probe() may return a negative errno value to indicate that -the driver did not bind to this device, in which case it should have -released all resources it allocated. - - int (*remove) (struct device * dev); - -remove is called to unbind a driver from a device. This may be -called if a device is physically removed from the system, if the -driver module is being unloaded, during a reboot sequence, or -in other cases. - -It is up to the driver to determine if the device is present or -not. It should free any resources allocated specifically for the -device; i.e. anything in the device's driver_data field. - -If the device is still present, it should quiesce the device and place -it into a supported low-power state. - - int (*suspend) (struct device * dev, pm_message_t state); - -suspend is called to put the device in a low power state. - - int (*resume) (struct device * dev); - -Resume is used to bring a device back from a low power state. - - -Attributes -~~~~~~~~~~ -struct driver_attribute { - struct attribute attr; - ssize_t (*show)(struct device_driver *driver, char *buf); - ssize_t (*store)(struct device_driver *, const char * buf, size_t count); -}; - -Device drivers can export attributes via their sysfs directories. -Drivers can declare attributes using a DRIVER_ATTR_RW and DRIVER_ATTR_RO -macro that works identically to the DEVICE_ATTR_RW and DEVICE_ATTR_RO -macros. - -Example: - -DRIVER_ATTR_RW(debug); - -This is equivalent to declaring: - -struct driver_attribute driver_attr_debug; - -This can then be used to add and remove the attribute from the -driver's directory using: - -int driver_create_file(struct device_driver *, const struct driver_attribute *); -void driver_remove_file(struct device_driver *, const struct driver_attribute *); diff --git a/Documentation/driver-model/index.rst b/Documentation/driver-model/index.rst new file mode 100644 index 000000000000..9f85d579ce56 --- /dev/null +++ b/Documentation/driver-model/index.rst @@ -0,0 +1,26 @@ +:orphan: + +============ +Driver Model +============ + +.. toctree:: + :maxdepth: 1 + + binding + bus + class + design-patterns + device + devres + driver + overview + platform + porting + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/driver-model/overview.rst b/Documentation/driver-model/overview.rst new file mode 100644 index 000000000000..d4d1e9b40e0c --- /dev/null +++ b/Documentation/driver-model/overview.rst @@ -0,0 +1,124 @@ +============================= +The Linux Kernel Device Model +============================= + +Patrick Mochel + +Drafted 26 August 2002 +Updated 31 January 2006 + + +Overview +~~~~~~~~ + +The Linux Kernel Driver Model is a unification of all the disparate driver +models that were previously used in the kernel. It is intended to augment the +bus-specific drivers for bridges and devices by consolidating a set of data +and operations into globally accessible data structures. + +Traditional driver models implemented some sort of tree-like structure +(sometimes just a list) for the devices they control. There wasn't any +uniformity across the different bus types. + +The current driver model provides a common, uniform data model for describing +a bus and the devices that can appear under the bus. The unified bus +model includes a set of common attributes which all busses carry, and a set +of common callbacks, such as device discovery during bus probing, bus +shutdown, bus power management, etc. + +The common device and bridge interface reflects the goals of the modern +computer: namely the ability to do seamless device "plug and play", power +management, and hot plug. In particular, the model dictated by Intel and +Microsoft (namely ACPI) ensures that almost every device on almost any bus +on an x86-compatible system can work within this paradigm. Of course, +not every bus is able to support all such operations, although most +buses support most of those operations. + + +Downstream Access +~~~~~~~~~~~~~~~~~ + +Common data fields have been moved out of individual bus layers into a common +data structure. These fields must still be accessed by the bus layers, +and sometimes by the device-specific drivers. + +Other bus layers are encouraged to do what has been done for the PCI layer. +struct pci_dev now looks like this:: + + struct pci_dev { + ... + + struct device dev; /* Generic device interface */ + ... + }; + +Note first that the struct device dev within the struct pci_dev is +statically allocated. This means only one allocation on device discovery. + +Note also that that struct device dev is not necessarily defined at the +front of the pci_dev structure. This is to make people think about what +they're doing when switching between the bus driver and the global driver, +and to discourage meaningless and incorrect casts between the two. + +The PCI bus layer freely accesses the fields of struct device. It knows about +the structure of struct pci_dev, and it should know the structure of struct +device. Individual PCI device drivers that have been converted to the current +driver model generally do not and should not touch the fields of struct device, +unless there is a compelling reason to do so. + +The above abstraction prevents unnecessary pain during transitional phases. +If it were not done this way, then when a field was renamed or removed, every +downstream driver would break. On the other hand, if only the bus layer +(and not the device layer) accesses the struct device, it is only the bus +layer that needs to change. + + +User Interface +~~~~~~~~~~~~~~ + +By virtue of having a complete hierarchical view of all the devices in the +system, exporting a complete hierarchical view to userspace becomes relatively +easy. This has been accomplished by implementing a special purpose virtual +file system named sysfs. + +Almost all mainstream Linux distros mount this filesystem automatically; you +can see some variation of the following in the output of the "mount" command:: + + $ mount + ... + none on /sys type sysfs (rw,noexec,nosuid,nodev) + ... + $ + +The auto-mounting of sysfs is typically accomplished by an entry similar to +the following in the /etc/fstab file:: + + none /sys sysfs defaults 0 0 + +or something similar in the /lib/init/fstab file on Debian-based systems:: + + none /sys sysfs nodev,noexec,nosuid 0 0 + +If sysfs is not automatically mounted, you can always do it manually with:: + + # mount -t sysfs sysfs /sys + +Whenever a device is inserted into the tree, a directory is created for it. +This directory may be populated at each layer of discovery - the global layer, +the bus layer, or the device layer. + +The global layer currently creates two files - 'name' and 'power'. The +former only reports the name of the device. The latter reports the +current power state of the device. It will also be used to set the current +power state. + +The bus layer may also create files for the devices it finds while probing the +bus. For example, the PCI layer currently creates 'irq' and 'resource' files +for each PCI device. + +A device-specific driver may also export files in its directory to expose +device-specific data or tunable interfaces. + +More information about the sysfs directory layout can be found in +the other documents in this directory and in the file +Documentation/filesystems/sysfs.txt. diff --git a/Documentation/driver-model/overview.txt b/Documentation/driver-model/overview.txt deleted file mode 100644 index 6a8f9a8075d8..000000000000 --- a/Documentation/driver-model/overview.txt +++ /dev/null @@ -1,123 +0,0 @@ -The Linux Kernel Device Model - -Patrick Mochel - -Drafted 26 August 2002 -Updated 31 January 2006 - - -Overview -~~~~~~~~ - -The Linux Kernel Driver Model is a unification of all the disparate driver -models that were previously used in the kernel. It is intended to augment the -bus-specific drivers for bridges and devices by consolidating a set of data -and operations into globally accessible data structures. - -Traditional driver models implemented some sort of tree-like structure -(sometimes just a list) for the devices they control. There wasn't any -uniformity across the different bus types. - -The current driver model provides a common, uniform data model for describing -a bus and the devices that can appear under the bus. The unified bus -model includes a set of common attributes which all busses carry, and a set -of common callbacks, such as device discovery during bus probing, bus -shutdown, bus power management, etc. - -The common device and bridge interface reflects the goals of the modern -computer: namely the ability to do seamless device "plug and play", power -management, and hot plug. In particular, the model dictated by Intel and -Microsoft (namely ACPI) ensures that almost every device on almost any bus -on an x86-compatible system can work within this paradigm. Of course, -not every bus is able to support all such operations, although most -buses support most of those operations. - - -Downstream Access -~~~~~~~~~~~~~~~~~ - -Common data fields have been moved out of individual bus layers into a common -data structure. These fields must still be accessed by the bus layers, -and sometimes by the device-specific drivers. - -Other bus layers are encouraged to do what has been done for the PCI layer. -struct pci_dev now looks like this: - -struct pci_dev { - ... - - struct device dev; /* Generic device interface */ - ... -}; - -Note first that the struct device dev within the struct pci_dev is -statically allocated. This means only one allocation on device discovery. - -Note also that that struct device dev is not necessarily defined at the -front of the pci_dev structure. This is to make people think about what -they're doing when switching between the bus driver and the global driver, -and to discourage meaningless and incorrect casts between the two. - -The PCI bus layer freely accesses the fields of struct device. It knows about -the structure of struct pci_dev, and it should know the structure of struct -device. Individual PCI device drivers that have been converted to the current -driver model generally do not and should not touch the fields of struct device, -unless there is a compelling reason to do so. - -The above abstraction prevents unnecessary pain during transitional phases. -If it were not done this way, then when a field was renamed or removed, every -downstream driver would break. On the other hand, if only the bus layer -(and not the device layer) accesses the struct device, it is only the bus -layer that needs to change. - - -User Interface -~~~~~~~~~~~~~~ - -By virtue of having a complete hierarchical view of all the devices in the -system, exporting a complete hierarchical view to userspace becomes relatively -easy. This has been accomplished by implementing a special purpose virtual -file system named sysfs. - -Almost all mainstream Linux distros mount this filesystem automatically; you -can see some variation of the following in the output of the "mount" command: - -$ mount -... -none on /sys type sysfs (rw,noexec,nosuid,nodev) -... -$ - -The auto-mounting of sysfs is typically accomplished by an entry similar to -the following in the /etc/fstab file: - -none /sys sysfs defaults 0 0 - -or something similar in the /lib/init/fstab file on Debian-based systems: - -none /sys sysfs nodev,noexec,nosuid 0 0 - -If sysfs is not automatically mounted, you can always do it manually with: - -# mount -t sysfs sysfs /sys - -Whenever a device is inserted into the tree, a directory is created for it. -This directory may be populated at each layer of discovery - the global layer, -the bus layer, or the device layer. - -The global layer currently creates two files - 'name' and 'power'. The -former only reports the name of the device. The latter reports the -current power state of the device. It will also be used to set the current -power state. - -The bus layer may also create files for the devices it finds while probing the -bus. For example, the PCI layer currently creates 'irq' and 'resource' files -for each PCI device. - -A device-specific driver may also export files in its directory to expose -device-specific data or tunable interfaces. - -More information about the sysfs directory layout can be found in -the other documents in this directory and in the file -Documentation/filesystems/sysfs.txt. - diff --git a/Documentation/driver-model/platform.rst b/Documentation/driver-model/platform.rst new file mode 100644 index 000000000000..334dd4071ae4 --- /dev/null +++ b/Documentation/driver-model/platform.rst @@ -0,0 +1,246 @@ +============================ +Platform Devices and Drivers +============================ + +See for the driver model interface to the +platform bus: platform_device, and platform_driver. This pseudo-bus +is used to connect devices on busses with minimal infrastructure, +like those used to integrate peripherals on many system-on-chip +processors, or some "legacy" PC interconnects; as opposed to large +formally specified ones like PCI or USB. + + +Platform devices +~~~~~~~~~~~~~~~~ +Platform devices are devices that typically appear as autonomous +entities in the system. This includes legacy port-based devices and +host bridges to peripheral buses, and most controllers integrated +into system-on-chip platforms. What they usually have in common +is direct addressing from a CPU bus. Rarely, a platform_device will +be connected through a segment of some other kind of bus; but its +registers will still be directly addressable. + +Platform devices are given a name, used in driver binding, and a +list of resources such as addresses and IRQs:: + + struct platform_device { + const char *name; + u32 id; + struct device dev; + u32 num_resources; + struct resource *resource; + }; + + +Platform drivers +~~~~~~~~~~~~~~~~ +Platform drivers follow the standard driver model convention, where +discovery/enumeration is handled outside the drivers, and drivers +provide probe() and remove() methods. They support power management +and shutdown notifications using the standard conventions:: + + struct platform_driver { + int (*probe)(struct platform_device *); + int (*remove)(struct platform_device *); + void (*shutdown)(struct platform_device *); + int (*suspend)(struct platform_device *, pm_message_t state); + int (*suspend_late)(struct platform_device *, pm_message_t state); + int (*resume_early)(struct platform_device *); + int (*resume)(struct platform_device *); + struct device_driver driver; + }; + +Note that probe() should in general verify that the specified device hardware +actually exists; sometimes platform setup code can't be sure. The probing +can use device resources, including clocks, and device platform_data. + +Platform drivers register themselves the normal way:: + + int platform_driver_register(struct platform_driver *drv); + +Or, in common situations where the device is known not to be hot-pluggable, +the probe() routine can live in an init section to reduce the driver's +runtime memory footprint:: + + int platform_driver_probe(struct platform_driver *drv, + int (*probe)(struct platform_device *)) + +Kernel modules can be composed of several platform drivers. The platform core +provides helpers to register and unregister an array of drivers:: + + int __platform_register_drivers(struct platform_driver * const *drivers, + unsigned int count, struct module *owner); + void platform_unregister_drivers(struct platform_driver * const *drivers, + unsigned int count); + +If one of the drivers fails to register, all drivers registered up to that +point will be unregistered in reverse order. Note that there is a convenience +macro that passes THIS_MODULE as owner parameter:: + + #define platform_register_drivers(drivers, count) + + +Device Enumeration +~~~~~~~~~~~~~~~~~~ +As a rule, platform specific (and often board-specific) setup code will +register platform devices:: + + int platform_device_register(struct platform_device *pdev); + + int platform_add_devices(struct platform_device **pdevs, int ndev); + +The general rule is to register only those devices that actually exist, +but in some cases extra devices might be registered. For example, a kernel +might be configured to work with an external network adapter that might not +be populated on all boards, or likewise to work with an integrated controller +that some boards might not hook up to any peripherals. + +In some cases, boot firmware will export tables describing the devices +that are populated on a given board. Without such tables, often the +only way for system setup code to set up the correct devices is to build +a kernel for a specific target board. Such board-specific kernels are +common with embedded and custom systems development. + +In many cases, the memory and IRQ resources associated with the platform +device are not enough to let the device's driver work. Board setup code +will often provide additional information using the device's platform_data +field to hold additional information. + +Embedded systems frequently need one or more clocks for platform devices, +which are normally kept off until they're actively needed (to save power). +System setup also associates those clocks with the device, so that that +calls to clk_get(&pdev->dev, clock_name) return them as needed. + + +Legacy Drivers: Device Probing +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Some drivers are not fully converted to the driver model, because they take +on a non-driver role: the driver registers its platform device, rather than +leaving that for system infrastructure. Such drivers can't be hotplugged +or coldplugged, since those mechanisms require device creation to be in a +different system component than the driver. + +The only "good" reason for this is to handle older system designs which, like +original IBM PCs, rely on error-prone "probe-the-hardware" models for hardware +configuration. Newer systems have largely abandoned that model, in favor of +bus-level support for dynamic configuration (PCI, USB), or device tables +provided by the boot firmware (e.g. PNPACPI on x86). There are too many +conflicting options about what might be where, and even educated guesses by +an operating system will be wrong often enough to make trouble. + +This style of driver is discouraged. If you're updating such a driver, +please try to move the device enumeration to a more appropriate location, +outside the driver. This will usually be cleanup, since such drivers +tend to already have "normal" modes, such as ones using device nodes that +were created by PNP or by platform device setup. + +None the less, there are some APIs to support such legacy drivers. Avoid +using these calls except with such hotplug-deficient drivers:: + + struct platform_device *platform_device_alloc( + const char *name, int id); + +You can use platform_device_alloc() to dynamically allocate a device, which +you will then initialize with resources and platform_device_register(). +A better solution is usually:: + + struct platform_device *platform_device_register_simple( + const char *name, int id, + struct resource *res, unsigned int nres); + +You can use platform_device_register_simple() as a one-step call to allocate +and register a device. + + +Device Naming and Driver Binding +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The platform_device.dev.bus_id is the canonical name for the devices. +It's built from two components: + + * platform_device.name ... which is also used to for driver matching. + + * platform_device.id ... the device instance number, or else "-1" + to indicate there's only one. + +These are concatenated, so name/id "serial"/0 indicates bus_id "serial.0", and +"serial/3" indicates bus_id "serial.3"; both would use the platform_driver +named "serial". While "my_rtc"/-1 would be bus_id "my_rtc" (no instance id) +and use the platform_driver called "my_rtc". + +Driver binding is performed automatically by the driver core, invoking +driver probe() after finding a match between device and driver. If the +probe() succeeds, the driver and device are bound as usual. There are +three different ways to find such a match: + + - Whenever a device is registered, the drivers for that bus are + checked for matches. Platform devices should be registered very + early during system boot. + + - When a driver is registered using platform_driver_register(), all + unbound devices on that bus are checked for matches. Drivers + usually register later during booting, or by module loading. + + - Registering a driver using platform_driver_probe() works just like + using platform_driver_register(), except that the driver won't + be probed later if another device registers. (Which is OK, since + this interface is only for use with non-hotpluggable devices.) + + +Early Platform Devices and Drivers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The early platform interfaces provide platform data to platform device +drivers early on during the system boot. The code is built on top of the +early_param() command line parsing and can be executed very early on. + +Example: "earlyprintk" class early serial console in 6 steps + +1. Registering early platform device data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code registers platform device data using the function +early_platform_add_devices(). In the case of early serial console this +should be hardware configuration for the serial port. Devices registered +at this point will later on be matched against early platform drivers. + +2. Parsing kernel command line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code calls parse_early_param() to parse the kernel +command line. This will execute all matching early_param() callbacks. +User specified early platform devices will be registered at this point. +For the early serial console case the user can specify port on the +kernel command line as "earlyprintk=serial.0" where "earlyprintk" is +the class string, "serial" is the name of the platform driver and +0 is the platform device id. If the id is -1 then the dot and the +id can be omitted. + +3. Installing early platform drivers belonging to a certain class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code may optionally force registration of all early +platform drivers belonging to a certain class using the function +early_platform_driver_register_all(). User specified devices from +step 2 have priority over these. This step is omitted by the serial +driver example since the early serial driver code should be disabled +unless the user has specified port on the kernel command line. + +4. Early platform driver registration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Compiled-in platform drivers making use of early_platform_init() are +automatically registered during step 2 or 3. The serial driver example +should use early_platform_init("earlyprintk", &platform_driver). + +5. Probing of early platform drivers belonging to a certain class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code calls early_platform_driver_probe() to match +registered early platform devices associated with a certain class with +registered early platform drivers. Matched devices will get probed(). +This step can be executed at any point during the early boot. As soon +as possible may be good for the serial port case. + +6. Inside the early platform driver probe() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The driver code needs to take special care during early boot, especially +when it comes to memory allocation and interrupt registration. The code +in the probe() function can use is_early_platform_device() to check if +it is called at early platform device or at the regular platform device +time. The early serial driver performs register_console() at this point. + +For further information, see . diff --git a/Documentation/driver-model/platform.txt b/Documentation/driver-model/platform.txt deleted file mode 100644 index 9d9e47dfc013..000000000000 --- a/Documentation/driver-model/platform.txt +++ /dev/null @@ -1,244 +0,0 @@ -Platform Devices and Drivers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -See for the driver model interface to the -platform bus: platform_device, and platform_driver. This pseudo-bus -is used to connect devices on busses with minimal infrastructure, -like those used to integrate peripherals on many system-on-chip -processors, or some "legacy" PC interconnects; as opposed to large -formally specified ones like PCI or USB. - - -Platform devices -~~~~~~~~~~~~~~~~ -Platform devices are devices that typically appear as autonomous -entities in the system. This includes legacy port-based devices and -host bridges to peripheral buses, and most controllers integrated -into system-on-chip platforms. What they usually have in common -is direct addressing from a CPU bus. Rarely, a platform_device will -be connected through a segment of some other kind of bus; but its -registers will still be directly addressable. - -Platform devices are given a name, used in driver binding, and a -list of resources such as addresses and IRQs. - -struct platform_device { - const char *name; - u32 id; - struct device dev; - u32 num_resources; - struct resource *resource; -}; - - -Platform drivers -~~~~~~~~~~~~~~~~ -Platform drivers follow the standard driver model convention, where -discovery/enumeration is handled outside the drivers, and drivers -provide probe() and remove() methods. They support power management -and shutdown notifications using the standard conventions. - -struct platform_driver { - int (*probe)(struct platform_device *); - int (*remove)(struct platform_device *); - void (*shutdown)(struct platform_device *); - int (*suspend)(struct platform_device *, pm_message_t state); - int (*suspend_late)(struct platform_device *, pm_message_t state); - int (*resume_early)(struct platform_device *); - int (*resume)(struct platform_device *); - struct device_driver driver; -}; - -Note that probe() should in general verify that the specified device hardware -actually exists; sometimes platform setup code can't be sure. The probing -can use device resources, including clocks, and device platform_data. - -Platform drivers register themselves the normal way: - - int platform_driver_register(struct platform_driver *drv); - -Or, in common situations where the device is known not to be hot-pluggable, -the probe() routine can live in an init section to reduce the driver's -runtime memory footprint: - - int platform_driver_probe(struct platform_driver *drv, - int (*probe)(struct platform_device *)) - -Kernel modules can be composed of several platform drivers. The platform core -provides helpers to register and unregister an array of drivers: - - int __platform_register_drivers(struct platform_driver * const *drivers, - unsigned int count, struct module *owner); - void platform_unregister_drivers(struct platform_driver * const *drivers, - unsigned int count); - -If one of the drivers fails to register, all drivers registered up to that -point will be unregistered in reverse order. Note that there is a convenience -macro that passes THIS_MODULE as owner parameter: - - #define platform_register_drivers(drivers, count) - - -Device Enumeration -~~~~~~~~~~~~~~~~~~ -As a rule, platform specific (and often board-specific) setup code will -register platform devices: - - int platform_device_register(struct platform_device *pdev); - - int platform_add_devices(struct platform_device **pdevs, int ndev); - -The general rule is to register only those devices that actually exist, -but in some cases extra devices might be registered. For example, a kernel -might be configured to work with an external network adapter that might not -be populated on all boards, or likewise to work with an integrated controller -that some boards might not hook up to any peripherals. - -In some cases, boot firmware will export tables describing the devices -that are populated on a given board. Without such tables, often the -only way for system setup code to set up the correct devices is to build -a kernel for a specific target board. Such board-specific kernels are -common with embedded and custom systems development. - -In many cases, the memory and IRQ resources associated with the platform -device are not enough to let the device's driver work. Board setup code -will often provide additional information using the device's platform_data -field to hold additional information. - -Embedded systems frequently need one or more clocks for platform devices, -which are normally kept off until they're actively needed (to save power). -System setup also associates those clocks with the device, so that that -calls to clk_get(&pdev->dev, clock_name) return them as needed. - - -Legacy Drivers: Device Probing -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Some drivers are not fully converted to the driver model, because they take -on a non-driver role: the driver registers its platform device, rather than -leaving that for system infrastructure. Such drivers can't be hotplugged -or coldplugged, since those mechanisms require device creation to be in a -different system component than the driver. - -The only "good" reason for this is to handle older system designs which, like -original IBM PCs, rely on error-prone "probe-the-hardware" models for hardware -configuration. Newer systems have largely abandoned that model, in favor of -bus-level support for dynamic configuration (PCI, USB), or device tables -provided by the boot firmware (e.g. PNPACPI on x86). There are too many -conflicting options about what might be where, and even educated guesses by -an operating system will be wrong often enough to make trouble. - -This style of driver is discouraged. If you're updating such a driver, -please try to move the device enumeration to a more appropriate location, -outside the driver. This will usually be cleanup, since such drivers -tend to already have "normal" modes, such as ones using device nodes that -were created by PNP or by platform device setup. - -None the less, there are some APIs to support such legacy drivers. Avoid -using these calls except with such hotplug-deficient drivers. - - struct platform_device *platform_device_alloc( - const char *name, int id); - -You can use platform_device_alloc() to dynamically allocate a device, which -you will then initialize with resources and platform_device_register(). -A better solution is usually: - - struct platform_device *platform_device_register_simple( - const char *name, int id, - struct resource *res, unsigned int nres); - -You can use platform_device_register_simple() as a one-step call to allocate -and register a device. - - -Device Naming and Driver Binding -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The platform_device.dev.bus_id is the canonical name for the devices. -It's built from two components: - - * platform_device.name ... which is also used to for driver matching. - - * platform_device.id ... the device instance number, or else "-1" - to indicate there's only one. - -These are concatenated, so name/id "serial"/0 indicates bus_id "serial.0", and -"serial/3" indicates bus_id "serial.3"; both would use the platform_driver -named "serial". While "my_rtc"/-1 would be bus_id "my_rtc" (no instance id) -and use the platform_driver called "my_rtc". - -Driver binding is performed automatically by the driver core, invoking -driver probe() after finding a match between device and driver. If the -probe() succeeds, the driver and device are bound as usual. There are -three different ways to find such a match: - - - Whenever a device is registered, the drivers for that bus are - checked for matches. Platform devices should be registered very - early during system boot. - - - When a driver is registered using platform_driver_register(), all - unbound devices on that bus are checked for matches. Drivers - usually register later during booting, or by module loading. - - - Registering a driver using platform_driver_probe() works just like - using platform_driver_register(), except that the driver won't - be probed later if another device registers. (Which is OK, since - this interface is only for use with non-hotpluggable devices.) - - -Early Platform Devices and Drivers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The early platform interfaces provide platform data to platform device -drivers early on during the system boot. The code is built on top of the -early_param() command line parsing and can be executed very early on. - -Example: "earlyprintk" class early serial console in 6 steps - -1. Registering early platform device data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The architecture code registers platform device data using the function -early_platform_add_devices(). In the case of early serial console this -should be hardware configuration for the serial port. Devices registered -at this point will later on be matched against early platform drivers. - -2. Parsing kernel command line -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The architecture code calls parse_early_param() to parse the kernel -command line. This will execute all matching early_param() callbacks. -User specified early platform devices will be registered at this point. -For the early serial console case the user can specify port on the -kernel command line as "earlyprintk=serial.0" where "earlyprintk" is -the class string, "serial" is the name of the platform driver and -0 is the platform device id. If the id is -1 then the dot and the -id can be omitted. - -3. Installing early platform drivers belonging to a certain class -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The architecture code may optionally force registration of all early -platform drivers belonging to a certain class using the function -early_platform_driver_register_all(). User specified devices from -step 2 have priority over these. This step is omitted by the serial -driver example since the early serial driver code should be disabled -unless the user has specified port on the kernel command line. - -4. Early platform driver registration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Compiled-in platform drivers making use of early_platform_init() are -automatically registered during step 2 or 3. The serial driver example -should use early_platform_init("earlyprintk", &platform_driver). - -5. Probing of early platform drivers belonging to a certain class -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The architecture code calls early_platform_driver_probe() to match -registered early platform devices associated with a certain class with -registered early platform drivers. Matched devices will get probed(). -This step can be executed at any point during the early boot. As soon -as possible may be good for the serial port case. - -6. Inside the early platform driver probe() -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The driver code needs to take special care during early boot, especially -when it comes to memory allocation and interrupt registration. The code -in the probe() function can use is_early_platform_device() to check if -it is called at early platform device or at the regular platform device -time. The early serial driver performs register_console() at this point. - -For further information, see . diff --git a/Documentation/driver-model/porting.rst b/Documentation/driver-model/porting.rst new file mode 100644 index 000000000000..ae4bf843c1d6 --- /dev/null +++ b/Documentation/driver-model/porting.rst @@ -0,0 +1,448 @@ +======================================= +Porting Drivers to the New Driver Model +======================================= + +Patrick Mochel + +7 January 2003 + + +Overview + +Please refer to `Documentation/driver-model/*.rst` for definitions of +various driver types and concepts. + +Most of the work of porting devices drivers to the new model happens +at the bus driver layer. This was intentional, to minimize the +negative effect on kernel drivers, and to allow a gradual transition +of bus drivers. + +In a nutshell, the driver model consists of a set of objects that can +be embedded in larger, bus-specific objects. Fields in these generic +objects can replace fields in the bus-specific objects. + +The generic objects must be registered with the driver model core. By +doing so, they will exported via the sysfs filesystem. sysfs can be +mounted by doing:: + + # mount -t sysfs sysfs /sys + + + +The Process + +Step 0: Read include/linux/device.h for object and function definitions. + +Step 1: Registering the bus driver. + + +- Define a struct bus_type for the bus driver:: + + struct bus_type pci_bus_type = { + .name = "pci", + }; + + +- Register the bus type. + + This should be done in the initialization function for the bus type, + which is usually the module_init(), or equivalent, function:: + + static int __init pci_driver_init(void) + { + return bus_register(&pci_bus_type); + } + + subsys_initcall(pci_driver_init); + + + The bus type may be unregistered (if the bus driver may be compiled + as a module) by doing:: + + bus_unregister(&pci_bus_type); + + +- Export the bus type for others to use. + + Other code may wish to reference the bus type, so declare it in a + shared header file and export the symbol. + +From include/linux/pci.h:: + + extern struct bus_type pci_bus_type; + + +From file the above code appears in:: + + EXPORT_SYMBOL(pci_bus_type); + + + +- This will cause the bus to show up in /sys/bus/pci/ with two + subdirectories: 'devices' and 'drivers':: + + # tree -d /sys/bus/pci/ + /sys/bus/pci/ + |-- devices + `-- drivers + + + +Step 2: Registering Devices. + +struct device represents a single device. It mainly contains metadata +describing the relationship the device has to other entities. + + +- Embed a struct device in the bus-specific device type:: + + + struct pci_dev { + ... + struct device dev; /* Generic device interface */ + ... + }; + + It is recommended that the generic device not be the first item in + the struct to discourage programmers from doing mindless casts + between the object types. Instead macros, or inline functions, + should be created to convert from the generic object type:: + + + #define to_pci_dev(n) container_of(n, struct pci_dev, dev) + + or + + static inline struct pci_dev * to_pci_dev(struct kobject * kobj) + { + return container_of(n, struct pci_dev, dev); + } + + This allows the compiler to verify type-safety of the operations + that are performed (which is Good). + + +- Initialize the device on registration. + + When devices are discovered or registered with the bus type, the + bus driver should initialize the generic device. The most important + things to initialize are the bus_id, parent, and bus fields. + + The bus_id is an ASCII string that contains the device's address on + the bus. The format of this string is bus-specific. This is + necessary for representing devices in sysfs. + + parent is the physical parent of the device. It is important that + the bus driver sets this field correctly. + + The driver model maintains an ordered list of devices that it uses + for power management. This list must be in order to guarantee that + devices are shutdown before their physical parents, and vice versa. + The order of this list is determined by the parent of registered + devices. + + Also, the location of the device's sysfs directory depends on a + device's parent. sysfs exports a directory structure that mirrors + the device hierarchy. Accurately setting the parent guarantees that + sysfs will accurately represent the hierarchy. + + The device's bus field is a pointer to the bus type the device + belongs to. This should be set to the bus_type that was declared + and initialized before. + + Optionally, the bus driver may set the device's name and release + fields. + + The name field is an ASCII string describing the device, like + + "ATI Technologies Inc Radeon QD" + + The release field is a callback that the driver model core calls + when the device has been removed, and all references to it have + been released. More on this in a moment. + + +- Register the device. + + Once the generic device has been initialized, it can be registered + with the driver model core by doing:: + + device_register(&dev->dev); + + It can later be unregistered by doing:: + + device_unregister(&dev->dev); + + This should happen on buses that support hotpluggable devices. + If a bus driver unregisters a device, it should not immediately free + it. It should instead wait for the driver model core to call the + device's release method, then free the bus-specific object. + (There may be other code that is currently referencing the device + structure, and it would be rude to free the device while that is + happening). + + + When the device is registered, a directory in sysfs is created. + The PCI tree in sysfs looks like:: + + /sys/devices/pci0/ + |-- 00:00.0 + |-- 00:01.0 + | `-- 01:00.0 + |-- 00:02.0 + | `-- 02:1f.0 + | `-- 03:00.0 + |-- 00:1e.0 + | `-- 04:04.0 + |-- 00:1f.0 + |-- 00:1f.1 + | |-- ide0 + | | |-- 0.0 + | | `-- 0.1 + | `-- ide1 + | `-- 1.0 + |-- 00:1f.2 + |-- 00:1f.3 + `-- 00:1f.5 + + Also, symlinks are created in the bus's 'devices' directory + that point to the device's directory in the physical hierarchy:: + + /sys/bus/pci/devices/ + |-- 00:00.0 -> ../../../devices/pci0/00:00.0 + |-- 00:01.0 -> ../../../devices/pci0/00:01.0 + |-- 00:02.0 -> ../../../devices/pci0/00:02.0 + |-- 00:1e.0 -> ../../../devices/pci0/00:1e.0 + |-- 00:1f.0 -> ../../../devices/pci0/00:1f.0 + |-- 00:1f.1 -> ../../../devices/pci0/00:1f.1 + |-- 00:1f.2 -> ../../../devices/pci0/00:1f.2 + |-- 00:1f.3 -> ../../../devices/pci0/00:1f.3 + |-- 00:1f.5 -> ../../../devices/pci0/00:1f.5 + |-- 01:00.0 -> ../../../devices/pci0/00:01.0/01:00.0 + |-- 02:1f.0 -> ../../../devices/pci0/00:02.0/02:1f.0 + |-- 03:00.0 -> ../../../devices/pci0/00:02.0/02:1f.0/03:00.0 + `-- 04:04.0 -> ../../../devices/pci0/00:1e.0/04:04.0 + + + +Step 3: Registering Drivers. + +struct device_driver is a simple driver structure that contains a set +of operations that the driver model core may call. + + +- Embed a struct device_driver in the bus-specific driver. + + Just like with devices, do something like:: + + struct pci_driver { + ... + struct device_driver driver; + }; + + +- Initialize the generic driver structure. + + When the driver registers with the bus (e.g. doing pci_register_driver()), + initialize the necessary fields of the driver: the name and bus + fields. + + +- Register the driver. + + After the generic driver has been initialized, call:: + + driver_register(&drv->driver); + + to register the driver with the core. + + When the driver is unregistered from the bus, unregister it from the + core by doing:: + + driver_unregister(&drv->driver); + + Note that this will block until all references to the driver have + gone away. Normally, there will not be any. + + +- Sysfs representation. + + Drivers are exported via sysfs in their bus's 'driver's directory. + For example:: + + /sys/bus/pci/drivers/ + |-- 3c59x + |-- Ensoniq AudioPCI + |-- agpgart-amdk7 + |-- e100 + `-- serial + + +Step 4: Define Generic Methods for Drivers. + +struct device_driver defines a set of operations that the driver model +core calls. Most of these operations are probably similar to +operations the bus already defines for drivers, but taking different +parameters. + +It would be difficult and tedious to force every driver on a bus to +simultaneously convert their drivers to generic format. Instead, the +bus driver should define single instances of the generic methods that +forward call to the bus-specific drivers. For instance:: + + + static int pci_device_remove(struct device * dev) + { + struct pci_dev * pci_dev = to_pci_dev(dev); + struct pci_driver * drv = pci_dev->driver; + + if (drv) { + if (drv->remove) + drv->remove(pci_dev); + pci_dev->driver = NULL; + } + return 0; + } + + +The generic driver should be initialized with these methods before it +is registered:: + + /* initialize common driver fields */ + drv->driver.name = drv->name; + drv->driver.bus = &pci_bus_type; + drv->driver.probe = pci_device_probe; + drv->driver.resume = pci_device_resume; + drv->driver.suspend = pci_device_suspend; + drv->driver.remove = pci_device_remove; + + /* register with core */ + driver_register(&drv->driver); + + +Ideally, the bus should only initialize the fields if they are not +already set. This allows the drivers to implement their own generic +methods. + + +Step 5: Support generic driver binding. + +The model assumes that a device or driver can be dynamically +registered with the bus at any time. When registration happens, +devices must be bound to a driver, or drivers must be bound to all +devices that it supports. + +A driver typically contains a list of device IDs that it supports. The +bus driver compares these IDs to the IDs of devices registered with it. +The format of the device IDs, and the semantics for comparing them are +bus-specific, so the generic model does attempt to generalize them. + +Instead, a bus may supply a method in struct bus_type that does the +comparison:: + + int (*match)(struct device * dev, struct device_driver * drv); + +match should return positive value if the driver supports the device, +and zero otherwise. It may also return error code (for example +-EPROBE_DEFER) if determining that given driver supports the device is +not possible. + +When a device is registered, the bus's list of drivers is iterated +over. bus->match() is called for each one until a match is found. + +When a driver is registered, the bus's list of devices is iterated +over. bus->match() is called for each device that is not already +claimed by a driver. + +When a device is successfully bound to a driver, device->driver is +set, the device is added to a per-driver list of devices, and a +symlink is created in the driver's sysfs directory that points to the +device's physical directory:: + + /sys/bus/pci/drivers/ + |-- 3c59x + | `-- 00:0b.0 -> ../../../../devices/pci0/00:0b.0 + |-- Ensoniq AudioPCI + |-- agpgart-amdk7 + | `-- 00:00.0 -> ../../../../devices/pci0/00:00.0 + |-- e100 + | `-- 00:0c.0 -> ../../../../devices/pci0/00:0c.0 + `-- serial + + +This driver binding should replace the existing driver binding +mechanism the bus currently uses. + + +Step 6: Supply a hotplug callback. + +Whenever a device is registered with the driver model core, the +userspace program /sbin/hotplug is called to notify userspace. +Users can define actions to perform when a device is inserted or +removed. + +The driver model core passes several arguments to userspace via +environment variables, including + +- ACTION: set to 'add' or 'remove' +- DEVPATH: set to the device's physical path in sysfs. + +A bus driver may also supply additional parameters for userspace to +consume. To do this, a bus must implement the 'hotplug' method in +struct bus_type:: + + int (*hotplug) (struct device *dev, char **envp, + int num_envp, char *buffer, int buffer_size); + +This is called immediately before /sbin/hotplug is executed. + + +Step 7: Cleaning up the bus driver. + +The generic bus, device, and driver structures provide several fields +that can replace those defined privately to the bus driver. + +- Device list. + +struct bus_type contains a list of all devices registered with the bus +type. This includes all devices on all instances of that bus type. +An internal list that the bus uses may be removed, in favor of using +this one. + +The core provides an iterator to access these devices:: + + int bus_for_each_dev(struct bus_type * bus, struct device * start, + void * data, int (*fn)(struct device *, void *)); + + +- Driver list. + +struct bus_type also contains a list of all drivers registered with +it. An internal list of drivers that the bus driver maintains may +be removed in favor of using the generic one. + +The drivers may be iterated over, like devices:: + + int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, + void * data, int (*fn)(struct device_driver *, void *)); + + +Please see drivers/base/bus.c for more information. + + +- rwsem + +struct bus_type contains an rwsem that protects all core accesses to +the device and driver lists. This can be used by the bus driver +internally, and should be used when accessing the device or driver +lists the bus maintains. + + +- Device and driver fields. + +Some of the fields in struct device and struct device_driver duplicate +fields in the bus-specific representations of these objects. Feel free +to remove the bus-specific ones and favor the generic ones. Note +though, that this will likely mean fixing up all the drivers that +reference the bus-specific fields (though those should all be 1-line +changes). diff --git a/Documentation/driver-model/porting.txt b/Documentation/driver-model/porting.txt deleted file mode 100644 index 453053f1661f..000000000000 --- a/Documentation/driver-model/porting.txt +++ /dev/null @@ -1,447 +0,0 @@ - -Porting Drivers to the New Driver Model - -Patrick Mochel - -7 January 2003 - - -Overview - -Please refer to Documentation/driver-model/*.txt for definitions of -various driver types and concepts. - -Most of the work of porting devices drivers to the new model happens -at the bus driver layer. This was intentional, to minimize the -negative effect on kernel drivers, and to allow a gradual transition -of bus drivers. - -In a nutshell, the driver model consists of a set of objects that can -be embedded in larger, bus-specific objects. Fields in these generic -objects can replace fields in the bus-specific objects. - -The generic objects must be registered with the driver model core. By -doing so, they will exported via the sysfs filesystem. sysfs can be -mounted by doing - - # mount -t sysfs sysfs /sys - - - -The Process - -Step 0: Read include/linux/device.h for object and function definitions. - -Step 1: Registering the bus driver. - - -- Define a struct bus_type for the bus driver. - -struct bus_type pci_bus_type = { - .name = "pci", -}; - - -- Register the bus type. - This should be done in the initialization function for the bus type, - which is usually the module_init(), or equivalent, function. - -static int __init pci_driver_init(void) -{ - return bus_register(&pci_bus_type); -} - -subsys_initcall(pci_driver_init); - - - The bus type may be unregistered (if the bus driver may be compiled - as a module) by doing: - - bus_unregister(&pci_bus_type); - - -- Export the bus type for others to use. - - Other code may wish to reference the bus type, so declare it in a - shared header file and export the symbol. - -From include/linux/pci.h: - -extern struct bus_type pci_bus_type; - - -From file the above code appears in: - -EXPORT_SYMBOL(pci_bus_type); - - - -- This will cause the bus to show up in /sys/bus/pci/ with two - subdirectories: 'devices' and 'drivers'. - -# tree -d /sys/bus/pci/ -/sys/bus/pci/ -|-- devices -`-- drivers - - - -Step 2: Registering Devices. - -struct device represents a single device. It mainly contains metadata -describing the relationship the device has to other entities. - - -- Embed a struct device in the bus-specific device type. - - -struct pci_dev { - ... - struct device dev; /* Generic device interface */ - ... -}; - - It is recommended that the generic device not be the first item in - the struct to discourage programmers from doing mindless casts - between the object types. Instead macros, or inline functions, - should be created to convert from the generic object type. - - -#define to_pci_dev(n) container_of(n, struct pci_dev, dev) - -or - -static inline struct pci_dev * to_pci_dev(struct kobject * kobj) -{ - return container_of(n, struct pci_dev, dev); -} - - This allows the compiler to verify type-safety of the operations - that are performed (which is Good). - - -- Initialize the device on registration. - - When devices are discovered or registered with the bus type, the - bus driver should initialize the generic device. The most important - things to initialize are the bus_id, parent, and bus fields. - - The bus_id is an ASCII string that contains the device's address on - the bus. The format of this string is bus-specific. This is - necessary for representing devices in sysfs. - - parent is the physical parent of the device. It is important that - the bus driver sets this field correctly. - - The driver model maintains an ordered list of devices that it uses - for power management. This list must be in order to guarantee that - devices are shutdown before their physical parents, and vice versa. - The order of this list is determined by the parent of registered - devices. - - Also, the location of the device's sysfs directory depends on a - device's parent. sysfs exports a directory structure that mirrors - the device hierarchy. Accurately setting the parent guarantees that - sysfs will accurately represent the hierarchy. - - The device's bus field is a pointer to the bus type the device - belongs to. This should be set to the bus_type that was declared - and initialized before. - - Optionally, the bus driver may set the device's name and release - fields. - - The name field is an ASCII string describing the device, like - - "ATI Technologies Inc Radeon QD" - - The release field is a callback that the driver model core calls - when the device has been removed, and all references to it have - been released. More on this in a moment. - - -- Register the device. - - Once the generic device has been initialized, it can be registered - with the driver model core by doing: - - device_register(&dev->dev); - - It can later be unregistered by doing: - - device_unregister(&dev->dev); - - This should happen on buses that support hotpluggable devices. - If a bus driver unregisters a device, it should not immediately free - it. It should instead wait for the driver model core to call the - device's release method, then free the bus-specific object. - (There may be other code that is currently referencing the device - structure, and it would be rude to free the device while that is - happening). - - - When the device is registered, a directory in sysfs is created. - The PCI tree in sysfs looks like: - -/sys/devices/pci0/ -|-- 00:00.0 -|-- 00:01.0 -| `-- 01:00.0 -|-- 00:02.0 -| `-- 02:1f.0 -| `-- 03:00.0 -|-- 00:1e.0 -| `-- 04:04.0 -|-- 00:1f.0 -|-- 00:1f.1 -| |-- ide0 -| | |-- 0.0 -| | `-- 0.1 -| `-- ide1 -| `-- 1.0 -|-- 00:1f.2 -|-- 00:1f.3 -`-- 00:1f.5 - - Also, symlinks are created in the bus's 'devices' directory - that point to the device's directory in the physical hierarchy. - -/sys/bus/pci/devices/ -|-- 00:00.0 -> ../../../devices/pci0/00:00.0 -|-- 00:01.0 -> ../../../devices/pci0/00:01.0 -|-- 00:02.0 -> ../../../devices/pci0/00:02.0 -|-- 00:1e.0 -> ../../../devices/pci0/00:1e.0 -|-- 00:1f.0 -> ../../../devices/pci0/00:1f.0 -|-- 00:1f.1 -> ../../../devices/pci0/00:1f.1 -|-- 00:1f.2 -> ../../../devices/pci0/00:1f.2 -|-- 00:1f.3 -> ../../../devices/pci0/00:1f.3 -|-- 00:1f.5 -> ../../../devices/pci0/00:1f.5 -|-- 01:00.0 -> ../../../devices/pci0/00:01.0/01:00.0 -|-- 02:1f.0 -> ../../../devices/pci0/00:02.0/02:1f.0 -|-- 03:00.0 -> ../../../devices/pci0/00:02.0/02:1f.0/03:00.0 -`-- 04:04.0 -> ../../../devices/pci0/00:1e.0/04:04.0 - - - -Step 3: Registering Drivers. - -struct device_driver is a simple driver structure that contains a set -of operations that the driver model core may call. - - -- Embed a struct device_driver in the bus-specific driver. - - Just like with devices, do something like: - -struct pci_driver { - ... - struct device_driver driver; -}; - - -- Initialize the generic driver structure. - - When the driver registers with the bus (e.g. doing pci_register_driver()), - initialize the necessary fields of the driver: the name and bus - fields. - - -- Register the driver. - - After the generic driver has been initialized, call - - driver_register(&drv->driver); - - to register the driver with the core. - - When the driver is unregistered from the bus, unregister it from the - core by doing: - - driver_unregister(&drv->driver); - - Note that this will block until all references to the driver have - gone away. Normally, there will not be any. - - -- Sysfs representation. - - Drivers are exported via sysfs in their bus's 'driver's directory. - For example: - -/sys/bus/pci/drivers/ -|-- 3c59x -|-- Ensoniq AudioPCI -|-- agpgart-amdk7 -|-- e100 -`-- serial - - -Step 4: Define Generic Methods for Drivers. - -struct device_driver defines a set of operations that the driver model -core calls. Most of these operations are probably similar to -operations the bus already defines for drivers, but taking different -parameters. - -It would be difficult and tedious to force every driver on a bus to -simultaneously convert their drivers to generic format. Instead, the -bus driver should define single instances of the generic methods that -forward call to the bus-specific drivers. For instance: - - -static int pci_device_remove(struct device * dev) -{ - struct pci_dev * pci_dev = to_pci_dev(dev); - struct pci_driver * drv = pci_dev->driver; - - if (drv) { - if (drv->remove) - drv->remove(pci_dev); - pci_dev->driver = NULL; - } - return 0; -} - - -The generic driver should be initialized with these methods before it -is registered. - - /* initialize common driver fields */ - drv->driver.name = drv->name; - drv->driver.bus = &pci_bus_type; - drv->driver.probe = pci_device_probe; - drv->driver.resume = pci_device_resume; - drv->driver.suspend = pci_device_suspend; - drv->driver.remove = pci_device_remove; - - /* register with core */ - driver_register(&drv->driver); - - -Ideally, the bus should only initialize the fields if they are not -already set. This allows the drivers to implement their own generic -methods. - - -Step 5: Support generic driver binding. - -The model assumes that a device or driver can be dynamically -registered with the bus at any time. When registration happens, -devices must be bound to a driver, or drivers must be bound to all -devices that it supports. - -A driver typically contains a list of device IDs that it supports. The -bus driver compares these IDs to the IDs of devices registered with it. -The format of the device IDs, and the semantics for comparing them are -bus-specific, so the generic model does attempt to generalize them. - -Instead, a bus may supply a method in struct bus_type that does the -comparison: - - int (*match)(struct device * dev, struct device_driver * drv); - -match should return positive value if the driver supports the device, -and zero otherwise. It may also return error code (for example --EPROBE_DEFER) if determining that given driver supports the device is -not possible. - -When a device is registered, the bus's list of drivers is iterated -over. bus->match() is called for each one until a match is found. - -When a driver is registered, the bus's list of devices is iterated -over. bus->match() is called for each device that is not already -claimed by a driver. - -When a device is successfully bound to a driver, device->driver is -set, the device is added to a per-driver list of devices, and a -symlink is created in the driver's sysfs directory that points to the -device's physical directory: - -/sys/bus/pci/drivers/ -|-- 3c59x -| `-- 00:0b.0 -> ../../../../devices/pci0/00:0b.0 -|-- Ensoniq AudioPCI -|-- agpgart-amdk7 -| `-- 00:00.0 -> ../../../../devices/pci0/00:00.0 -|-- e100 -| `-- 00:0c.0 -> ../../../../devices/pci0/00:0c.0 -`-- serial - - -This driver binding should replace the existing driver binding -mechanism the bus currently uses. - - -Step 6: Supply a hotplug callback. - -Whenever a device is registered with the driver model core, the -userspace program /sbin/hotplug is called to notify userspace. -Users can define actions to perform when a device is inserted or -removed. - -The driver model core passes several arguments to userspace via -environment variables, including - -- ACTION: set to 'add' or 'remove' -- DEVPATH: set to the device's physical path in sysfs. - -A bus driver may also supply additional parameters for userspace to -consume. To do this, a bus must implement the 'hotplug' method in -struct bus_type: - - int (*hotplug) (struct device *dev, char **envp, - int num_envp, char *buffer, int buffer_size); - -This is called immediately before /sbin/hotplug is executed. - - -Step 7: Cleaning up the bus driver. - -The generic bus, device, and driver structures provide several fields -that can replace those defined privately to the bus driver. - -- Device list. - -struct bus_type contains a list of all devices registered with the bus -type. This includes all devices on all instances of that bus type. -An internal list that the bus uses may be removed, in favor of using -this one. - -The core provides an iterator to access these devices. - -int bus_for_each_dev(struct bus_type * bus, struct device * start, - void * data, int (*fn)(struct device *, void *)); - - -- Driver list. - -struct bus_type also contains a list of all drivers registered with -it. An internal list of drivers that the bus driver maintains may -be removed in favor of using the generic one. - -The drivers may be iterated over, like devices: - -int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, - void * data, int (*fn)(struct device_driver *, void *)); - - -Please see drivers/base/bus.c for more information. - - -- rwsem - -struct bus_type contains an rwsem that protects all core accesses to -the device and driver lists. This can be used by the bus driver -internally, and should be used when accessing the device or driver -lists the bus maintains. - - -- Device and driver fields. - -Some of the fields in struct device and struct device_driver duplicate -fields in the bus-specific representations of these objects. Feel free -to remove the bus-specific ones and favor the generic ones. Note -though, that this will likely mean fixing up all the drivers that -reference the bus-specific fields (though those should all be 1-line -changes). - diff --git a/Documentation/eisa.txt b/Documentation/eisa.txt index 2806e5544e43..f388545a85a7 100644 --- a/Documentation/eisa.txt +++ b/Documentation/eisa.txt @@ -103,7 +103,7 @@ id_table an array of NULL terminated EISA id strings, (driver_data). driver a generic driver, such as described in - Documentation/driver-model/driver.txt. Only .name, + Documentation/driver-model/driver.rst. Only .name, .probe and .remove members are mandatory. =============== ==================================================== @@ -152,7 +152,7 @@ state set of flags indicating the state of the device. Current flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. res set of four 256 bytes I/O regions allocated to this device dma_mask DMA mask set from the parent device. -dev generic device (see Documentation/driver-model/device.txt) +dev generic device (see Documentation/driver-model/device.rst) ======== ============================================================ You can get the 'struct eisa_device' from 'struct device' using the diff --git a/Documentation/hwmon/submitting-patches.rst b/Documentation/hwmon/submitting-patches.rst index f9796b9d9db6..d5b05d3e54ba 100644 --- a/Documentation/hwmon/submitting-patches.rst +++ b/Documentation/hwmon/submitting-patches.rst @@ -89,7 +89,7 @@ increase the chances of your change being accepted. console. Excessive logging can seriously affect system performance. * Use devres functions whenever possible to allocate resources. For rationale - and supported functions, please see Documentation/driver-model/devres.txt. + and supported functions, please see Documentation/driver-model/devres.rst. If a function is not supported by devres, consider using devm_add_action(). * If the driver has a detect function, make sure it is silent. Debug messages diff --git a/drivers/base/platform.c b/drivers/base/platform.c index 4d1729853d1a..713903290385 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -5,7 +5,7 @@ * Copyright (c) 2002-3 Patrick Mochel * Copyright (c) 2002-3 Open Source Development Labs * - * Please see Documentation/driver-model/platform.txt for more + * Please see Documentation/driver-model/platform.rst for more * information. */ diff --git a/drivers/gpio/gpio-cs5535.c b/drivers/gpio/gpio-cs5535.c index 6314225dbed0..3611a0571667 100644 --- a/drivers/gpio/gpio-cs5535.c +++ b/drivers/gpio/gpio-cs5535.c @@ -41,7 +41,7 @@ MODULE_PARM_DESC(mask, "GPIO channel mask."); /* * FIXME: convert this singleton driver to use the state container - * design pattern, see Documentation/driver-model/design-patterns.txt + * design pattern, see Documentation/driver-model/design-patterns.rst */ static struct cs5535_gpio_chip { struct gpio_chip chip; diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 7843abf4d44d..98a62072d81e 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -2237,7 +2237,7 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) struct ice_hw *hw; int err; - /* this driver uses devres, see Documentation/driver-model/devres.txt */ + /* this driver uses devres, see Documentation/driver-model/devres.rst */ err = pcim_enable_device(pdev); if (err) return err; diff --git a/scripts/coccinelle/free/devm_free.cocci b/scripts/coccinelle/free/devm_free.cocci index b2a2cf8bf81f..e32236a979a8 100644 --- a/scripts/coccinelle/free/devm_free.cocci +++ b/scripts/coccinelle/free/devm_free.cocci @@ -2,7 +2,7 @@ /// functions. Values allocated using the devm_functions are freed when /// the device is detached, and thus the use of the standard freeing /// function would cause a double free. -/// See Documentation/driver-model/devres.txt for more information. +/// See Documentation/driver-model/devres.rst for more information. /// /// A difficulty of detecting this problem is that the standard freeing /// function might be called from a different function than the one -- cgit v1.2.3 From ac499fba98c3c65078fd84fa0a62cd6f6d5837ed Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Sat, 29 Jun 2019 07:36:46 -0300 Subject: docs: ipmb: place it at driver-api and convert to ReST No new doc should be added at the main Documentation/ directory. Instead, new docs should be added as ReST files, within the Kernel documentation body. Fixes: 51bd6f291583 ("Add support for IPMB driver") Signed-off-by: Mauro Carvalho Chehab Message-Id: Signed-off-by: Corey Minyard --- Documentation/IPMB.txt | 103 ------------------------------------ Documentation/driver-api/index.rst | 1 + Documentation/driver-api/ipmb.rst | 105 +++++++++++++++++++++++++++++++++++++ 3 files changed, 106 insertions(+), 103 deletions(-) delete mode 100644 Documentation/IPMB.txt create mode 100644 Documentation/driver-api/ipmb.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/IPMB.txt b/Documentation/IPMB.txt deleted file mode 100644 index a6ed8b68bd0f..000000000000 --- a/Documentation/IPMB.txt +++ /dev/null @@ -1,103 +0,0 @@ -============================== -IPMB Driver for a Satellite MC -============================== - -The Intelligent Platform Management Bus or IPMB, is an -I2C bus that provides a standardized interconnection between -different boards within a chassis. This interconnection is -between the baseboard management (BMC) and chassis electronics. -IPMB is also associated with the messaging protocol through the -IPMB bus. - -The devices using the IPMB are usually management -controllers that perform management functions such as servicing -the front panel interface, monitoring the baseboard, -hot-swapping disk drivers in the system chassis, etc... - -When an IPMB is implemented in the system, the BMC serves as -a controller to give system software access to the IPMB. The BMC -sends IPMI requests to a device (usually a Satellite Management -Controller or Satellite MC) via IPMB and the device -sends a response back to the BMC. - -For more information on IPMB and the format of an IPMB message, -refer to the IPMB and IPMI specifications. - -IPMB driver for Satellite MC ----------------------------- - -ipmb-dev-int - This is the driver needed on a Satellite MC to -receive IPMB messages from a BMC and send a response back. -This driver works with the I2C driver and a userspace -program such as OpenIPMI: - -1) It is an I2C slave backend driver. So, it defines a callback -function to set the Satellite MC as an I2C slave. -This callback function handles the received IPMI requests. - -2) It defines the read and write functions to enable a user -space program (such as OpenIPMI) to communicate with the kernel. - - -Load the IPMB driver --------------------- - -The driver needs to be loaded at boot time or manually first. -First, make sure you have the following in your config file: -CONFIG_IPMB_DEVICE_INTERFACE=y - -1) If you want the driver to be loaded at boot time: - -a) Add this entry to your ACPI table, under the appropriate SMBus: - -Device (SMB0) // Example SMBus host controller -{ - Name (_HID, "") // Vendor-Specific HID - Name (_UID, 0) // Unique ID of particular host controller - : - : - Device (IPMB) - { - Name (_HID, "IPMB0001") // IPMB device interface - Name (_UID, 0) // Unique device identifier - } -} - -b) Example for device tree: - -&i2c2 { - status = "okay"; - - ipmb@10 { - compatible = "ipmb-dev"; - reg = <0x10>; - }; -}; - -2) Manually from Linux: -modprobe ipmb-dev-int - - -Instantiate the device ----------------------- - -After loading the driver, you can instantiate the device as -described in 'Documentation/i2c/instantiating-devices'. -If you have multiple BMCs, each connected to your Satellite MC via -a different I2C bus, you can instantiate a device for each of -those BMCs. -The name of the instantiated device contains the I2C bus number -associated with it as follows: - -BMC1 ------ IPMB/I2C bus 1 ---------| /dev/ipmb-1 - Satellite MC -BMC1 ------ IPMB/I2C bus 2 ---------| /dev/ipmb-2 - -For instance, you can instantiate the ipmb-dev-int device from -user space at the 7 bit address 0x10 on bus 2: - - # echo ipmb-dev 0x1010 > /sys/bus/i2c/devices/i2c-2/new_device - -This will create the device file /dev/ipmb-2, which can be accessed -by the user space program. The device needs to be instantiated -before running the user space program. diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index d26308af6036..b189bd3013ff 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -34,6 +34,7 @@ available subsections can be seen below. pci/index spi i2c + ipmb i3c/index hsi edac diff --git a/Documentation/driver-api/ipmb.rst b/Documentation/driver-api/ipmb.rst new file mode 100644 index 000000000000..7e2265144157 --- /dev/null +++ b/Documentation/driver-api/ipmb.rst @@ -0,0 +1,105 @@ +============================== +IPMB Driver for a Satellite MC +============================== + +The Intelligent Platform Management Bus or IPMB, is an +I2C bus that provides a standardized interconnection between +different boards within a chassis. This interconnection is +between the baseboard management (BMC) and chassis electronics. +IPMB is also associated with the messaging protocol through the +IPMB bus. + +The devices using the IPMB are usually management +controllers that perform management functions such as servicing +the front panel interface, monitoring the baseboard, +hot-swapping disk drivers in the system chassis, etc... + +When an IPMB is implemented in the system, the BMC serves as +a controller to give system software access to the IPMB. The BMC +sends IPMI requests to a device (usually a Satellite Management +Controller or Satellite MC) via IPMB and the device +sends a response back to the BMC. + +For more information on IPMB and the format of an IPMB message, +refer to the IPMB and IPMI specifications. + +IPMB driver for Satellite MC +---------------------------- + +ipmb-dev-int - This is the driver needed on a Satellite MC to +receive IPMB messages from a BMC and send a response back. +This driver works with the I2C driver and a userspace +program such as OpenIPMI: + +1) It is an I2C slave backend driver. So, it defines a callback + function to set the Satellite MC as an I2C slave. + This callback function handles the received IPMI requests. + +2) It defines the read and write functions to enable a user + space program (such as OpenIPMI) to communicate with the kernel. + + +Load the IPMB driver +-------------------- + +The driver needs to be loaded at boot time or manually first. +First, make sure you have the following in your config file: +CONFIG_IPMB_DEVICE_INTERFACE=y + +1) If you want the driver to be loaded at boot time: + +a) Add this entry to your ACPI table, under the appropriate SMBus:: + + Device (SMB0) // Example SMBus host controller + { + Name (_HID, "") // Vendor-Specific HID + Name (_UID, 0) // Unique ID of particular host controller + : + : + Device (IPMB) + { + Name (_HID, "IPMB0001") // IPMB device interface + Name (_UID, 0) // Unique device identifier + } + } + +b) Example for device tree:: + + &i2c2 { + status = "okay"; + + ipmb@10 { + compatible = "ipmb-dev"; + reg = <0x10>; + }; + }; + +2) Manually from Linux:: + + modprobe ipmb-dev-int + + +Instantiate the device +---------------------- + +After loading the driver, you can instantiate the device as +described in 'Documentation/i2c/instantiating-devices'. +If you have multiple BMCs, each connected to your Satellite MC via +a different I2C bus, you can instantiate a device for each of +those BMCs. + +The name of the instantiated device contains the I2C bus number +associated with it as follows:: + + BMC1 ------ IPMB/I2C bus 1 ---------| /dev/ipmb-1 + Satellite MC + BMC1 ------ IPMB/I2C bus 2 ---------| /dev/ipmb-2 + +For instance, you can instantiate the ipmb-dev-int device from +user space at the 7 bit address 0x10 on bus 2:: + + # echo ipmb-dev 0x1010 > /sys/bus/i2c/devices/i2c-2/new_device + +This will create the device file /dev/ipmb-2, which can be accessed +by the user space program. The device needs to be instantiated +before running the user space program. -- cgit v1.2.3 From 01f14c52591dd9028b93d0641136a34b388b773d Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Mon, 1 Jul 2019 16:10:05 +0200 Subject: Documentation: gpio: Fix reference to gpiod_get_array() The function is called gpiod_get_array(), not gpiod_array_get(). Fixes: 77588c14ac868cae ("gpiolib: Pass array info to get/set array functions") Signed-off-by: Geert Uytterhoeven Link: https://lore.kernel.org/r/20190701141005.24631-1-geert+renesas@glider.be Signed-off-by: Linus Walleij --- Documentation/driver-api/gpio/consumer.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/gpio/consumer.rst b/Documentation/driver-api/gpio/consumer.rst index 23d68c321c5c..9559aa3cbcef 100644 --- a/Documentation/driver-api/gpio/consumer.rst +++ b/Documentation/driver-api/gpio/consumer.rst @@ -364,7 +364,7 @@ accessed sequentially. The functions take three arguments: * array_size - the number of array elements * desc_array - an array of GPIO descriptors - * array_info - optional information obtained from gpiod_array_get() + * array_info - optional information obtained from gpiod_get_array() * value_bitmap - a bitmap to store the GPIOs' values (get) or a bitmap of values to assign to the GPIOs (set) -- cgit v1.2.3 From 9dcb98a29b6e81394fa33ca984f3aaad4d0d1393 Mon Sep 17 00:00:00 2001 From: "Hook, Gary" Date: Mon, 24 Jun 2019 18:35:01 +0000 Subject: Documentation: dmaengine: clean up description of dmatest usage Fix the formatting of the multi-channel test usage example. Call out the note about parameter ordering and add detail on the settings of parameters for the new version of dmatest. Fixes: f80f9988a26d7 ("dmaengine: Documentation: Add documentation for multi chan testing") Signed-off-by: Gary R Hook Signed-off-by: Vinod Koul --- Documentation/driver-api/dmaengine/dmatest.rst | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/dmaengine/dmatest.rst b/Documentation/driver-api/dmaengine/dmatest.rst index e78d070bb468..ee268d445d38 100644 --- a/Documentation/driver-api/dmaengine/dmatest.rst +++ b/Documentation/driver-api/dmaengine/dmatest.rst @@ -44,7 +44,8 @@ Example of usage:: dmatest.timeout=2000 dmatest.iterations=1 dmatest.channel=dma0chan0 dmatest.run=1 -Example of multi-channel test usage: +Example of multi-channel test usage (new in the 5.0 kernel):: + % modprobe dmatest % echo 2000 > /sys/module/dmatest/parameters/timeout % echo 1 > /sys/module/dmatest/parameters/iterations @@ -53,15 +54,18 @@ Example of multi-channel test usage: % echo dma0chan2 > /sys/module/dmatest/parameters/channel % echo 1 > /sys/module/dmatest/parameters/run -Note: the channel parameter should always be the last parameter set prior to -running the test (setting run=1), this is because upon setting the channel -parameter, that specific channel is requested using the dmaengine and a thread -is created with the existing parameters. This thread is set as pending -and will be executed once run is set to 1. Any parameters set after the thread -is created are not applied. +.. note:: + For all tests, starting in the 5.0 kernel, either single- or multi-channel, + the channel parameter(s) must be set after all other parameters. It is at + that time that the existing parameter values are acquired for use by the + thread(s). All other parameters are shared. Therefore, if changes are made + to any of the other parameters, and an additional channel specified, the + (shared) parameters used for all threads will use the new values. + After the channels are specified, each thread is set as pending. All threads + begin execution when the run parameter is set to 1. .. hint:: - available channel list could be extracted by running the following command:: + A list of available channels can be found by running the following command:: % ls -1 /sys/class/dma/ @@ -204,6 +208,7 @@ Releasing Channels Channels can be freed by setting run to 0. Example:: + % echo dma0chan1 > /sys/module/dmatest/parameters/channel dmatest: Added 1 threads using dma0chan1 % cat /sys/class/dma/dma0chan1/in_use -- cgit v1.2.3 From d2bdd48a652bd0f7a5c78f3e418b4529fc469e1f Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 16:03:23 -0300 Subject: docs: rapidio: add it to the driver API This is actually a subsystem description, with contains both kAPI and uAPI. While it should ideally be slplit, let's place it at driver-api, as most things are related to kAPI and driver-specific info. Signed-off-by: Mauro Carvalho Chehab --- Documentation/admin-guide/index.rst | 1 + Documentation/admin-guide/rapidio.rst | 107 +++++++ Documentation/driver-api/index.rst | 2 +- Documentation/driver-api/rapidio.rst | 107 ------- Documentation/driver-api/rapidio/index.rst | 13 + Documentation/driver-api/rapidio/mport_cdev.rst | 110 +++++++ Documentation/driver-api/rapidio/rapidio.rst | 362 ++++++++++++++++++++++++ Documentation/driver-api/rapidio/rio_cm.rst | 135 +++++++++ Documentation/driver-api/rapidio/sysfs.rst | 7 + Documentation/driver-api/rapidio/tsi721.rst | 112 ++++++++ Documentation/rapidio/index.rst | 15 - Documentation/rapidio/mport_cdev.rst | 110 ------- Documentation/rapidio/rapidio.rst | 362 ------------------------ Documentation/rapidio/rio_cm.rst | 135 --------- Documentation/rapidio/sysfs.rst | 7 - Documentation/rapidio/tsi721.rst | 112 -------- drivers/rapidio/Kconfig | 2 +- 17 files changed, 849 insertions(+), 850 deletions(-) create mode 100644 Documentation/admin-guide/rapidio.rst delete mode 100644 Documentation/driver-api/rapidio.rst create mode 100644 Documentation/driver-api/rapidio/index.rst create mode 100644 Documentation/driver-api/rapidio/mport_cdev.rst create mode 100644 Documentation/driver-api/rapidio/rapidio.rst create mode 100644 Documentation/driver-api/rapidio/rio_cm.rst create mode 100644 Documentation/driver-api/rapidio/sysfs.rst create mode 100644 Documentation/driver-api/rapidio/tsi721.rst delete mode 100644 Documentation/rapidio/index.rst delete mode 100644 Documentation/rapidio/mport_cdev.rst delete mode 100644 Documentation/rapidio/rapidio.rst delete mode 100644 Documentation/rapidio/rio_cm.rst delete mode 100644 Documentation/rapidio/sysfs.rst delete mode 100644 Documentation/rapidio/tsi721.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 24fbe0568eff..8853c95ef0d4 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -61,6 +61,7 @@ configure specific aspects of kernel behavior to your liking. parport md module-signing + rapidio sysrq unicode vga-softcursor diff --git a/Documentation/admin-guide/rapidio.rst b/Documentation/admin-guide/rapidio.rst new file mode 100644 index 000000000000..71ff658ab78e --- /dev/null +++ b/Documentation/admin-guide/rapidio.rst @@ -0,0 +1,107 @@ +======================= +RapidIO Subsystem Guide +======================= + +:Author: Matt Porter + +Introduction +============ + +RapidIO is a high speed switched fabric interconnect with features aimed +at the embedded market. RapidIO provides support for memory-mapped I/O +as well as message-based transactions over the switched fabric network. +RapidIO has a standardized discovery mechanism not unlike the PCI bus +standard that allows simple detection of devices in a network. + +This documentation is provided for developers intending to support +RapidIO on new architectures, write new drivers, or to understand the +subsystem internals. + +Known Bugs and Limitations +========================== + +Bugs +---- + +None. ;) + +Limitations +----------- + +1. Access/management of RapidIO memory regions is not supported + +2. Multiple host enumeration is not supported + +RapidIO driver interface +======================== + +Drivers are provided a set of calls in order to interface with the +subsystem to gather info on devices, request/map memory region +resources, and manage mailboxes/doorbells. + +Functions +--------- + +.. kernel-doc:: include/linux/rio_drv.h + :internal: + +.. kernel-doc:: drivers/rapidio/rio-driver.c + :export: + +.. kernel-doc:: drivers/rapidio/rio.c + :export: + +Internals +========= + +This chapter contains the autogenerated documentation of the RapidIO +subsystem. + +Structures +---------- + +.. kernel-doc:: include/linux/rio.h + :internal: + +Enumeration and Discovery +------------------------- + +.. kernel-doc:: drivers/rapidio/rio-scan.c + :internal: + +Driver functionality +-------------------- + +.. kernel-doc:: drivers/rapidio/rio.c + :internal: + +.. kernel-doc:: drivers/rapidio/rio-access.c + :internal: + +Device model support +-------------------- + +.. kernel-doc:: drivers/rapidio/rio-driver.c + :internal: + +PPC32 support +------------- + +.. kernel-doc:: arch/powerpc/sysdev/fsl_rio.c + :internal: + +Credits +======= + +The following people have contributed to the RapidIO subsystem directly +or indirectly: + +1. Matt Porter\ mporter@kernel.crashing.org + +2. Randy Vinson\ rvinson@mvista.com + +3. Dan Malek\ dan@embeddedalley.com + +The following people have contributed to this document: + +1. Matt Porter\ mporter@kernel.crashing.org diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 6cd750a03ea0..d665cd9ab95f 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -45,7 +45,7 @@ available subsections can be seen below. miscellaneous mei/index w1 - rapidio + rapidio/index s390-drivers vme 80211/index diff --git a/Documentation/driver-api/rapidio.rst b/Documentation/driver-api/rapidio.rst deleted file mode 100644 index 71ff658ab78e..000000000000 --- a/Documentation/driver-api/rapidio.rst +++ /dev/null @@ -1,107 +0,0 @@ -======================= -RapidIO Subsystem Guide -======================= - -:Author: Matt Porter - -Introduction -============ - -RapidIO is a high speed switched fabric interconnect with features aimed -at the embedded market. RapidIO provides support for memory-mapped I/O -as well as message-based transactions over the switched fabric network. -RapidIO has a standardized discovery mechanism not unlike the PCI bus -standard that allows simple detection of devices in a network. - -This documentation is provided for developers intending to support -RapidIO on new architectures, write new drivers, or to understand the -subsystem internals. - -Known Bugs and Limitations -========================== - -Bugs ----- - -None. ;) - -Limitations ------------ - -1. Access/management of RapidIO memory regions is not supported - -2. Multiple host enumeration is not supported - -RapidIO driver interface -======================== - -Drivers are provided a set of calls in order to interface with the -subsystem to gather info on devices, request/map memory region -resources, and manage mailboxes/doorbells. - -Functions ---------- - -.. kernel-doc:: include/linux/rio_drv.h - :internal: - -.. kernel-doc:: drivers/rapidio/rio-driver.c - :export: - -.. kernel-doc:: drivers/rapidio/rio.c - :export: - -Internals -========= - -This chapter contains the autogenerated documentation of the RapidIO -subsystem. - -Structures ----------- - -.. kernel-doc:: include/linux/rio.h - :internal: - -Enumeration and Discovery -------------------------- - -.. kernel-doc:: drivers/rapidio/rio-scan.c - :internal: - -Driver functionality --------------------- - -.. kernel-doc:: drivers/rapidio/rio.c - :internal: - -.. kernel-doc:: drivers/rapidio/rio-access.c - :internal: - -Device model support --------------------- - -.. kernel-doc:: drivers/rapidio/rio-driver.c - :internal: - -PPC32 support -------------- - -.. kernel-doc:: arch/powerpc/sysdev/fsl_rio.c - :internal: - -Credits -======= - -The following people have contributed to the RapidIO subsystem directly -or indirectly: - -1. Matt Porter\ mporter@kernel.crashing.org - -2. Randy Vinson\ rvinson@mvista.com - -3. Dan Malek\ dan@embeddedalley.com - -The following people have contributed to this document: - -1. Matt Porter\ mporter@kernel.crashing.org diff --git a/Documentation/driver-api/rapidio/index.rst b/Documentation/driver-api/rapidio/index.rst new file mode 100644 index 000000000000..4c5e51a05134 --- /dev/null +++ b/Documentation/driver-api/rapidio/index.rst @@ -0,0 +1,13 @@ +=========================== +The Linux RapidIO Subsystem +=========================== + +.. toctree:: + :maxdepth: 1 + + rapidio + sysfs + + tsi721 + mport_cdev + rio_cm diff --git a/Documentation/driver-api/rapidio/mport_cdev.rst b/Documentation/driver-api/rapidio/mport_cdev.rst new file mode 100644 index 000000000000..df77a7f7be7d --- /dev/null +++ b/Documentation/driver-api/rapidio/mport_cdev.rst @@ -0,0 +1,110 @@ +================================================================== +RapidIO subsystem mport character device driver (rio_mport_cdev.c) +================================================================== + +1. Overview +=========== + +This device driver is the result of collaboration within the RapidIO.org +Software Task Group (STG) between Texas Instruments, Freescale, +Prodrive Technologies, Nokia Networks, BAE and IDT. Additional input was +received from other members of RapidIO.org. The objective was to create a +character mode driver interface which exposes the capabilities of RapidIO +devices directly to applications, in a manner that allows the numerous and +varied RapidIO implementations to interoperate. + +This driver (MPORT_CDEV) provides access to basic RapidIO subsystem operations +for user-space applications. Most of RapidIO operations are supported through +'ioctl' system calls. + +When loaded this device driver creates filesystem nodes named rio_mportX in /dev +directory for each registered RapidIO mport device. 'X' in the node name matches +to unique port ID assigned to each local mport device. + +Using available set of ioctl commands user-space applications can perform +following RapidIO bus and subsystem operations: + +- Reads and writes from/to configuration registers of mport devices + (RIO_MPORT_MAINT_READ_LOCAL/RIO_MPORT_MAINT_WRITE_LOCAL) +- Reads and writes from/to configuration registers of remote RapidIO devices. + This operations are defined as RapidIO Maintenance reads/writes in RIO spec. + (RIO_MPORT_MAINT_READ_REMOTE/RIO_MPORT_MAINT_WRITE_REMOTE) +- Set RapidIO Destination ID for mport devices (RIO_MPORT_MAINT_HDID_SET) +- Set RapidIO Component Tag for mport devices (RIO_MPORT_MAINT_COMPTAG_SET) +- Query logical index of mport devices (RIO_MPORT_MAINT_PORT_IDX_GET) +- Query capabilities and RapidIO link configuration of mport devices + (RIO_MPORT_GET_PROPERTIES) +- Enable/Disable reporting of RapidIO doorbell events to user-space applications + (RIO_ENABLE_DOORBELL_RANGE/RIO_DISABLE_DOORBELL_RANGE) +- Enable/Disable reporting of RIO port-write events to user-space applications + (RIO_ENABLE_PORTWRITE_RANGE/RIO_DISABLE_PORTWRITE_RANGE) +- Query/Control type of events reported through this driver: doorbells, + port-writes or both (RIO_SET_EVENT_MASK/RIO_GET_EVENT_MASK) +- Configure/Map mport's outbound requests window(s) for specific size, + RapidIO destination ID, hopcount and request type + (RIO_MAP_OUTBOUND/RIO_UNMAP_OUTBOUND) +- Configure/Map mport's inbound requests window(s) for specific size, + RapidIO base address and local memory base address + (RIO_MAP_INBOUND/RIO_UNMAP_INBOUND) +- Allocate/Free contiguous DMA coherent memory buffer for DMA data transfers + to/from remote RapidIO devices (RIO_ALLOC_DMA/RIO_FREE_DMA) +- Initiate DMA data transfers to/from remote RapidIO devices (RIO_TRANSFER). + Supports blocking, asynchronous and posted (a.k.a 'fire-and-forget') data + transfer modes. +- Check/Wait for completion of asynchronous DMA data transfer + (RIO_WAIT_FOR_ASYNC) +- Manage device objects supported by RapidIO subsystem (RIO_DEV_ADD/RIO_DEV_DEL). + This allows implementation of various RapidIO fabric enumeration algorithms + as user-space applications while using remaining functionality provided by + kernel RapidIO subsystem. + +2. Hardware Compatibility +========================= + +This device driver uses standard interfaces defined by kernel RapidIO subsystem +and therefore it can be used with any mport device driver registered by RapidIO +subsystem with limitations set by available mport implementation. + +At this moment the most common limitation is availability of RapidIO-specific +DMA engine framework for specific mport device. Users should verify available +functionality of their platform when planning to use this driver: + +- IDT Tsi721 PCIe-to-RapidIO bridge device and its mport device driver are fully + compatible with this driver. +- Freescale SoCs 'fsl_rio' mport driver does not have implementation for RapidIO + specific DMA engine support and therefore DMA data transfers mport_cdev driver + are not available. + +3. Module parameters +==================== + +- 'dma_timeout' + - DMA transfer completion timeout (in msec, default value 3000). + This parameter set a maximum completion wait time for SYNC mode DMA + transfer requests and for RIO_WAIT_FOR_ASYNC ioctl requests. + +- 'dbg_level' + - This parameter allows to control amount of debug information + generated by this device driver. This parameter is formed by set of + bit masks that correspond to the specific functional blocks. + For mask definitions see 'drivers/rapidio/devices/rio_mport_cdev.c' + This parameter can be changed dynamically. + Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level. + +4. Known problems +================= + + None. + +5. User-space Applications and API +================================== + +API library and applications that use this device driver are available from +RapidIO.org. + +6. TODO List +============ + +- Add support for sending/receiving "raw" RapidIO messaging packets. +- Add memory mapped DMA data transfers as an option when RapidIO-specific DMA + is not available. diff --git a/Documentation/driver-api/rapidio/rapidio.rst b/Documentation/driver-api/rapidio/rapidio.rst new file mode 100644 index 000000000000..fb8942d3ba85 --- /dev/null +++ b/Documentation/driver-api/rapidio/rapidio.rst @@ -0,0 +1,362 @@ +============ +Introduction +============ + +The RapidIO standard is a packet-based fabric interconnect standard designed for +use in embedded systems. Development of the RapidIO standard is directed by the +RapidIO Trade Association (RTA). The current version of the RapidIO specification +is publicly available for download from the RTA web-site [1]. + +This document describes the basics of the Linux RapidIO subsystem and provides +information on its major components. + +1 Overview +========== + +Because the RapidIO subsystem follows the Linux device model it is integrated +into the kernel similarly to other buses by defining RapidIO-specific device and +bus types and registering them within the device model. + +The Linux RapidIO subsystem is architecture independent and therefore defines +architecture-specific interfaces that provide support for common RapidIO +subsystem operations. + +2. Core Components +================== + +A typical RapidIO network is a combination of endpoints and switches. +Each of these components is represented in the subsystem by an associated data +structure. The core logical components of the RapidIO subsystem are defined +in include/linux/rio.h file. + +2.1 Master Port +--------------- + +A master port (or mport) is a RapidIO interface controller that is local to the +processor executing the Linux code. A master port generates and receives RapidIO +packets (transactions). In the RapidIO subsystem each master port is represented +by a rio_mport data structure. This structure contains master port specific +resources such as mailboxes and doorbells. The rio_mport also includes a unique +host device ID that is valid when a master port is configured as an enumerating +host. + +RapidIO master ports are serviced by subsystem specific mport device drivers +that provide functionality defined for this subsystem. To provide a hardware +independent interface for RapidIO subsystem operations, rio_mport structure +includes rio_ops data structure which contains pointers to hardware specific +implementations of RapidIO functions. + +2.2 Device +---------- + +A RapidIO device is any endpoint (other than mport) or switch in the network. +All devices are presented in the RapidIO subsystem by corresponding rio_dev data +structure. Devices form one global device list and per-network device lists +(depending on number of available mports and networks). + +2.3 Switch +---------- + +A RapidIO switch is a special class of device that routes packets between its +ports towards their final destination. The packet destination port within a +switch is defined by an internal routing table. A switch is presented in the +RapidIO subsystem by rio_dev data structure expanded by additional rio_switch +data structure, which contains switch specific information such as copy of the +routing table and pointers to switch specific functions. + +The RapidIO subsystem defines the format and initialization method for subsystem +specific switch drivers that are designed to provide hardware-specific +implementation of common switch management routines. + +2.4 Network +----------- + +A RapidIO network is a combination of interconnected endpoint and switch devices. +Each RapidIO network known to the system is represented by corresponding rio_net +data structure. This structure includes lists of all devices and local master +ports that form the same network. It also contains a pointer to the default +master port that is used to communicate with devices within the network. + +2.5 Device Drivers +------------------ + +RapidIO device-specific drivers follow Linux Kernel Driver Model and are +intended to support specific RapidIO devices attached to the RapidIO network. + +2.6 Subsystem Interfaces +------------------------ + +RapidIO interconnect specification defines features that may be used to provide +one or more common service layers for all participating RapidIO devices. These +common services may act separately from device-specific drivers or be used by +device-specific drivers. Example of such service provider is the RIONET driver +which implements Ethernet-over-RapidIO interface. Because only one driver can be +registered for a device, all common RapidIO services have to be registered as +subsystem interfaces. This allows to have multiple common services attached to +the same device without blocking attachment of a device-specific driver. + +3. Subsystem Initialization +=========================== + +In order to initialize the RapidIO subsystem, a platform must initialize and +register at least one master port within the RapidIO network. To register mport +within the subsystem controller driver's initialization code calls function +rio_register_mport() for each available master port. + +After all active master ports are registered with a RapidIO subsystem, +an enumeration and/or discovery routine may be called automatically or +by user-space command. + +RapidIO subsystem can be configured to be built as a statically linked or +modular component of the kernel (see details below). + +4. Enumeration and Discovery +============================ + +4.1 Overview +------------ + +RapidIO subsystem configuration options allow users to build enumeration and +discovery methods as statically linked components or loadable modules. +An enumeration/discovery method implementation and available input parameters +define how any given method can be attached to available RapidIO mports: +simply to all available mports OR individually to the specified mport device. + +Depending on selected enumeration/discovery build configuration, there are +several methods to initiate an enumeration and/or discovery process: + + (a) Statically linked enumeration and discovery process can be started + automatically during kernel initialization time using corresponding module + parameters. This was the original method used since introduction of RapidIO + subsystem. Now this method relies on enumerator module parameter which is + 'rio-scan.scan' for existing basic enumeration/discovery method. + When automatic start of enumeration/discovery is used a user has to ensure + that all discovering endpoints are started before the enumerating endpoint + and are waiting for enumeration to be completed. + Configuration option CONFIG_RAPIDIO_DISC_TIMEOUT defines time that discovering + endpoint waits for enumeration to be completed. If the specified timeout + expires the discovery process is terminated without obtaining RapidIO network + information. NOTE: a timed out discovery process may be restarted later using + a user-space command as it is described below (if the given endpoint was + enumerated successfully). + + (b) Statically linked enumeration and discovery process can be started by + a command from user space. This initiation method provides more flexibility + for a system startup compared to the option (a) above. After all participating + endpoints have been successfully booted, an enumeration process shall be + started first by issuing a user-space command, after an enumeration is + completed a discovery process can be started on all remaining endpoints. + + (c) Modular enumeration and discovery process can be started by a command from + user space. After an enumeration/discovery module is loaded, a network scan + process can be started by issuing a user-space command. + Similar to the option (b) above, an enumerator has to be started first. + + (d) Modular enumeration and discovery process can be started by a module + initialization routine. In this case an enumerating module shall be loaded + first. + +When a network scan process is started it calls an enumeration or discovery +routine depending on the configured role of a master port: host or agent. + +Enumeration is performed by a master port if it is configured as a host port by +assigning a host destination ID greater than or equal to zero. The host +destination ID can be assigned to a master port using various methods depending +on RapidIO subsystem build configuration: + + (a) For a statically linked RapidIO subsystem core use command line parameter + "rapidio.hdid=" with a list of destination ID assignments in order of mport + device registration. For example, in a system with two RapidIO controllers + the command line parameter "rapidio.hdid=-1,7" will result in assignment of + the host destination ID=7 to the second RapidIO controller, while the first + one will be assigned destination ID=-1. + + (b) If the RapidIO subsystem core is built as a loadable module, in addition + to the method shown above, the host destination ID(s) can be specified using + traditional methods of passing module parameter "hdid=" during its loading: + + - from command line: "modprobe rapidio hdid=-1,7", or + - from modprobe configuration file using configuration command "options", + like in this example: "options rapidio hdid=-1,7". An example of modprobe + configuration file is provided in the section below. + +NOTES: + (i) if "hdid=" parameter is omitted all available mport will be assigned + destination ID = -1; + + (ii) the "hdid=" parameter in systems with multiple mports can have + destination ID assignments omitted from the end of list (default = -1). + +If the host device ID for a specific master port is set to -1, the discovery +process will be performed for it. + +The enumeration and discovery routines use RapidIO maintenance transactions +to access the configuration space of devices. + +NOTE: If RapidIO switch-specific device drivers are built as loadable modules +they must be loaded before enumeration/discovery process starts. +This requirement is cased by the fact that enumeration/discovery methods invoke +vendor-specific callbacks on early stages. + +4.2 Automatic Start of Enumeration and Discovery +------------------------------------------------ + +Automatic enumeration/discovery start method is applicable only to built-in +enumeration/discovery RapidIO configuration selection. To enable automatic +enumeration/discovery start by existing basic enumerator method set use boot +command line parameter "rio-scan.scan=1". + +This configuration requires synchronized start of all RapidIO endpoints that +form a network which will be enumerated/discovered. Discovering endpoints have +to be started before an enumeration starts to ensure that all RapidIO +controllers have been initialized and are ready to be discovered. Configuration +parameter CONFIG_RAPIDIO_DISC_TIMEOUT defines time (in seconds) which +a discovering endpoint will wait for enumeration to be completed. + +When automatic enumeration/discovery start is selected, basic method's +initialization routine calls rio_init_mports() to perform enumeration or +discovery for all known mport devices. + +Depending on RapidIO network size and configuration this automatic +enumeration/discovery start method may be difficult to use due to the +requirement for synchronized start of all endpoints. + +4.3 User-space Start of Enumeration and Discovery +------------------------------------------------- + +User-space start of enumeration and discovery can be used with built-in and +modular build configurations. For user-space controlled start RapidIO subsystem +creates the sysfs write-only attribute file '/sys/bus/rapidio/scan'. To initiate +an enumeration or discovery process on specific mport device, a user needs to +write mport_ID (not RapidIO destination ID) into that file. The mport_ID is a +sequential number (0 ... RIO_MAX_MPORTS) assigned during mport device +registration. For example for machine with single RapidIO controller, mport_ID +for that controller always will be 0. + +To initiate RapidIO enumeration/discovery on all available mports a user may +write '-1' (or RIO_MPORT_ANY) into the scan attribute file. + +4.4 Basic Enumeration Method +---------------------------- + +This is an original enumeration/discovery method which is available since +first release of RapidIO subsystem code. The enumeration process is +implemented according to the enumeration algorithm outlined in the RapidIO +Interconnect Specification: Annex I [1]. + +This method can be configured as statically linked or loadable module. +The method's single parameter "scan" allows to trigger the enumeration/discovery +process from module initialization routine. + +This enumeration/discovery method can be started only once and does not support +unloading if it is built as a module. + +The enumeration process traverses the network using a recursive depth-first +algorithm. When a new device is found, the enumerator takes ownership of that +device by writing into the Host Device ID Lock CSR. It does this to ensure that +the enumerator has exclusive right to enumerate the device. If device ownership +is successfully acquired, the enumerator allocates a new rio_dev structure and +initializes it according to device capabilities. + +If the device is an endpoint, a unique device ID is assigned to it and its value +is written into the device's Base Device ID CSR. + +If the device is a switch, the enumerator allocates an additional rio_switch +structure to store switch specific information. Then the switch's vendor ID and +device ID are queried against a table of known RapidIO switches. Each switch +table entry contains a pointer to a switch-specific initialization routine that +initializes pointers to the rest of switch specific operations, and performs +hardware initialization if necessary. A RapidIO switch does not have a unique +device ID; it relies on hopcount and routing for device ID of an attached +endpoint if access to its configuration registers is required. If a switch (or +chain of switches) does not have any endpoint (except enumerator) attached to +it, a fake device ID will be assigned to configure a route to that switch. +In the case of a chain of switches without endpoint, one fake device ID is used +to configure a route through the entire chain and switches are differentiated by +their hopcount value. + +For both endpoints and switches the enumerator writes a unique component tag +into device's Component Tag CSR. That unique value is used by the error +management notification mechanism to identify a device that is reporting an +error management event. + +Enumeration beyond a switch is completed by iterating over each active egress +port of that switch. For each active link, a route to a default device ID +(0xFF for 8-bit systems and 0xFFFF for 16-bit systems) is temporarily written +into the routing table. The algorithm recurs by calling itself with hopcount + 1 +and the default device ID in order to access the device on the active port. + +After the host has completed enumeration of the entire network it releases +devices by clearing device ID locks (calls rio_clear_locks()). For each endpoint +in the system, it sets the Discovered bit in the Port General Control CSR +to indicate that enumeration is completed and agents are allowed to execute +passive discovery of the network. + +The discovery process is performed by agents and is similar to the enumeration +process that is described above. However, the discovery process is performed +without changes to the existing routing because agents only gather information +about RapidIO network structure and are building an internal map of discovered +devices. This way each Linux-based component of the RapidIO subsystem has +a complete view of the network. The discovery process can be performed +simultaneously by several agents. After initializing its RapidIO master port +each agent waits for enumeration completion by the host for the configured wait +time period. If this wait time period expires before enumeration is completed, +an agent skips RapidIO discovery and continues with remaining kernel +initialization. + +4.5 Adding New Enumeration/Discovery Method +------------------------------------------- + +RapidIO subsystem code organization allows addition of new enumeration/discovery +methods as new configuration options without significant impact to the core +RapidIO code. + +A new enumeration/discovery method has to be attached to one or more mport +devices before an enumeration/discovery process can be started. Normally, +method's module initialization routine calls rio_register_scan() to attach +an enumerator to a specified mport device (or devices). The basic enumerator +implementation demonstrates this process. + +4.6 Using Loadable RapidIO Switch Drivers +----------------------------------------- + +In the case when RapidIO switch drivers are built as loadable modules a user +must ensure that they are loaded before the enumeration/discovery starts. +This process can be automated by specifying pre- or post- dependencies in the +RapidIO-specific modprobe configuration file as shown in the example below. + +File /etc/modprobe.d/rapidio.conf:: + + # Configure RapidIO subsystem modules + + # Set enumerator host destination ID (overrides kernel command line option) + options rapidio hdid=-1,2 + + # Load RapidIO switch drivers immediately after rapidio core module was loaded + softdep rapidio post: idt_gen2 idtcps tsi57x + + # OR : + + # Load RapidIO switch drivers just before rio-scan enumerator module is loaded + softdep rio-scan pre: idt_gen2 idtcps tsi57x + + -------------------------- + +NOTE: + In the example above, one of "softdep" commands must be removed or + commented out to keep required module loading sequence. + +5. References +============= + +[1] RapidIO Trade Association. RapidIO Interconnect Specifications. + http://www.rapidio.org. + +[2] Rapidio TA. Technology Comparisons. + http://www.rapidio.org/education/technology_comparisons/ + +[3] RapidIO support for Linux. + http://lwn.net/Articles/139118/ + +[4] Matt Porter. RapidIO for Linux. Ottawa Linux Symposium, 2005 + http://www.kernel.org/doc/ols/2005/ols2005v2-pages-43-56.pdf diff --git a/Documentation/driver-api/rapidio/rio_cm.rst b/Documentation/driver-api/rapidio/rio_cm.rst new file mode 100644 index 000000000000..5294430a7a74 --- /dev/null +++ b/Documentation/driver-api/rapidio/rio_cm.rst @@ -0,0 +1,135 @@ +========================================================================== +RapidIO subsystem Channelized Messaging character device driver (rio_cm.c) +========================================================================== + + +1. Overview +=========== + +This device driver is the result of collaboration within the RapidIO.org +Software Task Group (STG) between Texas Instruments, Prodrive Technologies, +Nokia Networks, BAE and IDT. Additional input was received from other members +of RapidIO.org. + +The objective was to create a character mode driver interface which exposes +messaging capabilities of RapidIO endpoint devices (mports) directly +to applications, in a manner that allows the numerous and varied RapidIO +implementations to interoperate. + +This driver (RIO_CM) provides to user-space applications shared access to +RapidIO mailbox messaging resources. + +RapidIO specification (Part 2) defines that endpoint devices may have up to four +messaging mailboxes in case of multi-packet message (up to 4KB) and +up to 64 mailboxes if single-packet messages (up to 256 B) are used. In addition +to protocol definition limitations, a particular hardware implementation can +have reduced number of messaging mailboxes. RapidIO aware applications must +therefore share the messaging resources of a RapidIO endpoint. + +Main purpose of this device driver is to provide RapidIO mailbox messaging +capability to large number of user-space processes by introducing socket-like +operations using a single messaging mailbox. This allows applications to +use the limited RapidIO messaging hardware resources efficiently. + +Most of device driver's operations are supported through 'ioctl' system calls. + +When loaded this device driver creates a single file system node named rio_cm +in /dev directory common for all registered RapidIO mport devices. + +Following ioctl commands are available to user-space applications: + +- RIO_CM_MPORT_GET_LIST: + Returns to caller list of local mport devices that + support messaging operations (number of entries up to RIO_MAX_MPORTS). + Each list entry is combination of mport's index in the system and RapidIO + destination ID assigned to the port. +- RIO_CM_EP_GET_LIST_SIZE: + Returns number of messaging capable remote endpoints + in a RapidIO network associated with the specified mport device. +- RIO_CM_EP_GET_LIST: + Returns list of RapidIO destination IDs for messaging + capable remote endpoints (peers) available in a RapidIO network associated + with the specified mport device. +- RIO_CM_CHAN_CREATE: + Creates RapidIO message exchange channel data structure + with channel ID assigned automatically or as requested by a caller. +- RIO_CM_CHAN_BIND: + Binds the specified channel data structure to the specified + mport device. +- RIO_CM_CHAN_LISTEN: + Enables listening for connection requests on the specified + channel. +- RIO_CM_CHAN_ACCEPT: + Accepts a connection request from peer on the specified + channel. If wait timeout for this request is specified by a caller it is + a blocking call. If timeout set to 0 this is non-blocking call - ioctl + handler checks for a pending connection request and if one is not available + exits with -EGAIN error status immediately. +- RIO_CM_CHAN_CONNECT: + Sends a connection request to a remote peer/channel. +- RIO_CM_CHAN_SEND: + Sends a data message through the specified channel. + The handler for this request assumes that message buffer specified by + a caller includes the reserved space for a packet header required by + this driver. +- RIO_CM_CHAN_RECEIVE: + Receives a data message through a connected channel. + If the channel does not have an incoming message ready to return this ioctl + handler will wait for new message until timeout specified by a caller + expires. If timeout value is set to 0, ioctl handler uses a default value + defined by MAX_SCHEDULE_TIMEOUT. +- RIO_CM_CHAN_CLOSE: + Closes a specified channel and frees associated buffers. + If the specified channel is in the CONNECTED state, sends close notification + to the remote peer. + +The ioctl command codes and corresponding data structures intended for use by +user-space applications are defined in 'include/uapi/linux/rio_cm_cdev.h'. + +2. Hardware Compatibility +========================= + +This device driver uses standard interfaces defined by kernel RapidIO subsystem +and therefore it can be used with any mport device driver registered by RapidIO +subsystem with limitations set by available mport HW implementation of messaging +mailboxes. + +3. Module parameters +==================== + +- 'dbg_level' + - This parameter allows to control amount of debug information + generated by this device driver. This parameter is formed by set of + bit masks that correspond to the specific functional block. + For mask definitions see 'drivers/rapidio/devices/rio_cm.c' + This parameter can be changed dynamically. + Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level. + +- 'cmbox' + - Number of RapidIO mailbox to use (default value is 1). + This parameter allows to set messaging mailbox number that will be used + within entire RapidIO network. It can be used when default mailbox is + used by other device drivers or is not supported by some nodes in the + RapidIO network. + +- 'chstart' + - Start channel number for dynamic assignment. Default value - 256. + Allows to exclude channel numbers below this parameter from dynamic + allocation to avoid conflicts with software components that use + reserved predefined channel numbers. + +4. Known problems +================= + + None. + +5. User-space Applications and API Library +========================================== + +Messaging API library and applications that use this device driver are available +from RapidIO.org. + +6. TODO List +============ + +- Add support for system notification messages (reserved channel 0). diff --git a/Documentation/driver-api/rapidio/sysfs.rst b/Documentation/driver-api/rapidio/sysfs.rst new file mode 100644 index 000000000000..540f72683496 --- /dev/null +++ b/Documentation/driver-api/rapidio/sysfs.rst @@ -0,0 +1,7 @@ +============= +Sysfs entries +============= + +The RapidIO sysfs files have moved to: +Documentation/ABI/testing/sysfs-bus-rapidio and +Documentation/ABI/testing/sysfs-class-rapidio diff --git a/Documentation/driver-api/rapidio/tsi721.rst b/Documentation/driver-api/rapidio/tsi721.rst new file mode 100644 index 000000000000..42aea438cd20 --- /dev/null +++ b/Documentation/driver-api/rapidio/tsi721.rst @@ -0,0 +1,112 @@ +========================================================================= +RapidIO subsystem mport driver for IDT Tsi721 PCI Express-to-SRIO bridge. +========================================================================= + +1. Overview +=========== + +This driver implements all currently defined RapidIO mport callback functions. +It supports maintenance read and write operations, inbound and outbound RapidIO +doorbells, inbound maintenance port-writes and RapidIO messaging. + +To generate SRIO maintenance transactions this driver uses one of Tsi721 DMA +channels. This mechanism provides access to larger range of hop counts and +destination IDs without need for changes in outbound window translation. + +RapidIO messaging support uses dedicated messaging channels for each mailbox. +For inbound messages this driver uses destination ID matching to forward messages +into the corresponding message queue. Messaging callbacks are implemented to be +fully compatible with RIONET driver (Ethernet over RapidIO messaging services). + +1. Module parameters: + +- 'dbg_level' + - This parameter allows to control amount of debug information + generated by this device driver. This parameter is formed by set of + This parameter can be changed bit masks that correspond to the specific + functional block. + For mask definitions see 'drivers/rapidio/devices/tsi721.h' + This parameter can be changed dynamically. + Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level. + +- 'dma_desc_per_channel' + - This parameter defines number of hardware buffer + descriptors allocated for each registered Tsi721 DMA channel. + Its default value is 128. + +- 'dma_txqueue_sz' + - DMA transactions queue size. Defines number of pending + transaction requests that can be accepted by each DMA channel. + Default value is 16. + +- 'dma_sel' + - DMA channel selection mask. Bitmask that defines which hardware + DMA channels (0 ... 6) will be registered with DmaEngine core. + If bit is set to 1, the corresponding DMA channel will be registered. + DMA channels not selected by this mask will not be used by this device + driver. Default value is 0x7f (use all channels). + +- 'pcie_mrrs' + - override value for PCIe Maximum Read Request Size (MRRS). + This parameter gives an ability to override MRRS value set during PCIe + configuration process. Tsi721 supports read request sizes up to 4096B. + Value for this parameter must be set as defined by PCIe specification: + 0 = 128B, 1 = 256B, 2 = 512B, 3 = 1024B, 4 = 2048B and 5 = 4096B. + Default value is '-1' (= keep platform setting). + +- 'mbox_sel' + - RIO messaging MBOX selection mask. This is a bitmask that defines + messaging MBOXes are managed by this device driver. Mask bits 0 - 3 + correspond to MBOX0 - MBOX3. MBOX is under driver's control if the + corresponding bit is set to '1'. Default value is 0x0f (= all). + +2. Known problems +================= + + None. + +3. DMA Engine Support +===================== + +Tsi721 mport driver supports DMA data transfers between local system memory and +remote RapidIO devices. This functionality is implemented according to SLAVE +mode API defined by common Linux kernel DMA Engine framework. + +Depending on system requirements RapidIO DMA operations can be included/excluded +by setting CONFIG_RAPIDIO_DMA_ENGINE option. Tsi721 miniport driver uses seven +out of eight available BDMA channels to support DMA data transfers. +One BDMA channel is reserved for generation of maintenance read/write requests. + +If Tsi721 mport driver have been built with RAPIDIO_DMA_ENGINE support included, +this driver will accept DMA-specific module parameter: + + "dma_desc_per_channel" + - defines number of hardware buffer descriptors used by + each BDMA channel of Tsi721 (by default - 128). + +4. Version History + + ===== ==================================================================== + 1.1.0 DMA operations re-worked to support data scatter/gather lists larger + than hardware buffer descriptors ring. + 1.0.0 Initial driver release. + ===== ==================================================================== + +5. License +=========== + + Copyright(c) 2011 Integrated Device Technology, Inc. All rights reserved. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., + 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. diff --git a/Documentation/rapidio/index.rst b/Documentation/rapidio/index.rst deleted file mode 100644 index ab7b5541b346..000000000000 --- a/Documentation/rapidio/index.rst +++ /dev/null @@ -1,15 +0,0 @@ -:orphan: - -=========================== -The Linux RapidIO Subsystem -=========================== - -.. toctree:: - :maxdepth: 1 - - rapidio - sysfs - - tsi721 - mport_cdev - rio_cm diff --git a/Documentation/rapidio/mport_cdev.rst b/Documentation/rapidio/mport_cdev.rst deleted file mode 100644 index df77a7f7be7d..000000000000 --- a/Documentation/rapidio/mport_cdev.rst +++ /dev/null @@ -1,110 +0,0 @@ -================================================================== -RapidIO subsystem mport character device driver (rio_mport_cdev.c) -================================================================== - -1. Overview -=========== - -This device driver is the result of collaboration within the RapidIO.org -Software Task Group (STG) between Texas Instruments, Freescale, -Prodrive Technologies, Nokia Networks, BAE and IDT. Additional input was -received from other members of RapidIO.org. The objective was to create a -character mode driver interface which exposes the capabilities of RapidIO -devices directly to applications, in a manner that allows the numerous and -varied RapidIO implementations to interoperate. - -This driver (MPORT_CDEV) provides access to basic RapidIO subsystem operations -for user-space applications. Most of RapidIO operations are supported through -'ioctl' system calls. - -When loaded this device driver creates filesystem nodes named rio_mportX in /dev -directory for each registered RapidIO mport device. 'X' in the node name matches -to unique port ID assigned to each local mport device. - -Using available set of ioctl commands user-space applications can perform -following RapidIO bus and subsystem operations: - -- Reads and writes from/to configuration registers of mport devices - (RIO_MPORT_MAINT_READ_LOCAL/RIO_MPORT_MAINT_WRITE_LOCAL) -- Reads and writes from/to configuration registers of remote RapidIO devices. - This operations are defined as RapidIO Maintenance reads/writes in RIO spec. - (RIO_MPORT_MAINT_READ_REMOTE/RIO_MPORT_MAINT_WRITE_REMOTE) -- Set RapidIO Destination ID for mport devices (RIO_MPORT_MAINT_HDID_SET) -- Set RapidIO Component Tag for mport devices (RIO_MPORT_MAINT_COMPTAG_SET) -- Query logical index of mport devices (RIO_MPORT_MAINT_PORT_IDX_GET) -- Query capabilities and RapidIO link configuration of mport devices - (RIO_MPORT_GET_PROPERTIES) -- Enable/Disable reporting of RapidIO doorbell events to user-space applications - (RIO_ENABLE_DOORBELL_RANGE/RIO_DISABLE_DOORBELL_RANGE) -- Enable/Disable reporting of RIO port-write events to user-space applications - (RIO_ENABLE_PORTWRITE_RANGE/RIO_DISABLE_PORTWRITE_RANGE) -- Query/Control type of events reported through this driver: doorbells, - port-writes or both (RIO_SET_EVENT_MASK/RIO_GET_EVENT_MASK) -- Configure/Map mport's outbound requests window(s) for specific size, - RapidIO destination ID, hopcount and request type - (RIO_MAP_OUTBOUND/RIO_UNMAP_OUTBOUND) -- Configure/Map mport's inbound requests window(s) for specific size, - RapidIO base address and local memory base address - (RIO_MAP_INBOUND/RIO_UNMAP_INBOUND) -- Allocate/Free contiguous DMA coherent memory buffer for DMA data transfers - to/from remote RapidIO devices (RIO_ALLOC_DMA/RIO_FREE_DMA) -- Initiate DMA data transfers to/from remote RapidIO devices (RIO_TRANSFER). - Supports blocking, asynchronous and posted (a.k.a 'fire-and-forget') data - transfer modes. -- Check/Wait for completion of asynchronous DMA data transfer - (RIO_WAIT_FOR_ASYNC) -- Manage device objects supported by RapidIO subsystem (RIO_DEV_ADD/RIO_DEV_DEL). - This allows implementation of various RapidIO fabric enumeration algorithms - as user-space applications while using remaining functionality provided by - kernel RapidIO subsystem. - -2. Hardware Compatibility -========================= - -This device driver uses standard interfaces defined by kernel RapidIO subsystem -and therefore it can be used with any mport device driver registered by RapidIO -subsystem with limitations set by available mport implementation. - -At this moment the most common limitation is availability of RapidIO-specific -DMA engine framework for specific mport device. Users should verify available -functionality of their platform when planning to use this driver: - -- IDT Tsi721 PCIe-to-RapidIO bridge device and its mport device driver are fully - compatible with this driver. -- Freescale SoCs 'fsl_rio' mport driver does not have implementation for RapidIO - specific DMA engine support and therefore DMA data transfers mport_cdev driver - are not available. - -3. Module parameters -==================== - -- 'dma_timeout' - - DMA transfer completion timeout (in msec, default value 3000). - This parameter set a maximum completion wait time for SYNC mode DMA - transfer requests and for RIO_WAIT_FOR_ASYNC ioctl requests. - -- 'dbg_level' - - This parameter allows to control amount of debug information - generated by this device driver. This parameter is formed by set of - bit masks that correspond to the specific functional blocks. - For mask definitions see 'drivers/rapidio/devices/rio_mport_cdev.c' - This parameter can be changed dynamically. - Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level. - -4. Known problems -================= - - None. - -5. User-space Applications and API -================================== - -API library and applications that use this device driver are available from -RapidIO.org. - -6. TODO List -============ - -- Add support for sending/receiving "raw" RapidIO messaging packets. -- Add memory mapped DMA data transfers as an option when RapidIO-specific DMA - is not available. diff --git a/Documentation/rapidio/rapidio.rst b/Documentation/rapidio/rapidio.rst deleted file mode 100644 index fb8942d3ba85..000000000000 --- a/Documentation/rapidio/rapidio.rst +++ /dev/null @@ -1,362 +0,0 @@ -============ -Introduction -============ - -The RapidIO standard is a packet-based fabric interconnect standard designed for -use in embedded systems. Development of the RapidIO standard is directed by the -RapidIO Trade Association (RTA). The current version of the RapidIO specification -is publicly available for download from the RTA web-site [1]. - -This document describes the basics of the Linux RapidIO subsystem and provides -information on its major components. - -1 Overview -========== - -Because the RapidIO subsystem follows the Linux device model it is integrated -into the kernel similarly to other buses by defining RapidIO-specific device and -bus types and registering them within the device model. - -The Linux RapidIO subsystem is architecture independent and therefore defines -architecture-specific interfaces that provide support for common RapidIO -subsystem operations. - -2. Core Components -================== - -A typical RapidIO network is a combination of endpoints and switches. -Each of these components is represented in the subsystem by an associated data -structure. The core logical components of the RapidIO subsystem are defined -in include/linux/rio.h file. - -2.1 Master Port ---------------- - -A master port (or mport) is a RapidIO interface controller that is local to the -processor executing the Linux code. A master port generates and receives RapidIO -packets (transactions). In the RapidIO subsystem each master port is represented -by a rio_mport data structure. This structure contains master port specific -resources such as mailboxes and doorbells. The rio_mport also includes a unique -host device ID that is valid when a master port is configured as an enumerating -host. - -RapidIO master ports are serviced by subsystem specific mport device drivers -that provide functionality defined for this subsystem. To provide a hardware -independent interface for RapidIO subsystem operations, rio_mport structure -includes rio_ops data structure which contains pointers to hardware specific -implementations of RapidIO functions. - -2.2 Device ----------- - -A RapidIO device is any endpoint (other than mport) or switch in the network. -All devices are presented in the RapidIO subsystem by corresponding rio_dev data -structure. Devices form one global device list and per-network device lists -(depending on number of available mports and networks). - -2.3 Switch ----------- - -A RapidIO switch is a special class of device that routes packets between its -ports towards their final destination. The packet destination port within a -switch is defined by an internal routing table. A switch is presented in the -RapidIO subsystem by rio_dev data structure expanded by additional rio_switch -data structure, which contains switch specific information such as copy of the -routing table and pointers to switch specific functions. - -The RapidIO subsystem defines the format and initialization method for subsystem -specific switch drivers that are designed to provide hardware-specific -implementation of common switch management routines. - -2.4 Network ------------ - -A RapidIO network is a combination of interconnected endpoint and switch devices. -Each RapidIO network known to the system is represented by corresponding rio_net -data structure. This structure includes lists of all devices and local master -ports that form the same network. It also contains a pointer to the default -master port that is used to communicate with devices within the network. - -2.5 Device Drivers ------------------- - -RapidIO device-specific drivers follow Linux Kernel Driver Model and are -intended to support specific RapidIO devices attached to the RapidIO network. - -2.6 Subsystem Interfaces ------------------------- - -RapidIO interconnect specification defines features that may be used to provide -one or more common service layers for all participating RapidIO devices. These -common services may act separately from device-specific drivers or be used by -device-specific drivers. Example of such service provider is the RIONET driver -which implements Ethernet-over-RapidIO interface. Because only one driver can be -registered for a device, all common RapidIO services have to be registered as -subsystem interfaces. This allows to have multiple common services attached to -the same device without blocking attachment of a device-specific driver. - -3. Subsystem Initialization -=========================== - -In order to initialize the RapidIO subsystem, a platform must initialize and -register at least one master port within the RapidIO network. To register mport -within the subsystem controller driver's initialization code calls function -rio_register_mport() for each available master port. - -After all active master ports are registered with a RapidIO subsystem, -an enumeration and/or discovery routine may be called automatically or -by user-space command. - -RapidIO subsystem can be configured to be built as a statically linked or -modular component of the kernel (see details below). - -4. Enumeration and Discovery -============================ - -4.1 Overview ------------- - -RapidIO subsystem configuration options allow users to build enumeration and -discovery methods as statically linked components or loadable modules. -An enumeration/discovery method implementation and available input parameters -define how any given method can be attached to available RapidIO mports: -simply to all available mports OR individually to the specified mport device. - -Depending on selected enumeration/discovery build configuration, there are -several methods to initiate an enumeration and/or discovery process: - - (a) Statically linked enumeration and discovery process can be started - automatically during kernel initialization time using corresponding module - parameters. This was the original method used since introduction of RapidIO - subsystem. Now this method relies on enumerator module parameter which is - 'rio-scan.scan' for existing basic enumeration/discovery method. - When automatic start of enumeration/discovery is used a user has to ensure - that all discovering endpoints are started before the enumerating endpoint - and are waiting for enumeration to be completed. - Configuration option CONFIG_RAPIDIO_DISC_TIMEOUT defines time that discovering - endpoint waits for enumeration to be completed. If the specified timeout - expires the discovery process is terminated without obtaining RapidIO network - information. NOTE: a timed out discovery process may be restarted later using - a user-space command as it is described below (if the given endpoint was - enumerated successfully). - - (b) Statically linked enumeration and discovery process can be started by - a command from user space. This initiation method provides more flexibility - for a system startup compared to the option (a) above. After all participating - endpoints have been successfully booted, an enumeration process shall be - started first by issuing a user-space command, after an enumeration is - completed a discovery process can be started on all remaining endpoints. - - (c) Modular enumeration and discovery process can be started by a command from - user space. After an enumeration/discovery module is loaded, a network scan - process can be started by issuing a user-space command. - Similar to the option (b) above, an enumerator has to be started first. - - (d) Modular enumeration and discovery process can be started by a module - initialization routine. In this case an enumerating module shall be loaded - first. - -When a network scan process is started it calls an enumeration or discovery -routine depending on the configured role of a master port: host or agent. - -Enumeration is performed by a master port if it is configured as a host port by -assigning a host destination ID greater than or equal to zero. The host -destination ID can be assigned to a master port using various methods depending -on RapidIO subsystem build configuration: - - (a) For a statically linked RapidIO subsystem core use command line parameter - "rapidio.hdid=" with a list of destination ID assignments in order of mport - device registration. For example, in a system with two RapidIO controllers - the command line parameter "rapidio.hdid=-1,7" will result in assignment of - the host destination ID=7 to the second RapidIO controller, while the first - one will be assigned destination ID=-1. - - (b) If the RapidIO subsystem core is built as a loadable module, in addition - to the method shown above, the host destination ID(s) can be specified using - traditional methods of passing module parameter "hdid=" during its loading: - - - from command line: "modprobe rapidio hdid=-1,7", or - - from modprobe configuration file using configuration command "options", - like in this example: "options rapidio hdid=-1,7". An example of modprobe - configuration file is provided in the section below. - -NOTES: - (i) if "hdid=" parameter is omitted all available mport will be assigned - destination ID = -1; - - (ii) the "hdid=" parameter in systems with multiple mports can have - destination ID assignments omitted from the end of list (default = -1). - -If the host device ID for a specific master port is set to -1, the discovery -process will be performed for it. - -The enumeration and discovery routines use RapidIO maintenance transactions -to access the configuration space of devices. - -NOTE: If RapidIO switch-specific device drivers are built as loadable modules -they must be loaded before enumeration/discovery process starts. -This requirement is cased by the fact that enumeration/discovery methods invoke -vendor-specific callbacks on early stages. - -4.2 Automatic Start of Enumeration and Discovery ------------------------------------------------- - -Automatic enumeration/discovery start method is applicable only to built-in -enumeration/discovery RapidIO configuration selection. To enable automatic -enumeration/discovery start by existing basic enumerator method set use boot -command line parameter "rio-scan.scan=1". - -This configuration requires synchronized start of all RapidIO endpoints that -form a network which will be enumerated/discovered. Discovering endpoints have -to be started before an enumeration starts to ensure that all RapidIO -controllers have been initialized and are ready to be discovered. Configuration -parameter CONFIG_RAPIDIO_DISC_TIMEOUT defines time (in seconds) which -a discovering endpoint will wait for enumeration to be completed. - -When automatic enumeration/discovery start is selected, basic method's -initialization routine calls rio_init_mports() to perform enumeration or -discovery for all known mport devices. - -Depending on RapidIO network size and configuration this automatic -enumeration/discovery start method may be difficult to use due to the -requirement for synchronized start of all endpoints. - -4.3 User-space Start of Enumeration and Discovery -------------------------------------------------- - -User-space start of enumeration and discovery can be used with built-in and -modular build configurations. For user-space controlled start RapidIO subsystem -creates the sysfs write-only attribute file '/sys/bus/rapidio/scan'. To initiate -an enumeration or discovery process on specific mport device, a user needs to -write mport_ID (not RapidIO destination ID) into that file. The mport_ID is a -sequential number (0 ... RIO_MAX_MPORTS) assigned during mport device -registration. For example for machine with single RapidIO controller, mport_ID -for that controller always will be 0. - -To initiate RapidIO enumeration/discovery on all available mports a user may -write '-1' (or RIO_MPORT_ANY) into the scan attribute file. - -4.4 Basic Enumeration Method ----------------------------- - -This is an original enumeration/discovery method which is available since -first release of RapidIO subsystem code. The enumeration process is -implemented according to the enumeration algorithm outlined in the RapidIO -Interconnect Specification: Annex I [1]. - -This method can be configured as statically linked or loadable module. -The method's single parameter "scan" allows to trigger the enumeration/discovery -process from module initialization routine. - -This enumeration/discovery method can be started only once and does not support -unloading if it is built as a module. - -The enumeration process traverses the network using a recursive depth-first -algorithm. When a new device is found, the enumerator takes ownership of that -device by writing into the Host Device ID Lock CSR. It does this to ensure that -the enumerator has exclusive right to enumerate the device. If device ownership -is successfully acquired, the enumerator allocates a new rio_dev structure and -initializes it according to device capabilities. - -If the device is an endpoint, a unique device ID is assigned to it and its value -is written into the device's Base Device ID CSR. - -If the device is a switch, the enumerator allocates an additional rio_switch -structure to store switch specific information. Then the switch's vendor ID and -device ID are queried against a table of known RapidIO switches. Each switch -table entry contains a pointer to a switch-specific initialization routine that -initializes pointers to the rest of switch specific operations, and performs -hardware initialization if necessary. A RapidIO switch does not have a unique -device ID; it relies on hopcount and routing for device ID of an attached -endpoint if access to its configuration registers is required. If a switch (or -chain of switches) does not have any endpoint (except enumerator) attached to -it, a fake device ID will be assigned to configure a route to that switch. -In the case of a chain of switches without endpoint, one fake device ID is used -to configure a route through the entire chain and switches are differentiated by -their hopcount value. - -For both endpoints and switches the enumerator writes a unique component tag -into device's Component Tag CSR. That unique value is used by the error -management notification mechanism to identify a device that is reporting an -error management event. - -Enumeration beyond a switch is completed by iterating over each active egress -port of that switch. For each active link, a route to a default device ID -(0xFF for 8-bit systems and 0xFFFF for 16-bit systems) is temporarily written -into the routing table. The algorithm recurs by calling itself with hopcount + 1 -and the default device ID in order to access the device on the active port. - -After the host has completed enumeration of the entire network it releases -devices by clearing device ID locks (calls rio_clear_locks()). For each endpoint -in the system, it sets the Discovered bit in the Port General Control CSR -to indicate that enumeration is completed and agents are allowed to execute -passive discovery of the network. - -The discovery process is performed by agents and is similar to the enumeration -process that is described above. However, the discovery process is performed -without changes to the existing routing because agents only gather information -about RapidIO network structure and are building an internal map of discovered -devices. This way each Linux-based component of the RapidIO subsystem has -a complete view of the network. The discovery process can be performed -simultaneously by several agents. After initializing its RapidIO master port -each agent waits for enumeration completion by the host for the configured wait -time period. If this wait time period expires before enumeration is completed, -an agent skips RapidIO discovery and continues with remaining kernel -initialization. - -4.5 Adding New Enumeration/Discovery Method -------------------------------------------- - -RapidIO subsystem code organization allows addition of new enumeration/discovery -methods as new configuration options without significant impact to the core -RapidIO code. - -A new enumeration/discovery method has to be attached to one or more mport -devices before an enumeration/discovery process can be started. Normally, -method's module initialization routine calls rio_register_scan() to attach -an enumerator to a specified mport device (or devices). The basic enumerator -implementation demonstrates this process. - -4.6 Using Loadable RapidIO Switch Drivers ------------------------------------------ - -In the case when RapidIO switch drivers are built as loadable modules a user -must ensure that they are loaded before the enumeration/discovery starts. -This process can be automated by specifying pre- or post- dependencies in the -RapidIO-specific modprobe configuration file as shown in the example below. - -File /etc/modprobe.d/rapidio.conf:: - - # Configure RapidIO subsystem modules - - # Set enumerator host destination ID (overrides kernel command line option) - options rapidio hdid=-1,2 - - # Load RapidIO switch drivers immediately after rapidio core module was loaded - softdep rapidio post: idt_gen2 idtcps tsi57x - - # OR : - - # Load RapidIO switch drivers just before rio-scan enumerator module is loaded - softdep rio-scan pre: idt_gen2 idtcps tsi57x - - -------------------------- - -NOTE: - In the example above, one of "softdep" commands must be removed or - commented out to keep required module loading sequence. - -5. References -============= - -[1] RapidIO Trade Association. RapidIO Interconnect Specifications. - http://www.rapidio.org. - -[2] Rapidio TA. Technology Comparisons. - http://www.rapidio.org/education/technology_comparisons/ - -[3] RapidIO support for Linux. - http://lwn.net/Articles/139118/ - -[4] Matt Porter. RapidIO for Linux. Ottawa Linux Symposium, 2005 - http://www.kernel.org/doc/ols/2005/ols2005v2-pages-43-56.pdf diff --git a/Documentation/rapidio/rio_cm.rst b/Documentation/rapidio/rio_cm.rst deleted file mode 100644 index 5294430a7a74..000000000000 --- a/Documentation/rapidio/rio_cm.rst +++ /dev/null @@ -1,135 +0,0 @@ -========================================================================== -RapidIO subsystem Channelized Messaging character device driver (rio_cm.c) -========================================================================== - - -1. Overview -=========== - -This device driver is the result of collaboration within the RapidIO.org -Software Task Group (STG) between Texas Instruments, Prodrive Technologies, -Nokia Networks, BAE and IDT. Additional input was received from other members -of RapidIO.org. - -The objective was to create a character mode driver interface which exposes -messaging capabilities of RapidIO endpoint devices (mports) directly -to applications, in a manner that allows the numerous and varied RapidIO -implementations to interoperate. - -This driver (RIO_CM) provides to user-space applications shared access to -RapidIO mailbox messaging resources. - -RapidIO specification (Part 2) defines that endpoint devices may have up to four -messaging mailboxes in case of multi-packet message (up to 4KB) and -up to 64 mailboxes if single-packet messages (up to 256 B) are used. In addition -to protocol definition limitations, a particular hardware implementation can -have reduced number of messaging mailboxes. RapidIO aware applications must -therefore share the messaging resources of a RapidIO endpoint. - -Main purpose of this device driver is to provide RapidIO mailbox messaging -capability to large number of user-space processes by introducing socket-like -operations using a single messaging mailbox. This allows applications to -use the limited RapidIO messaging hardware resources efficiently. - -Most of device driver's operations are supported through 'ioctl' system calls. - -When loaded this device driver creates a single file system node named rio_cm -in /dev directory common for all registered RapidIO mport devices. - -Following ioctl commands are available to user-space applications: - -- RIO_CM_MPORT_GET_LIST: - Returns to caller list of local mport devices that - support messaging operations (number of entries up to RIO_MAX_MPORTS). - Each list entry is combination of mport's index in the system and RapidIO - destination ID assigned to the port. -- RIO_CM_EP_GET_LIST_SIZE: - Returns number of messaging capable remote endpoints - in a RapidIO network associated with the specified mport device. -- RIO_CM_EP_GET_LIST: - Returns list of RapidIO destination IDs for messaging - capable remote endpoints (peers) available in a RapidIO network associated - with the specified mport device. -- RIO_CM_CHAN_CREATE: - Creates RapidIO message exchange channel data structure - with channel ID assigned automatically or as requested by a caller. -- RIO_CM_CHAN_BIND: - Binds the specified channel data structure to the specified - mport device. -- RIO_CM_CHAN_LISTEN: - Enables listening for connection requests on the specified - channel. -- RIO_CM_CHAN_ACCEPT: - Accepts a connection request from peer on the specified - channel. If wait timeout for this request is specified by a caller it is - a blocking call. If timeout set to 0 this is non-blocking call - ioctl - handler checks for a pending connection request and if one is not available - exits with -EGAIN error status immediately. -- RIO_CM_CHAN_CONNECT: - Sends a connection request to a remote peer/channel. -- RIO_CM_CHAN_SEND: - Sends a data message through the specified channel. - The handler for this request assumes that message buffer specified by - a caller includes the reserved space for a packet header required by - this driver. -- RIO_CM_CHAN_RECEIVE: - Receives a data message through a connected channel. - If the channel does not have an incoming message ready to return this ioctl - handler will wait for new message until timeout specified by a caller - expires. If timeout value is set to 0, ioctl handler uses a default value - defined by MAX_SCHEDULE_TIMEOUT. -- RIO_CM_CHAN_CLOSE: - Closes a specified channel and frees associated buffers. - If the specified channel is in the CONNECTED state, sends close notification - to the remote peer. - -The ioctl command codes and corresponding data structures intended for use by -user-space applications are defined in 'include/uapi/linux/rio_cm_cdev.h'. - -2. Hardware Compatibility -========================= - -This device driver uses standard interfaces defined by kernel RapidIO subsystem -and therefore it can be used with any mport device driver registered by RapidIO -subsystem with limitations set by available mport HW implementation of messaging -mailboxes. - -3. Module parameters -==================== - -- 'dbg_level' - - This parameter allows to control amount of debug information - generated by this device driver. This parameter is formed by set of - bit masks that correspond to the specific functional block. - For mask definitions see 'drivers/rapidio/devices/rio_cm.c' - This parameter can be changed dynamically. - Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level. - -- 'cmbox' - - Number of RapidIO mailbox to use (default value is 1). - This parameter allows to set messaging mailbox number that will be used - within entire RapidIO network. It can be used when default mailbox is - used by other device drivers or is not supported by some nodes in the - RapidIO network. - -- 'chstart' - - Start channel number for dynamic assignment. Default value - 256. - Allows to exclude channel numbers below this parameter from dynamic - allocation to avoid conflicts with software components that use - reserved predefined channel numbers. - -4. Known problems -================= - - None. - -5. User-space Applications and API Library -========================================== - -Messaging API library and applications that use this device driver are available -from RapidIO.org. - -6. TODO List -============ - -- Add support for system notification messages (reserved channel 0). diff --git a/Documentation/rapidio/sysfs.rst b/Documentation/rapidio/sysfs.rst deleted file mode 100644 index 540f72683496..000000000000 --- a/Documentation/rapidio/sysfs.rst +++ /dev/null @@ -1,7 +0,0 @@ -============= -Sysfs entries -============= - -The RapidIO sysfs files have moved to: -Documentation/ABI/testing/sysfs-bus-rapidio and -Documentation/ABI/testing/sysfs-class-rapidio diff --git a/Documentation/rapidio/tsi721.rst b/Documentation/rapidio/tsi721.rst deleted file mode 100644 index 42aea438cd20..000000000000 --- a/Documentation/rapidio/tsi721.rst +++ /dev/null @@ -1,112 +0,0 @@ -========================================================================= -RapidIO subsystem mport driver for IDT Tsi721 PCI Express-to-SRIO bridge. -========================================================================= - -1. Overview -=========== - -This driver implements all currently defined RapidIO mport callback functions. -It supports maintenance read and write operations, inbound and outbound RapidIO -doorbells, inbound maintenance port-writes and RapidIO messaging. - -To generate SRIO maintenance transactions this driver uses one of Tsi721 DMA -channels. This mechanism provides access to larger range of hop counts and -destination IDs without need for changes in outbound window translation. - -RapidIO messaging support uses dedicated messaging channels for each mailbox. -For inbound messages this driver uses destination ID matching to forward messages -into the corresponding message queue. Messaging callbacks are implemented to be -fully compatible with RIONET driver (Ethernet over RapidIO messaging services). - -1. Module parameters: - -- 'dbg_level' - - This parameter allows to control amount of debug information - generated by this device driver. This parameter is formed by set of - This parameter can be changed bit masks that correspond to the specific - functional block. - For mask definitions see 'drivers/rapidio/devices/tsi721.h' - This parameter can be changed dynamically. - Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level. - -- 'dma_desc_per_channel' - - This parameter defines number of hardware buffer - descriptors allocated for each registered Tsi721 DMA channel. - Its default value is 128. - -- 'dma_txqueue_sz' - - DMA transactions queue size. Defines number of pending - transaction requests that can be accepted by each DMA channel. - Default value is 16. - -- 'dma_sel' - - DMA channel selection mask. Bitmask that defines which hardware - DMA channels (0 ... 6) will be registered with DmaEngine core. - If bit is set to 1, the corresponding DMA channel will be registered. - DMA channels not selected by this mask will not be used by this device - driver. Default value is 0x7f (use all channels). - -- 'pcie_mrrs' - - override value for PCIe Maximum Read Request Size (MRRS). - This parameter gives an ability to override MRRS value set during PCIe - configuration process. Tsi721 supports read request sizes up to 4096B. - Value for this parameter must be set as defined by PCIe specification: - 0 = 128B, 1 = 256B, 2 = 512B, 3 = 1024B, 4 = 2048B and 5 = 4096B. - Default value is '-1' (= keep platform setting). - -- 'mbox_sel' - - RIO messaging MBOX selection mask. This is a bitmask that defines - messaging MBOXes are managed by this device driver. Mask bits 0 - 3 - correspond to MBOX0 - MBOX3. MBOX is under driver's control if the - corresponding bit is set to '1'. Default value is 0x0f (= all). - -2. Known problems -================= - - None. - -3. DMA Engine Support -===================== - -Tsi721 mport driver supports DMA data transfers between local system memory and -remote RapidIO devices. This functionality is implemented according to SLAVE -mode API defined by common Linux kernel DMA Engine framework. - -Depending on system requirements RapidIO DMA operations can be included/excluded -by setting CONFIG_RAPIDIO_DMA_ENGINE option. Tsi721 miniport driver uses seven -out of eight available BDMA channels to support DMA data transfers. -One BDMA channel is reserved for generation of maintenance read/write requests. - -If Tsi721 mport driver have been built with RAPIDIO_DMA_ENGINE support included, -this driver will accept DMA-specific module parameter: - - "dma_desc_per_channel" - - defines number of hardware buffer descriptors used by - each BDMA channel of Tsi721 (by default - 128). - -4. Version History - - ===== ==================================================================== - 1.1.0 DMA operations re-worked to support data scatter/gather lists larger - than hardware buffer descriptors ring. - 1.0.0 Initial driver release. - ===== ==================================================================== - -5. License -=========== - - Copyright(c) 2011 Integrated Device Technology, Inc. All rights reserved. - - This program is free software; you can redistribute it and/or modify it - under the terms of the GNU General Public License as published by the Free - Software Foundation; either version 2 of the License, or (at your option) - any later version. - - This program is distributed in the hope that it will be useful, but WITHOUT - ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for - more details. - - You should have received a copy of the GNU General Public License along with - this program; if not, write to the Free Software Foundation, Inc., - 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. diff --git a/drivers/rapidio/Kconfig b/drivers/rapidio/Kconfig index 467e8fa06904..677d1aff61b7 100644 --- a/drivers/rapidio/Kconfig +++ b/drivers/rapidio/Kconfig @@ -86,7 +86,7 @@ config RAPIDIO_CHMAN This option includes RapidIO channelized messaging driver which provides socket-like interface to allow sharing of single RapidIO messaging mailbox between multiple user-space applications. - See "Documentation/rapidio/rio_cm.rst" for driver description. + See "Documentation/driver-api/rapidio/rio_cm.rst" for driver description. config RAPIDIO_MPORT_CDEV tristate "RapidIO /dev mport device driver" -- cgit v1.2.3 From ae4a05027e2f883fb5f822e48d67cacc26bf60e1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 16:32:31 -0300 Subject: docs: nvdimm: add it to the driver-api book The descriptions here are from Kernel driver's PoV. Signed-off-by: Mauro Carvalho Chehab Acked-by: Dan Williams --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/nvdimm/btt.rst | 285 +++++++++ Documentation/driver-api/nvdimm/index.rst | 10 + Documentation/driver-api/nvdimm/nvdimm.rst | 887 +++++++++++++++++++++++++++ Documentation/driver-api/nvdimm/security.rst | 143 +++++ Documentation/nvdimm/btt.rst | 285 --------- Documentation/nvdimm/index.rst | 12 - Documentation/nvdimm/nvdimm.rst | 887 --------------------------- Documentation/nvdimm/security.rst | 143 ----- drivers/nvdimm/Kconfig | 2 +- 10 files changed, 1327 insertions(+), 1328 deletions(-) create mode 100644 Documentation/driver-api/nvdimm/btt.rst create mode 100644 Documentation/driver-api/nvdimm/index.rst create mode 100644 Documentation/driver-api/nvdimm/nvdimm.rst create mode 100644 Documentation/driver-api/nvdimm/security.rst delete mode 100644 Documentation/nvdimm/btt.rst delete mode 100644 Documentation/nvdimm/index.rst delete mode 100644 Documentation/nvdimm/nvdimm.rst delete mode 100644 Documentation/nvdimm/security.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index d665cd9ab95f..410dd7110772 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -44,6 +44,7 @@ available subsections can be seen below. mtdnand miscellaneous mei/index + nvdimm/index w1 rapidio/index s390-drivers diff --git a/Documentation/driver-api/nvdimm/btt.rst b/Documentation/driver-api/nvdimm/btt.rst new file mode 100644 index 000000000000..2d8269f834bd --- /dev/null +++ b/Documentation/driver-api/nvdimm/btt.rst @@ -0,0 +1,285 @@ +============================= +BTT - Block Translation Table +============================= + + +1. Introduction +=============== + +Persistent memory based storage is able to perform IO at byte (or more +accurately, cache line) granularity. However, we often want to expose such +storage as traditional block devices. The block drivers for persistent memory +will do exactly this. However, they do not provide any atomicity guarantees. +Traditional SSDs typically provide protection against torn sectors in hardware, +using stored energy in capacitors to complete in-flight block writes, or perhaps +in firmware. We don't have this luxury with persistent memory - if a write is in +progress, and we experience a power failure, the block will contain a mix of old +and new data. Applications may not be prepared to handle such a scenario. + +The Block Translation Table (BTT) provides atomic sector update semantics for +persistent memory devices, so that applications that rely on sector writes not +being torn can continue to do so. The BTT manifests itself as a stacked block +device, and reserves a portion of the underlying storage for its metadata. At +the heart of it, is an indirection table that re-maps all the blocks on the +volume. It can be thought of as an extremely simple file system that only +provides atomic sector updates. + + +2. Static Layout +================ + +The underlying storage on which a BTT can be laid out is not limited in any way. +The BTT, however, splits the available space into chunks of up to 512 GiB, +called "Arenas". + +Each arena follows the same layout for its metadata, and all references in an +arena are internal to it (with the exception of one field that points to the +next arena). The following depicts the "On-disk" metadata layout:: + + + Backing Store +-------> Arena + +---------------+ | +------------------+ + | | | | Arena info block | + | Arena 0 +---+ | 4K | + | 512G | +------------------+ + | | | | + +---------------+ | | + | | | | + | Arena 1 | | Data Blocks | + | 512G | | | + | | | | + +---------------+ | | + | . | | | + | . | | | + | . | | | + | | | | + | | | | + +---------------+ +------------------+ + | | + | BTT Map | + | | + | | + +------------------+ + | | + | BTT Flog | + | | + +------------------+ + | Info block copy | + | 4K | + +------------------+ + + +3. Theory of Operation +====================== + + +a. The BTT Map +-------------- + +The map is a simple lookup/indirection table that maps an LBA to an internal +block. Each map entry is 32 bits. The two most significant bits are special +flags, and the remaining form the internal block number. + +======== ============================================================= +Bit Description +======== ============================================================= +31 - 30 Error and Zero flags - Used in the following way: + + == == ==================================================== + 31 30 Description + == == ==================================================== + 0 0 Initial state. Reads return zeroes; Premap = Postmap + 0 1 Zero state: Reads return zeroes + 1 0 Error state: Reads fail; Writes clear 'E' bit + 1 1 Normal Block – has valid postmap + == == ==================================================== + +29 - 0 Mappings to internal 'postmap' blocks +======== ============================================================= + + +Some of the terminology that will be subsequently used: + +============ ================================================================ +External LBA LBA as made visible to upper layers. +ABA Arena Block Address - Block offset/number within an arena +Premap ABA The block offset into an arena, which was decided upon by range + checking the External LBA +Postmap ABA The block number in the "Data Blocks" area obtained after + indirection from the map +nfree The number of free blocks that are maintained at any given time. + This is the number of concurrent writes that can happen to the + arena. +============ ================================================================ + + +For example, after adding a BTT, we surface a disk of 1024G. We get a read for +the external LBA at 768G. This falls into the second arena, and of the 512G +worth of blocks that this arena contributes, this block is at 256G. Thus, the +premap ABA is 256G. We now refer to the map, and find out the mapping for block +'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. + + +b. The BTT Flog +--------------- + +The BTT provides sector atomicity by making every write an "allocating write", +i.e. Every write goes to a "free" block. A running list of free blocks is +maintained in the form of the BTT flog. 'Flog' is a combination of the words +"free list" and "log". The flog contains 'nfree' entries, and an entry contains: + +======== ===================================================================== +lba The premap ABA that is being written to +old_map The old postmap ABA - after 'this' write completes, this will be a + free block. +new_map The new postmap ABA. The map will up updated to reflect this + lba->postmap_aba mapping, but we log it here in case we have to + recover. +seq Sequence number to mark which of the 2 sections of this flog entry is + valid/newest. It cycles between 01->10->11->01 (binary) under normal + operation, with 00 indicating an uninitialized state. +lba' alternate lba entry +old_map' alternate old postmap entry +new_map' alternate new postmap entry +seq' alternate sequence number. +======== ===================================================================== + +Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also +padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are +done such that for any entry being written, it: +a. overwrites the 'old' section in the entry based on sequence numbers +b. writes the 'new' section such that the sequence number is written last. + + +c. The concept of lanes +----------------------- + +While 'nfree' describes the number of concurrent IOs an arena can process +concurrently, 'nlanes' is the number of IOs the BTT device as a whole can +process:: + + nlanes = min(nfree, num_cpus) + +A lane number is obtained at the start of any IO, and is used for indexing into +all the on-disk and in-memory data structures for the duration of the IO. If +there are more CPUs than the max number of available lanes, than lanes are +protected by spinlocks. + + +d. In-memory data structure: Read Tracking Table (RTT) +------------------------------------------------------ + +Consider a case where we have two threads, one doing reads and the other, +writes. We can hit a condition where the writer thread grabs a free block to do +a new IO, but the (slow) reader thread is still reading from it. In other words, +the reader consulted a map entry, and started reading the corresponding block. A +writer started writing to the same external LBA, and finished the write updating +the map for that external LBA to point to its new postmap ABA. At this point the +internal, postmap block that the reader is (still) reading has been inserted +into the list of free blocks. If another write comes in for the same LBA, it can +grab this free block, and start writing to it, causing the reader to read +incorrect data. To prevent this, we introduce the RTT. + +The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts +into rtt[lane_number], the postmap ABA it is reading, and clears it after the +read is complete. Every writer thread, after grabbing a free block, checks the +RTT for its presence. If the postmap free block is in the RTT, it waits till the +reader clears the RTT entry, and only then starts writing to it. + + +e. In-memory data structure: map locks +-------------------------------------- + +Consider a case where two writer threads are writing to the same LBA. There can +be a race in the following sequence of steps:: + + free[lane] = map[premap_aba] + map[premap_aba] = postmap_aba + +Both threads can update their respective free[lane] with the same old, freed +postmap_aba. This has made the layout inconsistent by losing a free entry, and +at the same time, duplicating another free entry for two lanes. + +To solve this, we could have a single map lock (per arena) that has to be taken +before performing the above sequence, but we feel that could be too contentious. +Instead we use an array of (nfree) map_locks that is indexed by +(premap_aba modulo nfree). + + +f. Reconstruction from the Flog +------------------------------- + +On startup, we analyze the BTT flog to create our list of free blocks. We walk +through all the entries, and for each lane, of the set of two possible +'sections', we always look at the most recent one only (based on the sequence +number). The reconstruction rules/steps are simple: + +- Read map[log_entry.lba]. +- If log_entry.new matches the map entry, then log_entry.old is free. +- If log_entry.new does not match the map entry, then log_entry.new is free. + (This case can only be caused by power-fails/unsafe shutdowns) + + +g. Summarizing - Read and Write flows +------------------------------------- + +Read: + +1. Convert external LBA to arena number + pre-map ABA +2. Get a lane (and take lane_lock) +3. Read map to get the entry for this pre-map ABA +4. Enter post-map ABA into RTT[lane] +5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) +6. If ERROR flag set in map, end IO with EIO (go to step 8) +7. Read data from this block +8. Remove post-map ABA entry from RTT[lane] +9. Release lane (and lane_lock) + +Write: + +1. Convert external LBA to Arena number + pre-map ABA +2. Get a lane (and take lane_lock) +3. Use lane to index into in-memory free list and obtain a new block, next flog + index, next sequence number +4. Scan the RTT to check if free block is present, and spin/wait if it is. +5. Write data to this free block +6. Read map to get the existing post-map ABA entry for this pre-map ABA +7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] +8. Write new post-map ABA into map. +9. Write old post-map entry into the free list +10. Calculate next sequence number and write into the free list entry +11. Release lane (and lane_lock) + + +4. Error Handling +================= + +An arena would be in an error state if any of the metadata is corrupted +irrecoverably, either due to a bug or a media error. The following conditions +indicate an error: + +- Info block checksum does not match (and recovering from the copy also fails) +- All internal available blocks are not uniquely and entirely addressed by the + sum of mapped blocks and free blocks (from the BTT flog). +- Rebuilding free list from the flog reveals missing/duplicate/impossible + entries +- A map entry is out of bounds + +If any of these error conditions are encountered, the arena is put into a read +only state using a flag in the info block. + + +5. Usage +======== + +The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem +(pmem, or blk mode). The easiest way to set up such a namespace is using the +'ndctl' utility [1]: + +For example, the ndctl command line to setup a btt with a 4k sector size is:: + + ndctl create-namespace -f -e namespace0.0 -m sector -l 4k + +See ndctl create-namespace --help for more options. + +[1]: https://github.com/pmem/ndctl diff --git a/Documentation/driver-api/nvdimm/index.rst b/Documentation/driver-api/nvdimm/index.rst new file mode 100644 index 000000000000..19dc8ee371dc --- /dev/null +++ b/Documentation/driver-api/nvdimm/index.rst @@ -0,0 +1,10 @@ +=================================== +Non-Volatile Memory Device (NVDIMM) +=================================== + +.. toctree:: + :maxdepth: 1 + + nvdimm + btt + security diff --git a/Documentation/driver-api/nvdimm/nvdimm.rst b/Documentation/driver-api/nvdimm/nvdimm.rst new file mode 100644 index 000000000000..08f855cbb4e6 --- /dev/null +++ b/Documentation/driver-api/nvdimm/nvdimm.rst @@ -0,0 +1,887 @@ +=============================== +LIBNVDIMM: Non-Volatile Devices +=============================== + +libnvdimm - kernel / libndctl - userspace helper library + +linux-nvdimm@lists.01.org + +Version 13 + +.. contents: + + Glossary + Overview + Supporting Documents + Git Trees + LIBNVDIMM PMEM and BLK + Why BLK? + PMEM vs BLK + BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX + Example NVDIMM Platform + LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API + LIBNDCTL: Context + libndctl: instantiate a new library context example + LIBNVDIMM/LIBNDCTL: Bus + libnvdimm: control class device in /sys/class + libnvdimm: bus + libndctl: bus enumeration example + LIBNVDIMM/LIBNDCTL: DIMM (NMEM) + libnvdimm: DIMM (NMEM) + libndctl: DIMM enumeration example + LIBNVDIMM/LIBNDCTL: Region + libnvdimm: region + libndctl: region enumeration example + Why Not Encode the Region Type into the Region Name? + How Do I Determine the Major Type of a Region? + LIBNVDIMM/LIBNDCTL: Namespace + libnvdimm: namespace + libndctl: namespace enumeration example + libndctl: namespace creation example + Why the Term "namespace"? + LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" + libnvdimm: btt layout + libndctl: btt creation example + Summary LIBNDCTL Diagram + + +Glossary +======== + +PMEM: + A system-physical-address range where writes are persistent. A + block device composed of PMEM is capable of DAX. A PMEM address range + may span an interleave of several DIMMs. + +BLK: + A set of one or more programmable memory mapped apertures provided + by a DIMM to access its media. This indirection precludes the + performance benefit of interleaving, but enables DIMM-bounded failure + modes. + +DPA: + DIMM Physical Address, is a DIMM-relative offset. With one DIMM in + the system there would be a 1:1 system-physical-address:DPA association. + Once more DIMMs are added a memory controller interleave must be + decoded to determine the DPA associated with a given + system-physical-address. BLK capacity always has a 1:1 relationship + with a single-DIMM's DPA range. + +DAX: + File system extensions to bypass the page cache and block layer to + mmap persistent memory, from a PMEM block device, directly into a + process address space. + +DSM: + Device Specific Method: ACPI method to to control specific + device - in this case the firmware. + +DCR: + NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. + It defines a vendor-id, device-id, and interface format for a given DIMM. + +BTT: + Block Translation Table: Persistent memory is byte addressable. + Existing software may have an expectation that the power-fail-atomicity + of writes is at least one sector, 512 bytes. The BTT is an indirection + table with atomic update semantics to front a PMEM/BLK block device + driver and present arbitrary atomic sector sizes. + +LABEL: + Metadata stored on a DIMM device that partitions and identifies + (persistently names) storage between PMEM and BLK. It also partitions + BLK storage to host BTTs with different parameters per BLK-partition. + Note that traditional partition tables, GPT/MBR, are layered on top of a + BLK or PMEM device. + + +Overview +======== + +The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, +PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM +and BLK mode access. These three modes of operation are described by +the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM +implementation is generic and supports pre-NFIT platforms, it was guided +by the superset of capabilities need to support this ACPI 6 definition +for NVDIMM resources. The bulk of the kernel implementation is in place +to handle the case where DPA accessible via PMEM is aliased with DPA +accessible via BLK. When that occurs a LABEL is needed to reserve DPA +for exclusive access via one mode a time. + +Supporting Documents +-------------------- + +ACPI 6: + http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf +NVDIMM Namespace: + http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf +DSM Interface Example: + http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf +Driver Writer's Guide: + http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf + +Git Trees +--------- + +LIBNVDIMM: + https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git +LIBNDCTL: + https://github.com/pmem/ndctl.git +PMEM: + https://github.com/01org/prd + + +LIBNVDIMM PMEM and BLK +====================== + +Prior to the arrival of the NFIT, non-volatile memory was described to a +system in various ad-hoc ways. Usually only the bare minimum was +provided, namely, a single system-physical-address range where writes +are expected to be durable after a system power loss. Now, the NFIT +specification standardizes not only the description of PMEM, but also +BLK and platform message-passing entry points for control and +configuration. + +For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block +device driver: + + 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This + range is contiguous in system memory and may be interleaved (hardware + memory controller striped) across multiple DIMMs. When interleaved the + platform may optionally provide details of which DIMMs are participating + in the interleave. + + Note that while LIBNVDIMM describes system-physical-address ranges that may + alias with BLK access as ND_NAMESPACE_PMEM ranges and those without + alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no + distinction. The different device-types are an implementation detail + that userspace can exploit to implement policies like "only interface + with address ranges from certain DIMMs". It is worth noting that when + aliasing is present and a DIMM lacks a label, then no block device can + be created by default as userspace needs to do at least one allocation + of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once + registered, can be immediately attached to nd_pmem. + + 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform + defined apertures. A set of apertures will access just one DIMM. + Multiple windows (apertures) allow multiple concurrent accesses, much like + tagged-command-queuing, and would likely be used by different threads or + different CPUs. + + The NFIT specification defines a standard format for a BLK-aperture, but + the spec also allows for vendor specific layouts, and non-NFIT BLK + implementations may have other designs for BLK I/O. For this reason + "nd_blk" calls back into platform-specific code to perform the I/O. + + One such implementation is defined in the "Driver Writer's Guide" and "DSM + Interface Example". + + +Why BLK? +======== + +While PMEM provides direct byte-addressable CPU-load/store access to +NVDIMM storage, it does not provide the best system RAS (recovery, +availability, and serviceability) model. An access to a corrupted +system-physical-address address causes a CPU exception while an access +to a corrupted address through an BLK-aperture causes that block window +to raise an error status in a register. The latter is more aligned with +the standard error model that host-bus-adapter attached disks present. + +Also, if an administrator ever wants to replace a memory it is easier to +service a system at DIMM module boundaries. Compare this to PMEM where +data could be interleaved in an opaque hardware specific manner across +several DIMMs. + +PMEM vs BLK +----------- + +BLK-apertures solve these RAS problems, but their presence is also the +major contributing factor to the complexity of the ND subsystem. They +complicate the implementation because PMEM and BLK alias in DPA space. +Any given DIMM's DPA-range may contribute to one or more +system-physical-address sets of interleaved DIMMs, *and* may also be +accessed in its entirety through its BLK-aperture. Accessing a DPA +through a system-physical-address while simultaneously accessing the +same DPA through a BLK-aperture has undefined results. For this reason, +DIMMs with this dual interface configuration include a DSM function to +store/retrieve a LABEL. The LABEL effectively partitions the DPA-space +into exclusive system-physical-address and BLK-aperture accessible +regions. For simplicity a DIMM is allowed a PMEM "region" per each +interleave set in which it is a member. The remaining DPA space can be +carved into an arbitrary number of BLK devices with discontiguous +extents. + +BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +One of the few +reasons to allow multiple BLK namespaces per REGION is so that each +BLK-namespace can be configured with a BTT with unique atomic sector +sizes. While a PMEM device can host a BTT the LABEL specification does +not provide for a sector size to be specified for a PMEM namespace. + +This is due to the expectation that the primary usage model for PMEM is +via DAX, and the BTT is incompatible with DAX. However, for the cases +where an application or filesystem still needs atomic sector update +guarantees it can register a BTT on a PMEM device or partition. See +LIBNVDIMM/NDCTL: Block Translation Table "btt" + + +Example NVDIMM Platform +======================= + +For the remainder of this document the following diagram will be +referenced for any example sysfs layouts:: + + + (a) (b) DIMM BLK-REGION + +-------------------+--------+--------+--------+ + +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 + | imc0 +--+- - - region0- - - +--------+ +--------+ + +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 + | +-------------------+--------v v--------+ + +--+---+ | | + | cpu0 | region1 + +--+---+ | | + | +----------------------------^ ^--------+ + +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 + | imc1 +--+----------------------------| +--------+ + +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 + +----------------------------+--------+--------+ + +In this platform we have four DIMMs and two memory controllers in one +socket. Each unique interface (BLK or PMEM) to DPA space is identified +by a region device with a dynamically assigned id (REGION0 - REGION5). + + 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A + single PMEM namespace is created in the REGION0-SPA-range that spans most + of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that + interleaved system-physical-address range is reclaimed as BLK-aperture + accessed space starting at DPA-offset (a) into each DIMM. In that + reclaimed space we create two BLK-aperture "namespaces" from REGION2 and + REGION3 where "blk2.0" and "blk3.0" are just human readable names that + could be set to any user-desired name in the LABEL. + + 2. In the last portion of DIMM0 and DIMM1 we have an interleaved + system-physical-address range, REGION1, that spans those two DIMMs as + well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace + named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for + each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and + "blk5.0". + + 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 + interleaved system-physical-address range (i.e. the DPA address past + offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. + Note, that this example shows that BLK-aperture namespaces don't need to + be contiguous in DPA-space. + + This bus is provided by the kernel under the device + /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and + the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the + acpi_nfit.ko driver as well. + + +LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API +======================================================== + +What follows is a description of the LIBNVDIMM sysfs layout and a +corresponding object hierarchy diagram as viewed through the LIBNDCTL +API. The example sysfs paths and diagrams are relative to the Example +NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit +test. + +LIBNDCTL: Context +----------------- + +Every API call in the LIBNDCTL library requires a context that holds the +logging parameters and other library instance state. The library is +based on the libabc template: + + https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git + +LIBNDCTL: instantiate a new library context example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +:: + + struct ndctl_ctx *ctx; + + if (ndctl_new(&ctx) == 0) + return ctx; + else + return NULL; + +LIBNVDIMM/LIBNDCTL: Bus +----------------------- + +A bus has a 1:1 relationship with an NFIT. The current expectation for +ACPI based systems is that there is only ever one platform-global NFIT. +That said, it is trivial to register multiple NFITs, the specification +does not preclude it. The infrastructure supports multiple busses and +we use this capability to test multiple NFIT configurations in the unit +test. + +LIBNVDIMM: control class device in /sys/class +--------------------------------------------- + +This character device accepts DSM messages to be passed to DIMM +identified by its NFIT handle:: + + /sys/class/nd/ndctl0 + |-- dev + |-- device -> ../../../ndbus0 + |-- subsystem -> ../../../../../../../class/nd + + + +LIBNVDIMM: bus +-------------- + +:: + + struct nvdimm_bus *nvdimm_bus_register(struct device *parent, + struct nvdimm_bus_descriptor *nfit_desc); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- commands + |-- nd + |-- nfit + |-- nmem0 + |-- nmem1 + |-- nmem2 + |-- nmem3 + |-- power + |-- provider + |-- region0 + |-- region1 + |-- region2 + |-- region3 + |-- region4 + |-- region5 + |-- uevent + `-- wait_probe + +LIBNDCTL: bus enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Find the bus handle that describes the bus from Example NVDIMM Platform:: + + static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, + const char *provider) + { + struct ndctl_bus *bus; + + ndctl_bus_foreach(ctx, bus) + if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) + return bus; + + return NULL; + } + + bus = get_bus_by_provider(ctx, "nfit_test.0"); + + +LIBNVDIMM/LIBNDCTL: DIMM (NMEM) +------------------------------- + +The DIMM device provides a character device for sending commands to +hardware, and it is a container for LABELs. If the DIMM is defined by +NFIT then an optional 'nfit' attribute sub-directory is available to add +NFIT-specifics. + +Note that the kernel device name for "DIMMs" is "nmemX". The NFIT +describes these devices via "Memory Device to System Physical Address +Range Mapping Structure", and there is no requirement that they actually +be physical DIMMs, so we use a more generic name. + +LIBNVDIMM: DIMM (NMEM) +^^^^^^^^^^^^^^^^^^^^^^ + +:: + + struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, + const struct attribute_group **groups, unsigned long flags, + unsigned long *dsm_mask); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- nmem0 + | |-- available_slots + | |-- commands + | |-- dev + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nvdimm + | |-- modalias + | |-- nfit + | | |-- device + | | |-- format + | | |-- handle + | | |-- phys_id + | | |-- rev_id + | | |-- serial + | | `-- vendor + | |-- state + | |-- subsystem -> ../../../../../bus/nd + | `-- uevent + |-- nmem1 + [..] + + +LIBNDCTL: DIMM enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Note, in this example we are assuming NFIT-defined DIMMs which are +identified by an "nfit_handle" a 32-bit value where: + + - Bit 3:0 DIMM number within the memory channel + - Bit 7:4 memory channel number + - Bit 11:8 memory controller ID + - Bit 15:12 socket ID (within scope of a Node controller if node + controller is present) + - Bit 27:16 Node Controller ID + - Bit 31:28 Reserved + +:: + + static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, + unsigned int handle) + { + struct ndctl_dimm *dimm; + + ndctl_dimm_foreach(bus, dimm) + if (ndctl_dimm_get_handle(dimm) == handle) + return dimm; + + return NULL; + } + + #define DIMM_HANDLE(n, s, i, c, d) \ + (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ + | ((c & 0xf) << 4) | (d & 0xf)) + + dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); + +LIBNVDIMM/LIBNDCTL: Region +-------------------------- + +A generic REGION device is registered for each PMEM range or BLK-aperture +set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture +sets on the "nfit_test.0" bus. The primary role of regions are to be a +container of "mappings". A mapping is a tuple of . + +LIBNVDIMM provides a built-in driver for these REGION devices. This driver +is responsible for reconciling the aliased DPA mappings across all +regions, parsing the LABEL, if present, and then emitting NAMESPACE +devices with the resolved/exclusive DPA-boundaries for the nd_pmem or +nd_blk device driver to consume. + +In addition to the generic attributes of "mapping"s, "interleave_ways" +and "size" the REGION device also exports some convenience attributes. +"nstype" indicates the integer type of namespace-device this region +emits, "devtype" duplicates the DEVTYPE variable stored by udev at the +'add' event, "modalias" duplicates the MODALIAS variable stored by udev +at the 'add' event, and finally, the optional "spa_index" is provided in +the case where the region is defined by a SPA. + +LIBNVDIMM: region:: + + struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, + struct nd_region_desc *ndr_desc); + struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, + struct nd_region_desc *ndr_desc); + +:: + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- region0 + | |-- available_size + | |-- btt0 + | |-- btt_seed + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nd_region + | |-- init_namespaces + | |-- mapping0 + | |-- mapping1 + | |-- mappings + | |-- modalias + | |-- namespace0.0 + | |-- namespace_seed + | |-- numa_node + | |-- nfit + | | `-- spa_index + | |-- nstype + | |-- set_cookie + | |-- size + | |-- subsystem -> ../../../../../bus/nd + | `-- uevent + |-- region1 + [..] + +LIBNDCTL: region enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Sample region retrieval routines based on NFIT-unique data like +"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for +BLK:: + + static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, + unsigned int spa_index) + { + struct ndctl_region *region; + + ndctl_region_foreach(bus, region) { + if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) + continue; + if (ndctl_region_get_spa_index(region) == spa_index) + return region; + } + return NULL; + } + + static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, + unsigned int handle) + { + struct ndctl_region *region; + + ndctl_region_foreach(bus, region) { + struct ndctl_mapping *map; + + if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) + continue; + ndctl_mapping_foreach(region, map) { + struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); + + if (ndctl_dimm_get_handle(dimm) == handle) + return region; + } + } + return NULL; + } + + +Why Not Encode the Region Type into the Region Name? +---------------------------------------------------- + +At first glance it seems since NFIT defines just PMEM and BLK interface +types that we should simply name REGION devices with something derived +from those type names. However, the ND subsystem explicitly keeps the +REGION name generic and expects userspace to always consider the +region-attributes for four reasons: + + 1. There are already more than two REGION and "namespace" types. For + PMEM there are two subtypes. As mentioned previously we have PMEM where + the constituent DIMM devices are known and anonymous PMEM. For BLK + regions the NFIT specification already anticipates vendor specific + implementations. The exact distinction of what a region contains is in + the region-attributes not the region-name or the region-devtype. + + 2. A region with zero child-namespaces is a possible configuration. For + example, the NFIT allows for a DCR to be published without a + corresponding BLK-aperture. This equates to a DIMM that can only accept + control/configuration messages, but no i/o through a descendant block + device. Again, this "type" is advertised in the attributes ('mappings' + == 0) and the name does not tell you much. + + 3. What if a third major interface type arises in the future? Outside + of vendor specific implementations, it's not difficult to envision a + third class of interface type beyond BLK and PMEM. With a generic name + for the REGION level of the device-hierarchy old userspace + implementations can still make sense of new kernel advertised + region-types. Userspace can always rely on the generic region + attributes like "mappings", "size", etc and the expected child devices + named "namespace". This generic format of the device-model hierarchy + allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and + future-proof. + + 4. There are more robust mechanisms for determining the major type of a + region than a device name. See the next section, How Do I Determine the + Major Type of a Region? + +How Do I Determine the Major Type of a Region? +---------------------------------------------- + +Outside of the blanket recommendation of "use libndctl", or simply +looking at the kernel header (/usr/include/linux/ndctl.h) to decode the +"nstype" integer attribute, here are some other options. + +1. module alias lookup +^^^^^^^^^^^^^^^^^^^^^^ + + The whole point of region/namespace device type differentiation is to + decide which block-device driver will attach to a given LIBNVDIMM namespace. + One can simply use the modalias to lookup the resulting module. It's + important to note that this method is robust in the presence of a + vendor-specific driver down the road. If a vendor-specific + implementation wants to supplant the standard nd_blk driver it can with + minimal impact to the rest of LIBNVDIMM. + + In fact, a vendor may also want to have a vendor-specific region-driver + (outside of nd_region). For example, if a vendor defined its own LABEL + format it would need its own region driver to parse that LABEL and emit + the resulting namespaces. The output from module resolution is more + accurate than a region-name or region-devtype. + +2. udev +^^^^^^^ + + The kernel "devtype" is registered in the udev database:: + + # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 + P: /devices/platform/nfit_test.0/ndbus0/region0 + E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 + E: DEVTYPE=nd_pmem + E: MODALIAS=nd:t2 + E: SUBSYSTEM=nd + + # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 + P: /devices/platform/nfit_test.0/ndbus0/region4 + E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 + E: DEVTYPE=nd_blk + E: MODALIAS=nd:t3 + E: SUBSYSTEM=nd + + ...and is available as a region attribute, but keep in mind that the + "devtype" does not indicate sub-type variations and scripts should + really be understanding the other attributes. + +3. type specific attributes +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + As it currently stands a BLK-aperture region will never have a + "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A + BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM + that does not allow I/O. A PMEM region with a "mappings" value of zero + is a simple system-physical-address range. + + +LIBNVDIMM/LIBNDCTL: Namespace +----------------------------- + +A REGION, after resolving DPA aliasing and LABEL specified boundaries, +surfaces one or more "namespace" devices. The arrival of a "namespace" +device currently triggers either the nd_blk or nd_pmem driver to load +and register a disk/block device. + +LIBNVDIMM: namespace +^^^^^^^^^^^^^^^^^^^^ + +Here is a sample layout from the three major types of NAMESPACE where +namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' +attribute), namespace2.0 represents a BLK namespace (note it has a +'sector_size' attribute) that, and namespace6.0 represents an anonymous +PMEM namespace (note that has no 'uuid' attribute due to not support a +LABEL):: + + /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 + |-- alt_name + |-- devtype + |-- dpa_extents + |-- force_raw + |-- modalias + |-- numa_node + |-- resource + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + |-- uevent + `-- uuid + /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 + |-- alt_name + |-- devtype + |-- dpa_extents + |-- force_raw + |-- modalias + |-- numa_node + |-- sector_size + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + |-- uevent + `-- uuid + /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 + |-- block + | `-- pmem0 + |-- devtype + |-- driver -> ../../../../../../bus/nd/drivers/pmem + |-- force_raw + |-- modalias + |-- numa_node + |-- resource + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + `-- uevent + +LIBNDCTL: namespace enumeration example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Namespaces are indexed relative to their parent region, example below. +These indexes are mostly static from boot to boot, but subsystem makes +no guarantees in this regard. For a static namespace identifier use its +'uuid' attribute. + +:: + + static struct ndctl_namespace + *get_namespace_by_id(struct ndctl_region *region, unsigned int id) + { + struct ndctl_namespace *ndns; + + ndctl_namespace_foreach(region, ndns) + if (ndctl_namespace_get_id(ndns) == id) + return ndns; + + return NULL; + } + +LIBNDCTL: namespace creation example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Idle namespaces are automatically created by the kernel if a given +region has enough available capacity to create a new namespace. +Namespace instantiation involves finding an idle namespace and +configuring it. For the most part the setting of namespace attributes +can occur in any order, the only constraint is that 'uuid' must be set +before 'size'. This enables the kernel to track DPA allocations +internally with a static identifier:: + + static int configure_namespace(struct ndctl_region *region, + struct ndctl_namespace *ndns, + struct namespace_parameters *parameters) + { + char devname[50]; + + snprintf(devname, sizeof(devname), "namespace%d.%d", + ndctl_region_get_id(region), paramaters->id); + + ndctl_namespace_set_alt_name(ndns, devname); + /* 'uuid' must be set prior to setting size! */ + ndctl_namespace_set_uuid(ndns, paramaters->uuid); + ndctl_namespace_set_size(ndns, paramaters->size); + /* unlike pmem namespaces, blk namespaces have a sector size */ + if (parameters->lbasize) + ndctl_namespace_set_sector_size(ndns, parameters->lbasize); + ndctl_namespace_enable(ndns); + } + + +Why the Term "namespace"? +^^^^^^^^^^^^^^^^^^^^^^^^^ + + 1. Why not "volume" for instance? "volume" ran the risk of confusing + ND (libnvdimm subsystem) to a volume manager like device-mapper. + + 2. The term originated to describe the sub-devices that can be created + within a NVME controller (see the nvme specification: + http://www.nvmexpress.org/specifications/), and NFIT namespaces are + meant to parallel the capabilities and configurability of + NVME-namespaces. + + +LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" +------------------------------------------------- + +A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked +block device driver that fronts either the whole block device or a +partition of a block device emitted by either a PMEM or BLK NAMESPACE. + +LIBNVDIMM: btt layout +^^^^^^^^^^^^^^^^^^^^^ + +Every region will start out with at least one BTT device which is the +seed device. To activate it set the "namespace", "uuid", and +"sector_size" attributes and then bind the device to the nd_pmem or +nd_blk driver depending on the region type:: + + /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ + |-- namespace + |-- delete + |-- devtype + |-- modalias + |-- numa_node + |-- sector_size + |-- subsystem -> ../../../../../bus/nd + |-- uevent + `-- uuid + +LIBNDCTL: btt creation example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Similar to namespaces an idle BTT device is automatically created per +region. Each time this "seed" btt device is configured and enabled a new +seed is created. Creating a BTT configuration involves two steps of +finding and idle BTT and assigning it to consume a PMEM or BLK namespace:: + + static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) + { + struct ndctl_btt *btt; + + ndctl_btt_foreach(region, btt) + if (!ndctl_btt_is_enabled(btt) + && !ndctl_btt_is_configured(btt)) + return btt; + + return NULL; + } + + static int configure_btt(struct ndctl_region *region, + struct btt_parameters *parameters) + { + btt = get_idle_btt(region); + + ndctl_btt_set_uuid(btt, parameters->uuid); + ndctl_btt_set_sector_size(btt, parameters->sector_size); + ndctl_btt_set_namespace(btt, parameters->ndns); + /* turn off raw mode device */ + ndctl_namespace_disable(parameters->ndns); + /* turn on btt access */ + ndctl_btt_enable(btt); + } + +Once instantiated a new inactive btt seed device will appear underneath +the region. + +Once a "namespace" is removed from a BTT that instance of the BTT device +will be deleted or otherwise reset to default values. This deletion is +only at the device model level. In order to destroy a BTT the "info +block" needs to be destroyed. Note, that to destroy a BTT the media +needs to be written in raw mode. By default, the kernel will autodetect +the presence of a BTT and disable raw mode. This autodetect behavior +can be suppressed by enabling raw mode for the namespace via the +ndctl_namespace_set_raw_mode() API. + + +Summary LIBNDCTL Diagram +------------------------ + +For the given example above, here is the view of the objects as seen by the +LIBNDCTL API:: + + +---+ + |CTX| +---------+ +--------------+ +---------------+ + +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | + | | +---------+ +--------------+ +---------------+ + +-------+ | | +---------+ +--------------+ +---------------+ + | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | + +-------+ | | | +---------+ +--------------+ +---------------+ + | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ + +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | + | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ + +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | + | DIMM3 <-+ | +--------------+ +----------------------+ + +-------+ | +---------+ +--------------+ +---------------+ + +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | + | +---------+ | +--------------+ +----------------------+ + | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | + | +--------------+ +----------------------+ + | +---------+ +--------------+ +---------------+ + +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | + | +---------+ +--------------+ +---------------+ + | +---------+ +--------------+ +----------------------+ + +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | + +---------+ +--------------+ +---------------+------+ diff --git a/Documentation/driver-api/nvdimm/security.rst b/Documentation/driver-api/nvdimm/security.rst new file mode 100644 index 000000000000..ad9dea099b34 --- /dev/null +++ b/Documentation/driver-api/nvdimm/security.rst @@ -0,0 +1,143 @@ +=============== +NVDIMM Security +=============== + +1. Introduction +--------------- + +With the introduction of Intel Device Specific Methods (DSM) v1.8 +specification [1], security DSMs are introduced. The spec added the following +security DSMs: "get security state", "set passphrase", "disable passphrase", +"unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops +data structure has been added to struct dimm in order to support the security +operations and generic APIs are exposed to allow vendor neutral operations. + +2. Sysfs Interface +------------------ +The "security" sysfs attribute is provided in the nvdimm sysfs directory. For +example: +/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security + +The "show" attribute of that attribute will display the security state for +that DIMM. The following states are available: disabled, unlocked, locked, +frozen, and overwrite. If security is not supported, the sysfs attribute +will not be visible. + +The "store" attribute takes several commands when it is being written to +in order to support some of the security functionalities: +update - enable or update passphrase. +disable - disable enabled security and remove key. +freeze - freeze changing of security states. +erase - delete existing user encryption key. +overwrite - wipe the entire nvdimm. +master_update - enable or update master passphrase. +master_erase - delete existing user encryption key. + +3. Key Management +----------------- + +The key is associated to the payload by the DIMM id. For example: +# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id +8089-a2-1740-00000133 +The DIMM id would be provided along with the key payload (passphrase) to +the kernel. + +The security keys are managed on the basis of a single key per DIMM. The +key "passphrase" is expected to be 32bytes long. This is similar to the ATA +security specification [2]. A key is initially acquired via the request_key() +kernel API call during nvdimm unlock. It is up to the user to make sure that +all the keys are in the kernel user keyring for unlock. + +A nvdimm encrypted-key of format enc32 has the description format of: +nvdimm: + +See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating +encrypted-keys of enc32 format. TPM usage with a master trusted key is +preferred for sealing the encrypted-keys. + +4. Unlocking +------------ +When the DIMMs are being enumerated by the kernel, the kernel will attempt to +retrieve the key from the kernel user keyring. This is the only time +a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked +until reboot. Typically an entity (i.e. shell script) will inject all the +relevant encrypted-keys into the kernel user keyring during the initramfs phase. +This provides the unlock function access to all the related keys that contain +the passphrase for the respective nvdimms. It is also recommended that the +keys are injected before libnvdimm is loaded by modprobe. + +5. Update +--------- +When doing an update, it is expected that the existing key is removed from +the kernel user keyring and reinjected as different (old) key. It's irrelevant +what the key description is for the old key since we are only interested in the +keyid when doing the update operation. It is also expected that the new key +is injected with the description format described from earlier in this +document. The update command written to the sysfs attribute will be with +the format: +update + +If there is no old keyid due to a security enabling, then a 0 should be +passed in. + +6. Freeze +--------- +The freeze operation does not require any keys. The security config can be +frozen by a user with root privelege. + +7. Disable +---------- +The security disable command format is: +disable + +An key with the current passphrase payload that is tied to the nvdimm should be +in the kernel user keyring. + +8. Secure Erase +--------------- +The command format for doing a secure erase is: +erase + +An key with the current passphrase payload that is tied to the nvdimm should be +in the kernel user keyring. + +9. Overwrite +------------ +The command format for doing an overwrite is: +overwrite + +Overwrite can be done without a key if security is not enabled. A key serial +of 0 can be passed in to indicate no key. + +The sysfs attribute "security" can be polled to wait on overwrite completion. +Overwrite can last tens of minutes or more depending on nvdimm size. + +An encrypted-key with the current user passphrase that is tied to the nvdimm +should be injected and its keyid should be passed in via sysfs. + +10. Master Update +----------------- +The command format for doing a master update is: +update + +The operating mechanism for master update is identical to update except the +master passphrase key is passed to the kernel. The master passphrase key +is just another encrypted-key. + +This command is only available when security is disabled. + +11. Master Erase +---------------- +The command format for doing a master erase is: +master_erase + +This command has the same operating mechanism as erase except the master +passphrase key is passed to the kernel. The master passphrase key is just +another encrypted-key. + +This command is only available when the master security is enabled, indicated +by the extended security status. + +[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf + +[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf diff --git a/Documentation/nvdimm/btt.rst b/Documentation/nvdimm/btt.rst deleted file mode 100644 index 2d8269f834bd..000000000000 --- a/Documentation/nvdimm/btt.rst +++ /dev/null @@ -1,285 +0,0 @@ -============================= -BTT - Block Translation Table -============================= - - -1. Introduction -=============== - -Persistent memory based storage is able to perform IO at byte (or more -accurately, cache line) granularity. However, we often want to expose such -storage as traditional block devices. The block drivers for persistent memory -will do exactly this. However, they do not provide any atomicity guarantees. -Traditional SSDs typically provide protection against torn sectors in hardware, -using stored energy in capacitors to complete in-flight block writes, or perhaps -in firmware. We don't have this luxury with persistent memory - if a write is in -progress, and we experience a power failure, the block will contain a mix of old -and new data. Applications may not be prepared to handle such a scenario. - -The Block Translation Table (BTT) provides atomic sector update semantics for -persistent memory devices, so that applications that rely on sector writes not -being torn can continue to do so. The BTT manifests itself as a stacked block -device, and reserves a portion of the underlying storage for its metadata. At -the heart of it, is an indirection table that re-maps all the blocks on the -volume. It can be thought of as an extremely simple file system that only -provides atomic sector updates. - - -2. Static Layout -================ - -The underlying storage on which a BTT can be laid out is not limited in any way. -The BTT, however, splits the available space into chunks of up to 512 GiB, -called "Arenas". - -Each arena follows the same layout for its metadata, and all references in an -arena are internal to it (with the exception of one field that points to the -next arena). The following depicts the "On-disk" metadata layout:: - - - Backing Store +-------> Arena - +---------------+ | +------------------+ - | | | | Arena info block | - | Arena 0 +---+ | 4K | - | 512G | +------------------+ - | | | | - +---------------+ | | - | | | | - | Arena 1 | | Data Blocks | - | 512G | | | - | | | | - +---------------+ | | - | . | | | - | . | | | - | . | | | - | | | | - | | | | - +---------------+ +------------------+ - | | - | BTT Map | - | | - | | - +------------------+ - | | - | BTT Flog | - | | - +------------------+ - | Info block copy | - | 4K | - +------------------+ - - -3. Theory of Operation -====================== - - -a. The BTT Map --------------- - -The map is a simple lookup/indirection table that maps an LBA to an internal -block. Each map entry is 32 bits. The two most significant bits are special -flags, and the remaining form the internal block number. - -======== ============================================================= -Bit Description -======== ============================================================= -31 - 30 Error and Zero flags - Used in the following way: - - == == ==================================================== - 31 30 Description - == == ==================================================== - 0 0 Initial state. Reads return zeroes; Premap = Postmap - 0 1 Zero state: Reads return zeroes - 1 0 Error state: Reads fail; Writes clear 'E' bit - 1 1 Normal Block – has valid postmap - == == ==================================================== - -29 - 0 Mappings to internal 'postmap' blocks -======== ============================================================= - - -Some of the terminology that will be subsequently used: - -============ ================================================================ -External LBA LBA as made visible to upper layers. -ABA Arena Block Address - Block offset/number within an arena -Premap ABA The block offset into an arena, which was decided upon by range - checking the External LBA -Postmap ABA The block number in the "Data Blocks" area obtained after - indirection from the map -nfree The number of free blocks that are maintained at any given time. - This is the number of concurrent writes that can happen to the - arena. -============ ================================================================ - - -For example, after adding a BTT, we surface a disk of 1024G. We get a read for -the external LBA at 768G. This falls into the second arena, and of the 512G -worth of blocks that this arena contributes, this block is at 256G. Thus, the -premap ABA is 256G. We now refer to the map, and find out the mapping for block -'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. - - -b. The BTT Flog ---------------- - -The BTT provides sector atomicity by making every write an "allocating write", -i.e. Every write goes to a "free" block. A running list of free blocks is -maintained in the form of the BTT flog. 'Flog' is a combination of the words -"free list" and "log". The flog contains 'nfree' entries, and an entry contains: - -======== ===================================================================== -lba The premap ABA that is being written to -old_map The old postmap ABA - after 'this' write completes, this will be a - free block. -new_map The new postmap ABA. The map will up updated to reflect this - lba->postmap_aba mapping, but we log it here in case we have to - recover. -seq Sequence number to mark which of the 2 sections of this flog entry is - valid/newest. It cycles between 01->10->11->01 (binary) under normal - operation, with 00 indicating an uninitialized state. -lba' alternate lba entry -old_map' alternate old postmap entry -new_map' alternate new postmap entry -seq' alternate sequence number. -======== ===================================================================== - -Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also -padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are -done such that for any entry being written, it: -a. overwrites the 'old' section in the entry based on sequence numbers -b. writes the 'new' section such that the sequence number is written last. - - -c. The concept of lanes ------------------------ - -While 'nfree' describes the number of concurrent IOs an arena can process -concurrently, 'nlanes' is the number of IOs the BTT device as a whole can -process:: - - nlanes = min(nfree, num_cpus) - -A lane number is obtained at the start of any IO, and is used for indexing into -all the on-disk and in-memory data structures for the duration of the IO. If -there are more CPUs than the max number of available lanes, than lanes are -protected by spinlocks. - - -d. In-memory data structure: Read Tracking Table (RTT) ------------------------------------------------------- - -Consider a case where we have two threads, one doing reads and the other, -writes. We can hit a condition where the writer thread grabs a free block to do -a new IO, but the (slow) reader thread is still reading from it. In other words, -the reader consulted a map entry, and started reading the corresponding block. A -writer started writing to the same external LBA, and finished the write updating -the map for that external LBA to point to its new postmap ABA. At this point the -internal, postmap block that the reader is (still) reading has been inserted -into the list of free blocks. If another write comes in for the same LBA, it can -grab this free block, and start writing to it, causing the reader to read -incorrect data. To prevent this, we introduce the RTT. - -The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts -into rtt[lane_number], the postmap ABA it is reading, and clears it after the -read is complete. Every writer thread, after grabbing a free block, checks the -RTT for its presence. If the postmap free block is in the RTT, it waits till the -reader clears the RTT entry, and only then starts writing to it. - - -e. In-memory data structure: map locks --------------------------------------- - -Consider a case where two writer threads are writing to the same LBA. There can -be a race in the following sequence of steps:: - - free[lane] = map[premap_aba] - map[premap_aba] = postmap_aba - -Both threads can update their respective free[lane] with the same old, freed -postmap_aba. This has made the layout inconsistent by losing a free entry, and -at the same time, duplicating another free entry for two lanes. - -To solve this, we could have a single map lock (per arena) that has to be taken -before performing the above sequence, but we feel that could be too contentious. -Instead we use an array of (nfree) map_locks that is indexed by -(premap_aba modulo nfree). - - -f. Reconstruction from the Flog -------------------------------- - -On startup, we analyze the BTT flog to create our list of free blocks. We walk -through all the entries, and for each lane, of the set of two possible -'sections', we always look at the most recent one only (based on the sequence -number). The reconstruction rules/steps are simple: - -- Read map[log_entry.lba]. -- If log_entry.new matches the map entry, then log_entry.old is free. -- If log_entry.new does not match the map entry, then log_entry.new is free. - (This case can only be caused by power-fails/unsafe shutdowns) - - -g. Summarizing - Read and Write flows -------------------------------------- - -Read: - -1. Convert external LBA to arena number + pre-map ABA -2. Get a lane (and take lane_lock) -3. Read map to get the entry for this pre-map ABA -4. Enter post-map ABA into RTT[lane] -5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) -6. If ERROR flag set in map, end IO with EIO (go to step 8) -7. Read data from this block -8. Remove post-map ABA entry from RTT[lane] -9. Release lane (and lane_lock) - -Write: - -1. Convert external LBA to Arena number + pre-map ABA -2. Get a lane (and take lane_lock) -3. Use lane to index into in-memory free list and obtain a new block, next flog - index, next sequence number -4. Scan the RTT to check if free block is present, and spin/wait if it is. -5. Write data to this free block -6. Read map to get the existing post-map ABA entry for this pre-map ABA -7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] -8. Write new post-map ABA into map. -9. Write old post-map entry into the free list -10. Calculate next sequence number and write into the free list entry -11. Release lane (and lane_lock) - - -4. Error Handling -================= - -An arena would be in an error state if any of the metadata is corrupted -irrecoverably, either due to a bug or a media error. The following conditions -indicate an error: - -- Info block checksum does not match (and recovering from the copy also fails) -- All internal available blocks are not uniquely and entirely addressed by the - sum of mapped blocks and free blocks (from the BTT flog). -- Rebuilding free list from the flog reveals missing/duplicate/impossible - entries -- A map entry is out of bounds - -If any of these error conditions are encountered, the arena is put into a read -only state using a flag in the info block. - - -5. Usage -======== - -The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem -(pmem, or blk mode). The easiest way to set up such a namespace is using the -'ndctl' utility [1]: - -For example, the ndctl command line to setup a btt with a 4k sector size is:: - - ndctl create-namespace -f -e namespace0.0 -m sector -l 4k - -See ndctl create-namespace --help for more options. - -[1]: https://github.com/pmem/ndctl diff --git a/Documentation/nvdimm/index.rst b/Documentation/nvdimm/index.rst deleted file mode 100644 index 1a3402d3775e..000000000000 --- a/Documentation/nvdimm/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -:orphan: - -=================================== -Non-Volatile Memory Device (NVDIMM) -=================================== - -.. toctree:: - :maxdepth: 1 - - nvdimm - btt - security diff --git a/Documentation/nvdimm/nvdimm.rst b/Documentation/nvdimm/nvdimm.rst deleted file mode 100644 index 08f855cbb4e6..000000000000 --- a/Documentation/nvdimm/nvdimm.rst +++ /dev/null @@ -1,887 +0,0 @@ -=============================== -LIBNVDIMM: Non-Volatile Devices -=============================== - -libnvdimm - kernel / libndctl - userspace helper library - -linux-nvdimm@lists.01.org - -Version 13 - -.. contents: - - Glossary - Overview - Supporting Documents - Git Trees - LIBNVDIMM PMEM and BLK - Why BLK? - PMEM vs BLK - BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX - Example NVDIMM Platform - LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API - LIBNDCTL: Context - libndctl: instantiate a new library context example - LIBNVDIMM/LIBNDCTL: Bus - libnvdimm: control class device in /sys/class - libnvdimm: bus - libndctl: bus enumeration example - LIBNVDIMM/LIBNDCTL: DIMM (NMEM) - libnvdimm: DIMM (NMEM) - libndctl: DIMM enumeration example - LIBNVDIMM/LIBNDCTL: Region - libnvdimm: region - libndctl: region enumeration example - Why Not Encode the Region Type into the Region Name? - How Do I Determine the Major Type of a Region? - LIBNVDIMM/LIBNDCTL: Namespace - libnvdimm: namespace - libndctl: namespace enumeration example - libndctl: namespace creation example - Why the Term "namespace"? - LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" - libnvdimm: btt layout - libndctl: btt creation example - Summary LIBNDCTL Diagram - - -Glossary -======== - -PMEM: - A system-physical-address range where writes are persistent. A - block device composed of PMEM is capable of DAX. A PMEM address range - may span an interleave of several DIMMs. - -BLK: - A set of one or more programmable memory mapped apertures provided - by a DIMM to access its media. This indirection precludes the - performance benefit of interleaving, but enables DIMM-bounded failure - modes. - -DPA: - DIMM Physical Address, is a DIMM-relative offset. With one DIMM in - the system there would be a 1:1 system-physical-address:DPA association. - Once more DIMMs are added a memory controller interleave must be - decoded to determine the DPA associated with a given - system-physical-address. BLK capacity always has a 1:1 relationship - with a single-DIMM's DPA range. - -DAX: - File system extensions to bypass the page cache and block layer to - mmap persistent memory, from a PMEM block device, directly into a - process address space. - -DSM: - Device Specific Method: ACPI method to to control specific - device - in this case the firmware. - -DCR: - NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. - It defines a vendor-id, device-id, and interface format for a given DIMM. - -BTT: - Block Translation Table: Persistent memory is byte addressable. - Existing software may have an expectation that the power-fail-atomicity - of writes is at least one sector, 512 bytes. The BTT is an indirection - table with atomic update semantics to front a PMEM/BLK block device - driver and present arbitrary atomic sector sizes. - -LABEL: - Metadata stored on a DIMM device that partitions and identifies - (persistently names) storage between PMEM and BLK. It also partitions - BLK storage to host BTTs with different parameters per BLK-partition. - Note that traditional partition tables, GPT/MBR, are layered on top of a - BLK or PMEM device. - - -Overview -======== - -The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, -PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM -and BLK mode access. These three modes of operation are described by -the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM -implementation is generic and supports pre-NFIT platforms, it was guided -by the superset of capabilities need to support this ACPI 6 definition -for NVDIMM resources. The bulk of the kernel implementation is in place -to handle the case where DPA accessible via PMEM is aliased with DPA -accessible via BLK. When that occurs a LABEL is needed to reserve DPA -for exclusive access via one mode a time. - -Supporting Documents --------------------- - -ACPI 6: - http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf -NVDIMM Namespace: - http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf -DSM Interface Example: - http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf -Driver Writer's Guide: - http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf - -Git Trees ---------- - -LIBNVDIMM: - https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git -LIBNDCTL: - https://github.com/pmem/ndctl.git -PMEM: - https://github.com/01org/prd - - -LIBNVDIMM PMEM and BLK -====================== - -Prior to the arrival of the NFIT, non-volatile memory was described to a -system in various ad-hoc ways. Usually only the bare minimum was -provided, namely, a single system-physical-address range where writes -are expected to be durable after a system power loss. Now, the NFIT -specification standardizes not only the description of PMEM, but also -BLK and platform message-passing entry points for control and -configuration. - -For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block -device driver: - - 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This - range is contiguous in system memory and may be interleaved (hardware - memory controller striped) across multiple DIMMs. When interleaved the - platform may optionally provide details of which DIMMs are participating - in the interleave. - - Note that while LIBNVDIMM describes system-physical-address ranges that may - alias with BLK access as ND_NAMESPACE_PMEM ranges and those without - alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no - distinction. The different device-types are an implementation detail - that userspace can exploit to implement policies like "only interface - with address ranges from certain DIMMs". It is worth noting that when - aliasing is present and a DIMM lacks a label, then no block device can - be created by default as userspace needs to do at least one allocation - of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once - registered, can be immediately attached to nd_pmem. - - 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform - defined apertures. A set of apertures will access just one DIMM. - Multiple windows (apertures) allow multiple concurrent accesses, much like - tagged-command-queuing, and would likely be used by different threads or - different CPUs. - - The NFIT specification defines a standard format for a BLK-aperture, but - the spec also allows for vendor specific layouts, and non-NFIT BLK - implementations may have other designs for BLK I/O. For this reason - "nd_blk" calls back into platform-specific code to perform the I/O. - - One such implementation is defined in the "Driver Writer's Guide" and "DSM - Interface Example". - - -Why BLK? -======== - -While PMEM provides direct byte-addressable CPU-load/store access to -NVDIMM storage, it does not provide the best system RAS (recovery, -availability, and serviceability) model. An access to a corrupted -system-physical-address address causes a CPU exception while an access -to a corrupted address through an BLK-aperture causes that block window -to raise an error status in a register. The latter is more aligned with -the standard error model that host-bus-adapter attached disks present. - -Also, if an administrator ever wants to replace a memory it is easier to -service a system at DIMM module boundaries. Compare this to PMEM where -data could be interleaved in an opaque hardware specific manner across -several DIMMs. - -PMEM vs BLK ------------ - -BLK-apertures solve these RAS problems, but their presence is also the -major contributing factor to the complexity of the ND subsystem. They -complicate the implementation because PMEM and BLK alias in DPA space. -Any given DIMM's DPA-range may contribute to one or more -system-physical-address sets of interleaved DIMMs, *and* may also be -accessed in its entirety through its BLK-aperture. Accessing a DPA -through a system-physical-address while simultaneously accessing the -same DPA through a BLK-aperture has undefined results. For this reason, -DIMMs with this dual interface configuration include a DSM function to -store/retrieve a LABEL. The LABEL effectively partitions the DPA-space -into exclusive system-physical-address and BLK-aperture accessible -regions. For simplicity a DIMM is allowed a PMEM "region" per each -interleave set in which it is a member. The remaining DPA space can be -carved into an arbitrary number of BLK devices with discontiguous -extents. - -BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -One of the few -reasons to allow multiple BLK namespaces per REGION is so that each -BLK-namespace can be configured with a BTT with unique atomic sector -sizes. While a PMEM device can host a BTT the LABEL specification does -not provide for a sector size to be specified for a PMEM namespace. - -This is due to the expectation that the primary usage model for PMEM is -via DAX, and the BTT is incompatible with DAX. However, for the cases -where an application or filesystem still needs atomic sector update -guarantees it can register a BTT on a PMEM device or partition. See -LIBNVDIMM/NDCTL: Block Translation Table "btt" - - -Example NVDIMM Platform -======================= - -For the remainder of this document the following diagram will be -referenced for any example sysfs layouts:: - - - (a) (b) DIMM BLK-REGION - +-------------------+--------+--------+--------+ - +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 - | imc0 +--+- - - region0- - - +--------+ +--------+ - +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 - | +-------------------+--------v v--------+ - +--+---+ | | - | cpu0 | region1 - +--+---+ | | - | +----------------------------^ ^--------+ - +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 - | imc1 +--+----------------------------| +--------+ - +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 - +----------------------------+--------+--------+ - -In this platform we have four DIMMs and two memory controllers in one -socket. Each unique interface (BLK or PMEM) to DPA space is identified -by a region device with a dynamically assigned id (REGION0 - REGION5). - - 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A - single PMEM namespace is created in the REGION0-SPA-range that spans most - of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that - interleaved system-physical-address range is reclaimed as BLK-aperture - accessed space starting at DPA-offset (a) into each DIMM. In that - reclaimed space we create two BLK-aperture "namespaces" from REGION2 and - REGION3 where "blk2.0" and "blk3.0" are just human readable names that - could be set to any user-desired name in the LABEL. - - 2. In the last portion of DIMM0 and DIMM1 we have an interleaved - system-physical-address range, REGION1, that spans those two DIMMs as - well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace - named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for - each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and - "blk5.0". - - 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 - interleaved system-physical-address range (i.e. the DPA address past - offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. - Note, that this example shows that BLK-aperture namespaces don't need to - be contiguous in DPA-space. - - This bus is provided by the kernel under the device - /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and - the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the - acpi_nfit.ko driver as well. - - -LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API -======================================================== - -What follows is a description of the LIBNVDIMM sysfs layout and a -corresponding object hierarchy diagram as viewed through the LIBNDCTL -API. The example sysfs paths and diagrams are relative to the Example -NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit -test. - -LIBNDCTL: Context ------------------ - -Every API call in the LIBNDCTL library requires a context that holds the -logging parameters and other library instance state. The library is -based on the libabc template: - - https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git - -LIBNDCTL: instantiate a new library context example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -:: - - struct ndctl_ctx *ctx; - - if (ndctl_new(&ctx) == 0) - return ctx; - else - return NULL; - -LIBNVDIMM/LIBNDCTL: Bus ------------------------ - -A bus has a 1:1 relationship with an NFIT. The current expectation for -ACPI based systems is that there is only ever one platform-global NFIT. -That said, it is trivial to register multiple NFITs, the specification -does not preclude it. The infrastructure supports multiple busses and -we use this capability to test multiple NFIT configurations in the unit -test. - -LIBNVDIMM: control class device in /sys/class ---------------------------------------------- - -This character device accepts DSM messages to be passed to DIMM -identified by its NFIT handle:: - - /sys/class/nd/ndctl0 - |-- dev - |-- device -> ../../../ndbus0 - |-- subsystem -> ../../../../../../../class/nd - - - -LIBNVDIMM: bus --------------- - -:: - - struct nvdimm_bus *nvdimm_bus_register(struct device *parent, - struct nvdimm_bus_descriptor *nfit_desc); - -:: - - /sys/devices/platform/nfit_test.0/ndbus0 - |-- commands - |-- nd - |-- nfit - |-- nmem0 - |-- nmem1 - |-- nmem2 - |-- nmem3 - |-- power - |-- provider - |-- region0 - |-- region1 - |-- region2 - |-- region3 - |-- region4 - |-- region5 - |-- uevent - `-- wait_probe - -LIBNDCTL: bus enumeration example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Find the bus handle that describes the bus from Example NVDIMM Platform:: - - static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, - const char *provider) - { - struct ndctl_bus *bus; - - ndctl_bus_foreach(ctx, bus) - if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) - return bus; - - return NULL; - } - - bus = get_bus_by_provider(ctx, "nfit_test.0"); - - -LIBNVDIMM/LIBNDCTL: DIMM (NMEM) -------------------------------- - -The DIMM device provides a character device for sending commands to -hardware, and it is a container for LABELs. If the DIMM is defined by -NFIT then an optional 'nfit' attribute sub-directory is available to add -NFIT-specifics. - -Note that the kernel device name for "DIMMs" is "nmemX". The NFIT -describes these devices via "Memory Device to System Physical Address -Range Mapping Structure", and there is no requirement that they actually -be physical DIMMs, so we use a more generic name. - -LIBNVDIMM: DIMM (NMEM) -^^^^^^^^^^^^^^^^^^^^^^ - -:: - - struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, - const struct attribute_group **groups, unsigned long flags, - unsigned long *dsm_mask); - -:: - - /sys/devices/platform/nfit_test.0/ndbus0 - |-- nmem0 - | |-- available_slots - | |-- commands - | |-- dev - | |-- devtype - | |-- driver -> ../../../../../bus/nd/drivers/nvdimm - | |-- modalias - | |-- nfit - | | |-- device - | | |-- format - | | |-- handle - | | |-- phys_id - | | |-- rev_id - | | |-- serial - | | `-- vendor - | |-- state - | |-- subsystem -> ../../../../../bus/nd - | `-- uevent - |-- nmem1 - [..] - - -LIBNDCTL: DIMM enumeration example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Note, in this example we are assuming NFIT-defined DIMMs which are -identified by an "nfit_handle" a 32-bit value where: - - - Bit 3:0 DIMM number within the memory channel - - Bit 7:4 memory channel number - - Bit 11:8 memory controller ID - - Bit 15:12 socket ID (within scope of a Node controller if node - controller is present) - - Bit 27:16 Node Controller ID - - Bit 31:28 Reserved - -:: - - static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, - unsigned int handle) - { - struct ndctl_dimm *dimm; - - ndctl_dimm_foreach(bus, dimm) - if (ndctl_dimm_get_handle(dimm) == handle) - return dimm; - - return NULL; - } - - #define DIMM_HANDLE(n, s, i, c, d) \ - (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ - | ((c & 0xf) << 4) | (d & 0xf)) - - dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); - -LIBNVDIMM/LIBNDCTL: Region --------------------------- - -A generic REGION device is registered for each PMEM range or BLK-aperture -set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture -sets on the "nfit_test.0" bus. The primary role of regions are to be a -container of "mappings". A mapping is a tuple of . - -LIBNVDIMM provides a built-in driver for these REGION devices. This driver -is responsible for reconciling the aliased DPA mappings across all -regions, parsing the LABEL, if present, and then emitting NAMESPACE -devices with the resolved/exclusive DPA-boundaries for the nd_pmem or -nd_blk device driver to consume. - -In addition to the generic attributes of "mapping"s, "interleave_ways" -and "size" the REGION device also exports some convenience attributes. -"nstype" indicates the integer type of namespace-device this region -emits, "devtype" duplicates the DEVTYPE variable stored by udev at the -'add' event, "modalias" duplicates the MODALIAS variable stored by udev -at the 'add' event, and finally, the optional "spa_index" is provided in -the case where the region is defined by a SPA. - -LIBNVDIMM: region:: - - struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, - struct nd_region_desc *ndr_desc); - struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, - struct nd_region_desc *ndr_desc); - -:: - - /sys/devices/platform/nfit_test.0/ndbus0 - |-- region0 - | |-- available_size - | |-- btt0 - | |-- btt_seed - | |-- devtype - | |-- driver -> ../../../../../bus/nd/drivers/nd_region - | |-- init_namespaces - | |-- mapping0 - | |-- mapping1 - | |-- mappings - | |-- modalias - | |-- namespace0.0 - | |-- namespace_seed - | |-- numa_node - | |-- nfit - | | `-- spa_index - | |-- nstype - | |-- set_cookie - | |-- size - | |-- subsystem -> ../../../../../bus/nd - | `-- uevent - |-- region1 - [..] - -LIBNDCTL: region enumeration example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Sample region retrieval routines based on NFIT-unique data like -"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for -BLK:: - - static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, - unsigned int spa_index) - { - struct ndctl_region *region; - - ndctl_region_foreach(bus, region) { - if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) - continue; - if (ndctl_region_get_spa_index(region) == spa_index) - return region; - } - return NULL; - } - - static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, - unsigned int handle) - { - struct ndctl_region *region; - - ndctl_region_foreach(bus, region) { - struct ndctl_mapping *map; - - if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) - continue; - ndctl_mapping_foreach(region, map) { - struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); - - if (ndctl_dimm_get_handle(dimm) == handle) - return region; - } - } - return NULL; - } - - -Why Not Encode the Region Type into the Region Name? ----------------------------------------------------- - -At first glance it seems since NFIT defines just PMEM and BLK interface -types that we should simply name REGION devices with something derived -from those type names. However, the ND subsystem explicitly keeps the -REGION name generic and expects userspace to always consider the -region-attributes for four reasons: - - 1. There are already more than two REGION and "namespace" types. For - PMEM there are two subtypes. As mentioned previously we have PMEM where - the constituent DIMM devices are known and anonymous PMEM. For BLK - regions the NFIT specification already anticipates vendor specific - implementations. The exact distinction of what a region contains is in - the region-attributes not the region-name or the region-devtype. - - 2. A region with zero child-namespaces is a possible configuration. For - example, the NFIT allows for a DCR to be published without a - corresponding BLK-aperture. This equates to a DIMM that can only accept - control/configuration messages, but no i/o through a descendant block - device. Again, this "type" is advertised in the attributes ('mappings' - == 0) and the name does not tell you much. - - 3. What if a third major interface type arises in the future? Outside - of vendor specific implementations, it's not difficult to envision a - third class of interface type beyond BLK and PMEM. With a generic name - for the REGION level of the device-hierarchy old userspace - implementations can still make sense of new kernel advertised - region-types. Userspace can always rely on the generic region - attributes like "mappings", "size", etc and the expected child devices - named "namespace". This generic format of the device-model hierarchy - allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and - future-proof. - - 4. There are more robust mechanisms for determining the major type of a - region than a device name. See the next section, How Do I Determine the - Major Type of a Region? - -How Do I Determine the Major Type of a Region? ----------------------------------------------- - -Outside of the blanket recommendation of "use libndctl", or simply -looking at the kernel header (/usr/include/linux/ndctl.h) to decode the -"nstype" integer attribute, here are some other options. - -1. module alias lookup -^^^^^^^^^^^^^^^^^^^^^^ - - The whole point of region/namespace device type differentiation is to - decide which block-device driver will attach to a given LIBNVDIMM namespace. - One can simply use the modalias to lookup the resulting module. It's - important to note that this method is robust in the presence of a - vendor-specific driver down the road. If a vendor-specific - implementation wants to supplant the standard nd_blk driver it can with - minimal impact to the rest of LIBNVDIMM. - - In fact, a vendor may also want to have a vendor-specific region-driver - (outside of nd_region). For example, if a vendor defined its own LABEL - format it would need its own region driver to parse that LABEL and emit - the resulting namespaces. The output from module resolution is more - accurate than a region-name or region-devtype. - -2. udev -^^^^^^^ - - The kernel "devtype" is registered in the udev database:: - - # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 - P: /devices/platform/nfit_test.0/ndbus0/region0 - E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 - E: DEVTYPE=nd_pmem - E: MODALIAS=nd:t2 - E: SUBSYSTEM=nd - - # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 - P: /devices/platform/nfit_test.0/ndbus0/region4 - E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 - E: DEVTYPE=nd_blk - E: MODALIAS=nd:t3 - E: SUBSYSTEM=nd - - ...and is available as a region attribute, but keep in mind that the - "devtype" does not indicate sub-type variations and scripts should - really be understanding the other attributes. - -3. type specific attributes -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - - As it currently stands a BLK-aperture region will never have a - "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A - BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM - that does not allow I/O. A PMEM region with a "mappings" value of zero - is a simple system-physical-address range. - - -LIBNVDIMM/LIBNDCTL: Namespace ------------------------------ - -A REGION, after resolving DPA aliasing and LABEL specified boundaries, -surfaces one or more "namespace" devices. The arrival of a "namespace" -device currently triggers either the nd_blk or nd_pmem driver to load -and register a disk/block device. - -LIBNVDIMM: namespace -^^^^^^^^^^^^^^^^^^^^ - -Here is a sample layout from the three major types of NAMESPACE where -namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' -attribute), namespace2.0 represents a BLK namespace (note it has a -'sector_size' attribute) that, and namespace6.0 represents an anonymous -PMEM namespace (note that has no 'uuid' attribute due to not support a -LABEL):: - - /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 - |-- alt_name - |-- devtype - |-- dpa_extents - |-- force_raw - |-- modalias - |-- numa_node - |-- resource - |-- size - |-- subsystem -> ../../../../../../bus/nd - |-- type - |-- uevent - `-- uuid - /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 - |-- alt_name - |-- devtype - |-- dpa_extents - |-- force_raw - |-- modalias - |-- numa_node - |-- sector_size - |-- size - |-- subsystem -> ../../../../../../bus/nd - |-- type - |-- uevent - `-- uuid - /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 - |-- block - | `-- pmem0 - |-- devtype - |-- driver -> ../../../../../../bus/nd/drivers/pmem - |-- force_raw - |-- modalias - |-- numa_node - |-- resource - |-- size - |-- subsystem -> ../../../../../../bus/nd - |-- type - `-- uevent - -LIBNDCTL: namespace enumeration example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Namespaces are indexed relative to their parent region, example below. -These indexes are mostly static from boot to boot, but subsystem makes -no guarantees in this regard. For a static namespace identifier use its -'uuid' attribute. - -:: - - static struct ndctl_namespace - *get_namespace_by_id(struct ndctl_region *region, unsigned int id) - { - struct ndctl_namespace *ndns; - - ndctl_namespace_foreach(region, ndns) - if (ndctl_namespace_get_id(ndns) == id) - return ndns; - - return NULL; - } - -LIBNDCTL: namespace creation example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Idle namespaces are automatically created by the kernel if a given -region has enough available capacity to create a new namespace. -Namespace instantiation involves finding an idle namespace and -configuring it. For the most part the setting of namespace attributes -can occur in any order, the only constraint is that 'uuid' must be set -before 'size'. This enables the kernel to track DPA allocations -internally with a static identifier:: - - static int configure_namespace(struct ndctl_region *region, - struct ndctl_namespace *ndns, - struct namespace_parameters *parameters) - { - char devname[50]; - - snprintf(devname, sizeof(devname), "namespace%d.%d", - ndctl_region_get_id(region), paramaters->id); - - ndctl_namespace_set_alt_name(ndns, devname); - /* 'uuid' must be set prior to setting size! */ - ndctl_namespace_set_uuid(ndns, paramaters->uuid); - ndctl_namespace_set_size(ndns, paramaters->size); - /* unlike pmem namespaces, blk namespaces have a sector size */ - if (parameters->lbasize) - ndctl_namespace_set_sector_size(ndns, parameters->lbasize); - ndctl_namespace_enable(ndns); - } - - -Why the Term "namespace"? -^^^^^^^^^^^^^^^^^^^^^^^^^ - - 1. Why not "volume" for instance? "volume" ran the risk of confusing - ND (libnvdimm subsystem) to a volume manager like device-mapper. - - 2. The term originated to describe the sub-devices that can be created - within a NVME controller (see the nvme specification: - http://www.nvmexpress.org/specifications/), and NFIT namespaces are - meant to parallel the capabilities and configurability of - NVME-namespaces. - - -LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" -------------------------------------------------- - -A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked -block device driver that fronts either the whole block device or a -partition of a block device emitted by either a PMEM or BLK NAMESPACE. - -LIBNVDIMM: btt layout -^^^^^^^^^^^^^^^^^^^^^ - -Every region will start out with at least one BTT device which is the -seed device. To activate it set the "namespace", "uuid", and -"sector_size" attributes and then bind the device to the nd_pmem or -nd_blk driver depending on the region type:: - - /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ - |-- namespace - |-- delete - |-- devtype - |-- modalias - |-- numa_node - |-- sector_size - |-- subsystem -> ../../../../../bus/nd - |-- uevent - `-- uuid - -LIBNDCTL: btt creation example -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Similar to namespaces an idle BTT device is automatically created per -region. Each time this "seed" btt device is configured and enabled a new -seed is created. Creating a BTT configuration involves two steps of -finding and idle BTT and assigning it to consume a PMEM or BLK namespace:: - - static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) - { - struct ndctl_btt *btt; - - ndctl_btt_foreach(region, btt) - if (!ndctl_btt_is_enabled(btt) - && !ndctl_btt_is_configured(btt)) - return btt; - - return NULL; - } - - static int configure_btt(struct ndctl_region *region, - struct btt_parameters *parameters) - { - btt = get_idle_btt(region); - - ndctl_btt_set_uuid(btt, parameters->uuid); - ndctl_btt_set_sector_size(btt, parameters->sector_size); - ndctl_btt_set_namespace(btt, parameters->ndns); - /* turn off raw mode device */ - ndctl_namespace_disable(parameters->ndns); - /* turn on btt access */ - ndctl_btt_enable(btt); - } - -Once instantiated a new inactive btt seed device will appear underneath -the region. - -Once a "namespace" is removed from a BTT that instance of the BTT device -will be deleted or otherwise reset to default values. This deletion is -only at the device model level. In order to destroy a BTT the "info -block" needs to be destroyed. Note, that to destroy a BTT the media -needs to be written in raw mode. By default, the kernel will autodetect -the presence of a BTT and disable raw mode. This autodetect behavior -can be suppressed by enabling raw mode for the namespace via the -ndctl_namespace_set_raw_mode() API. - - -Summary LIBNDCTL Diagram ------------------------- - -For the given example above, here is the view of the objects as seen by the -LIBNDCTL API:: - - +---+ - |CTX| +---------+ +--------------+ +---------------+ - +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | - | | +---------+ +--------------+ +---------------+ - +-------+ | | +---------+ +--------------+ +---------------+ - | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | - +-------+ | | | +---------+ +--------------+ +---------------+ - | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ - +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | - | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ - +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | - | DIMM3 <-+ | +--------------+ +----------------------+ - +-------+ | +---------+ +--------------+ +---------------+ - +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | - | +---------+ | +--------------+ +----------------------+ - | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | - | +--------------+ +----------------------+ - | +---------+ +--------------+ +---------------+ - +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | - | +---------+ +--------------+ +---------------+ - | +---------+ +--------------+ +----------------------+ - +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | - +---------+ +--------------+ +---------------+------+ diff --git a/Documentation/nvdimm/security.rst b/Documentation/nvdimm/security.rst deleted file mode 100644 index ad9dea099b34..000000000000 --- a/Documentation/nvdimm/security.rst +++ /dev/null @@ -1,143 +0,0 @@ -=============== -NVDIMM Security -=============== - -1. Introduction ---------------- - -With the introduction of Intel Device Specific Methods (DSM) v1.8 -specification [1], security DSMs are introduced. The spec added the following -security DSMs: "get security state", "set passphrase", "disable passphrase", -"unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops -data structure has been added to struct dimm in order to support the security -operations and generic APIs are exposed to allow vendor neutral operations. - -2. Sysfs Interface ------------------- -The "security" sysfs attribute is provided in the nvdimm sysfs directory. For -example: -/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security - -The "show" attribute of that attribute will display the security state for -that DIMM. The following states are available: disabled, unlocked, locked, -frozen, and overwrite. If security is not supported, the sysfs attribute -will not be visible. - -The "store" attribute takes several commands when it is being written to -in order to support some of the security functionalities: -update - enable or update passphrase. -disable - disable enabled security and remove key. -freeze - freeze changing of security states. -erase - delete existing user encryption key. -overwrite - wipe the entire nvdimm. -master_update - enable or update master passphrase. -master_erase - delete existing user encryption key. - -3. Key Management ------------------ - -The key is associated to the payload by the DIMM id. For example: -# cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id -8089-a2-1740-00000133 -The DIMM id would be provided along with the key payload (passphrase) to -the kernel. - -The security keys are managed on the basis of a single key per DIMM. The -key "passphrase" is expected to be 32bytes long. This is similar to the ATA -security specification [2]. A key is initially acquired via the request_key() -kernel API call during nvdimm unlock. It is up to the user to make sure that -all the keys are in the kernel user keyring for unlock. - -A nvdimm encrypted-key of format enc32 has the description format of: -nvdimm: - -See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating -encrypted-keys of enc32 format. TPM usage with a master trusted key is -preferred for sealing the encrypted-keys. - -4. Unlocking ------------- -When the DIMMs are being enumerated by the kernel, the kernel will attempt to -retrieve the key from the kernel user keyring. This is the only time -a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked -until reboot. Typically an entity (i.e. shell script) will inject all the -relevant encrypted-keys into the kernel user keyring during the initramfs phase. -This provides the unlock function access to all the related keys that contain -the passphrase for the respective nvdimms. It is also recommended that the -keys are injected before libnvdimm is loaded by modprobe. - -5. Update ---------- -When doing an update, it is expected that the existing key is removed from -the kernel user keyring and reinjected as different (old) key. It's irrelevant -what the key description is for the old key since we are only interested in the -keyid when doing the update operation. It is also expected that the new key -is injected with the description format described from earlier in this -document. The update command written to the sysfs attribute will be with -the format: -update - -If there is no old keyid due to a security enabling, then a 0 should be -passed in. - -6. Freeze ---------- -The freeze operation does not require any keys. The security config can be -frozen by a user with root privelege. - -7. Disable ----------- -The security disable command format is: -disable - -An key with the current passphrase payload that is tied to the nvdimm should be -in the kernel user keyring. - -8. Secure Erase ---------------- -The command format for doing a secure erase is: -erase - -An key with the current passphrase payload that is tied to the nvdimm should be -in the kernel user keyring. - -9. Overwrite ------------- -The command format for doing an overwrite is: -overwrite - -Overwrite can be done without a key if security is not enabled. A key serial -of 0 can be passed in to indicate no key. - -The sysfs attribute "security" can be polled to wait on overwrite completion. -Overwrite can last tens of minutes or more depending on nvdimm size. - -An encrypted-key with the current user passphrase that is tied to the nvdimm -should be injected and its keyid should be passed in via sysfs. - -10. Master Update ------------------ -The command format for doing a master update is: -update - -The operating mechanism for master update is identical to update except the -master passphrase key is passed to the kernel. The master passphrase key -is just another encrypted-key. - -This command is only available when security is disabled. - -11. Master Erase ----------------- -The command format for doing a master erase is: -master_erase - -This command has the same operating mechanism as erase except the master -passphrase key is passed to the kernel. The master passphrase key is just -another encrypted-key. - -This command is only available when the master security is enabled, indicated -by the extended security status. - -[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf - -[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig index e89c1c332407..a5fde15e91d3 100644 --- a/drivers/nvdimm/Kconfig +++ b/drivers/nvdimm/Kconfig @@ -33,7 +33,7 @@ config BLK_DEV_PMEM Documentation/admin-guide/kernel-parameters.rst). This driver converts these persistent memory ranges into block devices that are capable of DAX (direct-access) file system mappings. See - Documentation/nvdimm/nvdimm.rst for more details. + Documentation/driver-api/nvdimm/nvdimm.rst for more details. Say Y if you want to use an NVDIMM -- cgit v1.2.3 From 43f6c0787c1781b951d686e8302377fcf85ccb8a Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 16:40:16 -0300 Subject: docs: mtd: move it to the driver-api book While I was tempted to move it to admin-guide, as some docs there are more userspace-faced, there are some very technical discussions about memory error correction code from the Kernel implementer's PoV. So, let's place it inside the driver-api book. Signed-off-by: Mauro Carvalho Chehab --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/mtd/index.rst | 10 + Documentation/driver-api/mtd/intel-spi.rst | 90 ++++ Documentation/driver-api/mtd/nand_ecc.rst | 763 +++++++++++++++++++++++++++++ Documentation/driver-api/mtd/spi-nor.rst | 66 +++ Documentation/mtd/index.rst | 12 - Documentation/mtd/intel-spi.rst | 90 ---- Documentation/mtd/nand_ecc.rst | 763 ----------------------------- Documentation/mtd/spi-nor.rst | 66 --- drivers/mtd/nand/raw/nand_ecc.c | 2 +- 10 files changed, 931 insertions(+), 932 deletions(-) create mode 100644 Documentation/driver-api/mtd/index.rst create mode 100644 Documentation/driver-api/mtd/intel-spi.rst create mode 100644 Documentation/driver-api/mtd/nand_ecc.rst create mode 100644 Documentation/driver-api/mtd/spi-nor.rst delete mode 100644 Documentation/mtd/index.rst delete mode 100644 Documentation/mtd/intel-spi.rst delete mode 100644 Documentation/mtd/nand_ecc.rst delete mode 100644 Documentation/mtd/spi-nor.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 410dd7110772..7ecc65093493 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -44,6 +44,7 @@ available subsections can be seen below. mtdnand miscellaneous mei/index + mtd/index nvdimm/index w1 rapidio/index diff --git a/Documentation/driver-api/mtd/index.rst b/Documentation/driver-api/mtd/index.rst new file mode 100644 index 000000000000..2e0e7cc4055e --- /dev/null +++ b/Documentation/driver-api/mtd/index.rst @@ -0,0 +1,10 @@ +============================== +Memory Technology Device (MTD) +============================== + +.. toctree:: + :maxdepth: 1 + + intel-spi + nand_ecc + spi-nor diff --git a/Documentation/driver-api/mtd/intel-spi.rst b/Documentation/driver-api/mtd/intel-spi.rst new file mode 100644 index 000000000000..0e6d9cd5388d --- /dev/null +++ b/Documentation/driver-api/mtd/intel-spi.rst @@ -0,0 +1,90 @@ +============================== +Upgrading BIOS using intel-spi +============================== + +Many Intel CPUs like Baytrail and Braswell include SPI serial flash host +controller which is used to hold BIOS and other platform specific data. +Since contents of the SPI serial flash is crucial for machine to function, +it is typically protected by different hardware protection mechanisms to +avoid accidental (or on purpose) overwrite of the content. + +Not all manufacturers protect the SPI serial flash, mainly because it +allows upgrading the BIOS image directly from an OS. + +The intel-spi driver makes it possible to read and write the SPI serial +flash, if certain protection bits are not set and locked. If it finds +any of them set, the whole MTD device is made read-only to prevent +partial overwrites. By default the driver exposes SPI serial flash +contents as read-only but it can be changed from kernel command line, +passing "intel-spi.writeable=1". + +Please keep in mind that overwriting the BIOS image on SPI serial flash +might render the machine unbootable and requires special equipment like +Dediprog to revive. You have been warned! + +Below are the steps how to upgrade MinnowBoard MAX BIOS directly from +Linux. + + 1) Download and extract the latest Minnowboard MAX BIOS SPI image + [1]. At the time writing this the latest image is v92. + + 2) Install mtd-utils package [2]. We need this in order to erase the SPI + serial flash. Distros like Debian and Fedora have this prepackaged with + name "mtd-utils". + + 3) Add "intel-spi.writeable=1" to the kernel command line and reboot + the board (you can also reload the driver passing "writeable=1" as + module parameter to modprobe). + + 4) Once the board is up and running again, find the right MTD partition + (it is named as "BIOS"):: + + # cat /proc/mtd + dev: size erasesize name + mtd0: 00800000 00001000 "BIOS" + + So here it will be /dev/mtd0 but it may vary. + + 5) Make backup of the existing image first:: + + # dd if=/dev/mtd0ro of=bios.bak + 16384+0 records in + 16384+0 records out + 8388608 bytes (8.4 MB) copied, 10.0269 s, 837 kB/s + + 6) Verify the backup: + + # sha1sum /dev/mtd0ro bios.bak + fdbb011920572ca6c991377c4b418a0502668b73 /dev/mtd0ro + fdbb011920572ca6c991377c4b418a0502668b73 bios.bak + + The SHA1 sums must match. Otherwise do not continue any further! + + 7) Erase the SPI serial flash. After this step, do not reboot the + board! Otherwise it will not start anymore:: + + # flash_erase /dev/mtd0 0 0 + Erasing 4 Kibyte @ 7ff000 -- 100 % complete + + 8) Once completed without errors you can write the new BIOS image: + + # dd if=MNW2MAX1.X64.0092.R01.1605221712.bin of=/dev/mtd0 + + 9) Verify that the new content of the SPI serial flash matches the new + BIOS image:: + + # sha1sum /dev/mtd0ro MNW2MAX1.X64.0092.R01.1605221712.bin + 9b4df9e4be2057fceec3a5529ec3d950836c87a2 /dev/mtd0ro + 9b4df9e4be2057fceec3a5529ec3d950836c87a2 MNW2MAX1.X64.0092.R01.1605221712.bin + + The SHA1 sums should match. + + 10) Now you can reboot your board and observe the new BIOS starting up + properly. + +References +---------- + +[1] https://firmware.intel.com/sites/default/files/MinnowBoard%2EMAX_%2EX64%2E92%2ER01%2Ezip + +[2] http://www.linux-mtd.infradead.org/ diff --git a/Documentation/driver-api/mtd/nand_ecc.rst b/Documentation/driver-api/mtd/nand_ecc.rst new file mode 100644 index 000000000000..e8d3c53a5056 --- /dev/null +++ b/Documentation/driver-api/mtd/nand_ecc.rst @@ -0,0 +1,763 @@ +========================== +NAND Error-correction Code +========================== + +Introduction +============ + +Having looked at the linux mtd/nand driver and more specific at nand_ecc.c +I felt there was room for optimisation. I bashed the code for a few hours +performing tricks like table lookup removing superfluous code etc. +After that the speed was increased by 35-40%. +Still I was not too happy as I felt there was additional room for improvement. + +Bad! I was hooked. +I decided to annotate my steps in this file. Perhaps it is useful to someone +or someone learns something from it. + + +The problem +=========== + +NAND flash (at least SLC one) typically has sectors of 256 bytes. +However NAND flash is not extremely reliable so some error detection +(and sometimes correction) is needed. + +This is done by means of a Hamming code. I'll try to explain it in +laymans terms (and apologies to all the pro's in the field in case I do +not use the right terminology, my coding theory class was almost 30 +years ago, and I must admit it was not one of my favourites). + +As I said before the ecc calculation is performed on sectors of 256 +bytes. This is done by calculating several parity bits over the rows and +columns. The parity used is even parity which means that the parity bit = 1 +if the data over which the parity is calculated is 1 and the parity bit = 0 +if the data over which the parity is calculated is 0. So the total +number of bits over the data over which the parity is calculated + the +parity bit is even. (see wikipedia if you can't follow this). +Parity is often calculated by means of an exclusive or operation, +sometimes also referred to as xor. In C the operator for xor is ^ + +Back to ecc. +Let's give a small figure: + +========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== +byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 +byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 +byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 +byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 +byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 +... +byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 +byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 + cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 + cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 + cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 +========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== + +This figure represents a sector of 256 bytes. +cp is my abbreviation for column parity, rp for row parity. + +Let's start to explain column parity. + +- cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. + + so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. + +Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. + +- cp2 is the parity over bit0, bit1, bit4 and bit5 +- cp3 is the parity over bit2, bit3, bit6 and bit7. +- cp4 is the parity over bit0, bit1, bit2 and bit3. +- cp5 is the parity over bit4, bit5, bit6 and bit7. + +Note that each of cp0 .. cp5 is exactly one bit. + +Row parity actually works almost the same. + +- rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) +- rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) +- rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... + (so handle two bytes, then skip 2 bytes). +- rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) +- for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. + + so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) +- and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. + +The story now becomes quite boring. I guess you get the idea. + +- rp6 covers 8 bytes then skips 8 etc +- rp7 skips 8 bytes then covers 8 etc +- rp8 covers 16 bytes then skips 16 etc +- rp9 skips 16 bytes then covers 16 etc +- rp10 covers 32 bytes then skips 32 etc +- rp11 skips 32 bytes then covers 32 etc +- rp12 covers 64 bytes then skips 64 etc +- rp13 skips 64 bytes then covers 64 etc +- rp14 covers 128 bytes then skips 128 +- rp15 skips 128 bytes then covers 128 + +In the end the parity bits are grouped together in three bytes as +follows: + +===== ===== ===== ===== ===== ===== ===== ===== ===== +ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 +===== ===== ===== ===== ===== ===== ===== ===== ===== +ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 +ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 +ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 +===== ===== ===== ===== ===== ===== ===== ===== ===== + +I detected after writing this that ST application note AN1823 +(http://www.st.com/stonline/) gives a much +nicer picture.(but they use line parity as term where I use row parity) +Oh well, I'm graphically challenged, so suffer with me for a moment :-) + +And I could not reuse the ST picture anyway for copyright reasons. + + +Attempt 0 +========= + +Implementing the parity calculation is pretty simple. +In C pseudocode:: + + for (i = 0; i < 256; i++) + { + if (i & 0x01) + rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; + else + rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0; + if (i & 0x02) + rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; + else + rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; + if (i & 0x04) + rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; + else + rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; + if (i & 0x08) + rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; + else + rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; + if (i & 0x10) + rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; + else + rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; + if (i & 0x20) + rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; + else + rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; + if (i & 0x40) + rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; + else + rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; + if (i & 0x80) + rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; + else + rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; + cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; + cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; + cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; + cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 + cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 + cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 + } + + +Analysis 0 +========== + +C does have bitwise operators but not really operators to do the above +efficiently (and most hardware has no such instructions either). +Therefore without implementing this it was clear that the code above was +not going to bring me a Nobel prize :-) + +Fortunately the exclusive or operation is commutative, so we can combine +the values in any order. So instead of calculating all the bits +individually, let us try to rearrange things. +For the column parity this is easy. We can just xor the bytes and in the +end filter out the relevant bits. This is pretty nice as it will bring +all cp calculation out of the for loop. + +Similarly we can first xor the bytes for the various rows. +This leads to: + + +Attempt 1 +========= + +:: + + const char parity[256] = { + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, + 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 + }; + + void ecc1(const unsigned char *buf, unsigned char *code) + { + int i; + const unsigned char *bp = buf; + unsigned char cur; + unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; + unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; + unsigned char par; + + par = 0; + rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; + rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; + rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; + rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; + + for (i = 0; i < 256; i++) + { + cur = *bp++; + par ^= cur; + if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; + if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; + if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; + } + code[0] = + (parity[rp7] << 7) | + (parity[rp6] << 6) | + (parity[rp5] << 5) | + (parity[rp4] << 4) | + (parity[rp3] << 3) | + (parity[rp2] << 2) | + (parity[rp1] << 1) | + (parity[rp0]); + code[1] = + (parity[rp15] << 7) | + (parity[rp14] << 6) | + (parity[rp13] << 5) | + (parity[rp12] << 4) | + (parity[rp11] << 3) | + (parity[rp10] << 2) | + (parity[rp9] << 1) | + (parity[rp8]); + code[2] = + (parity[par & 0xf0] << 7) | + (parity[par & 0x0f] << 6) | + (parity[par & 0xcc] << 5) | + (parity[par & 0x33] << 4) | + (parity[par & 0xaa] << 3) | + (parity[par & 0x55] << 2); + code[0] = ~code[0]; + code[1] = ~code[1]; + code[2] = ~code[2]; + } + +Still pretty straightforward. The last three invert statements are there to +give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash +all data is 0xff, so the checksum then matches. + +I also introduced the parity lookup. I expected this to be the fastest +way to calculate the parity, but I will investigate alternatives later +on. + + +Analysis 1 +========== + +The code works, but is not terribly efficient. On my system it took +almost 4 times as much time as the linux driver code. But hey, if it was +*that* easy this would have been done long before. +No pain. no gain. + +Fortunately there is plenty of room for improvement. + +In step 1 we moved from bit-wise calculation to byte-wise calculation. +However in C we can also use the unsigned long data type and virtually +every modern microprocessor supports 32 bit operations, so why not try +to write our code in such a way that we process data in 32 bit chunks. + +Of course this means some modification as the row parity is byte by +byte. A quick analysis: +for the column parity we use the par variable. When extending to 32 bits +we can in the end easily calculate rp0 and rp1 from it. +(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 +respectively, from MSB to LSB) +also rp2 and rp3 can be easily retrieved from par as rp3 covers the +first two MSBs and rp2 covers the last two LSBs. + +Note that of course now the loop is executed only 64 times (256/4). +And note that care must taken wrt byte ordering. The way bytes are +ordered in a long is machine dependent, and might affect us. +Anyway, if there is an issue: this code is developed on x86 (to be +precise: a DELL PC with a D920 Intel CPU) + +And of course the performance might depend on alignment, but I expect +that the I/O buffers in the nand driver are aligned properly (and +otherwise that should be fixed to get maximum performance). + +Let's give it a try... + + +Attempt 2 +========= + +:: + + extern const char parity[256]; + + void ecc2(const unsigned char *buf, unsigned char *code) + { + int i; + const unsigned long *bp = (unsigned long *)buf; + unsigned long cur; + unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; + unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; + unsigned long par; + + par = 0; + rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; + rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; + rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; + rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; + + for (i = 0; i < 64; i++) + { + cur = *bp++; + par ^= cur; + if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; + } + /* + we need to adapt the code generation for the fact that rp vars are now + long; also the column parity calculation needs to be changed. + we'll bring rp4 to 15 back to single byte entities by shifting and + xoring + */ + rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; + rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; + rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; + rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; + rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; + rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; + rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; + rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; + rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; + rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; + rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; + rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; + rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; + rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; + par ^= (par >> 16); + rp1 = (par >> 8); rp1 &= 0xff; + rp0 = (par & 0xff); + par ^= (par >> 8); par &= 0xff; + + code[0] = + (parity[rp7] << 7) | + (parity[rp6] << 6) | + (parity[rp5] << 5) | + (parity[rp4] << 4) | + (parity[rp3] << 3) | + (parity[rp2] << 2) | + (parity[rp1] << 1) | + (parity[rp0]); + code[1] = + (parity[rp15] << 7) | + (parity[rp14] << 6) | + (parity[rp13] << 5) | + (parity[rp12] << 4) | + (parity[rp11] << 3) | + (parity[rp10] << 2) | + (parity[rp9] << 1) | + (parity[rp8]); + code[2] = + (parity[par & 0xf0] << 7) | + (parity[par & 0x0f] << 6) | + (parity[par & 0xcc] << 5) | + (parity[par & 0x33] << 4) | + (parity[par & 0xaa] << 3) | + (parity[par & 0x55] << 2); + code[0] = ~code[0]; + code[1] = ~code[1]; + code[2] = ~code[2]; + } + +The parity array is not shown any more. Note also that for these +examples I kinda deviated from my regular programming style by allowing +multiple statements on a line, not using { } in then and else blocks +with only a single statement and by using operators like ^= + + +Analysis 2 +========== + +The code (of course) works, and hurray: we are a little bit faster than +the linux driver code (about 15%). But wait, don't cheer too quickly. +There is more to be gained. +If we look at e.g. rp14 and rp15 we see that we either xor our data with +rp14 or with rp15. However we also have par which goes over all data. +This means there is no need to calculate rp14 as it can be calculated from +rp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15; +(or if desired we can avoid calculating rp15 and calculate it from +rp14). That is why some places refer to inverse parity. +Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. +Effectively this means we can eliminate the else clause from the if +statements. Also we can optimise the calculation in the end a little bit +by going from long to byte first. Actually we can even avoid the table +lookups + +Attempt 3 +========= + +Odd replaced:: + + if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; + if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; + if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; + if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; + if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; + +with:: + + if (i & 0x01) rp5 ^= cur; + if (i & 0x02) rp7 ^= cur; + if (i & 0x04) rp9 ^= cur; + if (i & 0x08) rp11 ^= cur; + if (i & 0x10) rp13 ^= cur; + if (i & 0x20) rp15 ^= cur; + +and outside the loop added:: + + rp4 = par ^ rp5; + rp6 = par ^ rp7; + rp8 = par ^ rp9; + rp10 = par ^ rp11; + rp12 = par ^ rp13; + rp14 = par ^ rp15; + +And after that the code takes about 30% more time, although the number of +statements is reduced. This is also reflected in the assembly code. + + +Analysis 3 +========== + +Very weird. Guess it has to do with caching or instruction parallellism +or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting +observation was that this one is only 30% slower (according to time) +executing the code as my 3Ghz D920 processor. + +Well, it was expected not to be easy so maybe instead move to a +different track: let's move back to the code from attempt2 and do some +loop unrolling. This will eliminate a few if statements. I'll try +different amounts of unrolling to see what works best. + + +Attempt 4 +========= + +Unrolled the loop 1, 2, 3 and 4 times. +For 4 the code starts with:: + + for (i = 0; i < 4; i++) + { + cur = *bp++; + par ^= cur; + rp4 ^= cur; + rp6 ^= cur; + rp8 ^= cur; + rp10 ^= cur; + if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; + if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; + cur = *bp++; + par ^= cur; + rp5 ^= cur; + rp6 ^= cur; + ... + + +Analysis 4 +========== + +Unrolling once gains about 15% + +Unrolling twice keeps the gain at about 15% + +Unrolling three times gives a gain of 30% compared to attempt 2. + +Unrolling four times gives a marginal improvement compared to unrolling +three times. + +I decided to proceed with a four time unrolled loop anyway. It was my gut +feeling that in the next steps I would obtain additional gain from it. + +The next step was triggered by the fact that par contains the xor of all +bytes and rp4 and rp5 each contain the xor of half of the bytes. +So in effect par = rp4 ^ rp5. But as xor is commutative we can also say +that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can +eliminate rp5 (or rp4, but I already foresaw another optimisation). +The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. + + +Attempt 5 +========= + +Effectively so all odd digit rp assignments in the loop were removed. +This included the else clause of the if statements. +Of course after the loop we need to correct things by adding code like:: + + rp5 = par ^ rp4; + +Also the initial assignments (rp5 = 0; etc) could be removed. +Along the line I also removed the initialisation of rp0/1/2/3. + + +Analysis 5 +========== + +Measurements showed this was a good move. The run-time roughly halved +compared with attempt 4 with 4 times unrolled, and we only require 1/3rd +of the processor time compared to the current code in the linux kernel. + +However, still I thought there was more. I didn't like all the if +statements. Why not keep a running parity and only keep the last if +statement. Time for yet another version! + + +Attempt 6 +========= + +THe code within the for loop was changed to:: + + for (i = 0; i < 4; i++) + { + cur = *bp++; tmppar = cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= cur; + + cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + + par ^= tmppar; + if ((i & 0x1) == 0) rp12 ^= tmppar; + if ((i & 0x2) == 0) rp14 ^= tmppar; + } + +As you can see tmppar is used to accumulate the parity within a for +iteration. In the last 3 statements is added to par and, if needed, +to rp12 and rp14. + +While making the changes I also found that I could exploit that tmppar +contains the running parity for this iteration. So instead of having: +rp4 ^= cur; rp6 ^= cur; +I removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next +statement. A similar change was done for rp8 and rp10 + + +Analysis 6 +========== + +Measuring this code again showed big gain. When executing the original +linux code 1 million times, this took about 1 second on my system. +(using time to measure the performance). After this iteration I was back +to 0.075 sec. Actually I had to decide to start measuring over 10 +million iterations in order not to lose too much accuracy. This one +definitely seemed to be the jackpot! + +There is a little bit more room for improvement though. There are three +places with statements:: + + rp4 ^= cur; rp6 ^= cur; + +It seems more efficient to also maintain a variable rp4_6 in the while +loop; This eliminates 3 statements per loop. Of course after the loop we +need to correct by adding:: + + rp4 ^= rp4_6; + rp6 ^= rp4_6 + +Furthermore there are 4 sequential assignments to rp8. This can be +encoded slightly more efficiently by saving tmppar before those 4 lines +and later do rp8 = rp8 ^ tmppar ^ notrp8; +(where notrp8 is the value of rp8 before those 4 lines). +Again a use of the commutative property of xor. +Time for a new test! + + +Attempt 7 +========= + +The new code now looks like:: + + for (i = 0; i < 4; i++) + { + cur = *bp++; tmppar = cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; + + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; + + notrp8 = tmppar; + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + rp8 = rp8 ^ tmppar ^ notrp8; + + cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; + cur = *bp++; tmppar ^= cur; rp6 ^= cur; + cur = *bp++; tmppar ^= cur; rp4 ^= cur; + cur = *bp++; tmppar ^= cur; + + par ^= tmppar; + if ((i & 0x1) == 0) rp12 ^= tmppar; + if ((i & 0x2) == 0) rp14 ^= tmppar; + } + rp4 ^= rp4_6; + rp6 ^= rp4_6; + + +Not a big change, but every penny counts :-) + + +Analysis 7 +========== + +Actually this made things worse. Not very much, but I don't want to move +into the wrong direction. Maybe something to investigate later. Could +have to do with caching again. + +Guess that is what there is to win within the loop. Maybe unrolling one +more time will help. I'll keep the optimisations from 7 for now. + + +Attempt 8 +========= + +Unrolled the loop one more time. + + +Analysis 8 +========== + +This makes things worse. Let's stick with attempt 6 and continue from there. +Although it seems that the code within the loop cannot be optimised +further there is still room to optimize the generation of the ecc codes. +We can simply calculate the total parity. If this is 0 then rp4 = rp5 +etc. If the parity is 1, then rp4 = !rp5; + +But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits +in the result byte and then do something like:: + + code[0] |= (code[0] << 1); + +Lets test this. + + +Attempt 9 +========= + +Changed the code but again this slightly degrades performance. Tried all +kind of other things, like having dedicated parity arrays to avoid the +shift after parity[rp7] << 7; No gain. +Change the lookup using the parity array by using shift operators (e.g. +replace parity[rp7] << 7 with:: + + rp7 ^= (rp7 << 4); + rp7 ^= (rp7 << 2); + rp7 ^= (rp7 << 1); + rp7 &= 0x80; + +No gain. + +The only marginal change was inverting the parity bits, so we can remove +the last three invert statements. + +Ah well, pity this does not deliver more. Then again 10 million +iterations using the linux driver code takes between 13 and 13.5 +seconds, whereas my code now takes about 0.73 seconds for those 10 +million iterations. So basically I've improved the performance by a +factor 18 on my system. Not that bad. Of course on different hardware +you will get different results. No warranties! + +But of course there is no such thing as a free lunch. The codesize almost +tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. + + +Correcting errors +================= + +For correcting errors I again used the ST application note as a starter, +but I also peeked at the existing code. + +The algorithm itself is pretty straightforward. Just xor the given and +the calculated ecc. If all bytes are 0 there is no problem. If 11 bits +are 1 we have one correctable bit error. If there is 1 bit 1, we have an +error in the given ecc code. + +It proved to be fastest to do some table lookups. Performance gain +introduced by this is about a factor 2 on my system when a repair had to +be done, and 1% or so if no repair had to be done. + +Code size increased from 330 bytes to 686 bytes for this function. +(gcc 4.2, -O3) + + +Conclusion +========== + +The gain when calculating the ecc is tremendous. Om my development hardware +a speedup of a factor of 18 for ecc calculation was achieved. On a test on an +embedded system with a MIPS core a factor 7 was obtained. + +On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor +5 (big endian mode, gcc 4.1.2, -O3) + +For correction not much gain could be obtained (as bitflips are rare). Then +again there are also much less cycles spent there. + +It seems there is not much more gain possible in this, at least when +programmed in C. Of course it might be possible to squeeze something more +out of it with an assembler program, but due to pipeline behaviour etc +this is very tricky (at least for intel hw). + +Author: Frans Meulenbroeks + +Copyright (C) 2008 Koninklijke Philips Electronics NV. diff --git a/Documentation/driver-api/mtd/spi-nor.rst b/Documentation/driver-api/mtd/spi-nor.rst new file mode 100644 index 000000000000..f5333e3bf486 --- /dev/null +++ b/Documentation/driver-api/mtd/spi-nor.rst @@ -0,0 +1,66 @@ +================= +SPI NOR framework +================= + +Part I - Why do we need this framework? +--------------------------------------- + +SPI bus controllers (drivers/spi/) only deal with streams of bytes; the bus +controller operates agnostic of the specific device attached. However, some +controllers (such as Freescale's QuadSPI controller) cannot easily handle +arbitrary streams of bytes, but rather are designed specifically for SPI NOR. + +In particular, Freescale's QuadSPI controller must know the NOR commands to +find the right LUT sequence. Unfortunately, the SPI subsystem has no notion of +opcodes, addresses, or data payloads; a SPI controller simply knows to send or +receive bytes (Tx and Rx). Therefore, we must define a new layering scheme under +which the controller driver is aware of the opcodes, addressing, and other +details of the SPI NOR protocol. + +Part II - How does the framework work? +-------------------------------------- + +This framework just adds a new layer between the MTD and the SPI bus driver. +With this new layer, the SPI NOR controller driver does not depend on the +m25p80 code anymore. + +Before this framework, the layer is like:: + + MTD + ------------------------ + m25p80 + ------------------------ + SPI bus driver + ------------------------ + SPI NOR chip + + After this framework, the layer is like: + MTD + ------------------------ + SPI NOR framework + ------------------------ + m25p80 + ------------------------ + SPI bus driver + ------------------------ + SPI NOR chip + + With the SPI NOR controller driver (Freescale QuadSPI), it looks like: + MTD + ------------------------ + SPI NOR framework + ------------------------ + fsl-quadSPI + ------------------------ + SPI NOR chip + +Part III - How can drivers use the framework? +--------------------------------------------- + +The main API is spi_nor_scan(). Before you call the hook, a driver should +initialize the necessary fields for spi_nor{}. Please see +drivers/mtd/spi-nor/spi-nor.c for detail. Please also refer to fsl-quadspi.c +when you want to write a new driver for a SPI NOR controller. +Another API is spi_nor_restore(), this is used to restore the status of SPI +flash chip such as addressing mode. Call it whenever detach the driver from +device or reboot the system. diff --git a/Documentation/mtd/index.rst b/Documentation/mtd/index.rst deleted file mode 100644 index 4fdae418ac97..000000000000 --- a/Documentation/mtd/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -:orphan: - -============================== -Memory Technology Device (MTD) -============================== - -.. toctree:: - :maxdepth: 1 - - intel-spi - nand_ecc - spi-nor diff --git a/Documentation/mtd/intel-spi.rst b/Documentation/mtd/intel-spi.rst deleted file mode 100644 index 0e6d9cd5388d..000000000000 --- a/Documentation/mtd/intel-spi.rst +++ /dev/null @@ -1,90 +0,0 @@ -============================== -Upgrading BIOS using intel-spi -============================== - -Many Intel CPUs like Baytrail and Braswell include SPI serial flash host -controller which is used to hold BIOS and other platform specific data. -Since contents of the SPI serial flash is crucial for machine to function, -it is typically protected by different hardware protection mechanisms to -avoid accidental (or on purpose) overwrite of the content. - -Not all manufacturers protect the SPI serial flash, mainly because it -allows upgrading the BIOS image directly from an OS. - -The intel-spi driver makes it possible to read and write the SPI serial -flash, if certain protection bits are not set and locked. If it finds -any of them set, the whole MTD device is made read-only to prevent -partial overwrites. By default the driver exposes SPI serial flash -contents as read-only but it can be changed from kernel command line, -passing "intel-spi.writeable=1". - -Please keep in mind that overwriting the BIOS image on SPI serial flash -might render the machine unbootable and requires special equipment like -Dediprog to revive. You have been warned! - -Below are the steps how to upgrade MinnowBoard MAX BIOS directly from -Linux. - - 1) Download and extract the latest Minnowboard MAX BIOS SPI image - [1]. At the time writing this the latest image is v92. - - 2) Install mtd-utils package [2]. We need this in order to erase the SPI - serial flash. Distros like Debian and Fedora have this prepackaged with - name "mtd-utils". - - 3) Add "intel-spi.writeable=1" to the kernel command line and reboot - the board (you can also reload the driver passing "writeable=1" as - module parameter to modprobe). - - 4) Once the board is up and running again, find the right MTD partition - (it is named as "BIOS"):: - - # cat /proc/mtd - dev: size erasesize name - mtd0: 00800000 00001000 "BIOS" - - So here it will be /dev/mtd0 but it may vary. - - 5) Make backup of the existing image first:: - - # dd if=/dev/mtd0ro of=bios.bak - 16384+0 records in - 16384+0 records out - 8388608 bytes (8.4 MB) copied, 10.0269 s, 837 kB/s - - 6) Verify the backup: - - # sha1sum /dev/mtd0ro bios.bak - fdbb011920572ca6c991377c4b418a0502668b73 /dev/mtd0ro - fdbb011920572ca6c991377c4b418a0502668b73 bios.bak - - The SHA1 sums must match. Otherwise do not continue any further! - - 7) Erase the SPI serial flash. After this step, do not reboot the - board! Otherwise it will not start anymore:: - - # flash_erase /dev/mtd0 0 0 - Erasing 4 Kibyte @ 7ff000 -- 100 % complete - - 8) Once completed without errors you can write the new BIOS image: - - # dd if=MNW2MAX1.X64.0092.R01.1605221712.bin of=/dev/mtd0 - - 9) Verify that the new content of the SPI serial flash matches the new - BIOS image:: - - # sha1sum /dev/mtd0ro MNW2MAX1.X64.0092.R01.1605221712.bin - 9b4df9e4be2057fceec3a5529ec3d950836c87a2 /dev/mtd0ro - 9b4df9e4be2057fceec3a5529ec3d950836c87a2 MNW2MAX1.X64.0092.R01.1605221712.bin - - The SHA1 sums should match. - - 10) Now you can reboot your board and observe the new BIOS starting up - properly. - -References ----------- - -[1] https://firmware.intel.com/sites/default/files/MinnowBoard%2EMAX_%2EX64%2E92%2ER01%2Ezip - -[2] http://www.linux-mtd.infradead.org/ diff --git a/Documentation/mtd/nand_ecc.rst b/Documentation/mtd/nand_ecc.rst deleted file mode 100644 index e8d3c53a5056..000000000000 --- a/Documentation/mtd/nand_ecc.rst +++ /dev/null @@ -1,763 +0,0 @@ -========================== -NAND Error-correction Code -========================== - -Introduction -============ - -Having looked at the linux mtd/nand driver and more specific at nand_ecc.c -I felt there was room for optimisation. I bashed the code for a few hours -performing tricks like table lookup removing superfluous code etc. -After that the speed was increased by 35-40%. -Still I was not too happy as I felt there was additional room for improvement. - -Bad! I was hooked. -I decided to annotate my steps in this file. Perhaps it is useful to someone -or someone learns something from it. - - -The problem -=========== - -NAND flash (at least SLC one) typically has sectors of 256 bytes. -However NAND flash is not extremely reliable so some error detection -(and sometimes correction) is needed. - -This is done by means of a Hamming code. I'll try to explain it in -laymans terms (and apologies to all the pro's in the field in case I do -not use the right terminology, my coding theory class was almost 30 -years ago, and I must admit it was not one of my favourites). - -As I said before the ecc calculation is performed on sectors of 256 -bytes. This is done by calculating several parity bits over the rows and -columns. The parity used is even parity which means that the parity bit = 1 -if the data over which the parity is calculated is 1 and the parity bit = 0 -if the data over which the parity is calculated is 0. So the total -number of bits over the data over which the parity is calculated + the -parity bit is even. (see wikipedia if you can't follow this). -Parity is often calculated by means of an exclusive or operation, -sometimes also referred to as xor. In C the operator for xor is ^ - -Back to ecc. -Let's give a small figure: - -========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== -byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 -byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 -byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 -byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 -byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 -... -byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 -byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 - cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 - cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 - cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 -========= ==== ==== ==== ==== ==== ==== ==== ==== === === === === ==== - -This figure represents a sector of 256 bytes. -cp is my abbreviation for column parity, rp for row parity. - -Let's start to explain column parity. - -- cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. - - so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. - -Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. - -- cp2 is the parity over bit0, bit1, bit4 and bit5 -- cp3 is the parity over bit2, bit3, bit6 and bit7. -- cp4 is the parity over bit0, bit1, bit2 and bit3. -- cp5 is the parity over bit4, bit5, bit6 and bit7. - -Note that each of cp0 .. cp5 is exactly one bit. - -Row parity actually works almost the same. - -- rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) -- rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) -- rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... - (so handle two bytes, then skip 2 bytes). -- rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) -- for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. - - so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) -- and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. - -The story now becomes quite boring. I guess you get the idea. - -- rp6 covers 8 bytes then skips 8 etc -- rp7 skips 8 bytes then covers 8 etc -- rp8 covers 16 bytes then skips 16 etc -- rp9 skips 16 bytes then covers 16 etc -- rp10 covers 32 bytes then skips 32 etc -- rp11 skips 32 bytes then covers 32 etc -- rp12 covers 64 bytes then skips 64 etc -- rp13 skips 64 bytes then covers 64 etc -- rp14 covers 128 bytes then skips 128 -- rp15 skips 128 bytes then covers 128 - -In the end the parity bits are grouped together in three bytes as -follows: - -===== ===== ===== ===== ===== ===== ===== ===== ===== -ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 -===== ===== ===== ===== ===== ===== ===== ===== ===== -ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 -ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 -ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 -===== ===== ===== ===== ===== ===== ===== ===== ===== - -I detected after writing this that ST application note AN1823 -(http://www.st.com/stonline/) gives a much -nicer picture.(but they use line parity as term where I use row parity) -Oh well, I'm graphically challenged, so suffer with me for a moment :-) - -And I could not reuse the ST picture anyway for copyright reasons. - - -Attempt 0 -========= - -Implementing the parity calculation is pretty simple. -In C pseudocode:: - - for (i = 0; i < 256; i++) - { - if (i & 0x01) - rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; - else - rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0; - if (i & 0x02) - rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; - else - rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; - if (i & 0x04) - rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; - else - rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; - if (i & 0x08) - rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; - else - rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; - if (i & 0x10) - rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; - else - rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; - if (i & 0x20) - rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; - else - rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; - if (i & 0x40) - rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; - else - rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; - if (i & 0x80) - rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; - else - rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; - cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; - cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; - cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; - cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 - cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 - cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 - } - - -Analysis 0 -========== - -C does have bitwise operators but not really operators to do the above -efficiently (and most hardware has no such instructions either). -Therefore without implementing this it was clear that the code above was -not going to bring me a Nobel prize :-) - -Fortunately the exclusive or operation is commutative, so we can combine -the values in any order. So instead of calculating all the bits -individually, let us try to rearrange things. -For the column parity this is easy. We can just xor the bytes and in the -end filter out the relevant bits. This is pretty nice as it will bring -all cp calculation out of the for loop. - -Similarly we can first xor the bytes for the various rows. -This leads to: - - -Attempt 1 -========= - -:: - - const char parity[256] = { - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, - 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 - }; - - void ecc1(const unsigned char *buf, unsigned char *code) - { - int i; - const unsigned char *bp = buf; - unsigned char cur; - unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; - unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; - unsigned char par; - - par = 0; - rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; - rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; - rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; - rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; - - for (i = 0; i < 256; i++) - { - cur = *bp++; - par ^= cur; - if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; - if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; - if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; - if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; - if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; - if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; - if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; - if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; - } - code[0] = - (parity[rp7] << 7) | - (parity[rp6] << 6) | - (parity[rp5] << 5) | - (parity[rp4] << 4) | - (parity[rp3] << 3) | - (parity[rp2] << 2) | - (parity[rp1] << 1) | - (parity[rp0]); - code[1] = - (parity[rp15] << 7) | - (parity[rp14] << 6) | - (parity[rp13] << 5) | - (parity[rp12] << 4) | - (parity[rp11] << 3) | - (parity[rp10] << 2) | - (parity[rp9] << 1) | - (parity[rp8]); - code[2] = - (parity[par & 0xf0] << 7) | - (parity[par & 0x0f] << 6) | - (parity[par & 0xcc] << 5) | - (parity[par & 0x33] << 4) | - (parity[par & 0xaa] << 3) | - (parity[par & 0x55] << 2); - code[0] = ~code[0]; - code[1] = ~code[1]; - code[2] = ~code[2]; - } - -Still pretty straightforward. The last three invert statements are there to -give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash -all data is 0xff, so the checksum then matches. - -I also introduced the parity lookup. I expected this to be the fastest -way to calculate the parity, but I will investigate alternatives later -on. - - -Analysis 1 -========== - -The code works, but is not terribly efficient. On my system it took -almost 4 times as much time as the linux driver code. But hey, if it was -*that* easy this would have been done long before. -No pain. no gain. - -Fortunately there is plenty of room for improvement. - -In step 1 we moved from bit-wise calculation to byte-wise calculation. -However in C we can also use the unsigned long data type and virtually -every modern microprocessor supports 32 bit operations, so why not try -to write our code in such a way that we process data in 32 bit chunks. - -Of course this means some modification as the row parity is byte by -byte. A quick analysis: -for the column parity we use the par variable. When extending to 32 bits -we can in the end easily calculate rp0 and rp1 from it. -(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 -respectively, from MSB to LSB) -also rp2 and rp3 can be easily retrieved from par as rp3 covers the -first two MSBs and rp2 covers the last two LSBs. - -Note that of course now the loop is executed only 64 times (256/4). -And note that care must taken wrt byte ordering. The way bytes are -ordered in a long is machine dependent, and might affect us. -Anyway, if there is an issue: this code is developed on x86 (to be -precise: a DELL PC with a D920 Intel CPU) - -And of course the performance might depend on alignment, but I expect -that the I/O buffers in the nand driver are aligned properly (and -otherwise that should be fixed to get maximum performance). - -Let's give it a try... - - -Attempt 2 -========= - -:: - - extern const char parity[256]; - - void ecc2(const unsigned char *buf, unsigned char *code) - { - int i; - const unsigned long *bp = (unsigned long *)buf; - unsigned long cur; - unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; - unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; - unsigned long par; - - par = 0; - rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; - rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; - rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; - rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; - - for (i = 0; i < 64; i++) - { - cur = *bp++; - par ^= cur; - if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; - if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; - if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; - if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; - if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; - if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; - } - /* - we need to adapt the code generation for the fact that rp vars are now - long; also the column parity calculation needs to be changed. - we'll bring rp4 to 15 back to single byte entities by shifting and - xoring - */ - rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; - rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; - rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; - rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; - rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; - rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; - rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; - rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; - rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; - rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; - rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; - rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; - rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; - rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; - par ^= (par >> 16); - rp1 = (par >> 8); rp1 &= 0xff; - rp0 = (par & 0xff); - par ^= (par >> 8); par &= 0xff; - - code[0] = - (parity[rp7] << 7) | - (parity[rp6] << 6) | - (parity[rp5] << 5) | - (parity[rp4] << 4) | - (parity[rp3] << 3) | - (parity[rp2] << 2) | - (parity[rp1] << 1) | - (parity[rp0]); - code[1] = - (parity[rp15] << 7) | - (parity[rp14] << 6) | - (parity[rp13] << 5) | - (parity[rp12] << 4) | - (parity[rp11] << 3) | - (parity[rp10] << 2) | - (parity[rp9] << 1) | - (parity[rp8]); - code[2] = - (parity[par & 0xf0] << 7) | - (parity[par & 0x0f] << 6) | - (parity[par & 0xcc] << 5) | - (parity[par & 0x33] << 4) | - (parity[par & 0xaa] << 3) | - (parity[par & 0x55] << 2); - code[0] = ~code[0]; - code[1] = ~code[1]; - code[2] = ~code[2]; - } - -The parity array is not shown any more. Note also that for these -examples I kinda deviated from my regular programming style by allowing -multiple statements on a line, not using { } in then and else blocks -with only a single statement and by using operators like ^= - - -Analysis 2 -========== - -The code (of course) works, and hurray: we are a little bit faster than -the linux driver code (about 15%). But wait, don't cheer too quickly. -There is more to be gained. -If we look at e.g. rp14 and rp15 we see that we either xor our data with -rp14 or with rp15. However we also have par which goes over all data. -This means there is no need to calculate rp14 as it can be calculated from -rp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15; -(or if desired we can avoid calculating rp15 and calculate it from -rp14). That is why some places refer to inverse parity. -Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. -Effectively this means we can eliminate the else clause from the if -statements. Also we can optimise the calculation in the end a little bit -by going from long to byte first. Actually we can even avoid the table -lookups - -Attempt 3 -========= - -Odd replaced:: - - if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; - if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; - if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; - if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; - if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; - if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; - -with:: - - if (i & 0x01) rp5 ^= cur; - if (i & 0x02) rp7 ^= cur; - if (i & 0x04) rp9 ^= cur; - if (i & 0x08) rp11 ^= cur; - if (i & 0x10) rp13 ^= cur; - if (i & 0x20) rp15 ^= cur; - -and outside the loop added:: - - rp4 = par ^ rp5; - rp6 = par ^ rp7; - rp8 = par ^ rp9; - rp10 = par ^ rp11; - rp12 = par ^ rp13; - rp14 = par ^ rp15; - -And after that the code takes about 30% more time, although the number of -statements is reduced. This is also reflected in the assembly code. - - -Analysis 3 -========== - -Very weird. Guess it has to do with caching or instruction parallellism -or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting -observation was that this one is only 30% slower (according to time) -executing the code as my 3Ghz D920 processor. - -Well, it was expected not to be easy so maybe instead move to a -different track: let's move back to the code from attempt2 and do some -loop unrolling. This will eliminate a few if statements. I'll try -different amounts of unrolling to see what works best. - - -Attempt 4 -========= - -Unrolled the loop 1, 2, 3 and 4 times. -For 4 the code starts with:: - - for (i = 0; i < 4; i++) - { - cur = *bp++; - par ^= cur; - rp4 ^= cur; - rp6 ^= cur; - rp8 ^= cur; - rp10 ^= cur; - if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; - if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; - cur = *bp++; - par ^= cur; - rp5 ^= cur; - rp6 ^= cur; - ... - - -Analysis 4 -========== - -Unrolling once gains about 15% - -Unrolling twice keeps the gain at about 15% - -Unrolling three times gives a gain of 30% compared to attempt 2. - -Unrolling four times gives a marginal improvement compared to unrolling -three times. - -I decided to proceed with a four time unrolled loop anyway. It was my gut -feeling that in the next steps I would obtain additional gain from it. - -The next step was triggered by the fact that par contains the xor of all -bytes and rp4 and rp5 each contain the xor of half of the bytes. -So in effect par = rp4 ^ rp5. But as xor is commutative we can also say -that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can -eliminate rp5 (or rp4, but I already foresaw another optimisation). -The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. - - -Attempt 5 -========= - -Effectively so all odd digit rp assignments in the loop were removed. -This included the else clause of the if statements. -Of course after the loop we need to correct things by adding code like:: - - rp5 = par ^ rp4; - -Also the initial assignments (rp5 = 0; etc) could be removed. -Along the line I also removed the initialisation of rp0/1/2/3. - - -Analysis 5 -========== - -Measurements showed this was a good move. The run-time roughly halved -compared with attempt 4 with 4 times unrolled, and we only require 1/3rd -of the processor time compared to the current code in the linux kernel. - -However, still I thought there was more. I didn't like all the if -statements. Why not keep a running parity and only keep the last if -statement. Time for yet another version! - - -Attempt 6 -========= - -THe code within the for loop was changed to:: - - for (i = 0; i < 4; i++) - { - cur = *bp++; tmppar = cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; - - cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; - - cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; - cur = *bp++; tmppar ^= cur; rp8 ^= cur; - - cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; - - par ^= tmppar; - if ((i & 0x1) == 0) rp12 ^= tmppar; - if ((i & 0x2) == 0) rp14 ^= tmppar; - } - -As you can see tmppar is used to accumulate the parity within a for -iteration. In the last 3 statements is added to par and, if needed, -to rp12 and rp14. - -While making the changes I also found that I could exploit that tmppar -contains the running parity for this iteration. So instead of having: -rp4 ^= cur; rp6 ^= cur; -I removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next -statement. A similar change was done for rp8 and rp10 - - -Analysis 6 -========== - -Measuring this code again showed big gain. When executing the original -linux code 1 million times, this took about 1 second on my system. -(using time to measure the performance). After this iteration I was back -to 0.075 sec. Actually I had to decide to start measuring over 10 -million iterations in order not to lose too much accuracy. This one -definitely seemed to be the jackpot! - -There is a little bit more room for improvement though. There are three -places with statements:: - - rp4 ^= cur; rp6 ^= cur; - -It seems more efficient to also maintain a variable rp4_6 in the while -loop; This eliminates 3 statements per loop. Of course after the loop we -need to correct by adding:: - - rp4 ^= rp4_6; - rp6 ^= rp4_6 - -Furthermore there are 4 sequential assignments to rp8. This can be -encoded slightly more efficiently by saving tmppar before those 4 lines -and later do rp8 = rp8 ^ tmppar ^ notrp8; -(where notrp8 is the value of rp8 before those 4 lines). -Again a use of the commutative property of xor. -Time for a new test! - - -Attempt 7 -========= - -The new code now looks like:: - - for (i = 0; i < 4; i++) - { - cur = *bp++; tmppar = cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; - - cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; - - notrp8 = tmppar; - cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; - rp8 = rp8 ^ tmppar ^ notrp8; - - cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; - cur = *bp++; tmppar ^= cur; rp6 ^= cur; - cur = *bp++; tmppar ^= cur; rp4 ^= cur; - cur = *bp++; tmppar ^= cur; - - par ^= tmppar; - if ((i & 0x1) == 0) rp12 ^= tmppar; - if ((i & 0x2) == 0) rp14 ^= tmppar; - } - rp4 ^= rp4_6; - rp6 ^= rp4_6; - - -Not a big change, but every penny counts :-) - - -Analysis 7 -========== - -Actually this made things worse. Not very much, but I don't want to move -into the wrong direction. Maybe something to investigate later. Could -have to do with caching again. - -Guess that is what there is to win within the loop. Maybe unrolling one -more time will help. I'll keep the optimisations from 7 for now. - - -Attempt 8 -========= - -Unrolled the loop one more time. - - -Analysis 8 -========== - -This makes things worse. Let's stick with attempt 6 and continue from there. -Although it seems that the code within the loop cannot be optimised -further there is still room to optimize the generation of the ecc codes. -We can simply calculate the total parity. If this is 0 then rp4 = rp5 -etc. If the parity is 1, then rp4 = !rp5; - -But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits -in the result byte and then do something like:: - - code[0] |= (code[0] << 1); - -Lets test this. - - -Attempt 9 -========= - -Changed the code but again this slightly degrades performance. Tried all -kind of other things, like having dedicated parity arrays to avoid the -shift after parity[rp7] << 7; No gain. -Change the lookup using the parity array by using shift operators (e.g. -replace parity[rp7] << 7 with:: - - rp7 ^= (rp7 << 4); - rp7 ^= (rp7 << 2); - rp7 ^= (rp7 << 1); - rp7 &= 0x80; - -No gain. - -The only marginal change was inverting the parity bits, so we can remove -the last three invert statements. - -Ah well, pity this does not deliver more. Then again 10 million -iterations using the linux driver code takes between 13 and 13.5 -seconds, whereas my code now takes about 0.73 seconds for those 10 -million iterations. So basically I've improved the performance by a -factor 18 on my system. Not that bad. Of course on different hardware -you will get different results. No warranties! - -But of course there is no such thing as a free lunch. The codesize almost -tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. - - -Correcting errors -================= - -For correcting errors I again used the ST application note as a starter, -but I also peeked at the existing code. - -The algorithm itself is pretty straightforward. Just xor the given and -the calculated ecc. If all bytes are 0 there is no problem. If 11 bits -are 1 we have one correctable bit error. If there is 1 bit 1, we have an -error in the given ecc code. - -It proved to be fastest to do some table lookups. Performance gain -introduced by this is about a factor 2 on my system when a repair had to -be done, and 1% or so if no repair had to be done. - -Code size increased from 330 bytes to 686 bytes for this function. -(gcc 4.2, -O3) - - -Conclusion -========== - -The gain when calculating the ecc is tremendous. Om my development hardware -a speedup of a factor of 18 for ecc calculation was achieved. On a test on an -embedded system with a MIPS core a factor 7 was obtained. - -On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor -5 (big endian mode, gcc 4.1.2, -O3) - -For correction not much gain could be obtained (as bitflips are rare). Then -again there are also much less cycles spent there. - -It seems there is not much more gain possible in this, at least when -programmed in C. Of course it might be possible to squeeze something more -out of it with an assembler program, but due to pipeline behaviour etc -this is very tricky (at least for intel hw). - -Author: Frans Meulenbroeks - -Copyright (C) 2008 Koninklijke Philips Electronics NV. diff --git a/Documentation/mtd/spi-nor.rst b/Documentation/mtd/spi-nor.rst deleted file mode 100644 index f5333e3bf486..000000000000 --- a/Documentation/mtd/spi-nor.rst +++ /dev/null @@ -1,66 +0,0 @@ -================= -SPI NOR framework -================= - -Part I - Why do we need this framework? ---------------------------------------- - -SPI bus controllers (drivers/spi/) only deal with streams of bytes; the bus -controller operates agnostic of the specific device attached. However, some -controllers (such as Freescale's QuadSPI controller) cannot easily handle -arbitrary streams of bytes, but rather are designed specifically for SPI NOR. - -In particular, Freescale's QuadSPI controller must know the NOR commands to -find the right LUT sequence. Unfortunately, the SPI subsystem has no notion of -opcodes, addresses, or data payloads; a SPI controller simply knows to send or -receive bytes (Tx and Rx). Therefore, we must define a new layering scheme under -which the controller driver is aware of the opcodes, addressing, and other -details of the SPI NOR protocol. - -Part II - How does the framework work? --------------------------------------- - -This framework just adds a new layer between the MTD and the SPI bus driver. -With this new layer, the SPI NOR controller driver does not depend on the -m25p80 code anymore. - -Before this framework, the layer is like:: - - MTD - ------------------------ - m25p80 - ------------------------ - SPI bus driver - ------------------------ - SPI NOR chip - - After this framework, the layer is like: - MTD - ------------------------ - SPI NOR framework - ------------------------ - m25p80 - ------------------------ - SPI bus driver - ------------------------ - SPI NOR chip - - With the SPI NOR controller driver (Freescale QuadSPI), it looks like: - MTD - ------------------------ - SPI NOR framework - ------------------------ - fsl-quadSPI - ------------------------ - SPI NOR chip - -Part III - How can drivers use the framework? ---------------------------------------------- - -The main API is spi_nor_scan(). Before you call the hook, a driver should -initialize the necessary fields for spi_nor{}. Please see -drivers/mtd/spi-nor/spi-nor.c for detail. Please also refer to fsl-quadspi.c -when you want to write a new driver for a SPI NOR controller. -Another API is spi_nor_restore(), this is used to restore the status of SPI -flash chip such as addressing mode. Call it whenever detach the driver from -device or reboot the system. diff --git a/drivers/mtd/nand/raw/nand_ecc.c b/drivers/mtd/nand/raw/nand_ecc.c index f6a7808db818..09fdced659f5 100644 --- a/drivers/mtd/nand/raw/nand_ecc.c +++ b/drivers/mtd/nand/raw/nand_ecc.c @@ -11,7 +11,7 @@ * Thomas Gleixner (tglx@linutronix.de) * * Information on how this algorithm works and how it was developed - * can be found in Documentation/mtd/nand_ecc.rst + * can be found in Documentation/driver-api/mtd/nand_ecc.rst */ #include -- cgit v1.2.3 From e253d2c551ce876a374d533fbcc9e8f31142dcad Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 16:46:30 -0300 Subject: docs: nfc: add it to the driver-api book Most of the descriptions here are oriented to a Kernel developer. Signed-off-by: Mauro Carvalho Chehab --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/nfc/index.rst | 9 + Documentation/driver-api/nfc/nfc-hci.rst | 311 +++++++++++++++++++++++++++++ Documentation/driver-api/nfc/nfc-pn544.rst | 34 ++++ Documentation/nfc/index.rst | 11 - Documentation/nfc/nfc-hci.rst | 311 ----------------------------- Documentation/nfc/nfc-pn544.rst | 34 ---- 7 files changed, 355 insertions(+), 356 deletions(-) create mode 100644 Documentation/driver-api/nfc/index.rst create mode 100644 Documentation/driver-api/nfc/nfc-hci.rst create mode 100644 Documentation/driver-api/nfc/nfc-pn544.rst delete mode 100644 Documentation/nfc/index.rst delete mode 100644 Documentation/nfc/nfc-hci.rst delete mode 100644 Documentation/nfc/nfc-pn544.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 7ecc65093493..d6bf4a37cefe 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -56,6 +56,7 @@ available subsections can be seen below. pinctl gpio/index misc_devices + nfc/index dmaengine/index slimbus soundwire/index diff --git a/Documentation/driver-api/nfc/index.rst b/Documentation/driver-api/nfc/index.rst new file mode 100644 index 000000000000..3afb2c0c2e3c --- /dev/null +++ b/Documentation/driver-api/nfc/index.rst @@ -0,0 +1,9 @@ +======================== +Near Field Communication +======================== + +.. toctree:: + :maxdepth: 1 + + nfc-hci + nfc-pn544 diff --git a/Documentation/driver-api/nfc/nfc-hci.rst b/Documentation/driver-api/nfc/nfc-hci.rst new file mode 100644 index 000000000000..eb8a1a14e919 --- /dev/null +++ b/Documentation/driver-api/nfc/nfc-hci.rst @@ -0,0 +1,311 @@ +======================== +HCI backend for NFC Core +======================== + +- Author: Eric Lapuyade, Samuel Ortiz +- Contact: eric.lapuyade@intel.com, samuel.ortiz@intel.com + +General +------- + +The HCI layer implements much of the ETSI TS 102 622 V10.2.0 specification. It +enables easy writing of HCI-based NFC drivers. The HCI layer runs as an NFC Core +backend, implementing an abstract nfc device and translating NFC Core API +to HCI commands and events. + +HCI +--- + +HCI registers as an nfc device with NFC Core. Requests coming from userspace are +routed through netlink sockets to NFC Core and then to HCI. From this point, +they are translated in a sequence of HCI commands sent to the HCI layer in the +host controller (the chip). Commands can be executed synchronously (the sending +context blocks waiting for response) or asynchronously (the response is returned +from HCI Rx context). +HCI events can also be received from the host controller. They will be handled +and a translation will be forwarded to NFC Core as needed. There are hooks to +let the HCI driver handle proprietary events or override standard behavior. +HCI uses 2 execution contexts: + +- one for executing commands : nfc_hci_msg_tx_work(). Only one command + can be executing at any given moment. +- one for dispatching received events and commands : nfc_hci_msg_rx_work(). + +HCI Session initialization +-------------------------- + +The Session initialization is an HCI standard which must unfortunately +support proprietary gates. This is the reason why the driver will pass a list +of proprietary gates that must be part of the session. HCI will ensure all +those gates have pipes connected when the hci device is set up. +In case the chip supports pre-opened gates and pseudo-static pipes, the driver +can pass that information to HCI core. + +HCI Gates and Pipes +------------------- + +A gate defines the 'port' where some service can be found. In order to access +a service, one must create a pipe to that gate and open it. In this +implementation, pipes are totally hidden. The public API only knows gates. +This is consistent with the driver need to send commands to proprietary gates +without knowing the pipe connected to it. + +Driver interface +---------------- + +A driver is generally written in two parts : the physical link management and +the HCI management. This makes it easier to maintain a driver for a chip that +can be connected using various phy (i2c, spi, ...) + +HCI Management +-------------- + +A driver would normally register itself with HCI and provide the following +entry points:: + + struct nfc_hci_ops { + int (*open)(struct nfc_hci_dev *hdev); + void (*close)(struct nfc_hci_dev *hdev); + int (*hci_ready) (struct nfc_hci_dev *hdev); + int (*xmit) (struct nfc_hci_dev *hdev, struct sk_buff *skb); + int (*start_poll) (struct nfc_hci_dev *hdev, + u32 im_protocols, u32 tm_protocols); + int (*dep_link_up)(struct nfc_hci_dev *hdev, struct nfc_target *target, + u8 comm_mode, u8 *gb, size_t gb_len); + int (*dep_link_down)(struct nfc_hci_dev *hdev); + int (*target_from_gate) (struct nfc_hci_dev *hdev, u8 gate, + struct nfc_target *target); + int (*complete_target_discovered) (struct nfc_hci_dev *hdev, u8 gate, + struct nfc_target *target); + int (*im_transceive) (struct nfc_hci_dev *hdev, + struct nfc_target *target, struct sk_buff *skb, + data_exchange_cb_t cb, void *cb_context); + int (*tm_send)(struct nfc_hci_dev *hdev, struct sk_buff *skb); + int (*check_presence)(struct nfc_hci_dev *hdev, + struct nfc_target *target); + int (*event_received)(struct nfc_hci_dev *hdev, u8 gate, u8 event, + struct sk_buff *skb); + }; + +- open() and close() shall turn the hardware on and off. +- hci_ready() is an optional entry point that is called right after the hci + session has been set up. The driver can use it to do additional initialization + that must be performed using HCI commands. +- xmit() shall simply write a frame to the physical link. +- start_poll() is an optional entrypoint that shall set the hardware in polling + mode. This must be implemented only if the hardware uses proprietary gates or a + mechanism slightly different from the HCI standard. +- dep_link_up() is called after a p2p target has been detected, to finish + the p2p connection setup with hardware parameters that need to be passed back + to nfc core. +- dep_link_down() is called to bring the p2p link down. +- target_from_gate() is an optional entrypoint to return the nfc protocols + corresponding to a proprietary gate. +- complete_target_discovered() is an optional entry point to let the driver + perform additional proprietary processing necessary to auto activate the + discovered target. +- im_transceive() must be implemented by the driver if proprietary HCI commands + are required to send data to the tag. Some tag types will require custom + commands, others can be written to using the standard HCI commands. The driver + can check the tag type and either do proprietary processing, or return 1 to ask + for standard processing. The data exchange command itself must be sent + asynchronously. +- tm_send() is called to send data in the case of a p2p connection +- check_presence() is an optional entry point that will be called regularly + by the core to check that an activated tag is still in the field. If this is + not implemented, the core will not be able to push tag_lost events to the user + space +- event_received() is called to handle an event coming from the chip. Driver + can handle the event or return 1 to let HCI attempt standard processing. + +On the rx path, the driver is responsible to push incoming HCP frames to HCI +using nfc_hci_recv_frame(). HCI will take care of re-aggregation and handling +This must be done from a context that can sleep. + +PHY Management +-------------- + +The physical link (i2c, ...) management is defined by the following structure:: + + struct nfc_phy_ops { + int (*write)(void *dev_id, struct sk_buff *skb); + int (*enable)(void *dev_id); + void (*disable)(void *dev_id); + }; + +enable(): + turn the phy on (power on), make it ready to transfer data +disable(): + turn the phy off +write(): + Send a data frame to the chip. Note that to enable higher + layers such as an llc to store the frame for re-emission, this + function must not alter the skb. It must also not return a positive + result (return 0 for success, negative for failure). + +Data coming from the chip shall be sent directly to nfc_hci_recv_frame(). + +LLC +--- + +Communication between the CPU and the chip often requires some link layer +protocol. Those are isolated as modules managed by the HCI layer. There are +currently two modules : nop (raw transfert) and shdlc. +A new llc must implement the following functions:: + + struct nfc_llc_ops { + void *(*init) (struct nfc_hci_dev *hdev, xmit_to_drv_t xmit_to_drv, + rcv_to_hci_t rcv_to_hci, int tx_headroom, + int tx_tailroom, int *rx_headroom, int *rx_tailroom, + llc_failure_t llc_failure); + void (*deinit) (struct nfc_llc *llc); + int (*start) (struct nfc_llc *llc); + int (*stop) (struct nfc_llc *llc); + void (*rcv_from_drv) (struct nfc_llc *llc, struct sk_buff *skb); + int (*xmit_from_hci) (struct nfc_llc *llc, struct sk_buff *skb); + }; + +init(): + allocate and init your private storage +deinit(): + cleanup +start(): + establish the logical connection +stop (): + terminate the logical connection +rcv_from_drv(): + handle data coming from the chip, going to HCI +xmit_from_hci(): + handle data sent by HCI, going to the chip + +The llc must be registered with nfc before it can be used. Do that by +calling:: + + nfc_llc_register(const char *name, struct nfc_llc_ops *ops); + +Again, note that the llc does not handle the physical link. It is thus very +easy to mix any physical link with any llc for a given chip driver. + +Included Drivers +---------------- + +An HCI based driver for an NXP PN544, connected through I2C bus, and using +shdlc is included. + +Execution Contexts +------------------ + +The execution contexts are the following: +- IRQ handler (IRQH): +fast, cannot sleep. sends incoming frames to HCI where they are passed to +the current llc. In case of shdlc, the frame is queued in shdlc rx queue. + +- SHDLC State Machine worker (SMW) + + Only when llc_shdlc is used: handles shdlc rx & tx queues. + + Dispatches HCI cmd responses. + +- HCI Tx Cmd worker (MSGTXWQ) + + Serializes execution of HCI commands. + + Completes execution in case of response timeout. + +- HCI Rx worker (MSGRXWQ) + + Dispatches incoming HCI commands or events. + +- Syscall context from a userspace call (SYSCALL) + + Any entrypoint in HCI called from NFC Core + +Workflow executing an HCI command (using shdlc) +----------------------------------------------- + +Executing an HCI command can easily be performed synchronously using the +following API:: + + int nfc_hci_send_cmd (struct nfc_hci_dev *hdev, u8 gate, u8 cmd, + const u8 *param, size_t param_len, struct sk_buff **skb) + +The API must be invoked from a context that can sleep. Most of the time, this +will be the syscall context. skb will return the result that was received in +the response. + +Internally, execution is asynchronous. So all this API does is to enqueue the +HCI command, setup a local wait queue on stack, and wait_event() for completion. +The wait is not interruptible because it is guaranteed that the command will +complete after some short timeout anyway. + +MSGTXWQ context will then be scheduled and invoke nfc_hci_msg_tx_work(). +This function will dequeue the next pending command and send its HCP fragments +to the lower layer which happens to be shdlc. It will then start a timer to be +able to complete the command with a timeout error if no response arrive. + +SMW context gets scheduled and invokes nfc_shdlc_sm_work(). This function +handles shdlc framing in and out. It uses the driver xmit to send frames and +receives incoming frames in an skb queue filled from the driver IRQ handler. +SHDLC I(nformation) frames payload are HCP fragments. They are aggregated to +form complete HCI frames, which can be a response, command, or event. + +HCI Responses are dispatched immediately from this context to unblock +waiting command execution. Response processing involves invoking the completion +callback that was provided by nfc_hci_msg_tx_work() when it sent the command. +The completion callback will then wake the syscall context. + +It is also possible to execute the command asynchronously using this API:: + + static int nfc_hci_execute_cmd_async(struct nfc_hci_dev *hdev, u8 pipe, u8 cmd, + const u8 *param, size_t param_len, + data_exchange_cb_t cb, void *cb_context) + +The workflow is the same, except that the API call returns immediately, and +the callback will be called with the result from the SMW context. + +Workflow receiving an HCI event or command +------------------------------------------ + +HCI commands or events are not dispatched from SMW context. Instead, they are +queued to HCI rx_queue and will be dispatched from HCI rx worker +context (MSGRXWQ). This is done this way to allow a cmd or event handler +to also execute other commands (for example, handling the +NFC_HCI_EVT_TARGET_DISCOVERED event from PN544 requires to issue an +ANY_GET_PARAMETER to the reader A gate to get information on the target +that was discovered). + +Typically, such an event will be propagated to NFC Core from MSGRXWQ context. + +Error management +---------------- + +Errors that occur synchronously with the execution of an NFC Core request are +simply returned as the execution result of the request. These are easy. + +Errors that occur asynchronously (e.g. in a background protocol handling thread) +must be reported such that upper layers don't stay ignorant that something +went wrong below and know that expected events will probably never happen. +Handling of these errors is done as follows: + +- driver (pn544) fails to deliver an incoming frame: it stores the error such + that any subsequent call to the driver will result in this error. Then it + calls the standard nfc_shdlc_recv_frame() with a NULL argument to report the + problem above. shdlc stores a EREMOTEIO sticky status, which will trigger + SMW to report above in turn. + +- SMW is basically a background thread to handle incoming and outgoing shdlc + frames. This thread will also check the shdlc sticky status and report to HCI + when it discovers it is not able to run anymore because of an unrecoverable + error that happened within shdlc or below. If the problem occurs during shdlc + connection, the error is reported through the connect completion. + +- HCI: if an internal HCI error happens (frame is lost), or HCI is reported an + error from a lower layer, HCI will either complete the currently executing + command with that error, or notify NFC Core directly if no command is + executing. + +- NFC Core: when NFC Core is notified of an error from below and polling is + active, it will send a tag discovered event with an empty tag list to the user + space to let it know that the poll operation will never be able to detect a + tag. If polling is not active and the error was sticky, lower levels will + return it at next invocation. diff --git a/Documentation/driver-api/nfc/nfc-pn544.rst b/Documentation/driver-api/nfc/nfc-pn544.rst new file mode 100644 index 000000000000..6b2d8aae0c4e --- /dev/null +++ b/Documentation/driver-api/nfc/nfc-pn544.rst @@ -0,0 +1,34 @@ +============================================================================ +Kernel driver for the NXP Semiconductors PN544 Near Field Communication chip +============================================================================ + + +General +------- + +The PN544 is an integrated transmission module for contactless +communication. The driver goes under drives/nfc/ and is compiled as a +module named "pn544". + +Host Interfaces: I2C, SPI and HSU, this driver supports currently only I2C. + +Protocols +--------- + +In the normal (HCI) mode and in the firmware update mode read and +write functions behave a bit differently because the message formats +or the protocols are different. + +In the normal (HCI) mode the protocol used is derived from the ETSI +HCI specification. The firmware is updated using a specific protocol, +which is different from HCI. + +HCI messages consist of an eight bit header and the message body. The +header contains the message length. Maximum size for an HCI message is +33. In HCI mode sent messages are tested for a correct +checksum. Firmware update messages have the length in the second (MSB) +and third (LSB) bytes of the message. The maximum FW message length is +1024 bytes. + +For the ETSI HCI specification see +http://www.etsi.org/WebSite/Technologies/ProtocolSpecification.aspx diff --git a/Documentation/nfc/index.rst b/Documentation/nfc/index.rst deleted file mode 100644 index 4f4947fce80d..000000000000 --- a/Documentation/nfc/index.rst +++ /dev/null @@ -1,11 +0,0 @@ -:orphan: - -======================== -Near Field Communication -======================== - -.. toctree:: - :maxdepth: 1 - - nfc-hci - nfc-pn544 diff --git a/Documentation/nfc/nfc-hci.rst b/Documentation/nfc/nfc-hci.rst deleted file mode 100644 index eb8a1a14e919..000000000000 --- a/Documentation/nfc/nfc-hci.rst +++ /dev/null @@ -1,311 +0,0 @@ -======================== -HCI backend for NFC Core -======================== - -- Author: Eric Lapuyade, Samuel Ortiz -- Contact: eric.lapuyade@intel.com, samuel.ortiz@intel.com - -General -------- - -The HCI layer implements much of the ETSI TS 102 622 V10.2.0 specification. It -enables easy writing of HCI-based NFC drivers. The HCI layer runs as an NFC Core -backend, implementing an abstract nfc device and translating NFC Core API -to HCI commands and events. - -HCI ---- - -HCI registers as an nfc device with NFC Core. Requests coming from userspace are -routed through netlink sockets to NFC Core and then to HCI. From this point, -they are translated in a sequence of HCI commands sent to the HCI layer in the -host controller (the chip). Commands can be executed synchronously (the sending -context blocks waiting for response) or asynchronously (the response is returned -from HCI Rx context). -HCI events can also be received from the host controller. They will be handled -and a translation will be forwarded to NFC Core as needed. There are hooks to -let the HCI driver handle proprietary events or override standard behavior. -HCI uses 2 execution contexts: - -- one for executing commands : nfc_hci_msg_tx_work(). Only one command - can be executing at any given moment. -- one for dispatching received events and commands : nfc_hci_msg_rx_work(). - -HCI Session initialization --------------------------- - -The Session initialization is an HCI standard which must unfortunately -support proprietary gates. This is the reason why the driver will pass a list -of proprietary gates that must be part of the session. HCI will ensure all -those gates have pipes connected when the hci device is set up. -In case the chip supports pre-opened gates and pseudo-static pipes, the driver -can pass that information to HCI core. - -HCI Gates and Pipes -------------------- - -A gate defines the 'port' where some service can be found. In order to access -a service, one must create a pipe to that gate and open it. In this -implementation, pipes are totally hidden. The public API only knows gates. -This is consistent with the driver need to send commands to proprietary gates -without knowing the pipe connected to it. - -Driver interface ----------------- - -A driver is generally written in two parts : the physical link management and -the HCI management. This makes it easier to maintain a driver for a chip that -can be connected using various phy (i2c, spi, ...) - -HCI Management --------------- - -A driver would normally register itself with HCI and provide the following -entry points:: - - struct nfc_hci_ops { - int (*open)(struct nfc_hci_dev *hdev); - void (*close)(struct nfc_hci_dev *hdev); - int (*hci_ready) (struct nfc_hci_dev *hdev); - int (*xmit) (struct nfc_hci_dev *hdev, struct sk_buff *skb); - int (*start_poll) (struct nfc_hci_dev *hdev, - u32 im_protocols, u32 tm_protocols); - int (*dep_link_up)(struct nfc_hci_dev *hdev, struct nfc_target *target, - u8 comm_mode, u8 *gb, size_t gb_len); - int (*dep_link_down)(struct nfc_hci_dev *hdev); - int (*target_from_gate) (struct nfc_hci_dev *hdev, u8 gate, - struct nfc_target *target); - int (*complete_target_discovered) (struct nfc_hci_dev *hdev, u8 gate, - struct nfc_target *target); - int (*im_transceive) (struct nfc_hci_dev *hdev, - struct nfc_target *target, struct sk_buff *skb, - data_exchange_cb_t cb, void *cb_context); - int (*tm_send)(struct nfc_hci_dev *hdev, struct sk_buff *skb); - int (*check_presence)(struct nfc_hci_dev *hdev, - struct nfc_target *target); - int (*event_received)(struct nfc_hci_dev *hdev, u8 gate, u8 event, - struct sk_buff *skb); - }; - -- open() and close() shall turn the hardware on and off. -- hci_ready() is an optional entry point that is called right after the hci - session has been set up. The driver can use it to do additional initialization - that must be performed using HCI commands. -- xmit() shall simply write a frame to the physical link. -- start_poll() is an optional entrypoint that shall set the hardware in polling - mode. This must be implemented only if the hardware uses proprietary gates or a - mechanism slightly different from the HCI standard. -- dep_link_up() is called after a p2p target has been detected, to finish - the p2p connection setup with hardware parameters that need to be passed back - to nfc core. -- dep_link_down() is called to bring the p2p link down. -- target_from_gate() is an optional entrypoint to return the nfc protocols - corresponding to a proprietary gate. -- complete_target_discovered() is an optional entry point to let the driver - perform additional proprietary processing necessary to auto activate the - discovered target. -- im_transceive() must be implemented by the driver if proprietary HCI commands - are required to send data to the tag. Some tag types will require custom - commands, others can be written to using the standard HCI commands. The driver - can check the tag type and either do proprietary processing, or return 1 to ask - for standard processing. The data exchange command itself must be sent - asynchronously. -- tm_send() is called to send data in the case of a p2p connection -- check_presence() is an optional entry point that will be called regularly - by the core to check that an activated tag is still in the field. If this is - not implemented, the core will not be able to push tag_lost events to the user - space -- event_received() is called to handle an event coming from the chip. Driver - can handle the event or return 1 to let HCI attempt standard processing. - -On the rx path, the driver is responsible to push incoming HCP frames to HCI -using nfc_hci_recv_frame(). HCI will take care of re-aggregation and handling -This must be done from a context that can sleep. - -PHY Management --------------- - -The physical link (i2c, ...) management is defined by the following structure:: - - struct nfc_phy_ops { - int (*write)(void *dev_id, struct sk_buff *skb); - int (*enable)(void *dev_id); - void (*disable)(void *dev_id); - }; - -enable(): - turn the phy on (power on), make it ready to transfer data -disable(): - turn the phy off -write(): - Send a data frame to the chip. Note that to enable higher - layers such as an llc to store the frame for re-emission, this - function must not alter the skb. It must also not return a positive - result (return 0 for success, negative for failure). - -Data coming from the chip shall be sent directly to nfc_hci_recv_frame(). - -LLC ---- - -Communication between the CPU and the chip often requires some link layer -protocol. Those are isolated as modules managed by the HCI layer. There are -currently two modules : nop (raw transfert) and shdlc. -A new llc must implement the following functions:: - - struct nfc_llc_ops { - void *(*init) (struct nfc_hci_dev *hdev, xmit_to_drv_t xmit_to_drv, - rcv_to_hci_t rcv_to_hci, int tx_headroom, - int tx_tailroom, int *rx_headroom, int *rx_tailroom, - llc_failure_t llc_failure); - void (*deinit) (struct nfc_llc *llc); - int (*start) (struct nfc_llc *llc); - int (*stop) (struct nfc_llc *llc); - void (*rcv_from_drv) (struct nfc_llc *llc, struct sk_buff *skb); - int (*xmit_from_hci) (struct nfc_llc *llc, struct sk_buff *skb); - }; - -init(): - allocate and init your private storage -deinit(): - cleanup -start(): - establish the logical connection -stop (): - terminate the logical connection -rcv_from_drv(): - handle data coming from the chip, going to HCI -xmit_from_hci(): - handle data sent by HCI, going to the chip - -The llc must be registered with nfc before it can be used. Do that by -calling:: - - nfc_llc_register(const char *name, struct nfc_llc_ops *ops); - -Again, note that the llc does not handle the physical link. It is thus very -easy to mix any physical link with any llc for a given chip driver. - -Included Drivers ----------------- - -An HCI based driver for an NXP PN544, connected through I2C bus, and using -shdlc is included. - -Execution Contexts ------------------- - -The execution contexts are the following: -- IRQ handler (IRQH): -fast, cannot sleep. sends incoming frames to HCI where they are passed to -the current llc. In case of shdlc, the frame is queued in shdlc rx queue. - -- SHDLC State Machine worker (SMW) - - Only when llc_shdlc is used: handles shdlc rx & tx queues. - - Dispatches HCI cmd responses. - -- HCI Tx Cmd worker (MSGTXWQ) - - Serializes execution of HCI commands. - - Completes execution in case of response timeout. - -- HCI Rx worker (MSGRXWQ) - - Dispatches incoming HCI commands or events. - -- Syscall context from a userspace call (SYSCALL) - - Any entrypoint in HCI called from NFC Core - -Workflow executing an HCI command (using shdlc) ------------------------------------------------ - -Executing an HCI command can easily be performed synchronously using the -following API:: - - int nfc_hci_send_cmd (struct nfc_hci_dev *hdev, u8 gate, u8 cmd, - const u8 *param, size_t param_len, struct sk_buff **skb) - -The API must be invoked from a context that can sleep. Most of the time, this -will be the syscall context. skb will return the result that was received in -the response. - -Internally, execution is asynchronous. So all this API does is to enqueue the -HCI command, setup a local wait queue on stack, and wait_event() for completion. -The wait is not interruptible because it is guaranteed that the command will -complete after some short timeout anyway. - -MSGTXWQ context will then be scheduled and invoke nfc_hci_msg_tx_work(). -This function will dequeue the next pending command and send its HCP fragments -to the lower layer which happens to be shdlc. It will then start a timer to be -able to complete the command with a timeout error if no response arrive. - -SMW context gets scheduled and invokes nfc_shdlc_sm_work(). This function -handles shdlc framing in and out. It uses the driver xmit to send frames and -receives incoming frames in an skb queue filled from the driver IRQ handler. -SHDLC I(nformation) frames payload are HCP fragments. They are aggregated to -form complete HCI frames, which can be a response, command, or event. - -HCI Responses are dispatched immediately from this context to unblock -waiting command execution. Response processing involves invoking the completion -callback that was provided by nfc_hci_msg_tx_work() when it sent the command. -The completion callback will then wake the syscall context. - -It is also possible to execute the command asynchronously using this API:: - - static int nfc_hci_execute_cmd_async(struct nfc_hci_dev *hdev, u8 pipe, u8 cmd, - const u8 *param, size_t param_len, - data_exchange_cb_t cb, void *cb_context) - -The workflow is the same, except that the API call returns immediately, and -the callback will be called with the result from the SMW context. - -Workflow receiving an HCI event or command ------------------------------------------- - -HCI commands or events are not dispatched from SMW context. Instead, they are -queued to HCI rx_queue and will be dispatched from HCI rx worker -context (MSGRXWQ). This is done this way to allow a cmd or event handler -to also execute other commands (for example, handling the -NFC_HCI_EVT_TARGET_DISCOVERED event from PN544 requires to issue an -ANY_GET_PARAMETER to the reader A gate to get information on the target -that was discovered). - -Typically, such an event will be propagated to NFC Core from MSGRXWQ context. - -Error management ----------------- - -Errors that occur synchronously with the execution of an NFC Core request are -simply returned as the execution result of the request. These are easy. - -Errors that occur asynchronously (e.g. in a background protocol handling thread) -must be reported such that upper layers don't stay ignorant that something -went wrong below and know that expected events will probably never happen. -Handling of these errors is done as follows: - -- driver (pn544) fails to deliver an incoming frame: it stores the error such - that any subsequent call to the driver will result in this error. Then it - calls the standard nfc_shdlc_recv_frame() with a NULL argument to report the - problem above. shdlc stores a EREMOTEIO sticky status, which will trigger - SMW to report above in turn. - -- SMW is basically a background thread to handle incoming and outgoing shdlc - frames. This thread will also check the shdlc sticky status and report to HCI - when it discovers it is not able to run anymore because of an unrecoverable - error that happened within shdlc or below. If the problem occurs during shdlc - connection, the error is reported through the connect completion. - -- HCI: if an internal HCI error happens (frame is lost), or HCI is reported an - error from a lower layer, HCI will either complete the currently executing - command with that error, or notify NFC Core directly if no command is - executing. - -- NFC Core: when NFC Core is notified of an error from below and polling is - active, it will send a tag discovered event with an empty tag list to the user - space to let it know that the poll operation will never be able to detect a - tag. If polling is not active and the error was sticky, lower levels will - return it at next invocation. diff --git a/Documentation/nfc/nfc-pn544.rst b/Documentation/nfc/nfc-pn544.rst deleted file mode 100644 index 6b2d8aae0c4e..000000000000 --- a/Documentation/nfc/nfc-pn544.rst +++ /dev/null @@ -1,34 +0,0 @@ -============================================================================ -Kernel driver for the NXP Semiconductors PN544 Near Field Communication chip -============================================================================ - - -General -------- - -The PN544 is an integrated transmission module for contactless -communication. The driver goes under drives/nfc/ and is compiled as a -module named "pn544". - -Host Interfaces: I2C, SPI and HSU, this driver supports currently only I2C. - -Protocols ---------- - -In the normal (HCI) mode and in the firmware update mode read and -write functions behave a bit differently because the message formats -or the protocols are different. - -In the normal (HCI) mode the protocol used is derived from the ETSI -HCI specification. The firmware is updated using a specific protocol, -which is different from HCI. - -HCI messages consist of an eight bit header and the message body. The -header contains the message length. Maximum size for an HCI message is -33. In HCI mode sent messages are tested for a correct -checksum. Firmware update messages have the length in the second (MSB) -and third (LSB) bytes of the message. The maximum FW message length is -1024 bytes. - -For the ETSI HCI specification see -http://www.etsi.org/WebSite/Technologies/ProtocolSpecification.aspx -- cgit v1.2.3 From 19024c09c243c5107f738286459a0dd85697b089 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 16:48:15 -0300 Subject: docs: mmc: move it to the driver-api Most of the stuff here is related to the kAPI. Signed-off-by: Mauro Carvalho Chehab --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/mmc/index.rst | 11 +++ Documentation/driver-api/mmc/mmc-async-req.rst | 98 ++++++++++++++++++++++++++ Documentation/driver-api/mmc/mmc-dev-attrs.rst | 91 ++++++++++++++++++++++++ Documentation/driver-api/mmc/mmc-dev-parts.rst | 41 +++++++++++ Documentation/driver-api/mmc/mmc-tools.rst | 37 ++++++++++ Documentation/mmc/index.rst | 13 ---- Documentation/mmc/mmc-async-req.rst | 98 -------------------------- Documentation/mmc/mmc-dev-attrs.rst | 91 ------------------------ Documentation/mmc/mmc-dev-parts.rst | 41 ----------- Documentation/mmc/mmc-tools.rst | 37 ---------- 11 files changed, 279 insertions(+), 280 deletions(-) create mode 100644 Documentation/driver-api/mmc/index.rst create mode 100644 Documentation/driver-api/mmc/mmc-async-req.rst create mode 100644 Documentation/driver-api/mmc/mmc-dev-attrs.rst create mode 100644 Documentation/driver-api/mmc/mmc-dev-parts.rst create mode 100644 Documentation/driver-api/mmc/mmc-tools.rst delete mode 100644 Documentation/mmc/index.rst delete mode 100644 Documentation/mmc/mmc-async-req.rst delete mode 100644 Documentation/mmc/mmc-dev-attrs.rst delete mode 100644 Documentation/mmc/mmc-dev-parts.rst delete mode 100644 Documentation/mmc/mmc-tools.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index d6bf4a37cefe..25f85d3021aa 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -45,6 +45,7 @@ available subsections can be seen below. miscellaneous mei/index mtd/index + mmc/index nvdimm/index w1 rapidio/index diff --git a/Documentation/driver-api/mmc/index.rst b/Documentation/driver-api/mmc/index.rst new file mode 100644 index 000000000000..9aaf64951a8c --- /dev/null +++ b/Documentation/driver-api/mmc/index.rst @@ -0,0 +1,11 @@ +======================== +MMC/SD/SDIO card support +======================== + +.. toctree:: + :maxdepth: 1 + + mmc-dev-attrs + mmc-dev-parts + mmc-async-req + mmc-tools diff --git a/Documentation/driver-api/mmc/mmc-async-req.rst b/Documentation/driver-api/mmc/mmc-async-req.rst new file mode 100644 index 000000000000..0f7197c9c3b5 --- /dev/null +++ b/Documentation/driver-api/mmc/mmc-async-req.rst @@ -0,0 +1,98 @@ +======================== +MMC Asynchronous Request +======================== + +Rationale +========= + +How significant is the cache maintenance overhead? + +It depends. Fast eMMC and multiple cache levels with speculative cache +pre-fetch makes the cache overhead relatively significant. If the DMA +preparations for the next request are done in parallel with the current +transfer, the DMA preparation overhead would not affect the MMC performance. + +The intention of non-blocking (asynchronous) MMC requests is to minimize the +time between when an MMC request ends and another MMC request begins. + +Using mmc_wait_for_req(), the MMC controller is idle while dma_map_sg and +dma_unmap_sg are processing. Using non-blocking MMC requests makes it +possible to prepare the caches for next job in parallel with an active +MMC request. + +MMC block driver +================ + +The mmc_blk_issue_rw_rq() in the MMC block driver is made non-blocking. + +The increase in throughput is proportional to the time it takes to +prepare (major part of preparations are dma_map_sg() and dma_unmap_sg()) +a request and how fast the memory is. The faster the MMC/SD is the +more significant the prepare request time becomes. Roughly the expected +performance gain is 5% for large writes and 10% on large reads on a L2 cache +platform. In power save mode, when clocks run on a lower frequency, the DMA +preparation may cost even more. As long as these slower preparations are run +in parallel with the transfer performance won't be affected. + +Details on measurements from IOZone and mmc_test +================================================ + +https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req + +MMC core API extension +====================== + +There is one new public function mmc_start_req(). + +It starts a new MMC command request for a host. The function isn't +truly non-blocking. If there is an ongoing async request it waits +for completion of that request and starts the new one and returns. It +doesn't wait for the new request to complete. If there is no ongoing +request it starts the new request and returns immediately. + +MMC host extensions +=================== + +There are two optional members in the mmc_host_ops -- pre_req() and +post_req() -- that the host driver may implement in order to move work +to before and after the actual mmc_host_ops.request() function is called. + +In the DMA case pre_req() may do dma_map_sg() and prepare the DMA +descriptor, and post_req() runs the dma_unmap_sg(). + +Optimize for the first request +============================== + +The first request in a series of requests can't be prepared in parallel +with the previous transfer, since there is no previous request. + +The argument is_first_req in pre_req() indicates that there is no previous +request. The host driver may optimize for this scenario to minimize +the performance loss. A way to optimize for this is to split the current +request in two chunks, prepare the first chunk and start the request, +and finally prepare the second chunk and start the transfer. + +Pseudocode to handle is_first_req scenario with minimal prepare overhead:: + + if (is_first_req && req->size > threshold) + /* start MMC transfer for the complete transfer size */ + mmc_start_command(MMC_CMD_TRANSFER_FULL_SIZE); + + /* + * Begin to prepare DMA while cmd is being processed by MMC. + * The first chunk of the request should take the same time + * to prepare as the "MMC process command time". + * If prepare time exceeds MMC cmd time + * the transfer is delayed, guesstimate max 4k as first chunk size. + */ + prepare_1st_chunk_for_dma(req); + /* flush pending desc to the DMAC (dmaengine.h) */ + dma_issue_pending(req->dma_desc); + + prepare_2nd_chunk_for_dma(req); + /* + * The second issue_pending should be called before MMC runs out + * of the first chunk. If the MMC runs out of the first data chunk + * before this call, the transfer is delayed. + */ + dma_issue_pending(req->dma_desc); diff --git a/Documentation/driver-api/mmc/mmc-dev-attrs.rst b/Documentation/driver-api/mmc/mmc-dev-attrs.rst new file mode 100644 index 000000000000..4f44b1b730d6 --- /dev/null +++ b/Documentation/driver-api/mmc/mmc-dev-attrs.rst @@ -0,0 +1,91 @@ +================================== +SD and MMC Block Device Attributes +================================== + +These attributes are defined for the block devices associated with the +SD or MMC device. + +The following attributes are read/write. + + ======== =============================================== + force_ro Enforce read-only access even if write protect switch is off. + ======== =============================================== + +SD and MMC Device Attributes +============================ + +All attributes are read-only. + + ====================== =============================================== + cid Card Identification Register + csd Card Specific Data Register + scr SD Card Configuration Register (SD only) + date Manufacturing Date (from CID Register) + fwrev Firmware/Product Revision (from CID Register) + (SD and MMCv1 only) + hwrev Hardware/Product Revision (from CID Register) + (SD and MMCv1 only) + manfid Manufacturer ID (from CID Register) + name Product Name (from CID Register) + oemid OEM/Application ID (from CID Register) + prv Product Revision (from CID Register) + (SD and MMCv4 only) + serial Product Serial Number (from CID Register) + erase_size Erase group size + preferred_erase_size Preferred erase size + raw_rpmb_size_mult RPMB partition size + rel_sectors Reliable write sector count + ocr Operation Conditions Register + dsr Driver Stage Register + cmdq_en Command Queue enabled: + + 1 => enabled, 0 => not enabled + ====================== =============================================== + +Note on Erase Size and Preferred Erase Size: + + "erase_size" is the minimum size, in bytes, of an erase + operation. For MMC, "erase_size" is the erase group size + reported by the card. Note that "erase_size" does not apply + to trim or secure trim operations where the minimum size is + always one 512 byte sector. For SD, "erase_size" is 512 + if the card is block-addressed, 0 otherwise. + + SD/MMC cards can erase an arbitrarily large area up to and + including the whole card. When erasing a large area it may + be desirable to do it in smaller chunks for three reasons: + + 1. A single erase command will make all other I/O on + the card wait. This is not a problem if the whole card + is being erased, but erasing one partition will make + I/O for another partition on the same card wait for the + duration of the erase - which could be a several + minutes. + 2. To be able to inform the user of erase progress. + 3. The erase timeout becomes too large to be very + useful. Because the erase timeout contains a margin + which is multiplied by the size of the erase area, + the value can end up being several minutes for large + areas. + + "erase_size" is not the most efficient unit to erase + (especially for SD where it is just one sector), + hence "preferred_erase_size" provides a good chunk + size for erasing large areas. + + For MMC, "preferred_erase_size" is the high-capacity + erase size if a card specifies one, otherwise it is + based on the capacity of the card. + + For SD, "preferred_erase_size" is the allocation unit + size specified by the card. + + "preferred_erase_size" is in bytes. + +Note on raw_rpmb_size_mult: + + "raw_rpmb_size_mult" is a multiple of 128kB block. + + RPMB size in byte is calculated by using the following equation: + + RPMB partition size = 128kB x raw_rpmb_size_mult diff --git a/Documentation/driver-api/mmc/mmc-dev-parts.rst b/Documentation/driver-api/mmc/mmc-dev-parts.rst new file mode 100644 index 000000000000..995922f1f744 --- /dev/null +++ b/Documentation/driver-api/mmc/mmc-dev-parts.rst @@ -0,0 +1,41 @@ +============================ +SD and MMC Device Partitions +============================ + +Device partitions are additional logical block devices present on the +SD/MMC device. + +As of this writing, MMC boot partitions as supported and exposed as +/dev/mmcblkXboot0 and /dev/mmcblkXboot1, where X is the index of the +parent /dev/mmcblkX. + +MMC Boot Partitions +=================== + +Read and write access is provided to the two MMC boot partitions. Due to +the sensitive nature of the boot partition contents, which often store +a bootloader or bootloader configuration tables crucial to booting the +platform, write access is disabled by default to reduce the chance of +accidental bricking. + +To enable write access to /dev/mmcblkXbootY, disable the forced read-only +access with:: + + echo 0 > /sys/block/mmcblkXbootY/force_ro + +To re-enable read-only access:: + + echo 1 > /sys/block/mmcblkXbootY/force_ro + +The boot partitions can also be locked read only until the next power on, +with:: + + echo 1 > /sys/block/mmcblkXbootY/ro_lock_until_next_power_on + +This is a feature of the card and not of the kernel. If the card does +not support boot partition locking, the file will not exist. If the +feature has been disabled on the card, the file will be read-only. + +The boot partitions can also be locked permanently, but this feature is +not accessible through sysfs in order to avoid accidental or malicious +bricking. diff --git a/Documentation/driver-api/mmc/mmc-tools.rst b/Documentation/driver-api/mmc/mmc-tools.rst new file mode 100644 index 000000000000..54406093768b --- /dev/null +++ b/Documentation/driver-api/mmc/mmc-tools.rst @@ -0,0 +1,37 @@ +====================== +MMC tools introduction +====================== + +There is one MMC test tools called mmc-utils, which is maintained by Chris Ball, +you can find it at the below public git repository: + + http://git.kernel.org/cgit/linux/kernel/git/cjb/mmc-utils.git/ + +Functions +========= + +The mmc-utils tools can do the following: + + - Print and parse extcsd data. + - Determine the eMMC writeprotect status. + - Set the eMMC writeprotect status. + - Set the eMMC data sector size to 4KB by disabling emulation. + - Create general purpose partition. + - Enable the enhanced user area. + - Enable write reliability per partition. + - Print the response to STATUS_SEND (CMD13). + - Enable the boot partition. + - Set Boot Bus Conditions. + - Enable the eMMC BKOPS feature. + - Permanently enable the eMMC H/W Reset feature. + - Permanently disable the eMMC H/W Reset feature. + - Send Sanitize command. + - Program authentication key for the device. + - Counter value for the rpmb device will be read to stdout. + - Read from rpmb device to output. + - Write to rpmb device from data file. + - Enable the eMMC cache feature. + - Disable the eMMC cache feature. + - Print and parse CID data. + - Print and parse CSD data. + - Print and parse SCR data. diff --git a/Documentation/mmc/index.rst b/Documentation/mmc/index.rst deleted file mode 100644 index 3305478ddadb..000000000000 --- a/Documentation/mmc/index.rst +++ /dev/null @@ -1,13 +0,0 @@ -:orphan: - -======================== -MMC/SD/SDIO card support -======================== - -.. toctree:: - :maxdepth: 1 - - mmc-dev-attrs - mmc-dev-parts - mmc-async-req - mmc-tools diff --git a/Documentation/mmc/mmc-async-req.rst b/Documentation/mmc/mmc-async-req.rst deleted file mode 100644 index 0f7197c9c3b5..000000000000 --- a/Documentation/mmc/mmc-async-req.rst +++ /dev/null @@ -1,98 +0,0 @@ -======================== -MMC Asynchronous Request -======================== - -Rationale -========= - -How significant is the cache maintenance overhead? - -It depends. Fast eMMC and multiple cache levels with speculative cache -pre-fetch makes the cache overhead relatively significant. If the DMA -preparations for the next request are done in parallel with the current -transfer, the DMA preparation overhead would not affect the MMC performance. - -The intention of non-blocking (asynchronous) MMC requests is to minimize the -time between when an MMC request ends and another MMC request begins. - -Using mmc_wait_for_req(), the MMC controller is idle while dma_map_sg and -dma_unmap_sg are processing. Using non-blocking MMC requests makes it -possible to prepare the caches for next job in parallel with an active -MMC request. - -MMC block driver -================ - -The mmc_blk_issue_rw_rq() in the MMC block driver is made non-blocking. - -The increase in throughput is proportional to the time it takes to -prepare (major part of preparations are dma_map_sg() and dma_unmap_sg()) -a request and how fast the memory is. The faster the MMC/SD is the -more significant the prepare request time becomes. Roughly the expected -performance gain is 5% for large writes and 10% on large reads on a L2 cache -platform. In power save mode, when clocks run on a lower frequency, the DMA -preparation may cost even more. As long as these slower preparations are run -in parallel with the transfer performance won't be affected. - -Details on measurements from IOZone and mmc_test -================================================ - -https://wiki.linaro.org/WorkingGroups/Kernel/Specs/StoragePerfMMC-async-req - -MMC core API extension -====================== - -There is one new public function mmc_start_req(). - -It starts a new MMC command request for a host. The function isn't -truly non-blocking. If there is an ongoing async request it waits -for completion of that request and starts the new one and returns. It -doesn't wait for the new request to complete. If there is no ongoing -request it starts the new request and returns immediately. - -MMC host extensions -=================== - -There are two optional members in the mmc_host_ops -- pre_req() and -post_req() -- that the host driver may implement in order to move work -to before and after the actual mmc_host_ops.request() function is called. - -In the DMA case pre_req() may do dma_map_sg() and prepare the DMA -descriptor, and post_req() runs the dma_unmap_sg(). - -Optimize for the first request -============================== - -The first request in a series of requests can't be prepared in parallel -with the previous transfer, since there is no previous request. - -The argument is_first_req in pre_req() indicates that there is no previous -request. The host driver may optimize for this scenario to minimize -the performance loss. A way to optimize for this is to split the current -request in two chunks, prepare the first chunk and start the request, -and finally prepare the second chunk and start the transfer. - -Pseudocode to handle is_first_req scenario with minimal prepare overhead:: - - if (is_first_req && req->size > threshold) - /* start MMC transfer for the complete transfer size */ - mmc_start_command(MMC_CMD_TRANSFER_FULL_SIZE); - - /* - * Begin to prepare DMA while cmd is being processed by MMC. - * The first chunk of the request should take the same time - * to prepare as the "MMC process command time". - * If prepare time exceeds MMC cmd time - * the transfer is delayed, guesstimate max 4k as first chunk size. - */ - prepare_1st_chunk_for_dma(req); - /* flush pending desc to the DMAC (dmaengine.h) */ - dma_issue_pending(req->dma_desc); - - prepare_2nd_chunk_for_dma(req); - /* - * The second issue_pending should be called before MMC runs out - * of the first chunk. If the MMC runs out of the first data chunk - * before this call, the transfer is delayed. - */ - dma_issue_pending(req->dma_desc); diff --git a/Documentation/mmc/mmc-dev-attrs.rst b/Documentation/mmc/mmc-dev-attrs.rst deleted file mode 100644 index 4f44b1b730d6..000000000000 --- a/Documentation/mmc/mmc-dev-attrs.rst +++ /dev/null @@ -1,91 +0,0 @@ -================================== -SD and MMC Block Device Attributes -================================== - -These attributes are defined for the block devices associated with the -SD or MMC device. - -The following attributes are read/write. - - ======== =============================================== - force_ro Enforce read-only access even if write protect switch is off. - ======== =============================================== - -SD and MMC Device Attributes -============================ - -All attributes are read-only. - - ====================== =============================================== - cid Card Identification Register - csd Card Specific Data Register - scr SD Card Configuration Register (SD only) - date Manufacturing Date (from CID Register) - fwrev Firmware/Product Revision (from CID Register) - (SD and MMCv1 only) - hwrev Hardware/Product Revision (from CID Register) - (SD and MMCv1 only) - manfid Manufacturer ID (from CID Register) - name Product Name (from CID Register) - oemid OEM/Application ID (from CID Register) - prv Product Revision (from CID Register) - (SD and MMCv4 only) - serial Product Serial Number (from CID Register) - erase_size Erase group size - preferred_erase_size Preferred erase size - raw_rpmb_size_mult RPMB partition size - rel_sectors Reliable write sector count - ocr Operation Conditions Register - dsr Driver Stage Register - cmdq_en Command Queue enabled: - - 1 => enabled, 0 => not enabled - ====================== =============================================== - -Note on Erase Size and Preferred Erase Size: - - "erase_size" is the minimum size, in bytes, of an erase - operation. For MMC, "erase_size" is the erase group size - reported by the card. Note that "erase_size" does not apply - to trim or secure trim operations where the minimum size is - always one 512 byte sector. For SD, "erase_size" is 512 - if the card is block-addressed, 0 otherwise. - - SD/MMC cards can erase an arbitrarily large area up to and - including the whole card. When erasing a large area it may - be desirable to do it in smaller chunks for three reasons: - - 1. A single erase command will make all other I/O on - the card wait. This is not a problem if the whole card - is being erased, but erasing one partition will make - I/O for another partition on the same card wait for the - duration of the erase - which could be a several - minutes. - 2. To be able to inform the user of erase progress. - 3. The erase timeout becomes too large to be very - useful. Because the erase timeout contains a margin - which is multiplied by the size of the erase area, - the value can end up being several minutes for large - areas. - - "erase_size" is not the most efficient unit to erase - (especially for SD where it is just one sector), - hence "preferred_erase_size" provides a good chunk - size for erasing large areas. - - For MMC, "preferred_erase_size" is the high-capacity - erase size if a card specifies one, otherwise it is - based on the capacity of the card. - - For SD, "preferred_erase_size" is the allocation unit - size specified by the card. - - "preferred_erase_size" is in bytes. - -Note on raw_rpmb_size_mult: - - "raw_rpmb_size_mult" is a multiple of 128kB block. - - RPMB size in byte is calculated by using the following equation: - - RPMB partition size = 128kB x raw_rpmb_size_mult diff --git a/Documentation/mmc/mmc-dev-parts.rst b/Documentation/mmc/mmc-dev-parts.rst deleted file mode 100644 index 995922f1f744..000000000000 --- a/Documentation/mmc/mmc-dev-parts.rst +++ /dev/null @@ -1,41 +0,0 @@ -============================ -SD and MMC Device Partitions -============================ - -Device partitions are additional logical block devices present on the -SD/MMC device. - -As of this writing, MMC boot partitions as supported and exposed as -/dev/mmcblkXboot0 and /dev/mmcblkXboot1, where X is the index of the -parent /dev/mmcblkX. - -MMC Boot Partitions -=================== - -Read and write access is provided to the two MMC boot partitions. Due to -the sensitive nature of the boot partition contents, which often store -a bootloader or bootloader configuration tables crucial to booting the -platform, write access is disabled by default to reduce the chance of -accidental bricking. - -To enable write access to /dev/mmcblkXbootY, disable the forced read-only -access with:: - - echo 0 > /sys/block/mmcblkXbootY/force_ro - -To re-enable read-only access:: - - echo 1 > /sys/block/mmcblkXbootY/force_ro - -The boot partitions can also be locked read only until the next power on, -with:: - - echo 1 > /sys/block/mmcblkXbootY/ro_lock_until_next_power_on - -This is a feature of the card and not of the kernel. If the card does -not support boot partition locking, the file will not exist. If the -feature has been disabled on the card, the file will be read-only. - -The boot partitions can also be locked permanently, but this feature is -not accessible through sysfs in order to avoid accidental or malicious -bricking. diff --git a/Documentation/mmc/mmc-tools.rst b/Documentation/mmc/mmc-tools.rst deleted file mode 100644 index 54406093768b..000000000000 --- a/Documentation/mmc/mmc-tools.rst +++ /dev/null @@ -1,37 +0,0 @@ -====================== -MMC tools introduction -====================== - -There is one MMC test tools called mmc-utils, which is maintained by Chris Ball, -you can find it at the below public git repository: - - http://git.kernel.org/cgit/linux/kernel/git/cjb/mmc-utils.git/ - -Functions -========= - -The mmc-utils tools can do the following: - - - Print and parse extcsd data. - - Determine the eMMC writeprotect status. - - Set the eMMC writeprotect status. - - Set the eMMC data sector size to 4KB by disabling emulation. - - Create general purpose partition. - - Enable the enhanced user area. - - Enable write reliability per partition. - - Print the response to STATUS_SEND (CMD13). - - Enable the boot partition. - - Set Boot Bus Conditions. - - Enable the eMMC BKOPS feature. - - Permanently enable the eMMC H/W Reset feature. - - Permanently disable the eMMC H/W Reset feature. - - Send Sanitize command. - - Program authentication key for the device. - - Counter value for the rpmb device will be read to stdout. - - Read from rpmb device to output. - - Write to rpmb device from data file. - - Enable the eMMC cache feature. - - Disable the eMMC cache feature. - - Print and parse CID data. - - Print and parse CSD data. - - Print and parse SCR data. -- cgit v1.2.3 From c0b11a50aee643ac40ded5dbcd48189ee0926ee4 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 16:50:07 -0300 Subject: docs: md: move it to the driver-api book The docs there were meant to be read by a Kernel developer. Signed-off-by: Mauro Carvalho Chehab --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/md/index.rst | 10 + Documentation/driver-api/md/md-cluster.rst | 385 ++++++++++++++++++++++++++++ Documentation/driver-api/md/raid5-cache.rst | 111 ++++++++ Documentation/driver-api/md/raid5-ppl.rst | 47 ++++ Documentation/md/index.rst | 12 - Documentation/md/md-cluster.rst | 385 ---------------------------- Documentation/md/raid5-cache.rst | 111 -------- Documentation/md/raid5-ppl.rst | 47 ---- 9 files changed, 554 insertions(+), 555 deletions(-) create mode 100644 Documentation/driver-api/md/index.rst create mode 100644 Documentation/driver-api/md/md-cluster.rst create mode 100644 Documentation/driver-api/md/raid5-cache.rst create mode 100644 Documentation/driver-api/md/raid5-ppl.rst delete mode 100644 Documentation/md/index.rst delete mode 100644 Documentation/md/md-cluster.rst delete mode 100644 Documentation/md/raid5-cache.rst delete mode 100644 Documentation/md/raid5-ppl.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 25f85d3021aa..b5179bf2ada2 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -56,6 +56,7 @@ available subsections can be seen below. firmware/index pinctl gpio/index + md/index misc_devices nfc/index dmaengine/index diff --git a/Documentation/driver-api/md/index.rst b/Documentation/driver-api/md/index.rst new file mode 100644 index 000000000000..205080891a1a --- /dev/null +++ b/Documentation/driver-api/md/index.rst @@ -0,0 +1,10 @@ +==== +RAID +==== + +.. toctree:: + :maxdepth: 1 + + md-cluster + raid5-cache + raid5-ppl diff --git a/Documentation/driver-api/md/md-cluster.rst b/Documentation/driver-api/md/md-cluster.rst new file mode 100644 index 000000000000..96eb52cec7eb --- /dev/null +++ b/Documentation/driver-api/md/md-cluster.rst @@ -0,0 +1,385 @@ +========== +MD Cluster +========== + +The cluster MD is a shared-device RAID for a cluster, it supports +two levels: raid1 and raid10 (limited support). + + +1. On-disk format +================= + +Separate write-intent-bitmaps are used for each cluster node. +The bitmaps record all writes that may have been started on that node, +and may not yet have finished. The on-disk layout is:: + + 0 4k 8k 12k + ------------------------------------------------------------------- + | idle | md super | bm super [0] + bits | + | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | + | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | + | bm bits [3, contd] | | | + +During "normal" functioning we assume the filesystem ensures that only +one node writes to any given block at a time, so a write request will + + - set the appropriate bit (if not already set) + - commit the write to all mirrors + - schedule the bit to be cleared after a timeout. + +Reads are just handled normally. It is up to the filesystem to ensure +one node doesn't read from a location where another node (or the same +node) is writing. + + +2. DLM Locks for management +=========================== + +There are three groups of locks for managing the device: + +2.1 Bitmap lock resource (bm_lockres) +------------------------------------- + + The bm_lockres protects individual node bitmaps. They are named in + the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a + node joins the cluster, it acquires the lock in PW mode and it stays + so during the lifetime the node is part of the cluster. The lock + resource number is based on the slot number returned by the DLM + subsystem. Since DLM starts node count from one and bitmap slots + start from zero, one is subtracted from the DLM slot number to arrive + at the bitmap slot number. + + The LVB of the bitmap lock for a particular node records the range + of sectors that are being re-synced by that node. No other + node may write to those sectors. This is used when a new nodes + joins the cluster. + +2.2 Message passing locks +------------------------- + + Each node has to communicate with other nodes when starting or ending + resync, and for metadata superblock updates. This communication is + managed through three locks: "token", "message", and "ack", together + with the Lock Value Block (LVB) of one of the "message" lock. + +2.3 new-device management +------------------------- + + A single lock: "no-new-dev" is used to co-ordinate the addition of + new devices - this must be synchronized across the array. + Normally all nodes hold a concurrent-read lock on this device. + +3. Communication +================ + + Messages can be broadcast to all nodes, and the sender waits for all + other nodes to acknowledge the message before proceeding. Only one + message can be processed at a time. + +3.1 Message Types +----------------- + + There are six types of messages which are passed: + +3.1.1 METADATA_UPDATED +^^^^^^^^^^^^^^^^^^^^^^ + + informs other nodes that the metadata has + been updated, and the node must re-read the md superblock. This is + performed synchronously. It is primarily used to signal device + failure. + +3.1.2 RESYNCING +^^^^^^^^^^^^^^^ + informs other nodes that a resync is initiated or + ended so that each node may suspend or resume the region. Each + RESYNCING message identifies a range of the devices that the + sending node is about to resync. This overrides any previous + notification from that node: only one ranged can be resynced at a + time per-node. + +3.1.3 NEWDISK +^^^^^^^^^^^^^ + + informs other nodes that a device is being added to + the array. Message contains an identifier for that device. See + below for further details. + +3.1.4 REMOVE +^^^^^^^^^^^^ + + A failed or spare device is being removed from the + array. The slot-number of the device is included in the message. + + 3.1.5 RE_ADD: + + A failed device is being re-activated - the assumption + is that it has been determined to be working again. + + 3.1.6 BITMAP_NEEDS_SYNC: + + If a node is stopped locally but the bitmap + isn't clean, then another node is informed to take the ownership of + resync. + +3.2 Communication mechanism +--------------------------- + + The DLM LVB is used to communicate within nodes of the cluster. There + are three resources used for the purpose: + +3.2.1 token +^^^^^^^^^^^ + The resource which protects the entire communication + system. The node having the token resource is allowed to + communicate. + +3.2.2 message +^^^^^^^^^^^^^ + The lock resource which carries the data to communicate. + +3.2.3 ack +^^^^^^^^^ + + The resource, acquiring which means the message has been + acknowledged by all nodes in the cluster. The BAST of the resource + is used to inform the receiving node that a node wants to + communicate. + +The algorithm is: + + 1. receive status - all nodes have concurrent-reader lock on "ack":: + + sender receiver receiver + "ack":CR "ack":CR "ack":CR + + 2. sender get EX on "token", + sender get EX on "message":: + + sender receiver receiver + "token":EX "ack":CR "ack":CR + "message":EX + "ack":CR + + Sender checks that it still needs to send a message. Messages + received or other events that happened while waiting for the + "token" may have made this message inappropriate or redundant. + + 3. sender writes LVB + + sender down-convert "message" from EX to CW + + sender try to get EX of "ack" + + :: + + [ wait until all receivers have *processed* the "message" ] + + [ triggered by bast of "ack" ] + receiver get CR on "message" + receiver read LVB + receiver processes the message + [ wait finish ] + receiver releases "ack" + receiver tries to get PR on "message" + + sender receiver receiver + "token":EX "message":CR "message":CR + "message":CW + "ack":EX + + 4. triggered by grant of EX on "ack" (indicating all receivers + have processed message) + + sender down-converts "ack" from EX to CR + + sender releases "message" + + sender releases "token" + + :: + + receiver upconvert to PR on "message" + receiver get CR of "ack" + receiver release "message" + + sender receiver receiver + "ack":CR "ack":CR "ack":CR + + +4. Handling Failures +==================== + +4.1 Node Failure +---------------- + + When a node fails, the DLM informs the cluster with the slot + number. The node starts a cluster recovery thread. The cluster + recovery thread: + + - acquires the bitmap lock of the failed node + - opens the bitmap + - reads the bitmap of the failed node + - copies the set bitmap to local node + - cleans the bitmap of the failed node + - releases bitmap lock of the failed node + - initiates resync of the bitmap on the current node + md_check_recovery is invoked within recover_bitmaps, + then md_check_recovery -> metadata_update_start/finish, + it will lock the communication by lock_comm. + Which means when one node is resyncing it blocks all + other nodes from writing anywhere on the array. + + The resync process is the regular md resync. However, in a clustered + environment when a resync is performed, it needs to tell other nodes + of the areas which are suspended. Before a resync starts, the node + send out RESYNCING with the (lo,hi) range of the area which needs to + be suspended. Each node maintains a suspend_list, which contains the + list of ranges which are currently suspended. On receiving RESYNCING, + the node adds the range to the suspend_list. Similarly, when the node + performing resync finishes, it sends RESYNCING with an empty range to + other nodes and other nodes remove the corresponding entry from the + suspend_list. + + A helper function, ->area_resyncing() can be used to check if a + particular I/O range should be suspended or not. + +4.2 Device Failure +================== + + Device failures are handled and communicated with the metadata update + routine. When a node detects a device failure it does not allow + any further writes to that device until the failure has been + acknowledged by all other nodes. + +5. Adding a new Device +---------------------- + + For adding a new device, it is necessary that all nodes "see" the new + device to be added. For this, the following algorithm is used: + + 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues + ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD) + 2. Node 1 sends a NEWDISK message with uuid and slot number + 3. Other nodes issue kobject_uevent_env with uuid and slot number + (Steps 4,5 could be a udev rule) + 4. In userspace, the node searches for the disk, perhaps + using blkid -t SUB_UUID="" + 5. Other nodes issue either of the following depending on whether + the disk was found: + ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and + disc.number set to slot number) + ioctl(CLUSTERED_DISK_NACK) + 6. Other nodes drop lock on "no-new-devs" (CR) if device is found + 7. Node 1 attempts EX lock on "no-new-dev" + 8. If node 1 gets the lock, it sends METADATA_UPDATED after + unmarking the disk as SpareLocal + 9. If not (get "no-new-dev" lock), it fails the operation and sends + METADATA_UPDATED. + 10. Other nodes get the information whether a disk is added or not + by the following METADATA_UPDATED. + +6. Module interface +=================== + + There are 17 call-backs which the md core can make to the cluster + module. Understanding these can give a good overview of the whole + process. + +6.1 join(nodes) and leave() +--------------------------- + + These are called when an array is started with a clustered bitmap, + and when the array is stopped. join() ensures the cluster is + available and initializes the various resources. + Only the first 'nodes' nodes in the cluster can use the array. + +6.2 slot_number() +----------------- + + Reports the slot number advised by the cluster infrastructure. + Range is from 0 to nodes-1. + +6.3 resync_info_update() +------------------------ + + This updates the resync range that is stored in the bitmap lock. + The starting point is updated as the resync progresses. The + end point is always the end of the array. + It does *not* send a RESYNCING message. + +6.4 resync_start(), resync_finish() +----------------------------------- + + These are called when resync/recovery/reshape starts or stops. + They update the resyncing range in the bitmap lock and also + send a RESYNCING message. resync_start reports the whole + array as resyncing, resync_finish reports none of it. + + resync_finish() also sends a BITMAP_NEEDS_SYNC message which + allows some other node to take over. + +6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel() +------------------------------------------------------------------------------- + + metadata_update_start is used to get exclusive access to + the metadata. If a change is still needed once that access is + gained, metadata_update_finish() will send a METADATA_UPDATE + message to all other nodes, otherwise metadata_update_cancel() + can be used to release the lock. + +6.6 area_resyncing() +-------------------- + + This combines two elements of functionality. + + Firstly, it will check if any node is currently resyncing + anything in a given range of sectors. If any resync is found, + then the caller will avoid writing or read-balancing in that + range. + + Secondly, while node recovery is happening it reports that + all areas are resyncing for READ requests. This avoids races + between the cluster-filesystem and the cluster-RAID handling + a node failure. + +6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack() +--------------------------------------------------------------- + + These are used to manage the new-disk protocol described above. + When a new device is added, add_new_disk_start() is called before + it is bound to the array and, if that succeeds, add_new_disk_finish() + is called the device is fully added. + + When a device is added in acknowledgement to a previous + request, or when the device is declared "unavailable", + new_disk_ack() is called. + +6.8 remove_disk() +----------------- + + This is called when a spare or failed device is removed from + the array. It causes a REMOVE message to be send to other nodes. + +6.9 gather_bitmaps() +-------------------- + + This sends a RE_ADD message to all other nodes and then + gathers bitmap information from all bitmaps. This combined + bitmap is then used to recovery the re-added device. + +6.10 lock_all_bitmaps() and unlock_all_bitmaps() +------------------------------------------------ + + These are called when change bitmap to none. If a node plans + to clear the cluster raid's bitmap, it need to make sure no other + nodes are using the raid which is achieved by lock all bitmap + locks within the cluster, and also those locks are unlocked + accordingly. + +7. Unsupported features +======================= + +There are somethings which are not supported by cluster MD yet. + +- change array_sectors. diff --git a/Documentation/driver-api/md/raid5-cache.rst b/Documentation/driver-api/md/raid5-cache.rst new file mode 100644 index 000000000000..d7a15f44a7c3 --- /dev/null +++ b/Documentation/driver-api/md/raid5-cache.rst @@ -0,0 +1,111 @@ +================ +RAID 4/5/6 cache +================ + +Raid 4/5/6 could include an extra disk for data cache besides normal RAID +disks. The role of RAID disks isn't changed with the cache disk. The cache disk +caches data to the RAID disks. The cache can be in write-through (supported +since 4.4) or write-back mode (supported since 4.10). mdadm (supported since +3.4) has a new option '--write-journal' to create array with cache. Please +refer to mdadm manual for details. By default (RAID array starts), the cache is +in write-through mode. A user can switch it to write-back mode by:: + + echo "write-back" > /sys/block/md0/md/journal_mode + +And switch it back to write-through mode by:: + + echo "write-through" > /sys/block/md0/md/journal_mode + +In both modes, all writes to the array will hit cache disk first. This means +the cache disk must be fast and sustainable. + +write-through mode +================== + +This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean +shutdown can cause data in some stripes to not be in consistent state, eg, data +and parity don't match. The reason is that a stripe write involves several RAID +disks and it's possible the writes don't hit all RAID disks yet before the +unclean shutdown. We call an array degraded if it has inconsistent data. MD +tries to resync the array to bring it back to normal state. But before the +resync completes, any system crash will expose the chance of real data +corruption in the RAID array. This problem is called 'write hole'. + +The write-through cache will cache all data on cache disk first. After the data +is safe on the cache disk, the data will be flushed onto RAID disks. The +two-step write will guarantee MD can recover correct data after unclean +shutdown even the array is degraded. Thus the cache can close the 'write hole'. + +In write-through mode, MD reports IO completion to upper layer (usually +filesystems) after the data is safe on RAID disks, so cache disk failure +doesn't cause data loss. Of course cache disk failure means the array is +exposed to 'write hole' again. + +In write-through mode, the cache disk isn't required to be big. Several +hundreds megabytes are enough. + +write-back mode +=============== + +write-back mode fixes the 'write hole' issue too, since all write data is +cached on cache disk. But the main goal of 'write-back' cache is to speed up +write. If a write crosses all RAID disks of a stripe, we call it full-stripe +write. For non-full-stripe writes, MD must read old data before the new parity +can be calculated. These synchronous reads hurt write throughput. Some writes +which are sequential but not dispatched in the same time will suffer from this +overhead too. Write-back cache will aggregate the data and flush the data to +RAID disks only after the data becomes a full stripe write. This will +completely avoid the overhead, so it's very helpful for some workloads. A +typical workload which does sequential write followed by fsync is an example. + +In write-back mode, MD reports IO completion to upper layer (usually +filesystems) right after the data hits cache disk. The data is flushed to raid +disks later after specific conditions met. So cache disk failure will cause +data loss. + +In write-back mode, MD also caches data in memory. The memory cache includes +the same data stored on cache disk, so a power loss doesn't cause data loss. +The memory cache size has performance impact for the array. It's recommended +the size is big. A user can configure the size by:: + + echo "2048" > /sys/block/md0/md/stripe_cache_size + +Too small cache disk will make the write aggregation less efficient in this +mode depending on the workloads. It's recommended to use a cache disk with at +least several gigabytes size in write-back mode. + +The implementation +================== + +The write-through and write-back cache use the same disk format. The cache disk +is organized as a simple write log. The log consists of 'meta data' and 'data' +pairs. The meta data describes the data. It also includes checksum and sequence +ID for recovery identification. Data can be IO data and parity data. Data is +checksumed too. The checksum is stored in the meta data ahead of the data. The +checksum is an optimization because MD can write meta and data freely without +worry about the order. MD superblock has a field pointed to the valid meta data +of log head. + +The log implementation is pretty straightforward. The difficult part is the +order in which MD writes data to cache disk and RAID disks. Specifically, in +write-through mode, MD calculates parity for IO data, writes both IO data and +parity to the log, writes the data and parity to RAID disks after the data and +parity is settled down in log and finally the IO is finished. Read just reads +from raid disks as usual. + +In write-back mode, MD writes IO data to the log and reports IO completion. The +data is also fully cached in memory at that time, which means read must query +memory cache. If some conditions are met, MD will flush the data to RAID disks. +MD will calculate parity for the data and write parity into the log. After this +is finished, MD will write both data and parity into RAID disks, then MD can +release the memory cache. The flush conditions could be stripe becomes a full +stripe write, free cache disk space is low or free in-kernel memory cache space +is low. + +After an unclean shutdown, MD does recovery. MD reads all meta data and data +from the log. The sequence ID and checksum will help us detect corrupted meta +data and data. If MD finds a stripe with data and valid parities (1 parity for +raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If +parities are incompleted, they are discarded. If part of data is corrupted, +they are discarded too. MD then loads valid data and writes them to RAID disks +in normal way. diff --git a/Documentation/driver-api/md/raid5-ppl.rst b/Documentation/driver-api/md/raid5-ppl.rst new file mode 100644 index 000000000000..357e5515bc55 --- /dev/null +++ b/Documentation/driver-api/md/raid5-ppl.rst @@ -0,0 +1,47 @@ +================== +Partial Parity Log +================== + +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue +addressed by PPL is that after a dirty shutdown, parity of a particular stripe +may become inconsistent with data on other member disks. If the array is also +in degraded state, there is no way to recalculate parity, because one of the +disks is missing. This can lead to silent data corruption when rebuilding the +array or using it is as degraded - data calculated from parity for array blocks +that have not been touched by a write request during the unclean shutdown can +be incorrect. Such condition is known as the RAID5 Write Hole. Because of +this, md by default does not allow starting a dirty degraded array. + +Partial parity for a write operation is the XOR of stripe data chunks not +modified by this write. It is just enough data needed for recovering from the +write hole. XORing partial parity with the modified chunks produces parity for +the stripe, consistent with its state before the write operation, regardless of +which chunk writes have completed. If one of the not modified data disks of +this stripe is missing, this updated parity can be used to recover its +contents. PPL recovery is also performed when starting an array after an +unclean shutdown and all disks are available, eliminating the need to resync +the array. Because of this, using write-intent bitmap and PPL together is not +supported. + +When handling a write request PPL writes partial parity before new data and +parity are dispatched to disks. PPL is a distributed log - it is stored on +array member drives in the metadata area, on the parity drive of a particular +stripe. It does not require a dedicated journaling drive. Write performance is +reduced by up to 30%-40% but it scales with the number of drives in the array +and the journaling drive does not become a bottleneck or a single point of +failure. + +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is +not a true journal. It does not protect from losing in-flight data, only from +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is +performed for this stripe (parity is not updated). So it is possible to have +arbitrary data in the written part of a stripe if that disk is lost. In such +case the behavior is the same as in plain raid5. + +PPL is available for md version-1 metadata and external (specifically IMSM) +metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl. + +There is a limitation of maximum 64 disks in the array for PPL. It allows to +keep data structures and implementation simple. RAID5 arrays with so many disks +are not likely due to high risk of multiple disks failure. Such restriction +should not be a real life limitation. diff --git a/Documentation/md/index.rst b/Documentation/md/index.rst deleted file mode 100644 index c4db34ed327d..000000000000 --- a/Documentation/md/index.rst +++ /dev/null @@ -1,12 +0,0 @@ -:orphan: - -==== -RAID -==== - -.. toctree:: - :maxdepth: 1 - - md-cluster - raid5-cache - raid5-ppl diff --git a/Documentation/md/md-cluster.rst b/Documentation/md/md-cluster.rst deleted file mode 100644 index 96eb52cec7eb..000000000000 --- a/Documentation/md/md-cluster.rst +++ /dev/null @@ -1,385 +0,0 @@ -========== -MD Cluster -========== - -The cluster MD is a shared-device RAID for a cluster, it supports -two levels: raid1 and raid10 (limited support). - - -1. On-disk format -================= - -Separate write-intent-bitmaps are used for each cluster node. -The bitmaps record all writes that may have been started on that node, -and may not yet have finished. The on-disk layout is:: - - 0 4k 8k 12k - ------------------------------------------------------------------- - | idle | md super | bm super [0] + bits | - | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | - | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | - | bm bits [3, contd] | | | - -During "normal" functioning we assume the filesystem ensures that only -one node writes to any given block at a time, so a write request will - - - set the appropriate bit (if not already set) - - commit the write to all mirrors - - schedule the bit to be cleared after a timeout. - -Reads are just handled normally. It is up to the filesystem to ensure -one node doesn't read from a location where another node (or the same -node) is writing. - - -2. DLM Locks for management -=========================== - -There are three groups of locks for managing the device: - -2.1 Bitmap lock resource (bm_lockres) -------------------------------------- - - The bm_lockres protects individual node bitmaps. They are named in - the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a - node joins the cluster, it acquires the lock in PW mode and it stays - so during the lifetime the node is part of the cluster. The lock - resource number is based on the slot number returned by the DLM - subsystem. Since DLM starts node count from one and bitmap slots - start from zero, one is subtracted from the DLM slot number to arrive - at the bitmap slot number. - - The LVB of the bitmap lock for a particular node records the range - of sectors that are being re-synced by that node. No other - node may write to those sectors. This is used when a new nodes - joins the cluster. - -2.2 Message passing locks -------------------------- - - Each node has to communicate with other nodes when starting or ending - resync, and for metadata superblock updates. This communication is - managed through three locks: "token", "message", and "ack", together - with the Lock Value Block (LVB) of one of the "message" lock. - -2.3 new-device management -------------------------- - - A single lock: "no-new-dev" is used to co-ordinate the addition of - new devices - this must be synchronized across the array. - Normally all nodes hold a concurrent-read lock on this device. - -3. Communication -================ - - Messages can be broadcast to all nodes, and the sender waits for all - other nodes to acknowledge the message before proceeding. Only one - message can be processed at a time. - -3.1 Message Types ------------------ - - There are six types of messages which are passed: - -3.1.1 METADATA_UPDATED -^^^^^^^^^^^^^^^^^^^^^^ - - informs other nodes that the metadata has - been updated, and the node must re-read the md superblock. This is - performed synchronously. It is primarily used to signal device - failure. - -3.1.2 RESYNCING -^^^^^^^^^^^^^^^ - informs other nodes that a resync is initiated or - ended so that each node may suspend or resume the region. Each - RESYNCING message identifies a range of the devices that the - sending node is about to resync. This overrides any previous - notification from that node: only one ranged can be resynced at a - time per-node. - -3.1.3 NEWDISK -^^^^^^^^^^^^^ - - informs other nodes that a device is being added to - the array. Message contains an identifier for that device. See - below for further details. - -3.1.4 REMOVE -^^^^^^^^^^^^ - - A failed or spare device is being removed from the - array. The slot-number of the device is included in the message. - - 3.1.5 RE_ADD: - - A failed device is being re-activated - the assumption - is that it has been determined to be working again. - - 3.1.6 BITMAP_NEEDS_SYNC: - - If a node is stopped locally but the bitmap - isn't clean, then another node is informed to take the ownership of - resync. - -3.2 Communication mechanism ---------------------------- - - The DLM LVB is used to communicate within nodes of the cluster. There - are three resources used for the purpose: - -3.2.1 token -^^^^^^^^^^^ - The resource which protects the entire communication - system. The node having the token resource is allowed to - communicate. - -3.2.2 message -^^^^^^^^^^^^^ - The lock resource which carries the data to communicate. - -3.2.3 ack -^^^^^^^^^ - - The resource, acquiring which means the message has been - acknowledged by all nodes in the cluster. The BAST of the resource - is used to inform the receiving node that a node wants to - communicate. - -The algorithm is: - - 1. receive status - all nodes have concurrent-reader lock on "ack":: - - sender receiver receiver - "ack":CR "ack":CR "ack":CR - - 2. sender get EX on "token", - sender get EX on "message":: - - sender receiver receiver - "token":EX "ack":CR "ack":CR - "message":EX - "ack":CR - - Sender checks that it still needs to send a message. Messages - received or other events that happened while waiting for the - "token" may have made this message inappropriate or redundant. - - 3. sender writes LVB - - sender down-convert "message" from EX to CW - - sender try to get EX of "ack" - - :: - - [ wait until all receivers have *processed* the "message" ] - - [ triggered by bast of "ack" ] - receiver get CR on "message" - receiver read LVB - receiver processes the message - [ wait finish ] - receiver releases "ack" - receiver tries to get PR on "message" - - sender receiver receiver - "token":EX "message":CR "message":CR - "message":CW - "ack":EX - - 4. triggered by grant of EX on "ack" (indicating all receivers - have processed message) - - sender down-converts "ack" from EX to CR - - sender releases "message" - - sender releases "token" - - :: - - receiver upconvert to PR on "message" - receiver get CR of "ack" - receiver release "message" - - sender receiver receiver - "ack":CR "ack":CR "ack":CR - - -4. Handling Failures -==================== - -4.1 Node Failure ----------------- - - When a node fails, the DLM informs the cluster with the slot - number. The node starts a cluster recovery thread. The cluster - recovery thread: - - - acquires the bitmap lock of the failed node - - opens the bitmap - - reads the bitmap of the failed node - - copies the set bitmap to local node - - cleans the bitmap of the failed node - - releases bitmap lock of the failed node - - initiates resync of the bitmap on the current node - md_check_recovery is invoked within recover_bitmaps, - then md_check_recovery -> metadata_update_start/finish, - it will lock the communication by lock_comm. - Which means when one node is resyncing it blocks all - other nodes from writing anywhere on the array. - - The resync process is the regular md resync. However, in a clustered - environment when a resync is performed, it needs to tell other nodes - of the areas which are suspended. Before a resync starts, the node - send out RESYNCING with the (lo,hi) range of the area which needs to - be suspended. Each node maintains a suspend_list, which contains the - list of ranges which are currently suspended. On receiving RESYNCING, - the node adds the range to the suspend_list. Similarly, when the node - performing resync finishes, it sends RESYNCING with an empty range to - other nodes and other nodes remove the corresponding entry from the - suspend_list. - - A helper function, ->area_resyncing() can be used to check if a - particular I/O range should be suspended or not. - -4.2 Device Failure -================== - - Device failures are handled and communicated with the metadata update - routine. When a node detects a device failure it does not allow - any further writes to that device until the failure has been - acknowledged by all other nodes. - -5. Adding a new Device ----------------------- - - For adding a new device, it is necessary that all nodes "see" the new - device to be added. For this, the following algorithm is used: - - 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues - ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD) - 2. Node 1 sends a NEWDISK message with uuid and slot number - 3. Other nodes issue kobject_uevent_env with uuid and slot number - (Steps 4,5 could be a udev rule) - 4. In userspace, the node searches for the disk, perhaps - using blkid -t SUB_UUID="" - 5. Other nodes issue either of the following depending on whether - the disk was found: - ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and - disc.number set to slot number) - ioctl(CLUSTERED_DISK_NACK) - 6. Other nodes drop lock on "no-new-devs" (CR) if device is found - 7. Node 1 attempts EX lock on "no-new-dev" - 8. If node 1 gets the lock, it sends METADATA_UPDATED after - unmarking the disk as SpareLocal - 9. If not (get "no-new-dev" lock), it fails the operation and sends - METADATA_UPDATED. - 10. Other nodes get the information whether a disk is added or not - by the following METADATA_UPDATED. - -6. Module interface -=================== - - There are 17 call-backs which the md core can make to the cluster - module. Understanding these can give a good overview of the whole - process. - -6.1 join(nodes) and leave() ---------------------------- - - These are called when an array is started with a clustered bitmap, - and when the array is stopped. join() ensures the cluster is - available and initializes the various resources. - Only the first 'nodes' nodes in the cluster can use the array. - -6.2 slot_number() ------------------ - - Reports the slot number advised by the cluster infrastructure. - Range is from 0 to nodes-1. - -6.3 resync_info_update() ------------------------- - - This updates the resync range that is stored in the bitmap lock. - The starting point is updated as the resync progresses. The - end point is always the end of the array. - It does *not* send a RESYNCING message. - -6.4 resync_start(), resync_finish() ------------------------------------ - - These are called when resync/recovery/reshape starts or stops. - They update the resyncing range in the bitmap lock and also - send a RESYNCING message. resync_start reports the whole - array as resyncing, resync_finish reports none of it. - - resync_finish() also sends a BITMAP_NEEDS_SYNC message which - allows some other node to take over. - -6.5 metadata_update_start(), metadata_update_finish(), metadata_update_cancel() -------------------------------------------------------------------------------- - - metadata_update_start is used to get exclusive access to - the metadata. If a change is still needed once that access is - gained, metadata_update_finish() will send a METADATA_UPDATE - message to all other nodes, otherwise metadata_update_cancel() - can be used to release the lock. - -6.6 area_resyncing() --------------------- - - This combines two elements of functionality. - - Firstly, it will check if any node is currently resyncing - anything in a given range of sectors. If any resync is found, - then the caller will avoid writing or read-balancing in that - range. - - Secondly, while node recovery is happening it reports that - all areas are resyncing for READ requests. This avoids races - between the cluster-filesystem and the cluster-RAID handling - a node failure. - -6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack() ---------------------------------------------------------------- - - These are used to manage the new-disk protocol described above. - When a new device is added, add_new_disk_start() is called before - it is bound to the array and, if that succeeds, add_new_disk_finish() - is called the device is fully added. - - When a device is added in acknowledgement to a previous - request, or when the device is declared "unavailable", - new_disk_ack() is called. - -6.8 remove_disk() ------------------ - - This is called when a spare or failed device is removed from - the array. It causes a REMOVE message to be send to other nodes. - -6.9 gather_bitmaps() --------------------- - - This sends a RE_ADD message to all other nodes and then - gathers bitmap information from all bitmaps. This combined - bitmap is then used to recovery the re-added device. - -6.10 lock_all_bitmaps() and unlock_all_bitmaps() ------------------------------------------------- - - These are called when change bitmap to none. If a node plans - to clear the cluster raid's bitmap, it need to make sure no other - nodes are using the raid which is achieved by lock all bitmap - locks within the cluster, and also those locks are unlocked - accordingly. - -7. Unsupported features -======================= - -There are somethings which are not supported by cluster MD yet. - -- change array_sectors. diff --git a/Documentation/md/raid5-cache.rst b/Documentation/md/raid5-cache.rst deleted file mode 100644 index d7a15f44a7c3..000000000000 --- a/Documentation/md/raid5-cache.rst +++ /dev/null @@ -1,111 +0,0 @@ -================ -RAID 4/5/6 cache -================ - -Raid 4/5/6 could include an extra disk for data cache besides normal RAID -disks. The role of RAID disks isn't changed with the cache disk. The cache disk -caches data to the RAID disks. The cache can be in write-through (supported -since 4.4) or write-back mode (supported since 4.10). mdadm (supported since -3.4) has a new option '--write-journal' to create array with cache. Please -refer to mdadm manual for details. By default (RAID array starts), the cache is -in write-through mode. A user can switch it to write-back mode by:: - - echo "write-back" > /sys/block/md0/md/journal_mode - -And switch it back to write-through mode by:: - - echo "write-through" > /sys/block/md0/md/journal_mode - -In both modes, all writes to the array will hit cache disk first. This means -the cache disk must be fast and sustainable. - -write-through mode -================== - -This mode mainly fixes the 'write hole' issue. For RAID 4/5/6 array, an unclean -shutdown can cause data in some stripes to not be in consistent state, eg, data -and parity don't match. The reason is that a stripe write involves several RAID -disks and it's possible the writes don't hit all RAID disks yet before the -unclean shutdown. We call an array degraded if it has inconsistent data. MD -tries to resync the array to bring it back to normal state. But before the -resync completes, any system crash will expose the chance of real data -corruption in the RAID array. This problem is called 'write hole'. - -The write-through cache will cache all data on cache disk first. After the data -is safe on the cache disk, the data will be flushed onto RAID disks. The -two-step write will guarantee MD can recover correct data after unclean -shutdown even the array is degraded. Thus the cache can close the 'write hole'. - -In write-through mode, MD reports IO completion to upper layer (usually -filesystems) after the data is safe on RAID disks, so cache disk failure -doesn't cause data loss. Of course cache disk failure means the array is -exposed to 'write hole' again. - -In write-through mode, the cache disk isn't required to be big. Several -hundreds megabytes are enough. - -write-back mode -=============== - -write-back mode fixes the 'write hole' issue too, since all write data is -cached on cache disk. But the main goal of 'write-back' cache is to speed up -write. If a write crosses all RAID disks of a stripe, we call it full-stripe -write. For non-full-stripe writes, MD must read old data before the new parity -can be calculated. These synchronous reads hurt write throughput. Some writes -which are sequential but not dispatched in the same time will suffer from this -overhead too. Write-back cache will aggregate the data and flush the data to -RAID disks only after the data becomes a full stripe write. This will -completely avoid the overhead, so it's very helpful for some workloads. A -typical workload which does sequential write followed by fsync is an example. - -In write-back mode, MD reports IO completion to upper layer (usually -filesystems) right after the data hits cache disk. The data is flushed to raid -disks later after specific conditions met. So cache disk failure will cause -data loss. - -In write-back mode, MD also caches data in memory. The memory cache includes -the same data stored on cache disk, so a power loss doesn't cause data loss. -The memory cache size has performance impact for the array. It's recommended -the size is big. A user can configure the size by:: - - echo "2048" > /sys/block/md0/md/stripe_cache_size - -Too small cache disk will make the write aggregation less efficient in this -mode depending on the workloads. It's recommended to use a cache disk with at -least several gigabytes size in write-back mode. - -The implementation -================== - -The write-through and write-back cache use the same disk format. The cache disk -is organized as a simple write log. The log consists of 'meta data' and 'data' -pairs. The meta data describes the data. It also includes checksum and sequence -ID for recovery identification. Data can be IO data and parity data. Data is -checksumed too. The checksum is stored in the meta data ahead of the data. The -checksum is an optimization because MD can write meta and data freely without -worry about the order. MD superblock has a field pointed to the valid meta data -of log head. - -The log implementation is pretty straightforward. The difficult part is the -order in which MD writes data to cache disk and RAID disks. Specifically, in -write-through mode, MD calculates parity for IO data, writes both IO data and -parity to the log, writes the data and parity to RAID disks after the data and -parity is settled down in log and finally the IO is finished. Read just reads -from raid disks as usual. - -In write-back mode, MD writes IO data to the log and reports IO completion. The -data is also fully cached in memory at that time, which means read must query -memory cache. If some conditions are met, MD will flush the data to RAID disks. -MD will calculate parity for the data and write parity into the log. After this -is finished, MD will write both data and parity into RAID disks, then MD can -release the memory cache. The flush conditions could be stripe becomes a full -stripe write, free cache disk space is low or free in-kernel memory cache space -is low. - -After an unclean shutdown, MD does recovery. MD reads all meta data and data -from the log. The sequence ID and checksum will help us detect corrupted meta -data and data. If MD finds a stripe with data and valid parities (1 parity for -raid4/5 and 2 for raid6), MD will write the data and parities to RAID disks. If -parities are incompleted, they are discarded. If part of data is corrupted, -they are discarded too. MD then loads valid data and writes them to RAID disks -in normal way. diff --git a/Documentation/md/raid5-ppl.rst b/Documentation/md/raid5-ppl.rst deleted file mode 100644 index 357e5515bc55..000000000000 --- a/Documentation/md/raid5-ppl.rst +++ /dev/null @@ -1,47 +0,0 @@ -================== -Partial Parity Log -================== - -Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue -addressed by PPL is that after a dirty shutdown, parity of a particular stripe -may become inconsistent with data on other member disks. If the array is also -in degraded state, there is no way to recalculate parity, because one of the -disks is missing. This can lead to silent data corruption when rebuilding the -array or using it is as degraded - data calculated from parity for array blocks -that have not been touched by a write request during the unclean shutdown can -be incorrect. Such condition is known as the RAID5 Write Hole. Because of -this, md by default does not allow starting a dirty degraded array. - -Partial parity for a write operation is the XOR of stripe data chunks not -modified by this write. It is just enough data needed for recovering from the -write hole. XORing partial parity with the modified chunks produces parity for -the stripe, consistent with its state before the write operation, regardless of -which chunk writes have completed. If one of the not modified data disks of -this stripe is missing, this updated parity can be used to recover its -contents. PPL recovery is also performed when starting an array after an -unclean shutdown and all disks are available, eliminating the need to resync -the array. Because of this, using write-intent bitmap and PPL together is not -supported. - -When handling a write request PPL writes partial parity before new data and -parity are dispatched to disks. PPL is a distributed log - it is stored on -array member drives in the metadata area, on the parity drive of a particular -stripe. It does not require a dedicated journaling drive. Write performance is -reduced by up to 30%-40% but it scales with the number of drives in the array -and the journaling drive does not become a bottleneck or a single point of -failure. - -Unlike raid5-cache, the other solution in md for closing the write hole, PPL is -not a true journal. It does not protect from losing in-flight data, only from -silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is -performed for this stripe (parity is not updated). So it is possible to have -arbitrary data in the written part of a stripe if that disk is lost. In such -case the behavior is the same as in plain raid5. - -PPL is available for md version-1 metadata and external (specifically IMSM) -metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl. - -There is a limitation of maximum 64 disks in the array for PPL. It allows to -keep data structures and implementation simple. RAID5 arrays with so many disks -are not likely due to high risk of multiple disks failure. Such restriction -should not be a real life limitation. -- cgit v1.2.3 From 9b1f44028ff2e051816517781153e10a2d748dc3 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 17:15:10 -0300 Subject: docs: interconnect.rst: add it to the driver-api guide This is intended for Kernel hackers audience. Signed-off-by: Mauro Carvalho Chehab Reviewed-by: Georgi Djakov --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/interconnect.rst | 93 ++++++++++++++++++++++++++++ Documentation/interconnect/interconnect.rst | 95 ----------------------------- MAINTAINERS | 2 +- 4 files changed, 95 insertions(+), 96 deletions(-) create mode 100644 Documentation/driver-api/interconnect.rst delete mode 100644 Documentation/interconnect/interconnect.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index b5179bf2ada2..baa77a666e46 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -36,6 +36,7 @@ available subsections can be seen below. i2c ipmb i3c/index + interconnect hsi edac scsi diff --git a/Documentation/driver-api/interconnect.rst b/Documentation/driver-api/interconnect.rst new file mode 100644 index 000000000000..c3e004893796 --- /dev/null +++ b/Documentation/driver-api/interconnect.rst @@ -0,0 +1,93 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +GENERIC SYSTEM INTERCONNECT SUBSYSTEM +===================================== + +Introduction +------------ + +This framework is designed to provide a standard kernel interface to control +the settings of the interconnects on an SoC. These settings can be throughput, +latency and priority between multiple interconnected devices or functional +blocks. This can be controlled dynamically in order to save power or provide +maximum performance. + +The interconnect bus is hardware with configurable parameters, which can be +set on a data path according to the requests received from various drivers. +An example of interconnect buses are the interconnects between various +components or functional blocks in chipsets. There can be multiple interconnects +on an SoC that can be multi-tiered. + +Below is a simplified diagram of a real-world SoC interconnect bus topology. + +:: + + +----------------+ +----------------+ + | HW Accelerator |--->| M NoC |<---------------+ + +----------------+ +----------------+ | + | | +------------+ + +-----+ +-------------+ V +------+ | | + | DDR | | +--------+ | PCIe | | | + +-----+ | | Slaves | +------+ | | + ^ ^ | +--------+ | | C NoC | + | | V V | | + +------------------+ +------------------------+ | | +-----+ + | |-->| |-->| |-->| CPU | + | |-->| |<--| | +-----+ + | Mem NoC | | S NoC | +------------+ + | |<--| |---------+ | + | |<--| |<------+ | | +--------+ + +------------------+ +------------------------+ | | +-->| Slaves | + ^ ^ ^ ^ ^ | | +--------+ + | | | | | | V + +------+ | +-----+ +-----+ +---------+ +----------------+ +--------+ + | CPUs | | | GPU | | DSP | | Masters |-->| P NoC |-->| Slaves | + +------+ | +-----+ +-----+ +---------+ +----------------+ +--------+ + | + +-------+ + | Modem | + +-------+ + +Terminology +----------- + +Interconnect provider is the software definition of the interconnect hardware. +The interconnect providers on the above diagram are M NoC, S NoC, C NoC, P NoC +and Mem NoC. + +Interconnect node is the software definition of the interconnect hardware +port. Each interconnect provider consists of multiple interconnect nodes, +which are connected to other SoC components including other interconnect +providers. The point on the diagram where the CPUs connect to the memory is +called an interconnect node, which belongs to the Mem NoC interconnect provider. + +Interconnect endpoints are the first or the last element of the path. Every +endpoint is a node, but not every node is an endpoint. + +Interconnect path is everything between two endpoints including all the nodes +that have to be traversed to reach from a source to destination node. It may +include multiple master-slave pairs across several interconnect providers. + +Interconnect consumers are the entities which make use of the data paths exposed +by the providers. The consumers send requests to providers requesting various +throughput, latency and priority. Usually the consumers are device drivers, that +send request based on their needs. An example for a consumer is a video decoder +that supports various formats and image sizes. + +Interconnect providers +---------------------- + +Interconnect provider is an entity that implements methods to initialize and +configure interconnect bus hardware. The interconnect provider drivers should +be registered with the interconnect provider core. + +.. kernel-doc:: include/linux/interconnect-provider.h + +Interconnect consumers +---------------------- + +Interconnect consumers are the clients which use the interconnect APIs to +get paths between endpoints and set their bandwidth/latency/QoS requirements +for these interconnect paths. These interfaces are not currently +documented. diff --git a/Documentation/interconnect/interconnect.rst b/Documentation/interconnect/interconnect.rst deleted file mode 100644 index 56e331dab70e..000000000000 --- a/Documentation/interconnect/interconnect.rst +++ /dev/null @@ -1,95 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -:orphan: - -===================================== -GENERIC SYSTEM INTERCONNECT SUBSYSTEM -===================================== - -Introduction ------------- - -This framework is designed to provide a standard kernel interface to control -the settings of the interconnects on an SoC. These settings can be throughput, -latency and priority between multiple interconnected devices or functional -blocks. This can be controlled dynamically in order to save power or provide -maximum performance. - -The interconnect bus is hardware with configurable parameters, which can be -set on a data path according to the requests received from various drivers. -An example of interconnect buses are the interconnects between various -components or functional blocks in chipsets. There can be multiple interconnects -on an SoC that can be multi-tiered. - -Below is a simplified diagram of a real-world SoC interconnect bus topology. - -:: - - +----------------+ +----------------+ - | HW Accelerator |--->| M NoC |<---------------+ - +----------------+ +----------------+ | - | | +------------+ - +-----+ +-------------+ V +------+ | | - | DDR | | +--------+ | PCIe | | | - +-----+ | | Slaves | +------+ | | - ^ ^ | +--------+ | | C NoC | - | | V V | | - +------------------+ +------------------------+ | | +-----+ - | |-->| |-->| |-->| CPU | - | |-->| |<--| | +-----+ - | Mem NoC | | S NoC | +------------+ - | |<--| |---------+ | - | |<--| |<------+ | | +--------+ - +------------------+ +------------------------+ | | +-->| Slaves | - ^ ^ ^ ^ ^ | | +--------+ - | | | | | | V - +------+ | +-----+ +-----+ +---------+ +----------------+ +--------+ - | CPUs | | | GPU | | DSP | | Masters |-->| P NoC |-->| Slaves | - +------+ | +-----+ +-----+ +---------+ +----------------+ +--------+ - | - +-------+ - | Modem | - +-------+ - -Terminology ------------ - -Interconnect provider is the software definition of the interconnect hardware. -The interconnect providers on the above diagram are M NoC, S NoC, C NoC, P NoC -and Mem NoC. - -Interconnect node is the software definition of the interconnect hardware -port. Each interconnect provider consists of multiple interconnect nodes, -which are connected to other SoC components including other interconnect -providers. The point on the diagram where the CPUs connect to the memory is -called an interconnect node, which belongs to the Mem NoC interconnect provider. - -Interconnect endpoints are the first or the last element of the path. Every -endpoint is a node, but not every node is an endpoint. - -Interconnect path is everything between two endpoints including all the nodes -that have to be traversed to reach from a source to destination node. It may -include multiple master-slave pairs across several interconnect providers. - -Interconnect consumers are the entities which make use of the data paths exposed -by the providers. The consumers send requests to providers requesting various -throughput, latency and priority. Usually the consumers are device drivers, that -send request based on their needs. An example for a consumer is a video decoder -that supports various formats and image sizes. - -Interconnect providers ----------------------- - -Interconnect provider is an entity that implements methods to initialize and -configure interconnect bus hardware. The interconnect provider drivers should -be registered with the interconnect provider core. - -.. kernel-doc:: include/linux/interconnect-provider.h - -Interconnect consumers ----------------------- - -Interconnect consumers are the clients which use the interconnect APIs to -get paths between endpoints and set their bandwidth/latency/QoS requirements -for these interconnect paths. These interfaces are not currently -documented. diff --git a/MAINTAINERS b/MAINTAINERS index b8ce346d5254..49e9a58f4799 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8326,7 +8326,7 @@ INTERCONNECT API M: Georgi Djakov L: linux-pm@vger.kernel.org S: Maintained -F: Documentation/interconnect/ +F: Documentation/driver-api/interconnect.rst F: Documentation/devicetree/bindings/interconnect/ F: drivers/interconnect/ F: include/dt-bindings/interconnect/ -- cgit v1.2.3 From ec4b78a0e7dd4751423089b7cfd32168f9052377 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 15:00:25 -0300 Subject: docs: early-userspace: move to driver-api guide Those documents describe a kAPI. So, add to the driver-api book. Signed-off-by: Mauro Carvalho Chehab --- .../driver-api/early-userspace/buffer-format.rst | 119 ++++++++++++++++ .../early-userspace/early_userspace_support.rst | 154 +++++++++++++++++++++ Documentation/driver-api/early-userspace/index.rst | 16 +++ Documentation/driver-api/index.rst | 1 + Documentation/early-userspace/buffer-format.rst | 119 ---------------- .../early-userspace/early_userspace_support.rst | 154 --------------------- Documentation/early-userspace/index.rst | 18 --- Documentation/filesystems/nfs/nfsroot.txt | 2 +- .../filesystems/ramfs-rootfs-initramfs.txt | 4 +- usr/Kconfig | 2 +- 10 files changed, 294 insertions(+), 295 deletions(-) create mode 100644 Documentation/driver-api/early-userspace/buffer-format.rst create mode 100644 Documentation/driver-api/early-userspace/early_userspace_support.rst create mode 100644 Documentation/driver-api/early-userspace/index.rst delete mode 100644 Documentation/early-userspace/buffer-format.rst delete mode 100644 Documentation/early-userspace/early_userspace_support.rst delete mode 100644 Documentation/early-userspace/index.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/early-userspace/buffer-format.rst b/Documentation/driver-api/early-userspace/buffer-format.rst new file mode 100644 index 000000000000..7f74e301fdf3 --- /dev/null +++ b/Documentation/driver-api/early-userspace/buffer-format.rst @@ -0,0 +1,119 @@ +======================= +initramfs buffer format +======================= + +Al Viro, H. Peter Anvin + +Last revision: 2002-01-13 + +Starting with kernel 2.5.x, the old "initial ramdisk" protocol is +getting {replaced/complemented} with the new "initial ramfs" +(initramfs) protocol. The initramfs contents is passed using the same +memory buffer protocol used by the initrd protocol, but the contents +is different. The initramfs buffer contains an archive which is +expanded into a ramfs filesystem; this document details the format of +the initramfs buffer format. + +The initramfs buffer format is based around the "newc" or "crc" CPIO +formats, and can be created with the cpio(1) utility. The cpio +archive can be compressed using gzip(1). One valid version of an +initramfs buffer is thus a single .cpio.gz file. + +The full format of the initramfs buffer is defined by the following +grammar, where:: + + * is used to indicate "0 or more occurrences of" + (|) indicates alternatives + + indicates concatenation + GZIP() indicates the gzip(1) of the operand + ALGN(n) means padding with null bytes to an n-byte boundary + + initramfs := ("\0" | cpio_archive | cpio_gzip_archive)* + + cpio_gzip_archive := GZIP(cpio_archive) + + cpio_archive := cpio_file* + ( | cpio_trailer) + + cpio_file := ALGN(4) + cpio_header + filename + "\0" + ALGN(4) + data + + cpio_trailer := ALGN(4) + cpio_header + "TRAILER!!!\0" + ALGN(4) + + +In human terms, the initramfs buffer contains a collection of +compressed and/or uncompressed cpio archives (in the "newc" or "crc" +formats); arbitrary amounts zero bytes (for padding) can be added +between members. + +The cpio "TRAILER!!!" entry (cpio end-of-archive) is optional, but is +not ignored; see "handling of hard links" below. + +The structure of the cpio_header is as follows (all fields contain +hexadecimal ASCII numbers fully padded with '0' on the left to the +full width of the field, for example, the integer 4780 is represented +by the ASCII string "000012ac"): + +============= ================== ============================================== +Field name Field size Meaning +============= ================== ============================================== +c_magic 6 bytes The string "070701" or "070702" +c_ino 8 bytes File inode number +c_mode 8 bytes File mode and permissions +c_uid 8 bytes File uid +c_gid 8 bytes File gid +c_nlink 8 bytes Number of links +c_mtime 8 bytes Modification time +c_filesize 8 bytes Size of data field +c_maj 8 bytes Major part of file device number +c_min 8 bytes Minor part of file device number +c_rmaj 8 bytes Major part of device node reference +c_rmin 8 bytes Minor part of device node reference +c_namesize 8 bytes Length of filename, including final \0 +c_chksum 8 bytes Checksum of data field if c_magic is 070702; + otherwise zero +============= ================== ============================================== + +The c_mode field matches the contents of st_mode returned by stat(2) +on Linux, and encodes the file type and file permissions. + +The c_filesize should be zero for any file which is not a regular file +or symlink. + +The c_chksum field contains a simple 32-bit unsigned sum of all the +bytes in the data field. cpio(1) refers to this as "crc", which is +clearly incorrect (a cyclic redundancy check is a different and +significantly stronger integrity check), however, this is the +algorithm used. + +If the filename is "TRAILER!!!" this is actually an end-of-archive +marker; the c_filesize for an end-of-archive marker must be zero. + + +Handling of hard links +====================== + +When a nondirectory with c_nlink > 1 is seen, the (c_maj,c_min,c_ino) +tuple is looked up in a tuple buffer. If not found, it is entered in +the tuple buffer and the entry is created as usual; if found, a hard +link rather than a second copy of the file is created. It is not +necessary (but permitted) to include a second copy of the file +contents; if the file contents is not included, the c_filesize field +should be set to zero to indicate no data section follows. If data is +present, the previous instance of the file is overwritten; this allows +the data-carrying instance of a file to occur anywhere in the sequence +(GNU cpio is reported to attach the data to the last instance of a +file only.) + +c_filesize must not be zero for a symlink. + +When a "TRAILER!!!" end-of-archive marker is seen, the tuple buffer is +reset. This permits archives which are generated independently to be +concatenated. + +To combine file data from different sources (without having to +regenerate the (c_maj,c_min,c_ino) fields), therefore, either one of +the following techniques can be used: + +a) Separate the different file data sources with a "TRAILER!!!" + end-of-archive marker, or + +b) Make sure c_nlink == 1 for all nondirectory entries. diff --git a/Documentation/driver-api/early-userspace/early_userspace_support.rst b/Documentation/driver-api/early-userspace/early_userspace_support.rst new file mode 100644 index 000000000000..3deefb34046b --- /dev/null +++ b/Documentation/driver-api/early-userspace/early_userspace_support.rst @@ -0,0 +1,154 @@ +======================= +Early userspace support +======================= + +Last update: 2004-12-20 tlh + + +"Early userspace" is a set of libraries and programs that provide +various pieces of functionality that are important enough to be +available while a Linux kernel is coming up, but that don't need to be +run inside the kernel itself. + +It consists of several major infrastructure components: + +- gen_init_cpio, a program that builds a cpio-format archive + containing a root filesystem image. This archive is compressed, and + the compressed image is linked into the kernel image. +- initramfs, a chunk of code that unpacks the compressed cpio image + midway through the kernel boot process. +- klibc, a userspace C library, currently packaged separately, that is + optimized for correctness and small size. + +The cpio file format used by initramfs is the "newc" (aka "cpio -H newc") +format, and is documented in the file "buffer-format.txt". There are +two ways to add an early userspace image: specify an existing cpio +archive to be used as the image or have the kernel build process build +the image from specifications. + +CPIO ARCHIVE method +------------------- + +You can create a cpio archive that contains the early userspace image. +Your cpio archive should be specified in CONFIG_INITRAMFS_SOURCE and it +will be used directly. Only a single cpio file may be specified in +CONFIG_INITRAMFS_SOURCE and directory and file names are not allowed in +combination with a cpio archive. + +IMAGE BUILDING method +--------------------- + +The kernel build process can also build an early userspace image from +source parts rather than supplying a cpio archive. This method provides +a way to create images with root-owned files even though the image was +built by an unprivileged user. + +The image is specified as one or more sources in +CONFIG_INITRAMFS_SOURCE. Sources can be either directories or files - +cpio archives are *not* allowed when building from sources. + +A source directory will have it and all of its contents packaged. The +specified directory name will be mapped to '/'. When packaging a +directory, limited user and group ID translation can be performed. +INITRAMFS_ROOT_UID can be set to a user ID that needs to be mapped to +user root (0). INITRAMFS_ROOT_GID can be set to a group ID that needs +to be mapped to group root (0). + +A source file must be directives in the format required by the +usr/gen_init_cpio utility (run 'usr/gen_init_cpio -h' to get the +file format). The directives in the file will be passed directly to +usr/gen_init_cpio. + +When a combination of directories and files are specified then the +initramfs image will be an aggregate of all of them. In this way a user +can create a 'root-image' directory and install all files into it. +Because device-special files cannot be created by a unprivileged user, +special files can be listed in a 'root-files' file. Both 'root-image' +and 'root-files' can be listed in CONFIG_INITRAMFS_SOURCE and a complete +early userspace image can be built by an unprivileged user. + +As a technical note, when directories and files are specified, the +entire CONFIG_INITRAMFS_SOURCE is passed to +usr/gen_initramfs_list.sh. This means that CONFIG_INITRAMFS_SOURCE +can really be interpreted as any legal argument to +gen_initramfs_list.sh. If a directory is specified as an argument then +the contents are scanned, uid/gid translation is performed, and +usr/gen_init_cpio file directives are output. If a directory is +specified as an argument to usr/gen_initramfs_list.sh then the +contents of the file are simply copied to the output. All of the output +directives from directory scanning and file contents copying are +processed by usr/gen_init_cpio. + +See also 'usr/gen_initramfs_list.sh -h'. + +Where's this all leading? +========================= + +The klibc distribution contains some of the necessary software to make +early userspace useful. The klibc distribution is currently +maintained separately from the kernel. + +You can obtain somewhat infrequent snapshots of klibc from +https://www.kernel.org/pub/linux/libs/klibc/ + +For active users, you are better off using the klibc git +repository, at http://git.kernel.org/?p=libs/klibc/klibc.git + +The standalone klibc distribution currently provides three components, +in addition to the klibc library: + +- ipconfig, a program that configures network interfaces. It can + configure them statically, or use DHCP to obtain information + dynamically (aka "IP autoconfiguration"). +- nfsmount, a program that can mount an NFS filesystem. +- kinit, the "glue" that uses ipconfig and nfsmount to replace the old + support for IP autoconfig, mount a filesystem over NFS, and continue + system boot using that filesystem as root. + +kinit is built as a single statically linked binary to save space. + +Eventually, several more chunks of kernel functionality will hopefully +move to early userspace: + +- Almost all of init/do_mounts* (the beginning of this is already in + place) +- ACPI table parsing +- Insert unwieldy subsystem that doesn't really need to be in kernel + space here + +If kinit doesn't meet your current needs and you've got bytes to burn, +the klibc distribution includes a small Bourne-compatible shell (ash) +and a number of other utilities, so you can replace kinit and build +custom initramfs images that meet your needs exactly. + +For questions and help, you can sign up for the early userspace +mailing list at http://www.zytor.com/mailman/listinfo/klibc + +How does it work? +================= + +The kernel has currently 3 ways to mount the root filesystem: + +a) all required device and filesystem drivers compiled into the kernel, no + initrd. init/main.c:init() will call prepare_namespace() to mount the + final root filesystem, based on the root= option and optional init= to run + some other init binary than listed at the end of init/main.c:init(). + +b) some device and filesystem drivers built as modules and stored in an + initrd. The initrd must contain a binary '/linuxrc' which is supposed to + load these driver modules. It is also possible to mount the final root + filesystem via linuxrc and use the pivot_root syscall. The initrd is + mounted and executed via prepare_namespace(). + +c) using initramfs. The call to prepare_namespace() must be skipped. + This means that a binary must do all the work. Said binary can be stored + into initramfs either via modifying usr/gen_init_cpio.c or via the new + initrd format, an cpio archive. It must be called "/init". This binary + is responsible to do all the things prepare_namespace() would do. + + To maintain backwards compatibility, the /init binary will only run if it + comes via an initramfs cpio archive. If this is not the case, + init/main.c:init() will run prepare_namespace() to mount the final root + and exec one of the predefined init binaries. + +Bryan O'Sullivan diff --git a/Documentation/driver-api/early-userspace/index.rst b/Documentation/driver-api/early-userspace/index.rst new file mode 100644 index 000000000000..6f20c3c560d8 --- /dev/null +++ b/Documentation/driver-api/early-userspace/index.rst @@ -0,0 +1,16 @@ +=============== +Early Userspace +=============== + +.. toctree:: + :maxdepth: 1 + + early_userspace_support + buffer-format + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index baa77a666e46..0f281f4f648f 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -16,6 +16,7 @@ available subsections can be seen below. basics infrastructure + early-userspace/index pm/index clk device-io diff --git a/Documentation/early-userspace/buffer-format.rst b/Documentation/early-userspace/buffer-format.rst deleted file mode 100644 index 7f74e301fdf3..000000000000 --- a/Documentation/early-userspace/buffer-format.rst +++ /dev/null @@ -1,119 +0,0 @@ -======================= -initramfs buffer format -======================= - -Al Viro, H. Peter Anvin - -Last revision: 2002-01-13 - -Starting with kernel 2.5.x, the old "initial ramdisk" protocol is -getting {replaced/complemented} with the new "initial ramfs" -(initramfs) protocol. The initramfs contents is passed using the same -memory buffer protocol used by the initrd protocol, but the contents -is different. The initramfs buffer contains an archive which is -expanded into a ramfs filesystem; this document details the format of -the initramfs buffer format. - -The initramfs buffer format is based around the "newc" or "crc" CPIO -formats, and can be created with the cpio(1) utility. The cpio -archive can be compressed using gzip(1). One valid version of an -initramfs buffer is thus a single .cpio.gz file. - -The full format of the initramfs buffer is defined by the following -grammar, where:: - - * is used to indicate "0 or more occurrences of" - (|) indicates alternatives - + indicates concatenation - GZIP() indicates the gzip(1) of the operand - ALGN(n) means padding with null bytes to an n-byte boundary - - initramfs := ("\0" | cpio_archive | cpio_gzip_archive)* - - cpio_gzip_archive := GZIP(cpio_archive) - - cpio_archive := cpio_file* + ( | cpio_trailer) - - cpio_file := ALGN(4) + cpio_header + filename + "\0" + ALGN(4) + data - - cpio_trailer := ALGN(4) + cpio_header + "TRAILER!!!\0" + ALGN(4) - - -In human terms, the initramfs buffer contains a collection of -compressed and/or uncompressed cpio archives (in the "newc" or "crc" -formats); arbitrary amounts zero bytes (for padding) can be added -between members. - -The cpio "TRAILER!!!" entry (cpio end-of-archive) is optional, but is -not ignored; see "handling of hard links" below. - -The structure of the cpio_header is as follows (all fields contain -hexadecimal ASCII numbers fully padded with '0' on the left to the -full width of the field, for example, the integer 4780 is represented -by the ASCII string "000012ac"): - -============= ================== ============================================== -Field name Field size Meaning -============= ================== ============================================== -c_magic 6 bytes The string "070701" or "070702" -c_ino 8 bytes File inode number -c_mode 8 bytes File mode and permissions -c_uid 8 bytes File uid -c_gid 8 bytes File gid -c_nlink 8 bytes Number of links -c_mtime 8 bytes Modification time -c_filesize 8 bytes Size of data field -c_maj 8 bytes Major part of file device number -c_min 8 bytes Minor part of file device number -c_rmaj 8 bytes Major part of device node reference -c_rmin 8 bytes Minor part of device node reference -c_namesize 8 bytes Length of filename, including final \0 -c_chksum 8 bytes Checksum of data field if c_magic is 070702; - otherwise zero -============= ================== ============================================== - -The c_mode field matches the contents of st_mode returned by stat(2) -on Linux, and encodes the file type and file permissions. - -The c_filesize should be zero for any file which is not a regular file -or symlink. - -The c_chksum field contains a simple 32-bit unsigned sum of all the -bytes in the data field. cpio(1) refers to this as "crc", which is -clearly incorrect (a cyclic redundancy check is a different and -significantly stronger integrity check), however, this is the -algorithm used. - -If the filename is "TRAILER!!!" this is actually an end-of-archive -marker; the c_filesize for an end-of-archive marker must be zero. - - -Handling of hard links -====================== - -When a nondirectory with c_nlink > 1 is seen, the (c_maj,c_min,c_ino) -tuple is looked up in a tuple buffer. If not found, it is entered in -the tuple buffer and the entry is created as usual; if found, a hard -link rather than a second copy of the file is created. It is not -necessary (but permitted) to include a second copy of the file -contents; if the file contents is not included, the c_filesize field -should be set to zero to indicate no data section follows. If data is -present, the previous instance of the file is overwritten; this allows -the data-carrying instance of a file to occur anywhere in the sequence -(GNU cpio is reported to attach the data to the last instance of a -file only.) - -c_filesize must not be zero for a symlink. - -When a "TRAILER!!!" end-of-archive marker is seen, the tuple buffer is -reset. This permits archives which are generated independently to be -concatenated. - -To combine file data from different sources (without having to -regenerate the (c_maj,c_min,c_ino) fields), therefore, either one of -the following techniques can be used: - -a) Separate the different file data sources with a "TRAILER!!!" - end-of-archive marker, or - -b) Make sure c_nlink == 1 for all nondirectory entries. diff --git a/Documentation/early-userspace/early_userspace_support.rst b/Documentation/early-userspace/early_userspace_support.rst deleted file mode 100644 index 3deefb34046b..000000000000 --- a/Documentation/early-userspace/early_userspace_support.rst +++ /dev/null @@ -1,154 +0,0 @@ -======================= -Early userspace support -======================= - -Last update: 2004-12-20 tlh - - -"Early userspace" is a set of libraries and programs that provide -various pieces of functionality that are important enough to be -available while a Linux kernel is coming up, but that don't need to be -run inside the kernel itself. - -It consists of several major infrastructure components: - -- gen_init_cpio, a program that builds a cpio-format archive - containing a root filesystem image. This archive is compressed, and - the compressed image is linked into the kernel image. -- initramfs, a chunk of code that unpacks the compressed cpio image - midway through the kernel boot process. -- klibc, a userspace C library, currently packaged separately, that is - optimized for correctness and small size. - -The cpio file format used by initramfs is the "newc" (aka "cpio -H newc") -format, and is documented in the file "buffer-format.txt". There are -two ways to add an early userspace image: specify an existing cpio -archive to be used as the image or have the kernel build process build -the image from specifications. - -CPIO ARCHIVE method -------------------- - -You can create a cpio archive that contains the early userspace image. -Your cpio archive should be specified in CONFIG_INITRAMFS_SOURCE and it -will be used directly. Only a single cpio file may be specified in -CONFIG_INITRAMFS_SOURCE and directory and file names are not allowed in -combination with a cpio archive. - -IMAGE BUILDING method ---------------------- - -The kernel build process can also build an early userspace image from -source parts rather than supplying a cpio archive. This method provides -a way to create images with root-owned files even though the image was -built by an unprivileged user. - -The image is specified as one or more sources in -CONFIG_INITRAMFS_SOURCE. Sources can be either directories or files - -cpio archives are *not* allowed when building from sources. - -A source directory will have it and all of its contents packaged. The -specified directory name will be mapped to '/'. When packaging a -directory, limited user and group ID translation can be performed. -INITRAMFS_ROOT_UID can be set to a user ID that needs to be mapped to -user root (0). INITRAMFS_ROOT_GID can be set to a group ID that needs -to be mapped to group root (0). - -A source file must be directives in the format required by the -usr/gen_init_cpio utility (run 'usr/gen_init_cpio -h' to get the -file format). The directives in the file will be passed directly to -usr/gen_init_cpio. - -When a combination of directories and files are specified then the -initramfs image will be an aggregate of all of them. In this way a user -can create a 'root-image' directory and install all files into it. -Because device-special files cannot be created by a unprivileged user, -special files can be listed in a 'root-files' file. Both 'root-image' -and 'root-files' can be listed in CONFIG_INITRAMFS_SOURCE and a complete -early userspace image can be built by an unprivileged user. - -As a technical note, when directories and files are specified, the -entire CONFIG_INITRAMFS_SOURCE is passed to -usr/gen_initramfs_list.sh. This means that CONFIG_INITRAMFS_SOURCE -can really be interpreted as any legal argument to -gen_initramfs_list.sh. If a directory is specified as an argument then -the contents are scanned, uid/gid translation is performed, and -usr/gen_init_cpio file directives are output. If a directory is -specified as an argument to usr/gen_initramfs_list.sh then the -contents of the file are simply copied to the output. All of the output -directives from directory scanning and file contents copying are -processed by usr/gen_init_cpio. - -See also 'usr/gen_initramfs_list.sh -h'. - -Where's this all leading? -========================= - -The klibc distribution contains some of the necessary software to make -early userspace useful. The klibc distribution is currently -maintained separately from the kernel. - -You can obtain somewhat infrequent snapshots of klibc from -https://www.kernel.org/pub/linux/libs/klibc/ - -For active users, you are better off using the klibc git -repository, at http://git.kernel.org/?p=libs/klibc/klibc.git - -The standalone klibc distribution currently provides three components, -in addition to the klibc library: - -- ipconfig, a program that configures network interfaces. It can - configure them statically, or use DHCP to obtain information - dynamically (aka "IP autoconfiguration"). -- nfsmount, a program that can mount an NFS filesystem. -- kinit, the "glue" that uses ipconfig and nfsmount to replace the old - support for IP autoconfig, mount a filesystem over NFS, and continue - system boot using that filesystem as root. - -kinit is built as a single statically linked binary to save space. - -Eventually, several more chunks of kernel functionality will hopefully -move to early userspace: - -- Almost all of init/do_mounts* (the beginning of this is already in - place) -- ACPI table parsing -- Insert unwieldy subsystem that doesn't really need to be in kernel - space here - -If kinit doesn't meet your current needs and you've got bytes to burn, -the klibc distribution includes a small Bourne-compatible shell (ash) -and a number of other utilities, so you can replace kinit and build -custom initramfs images that meet your needs exactly. - -For questions and help, you can sign up for the early userspace -mailing list at http://www.zytor.com/mailman/listinfo/klibc - -How does it work? -================= - -The kernel has currently 3 ways to mount the root filesystem: - -a) all required device and filesystem drivers compiled into the kernel, no - initrd. init/main.c:init() will call prepare_namespace() to mount the - final root filesystem, based on the root= option and optional init= to run - some other init binary than listed at the end of init/main.c:init(). - -b) some device and filesystem drivers built as modules and stored in an - initrd. The initrd must contain a binary '/linuxrc' which is supposed to - load these driver modules. It is also possible to mount the final root - filesystem via linuxrc and use the pivot_root syscall. The initrd is - mounted and executed via prepare_namespace(). - -c) using initramfs. The call to prepare_namespace() must be skipped. - This means that a binary must do all the work. Said binary can be stored - into initramfs either via modifying usr/gen_init_cpio.c or via the new - initrd format, an cpio archive. It must be called "/init". This binary - is responsible to do all the things prepare_namespace() would do. - - To maintain backwards compatibility, the /init binary will only run if it - comes via an initramfs cpio archive. If this is not the case, - init/main.c:init() will run prepare_namespace() to mount the final root - and exec one of the predefined init binaries. - -Bryan O'Sullivan diff --git a/Documentation/early-userspace/index.rst b/Documentation/early-userspace/index.rst deleted file mode 100644 index 2b8eb6132058..000000000000 --- a/Documentation/early-userspace/index.rst +++ /dev/null @@ -1,18 +0,0 @@ -:orphan: - -=============== -Early Userspace -=============== - -.. toctree:: - :maxdepth: 1 - - early_userspace_support - buffer-format - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt index 4862d3d77e27..ae4332464560 100644 --- a/Documentation/filesystems/nfs/nfsroot.txt +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -239,7 +239,7 @@ rdinit= A description of the process of mounting the root file system can be found in: - Documentation/early-userspace/early_userspace_support.rst + Documentation/driver-api/early-userspace/early_userspace_support.rst diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt index fa985909dbca..97d42ccaa92d 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt @@ -105,7 +105,7 @@ All this differs from the old initrd in several ways: - The old initrd file was a gzipped filesystem image (in some file format, such as ext2, that needed a driver built into the kernel), while the new initramfs archive is a gzipped cpio archive (like tar only simpler, - see cpio(1) and Documentation/early-userspace/buffer-format.rst). The + see cpio(1) and Documentation/driver-api/early-userspace/buffer-format.rst). The kernel's cpio extraction code is not only extremely small, it's also __init text and data that can be discarded during the boot process. @@ -159,7 +159,7 @@ One advantage of the configuration file is that root access is not required to set permissions or create device nodes in the new archive. (Note that those two example "file" entries expect to find files named "init.sh" and "busybox" in a directory called "initramfs", under the linux-2.6.* directory. See -Documentation/early-userspace/early_userspace_support.rst for more details.) +Documentation/driver-api/early-userspace/early_userspace_support.rst for more details.) The kernel does not depend on external cpio tools. If you specify a directory instead of a configuration file, the kernel's build infrastructure diff --git a/usr/Kconfig b/usr/Kconfig index 86e37e297278..a6b68503d177 100644 --- a/usr/Kconfig +++ b/usr/Kconfig @@ -18,7 +18,7 @@ config INITRAMFS_SOURCE When multiple directories and files are specified then the initramfs image will be the aggregate of all of them. - See for more details. + See for more details. If you are not sure, leave it blank. -- cgit v1.2.3 From 56198359b64125dd0f9fa991972b61e4bc4fc6b5 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 11:44:24 -0300 Subject: docs: lp855x-driver.rst: add it to the driver-api book The content of this file is intended for backlight Kernel developers. Signed-off-by: Mauro Carvalho Chehab --- Documentation/backlight/lp855x-driver.rst | 83 ---------------------- .../driver-api/backlight/lp855x-driver.rst | 81 +++++++++++++++++++++ Documentation/driver-api/index.rst | 1 + MAINTAINERS | 2 +- 4 files changed, 83 insertions(+), 84 deletions(-) delete mode 100644 Documentation/backlight/lp855x-driver.rst create mode 100644 Documentation/driver-api/backlight/lp855x-driver.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/backlight/lp855x-driver.rst b/Documentation/backlight/lp855x-driver.rst deleted file mode 100644 index 62b7ed847a77..000000000000 --- a/Documentation/backlight/lp855x-driver.rst +++ /dev/null @@ -1,83 +0,0 @@ -:orphan: - -==================== -Kernel driver lp855x -==================== - -Backlight driver for LP855x ICs - -Supported chips: - - Texas Instruments LP8550, LP8551, LP8552, LP8553, LP8555, LP8556 and - LP8557 - -Author: Milo(Woogyom) Kim - -Description ------------ - -* Brightness control - - Brightness can be controlled by the pwm input or the i2c command. - The lp855x driver supports both cases. - -* Device attributes - - 1) bl_ctl_mode - - Backlight control mode. - - Value: pwm based or register based - - 2) chip_id - - The lp855x chip id. - - Value: lp8550/lp8551/lp8552/lp8553/lp8555/lp8556/lp8557 - -Platform data for lp855x ------------------------- - -For supporting platform specific data, the lp855x platform data can be used. - -* name: - Backlight driver name. If it is not defined, default name is set. -* device_control: - Value of DEVICE CONTROL register. -* initial_brightness: - Initial value of backlight brightness. -* period_ns: - Platform specific PWM period value. unit is nano. - Only valid when brightness is pwm input mode. -* size_program: - Total size of lp855x_rom_data. -* rom_data: - List of new eeprom/eprom registers. - -Examples -======== - -1) lp8552 platform data: i2c register mode with new eeprom data:: - - #define EEPROM_A5_ADDR 0xA5 - #define EEPROM_A5_VAL 0x4f /* EN_VSYNC=0 */ - - static struct lp855x_rom_data lp8552_eeprom_arr[] = { - {EEPROM_A5_ADDR, EEPROM_A5_VAL}, - }; - - static struct lp855x_platform_data lp8552_pdata = { - .name = "lcd-bl", - .device_control = I2C_CONFIG(LP8552), - .initial_brightness = INITIAL_BRT, - .size_program = ARRAY_SIZE(lp8552_eeprom_arr), - .rom_data = lp8552_eeprom_arr, - }; - -2) lp8556 platform data: pwm input mode with default rom data:: - - static struct lp855x_platform_data lp8556_pdata = { - .device_control = PWM_CONFIG(LP8556), - .initial_brightness = INITIAL_BRT, - .period_ns = 1000000, - }; diff --git a/Documentation/driver-api/backlight/lp855x-driver.rst b/Documentation/driver-api/backlight/lp855x-driver.rst new file mode 100644 index 000000000000..1e0b224fc397 --- /dev/null +++ b/Documentation/driver-api/backlight/lp855x-driver.rst @@ -0,0 +1,81 @@ +==================== +Kernel driver lp855x +==================== + +Backlight driver for LP855x ICs + +Supported chips: + + Texas Instruments LP8550, LP8551, LP8552, LP8553, LP8555, LP8556 and + LP8557 + +Author: Milo(Woogyom) Kim + +Description +----------- + +* Brightness control + + Brightness can be controlled by the pwm input or the i2c command. + The lp855x driver supports both cases. + +* Device attributes + + 1) bl_ctl_mode + + Backlight control mode. + + Value: pwm based or register based + + 2) chip_id + + The lp855x chip id. + + Value: lp8550/lp8551/lp8552/lp8553/lp8555/lp8556/lp8557 + +Platform data for lp855x +------------------------ + +For supporting platform specific data, the lp855x platform data can be used. + +* name: + Backlight driver name. If it is not defined, default name is set. +* device_control: + Value of DEVICE CONTROL register. +* initial_brightness: + Initial value of backlight brightness. +* period_ns: + Platform specific PWM period value. unit is nano. + Only valid when brightness is pwm input mode. +* size_program: + Total size of lp855x_rom_data. +* rom_data: + List of new eeprom/eprom registers. + +Examples +======== + +1) lp8552 platform data: i2c register mode with new eeprom data:: + + #define EEPROM_A5_ADDR 0xA5 + #define EEPROM_A5_VAL 0x4f /* EN_VSYNC=0 */ + + static struct lp855x_rom_data lp8552_eeprom_arr[] = { + {EEPROM_A5_ADDR, EEPROM_A5_VAL}, + }; + + static struct lp855x_platform_data lp8552_pdata = { + .name = "lcd-bl", + .device_control = I2C_CONFIG(LP8552), + .initial_brightness = INITIAL_BRT, + .size_program = ARRAY_SIZE(lp8552_eeprom_arr), + .rom_data = lp8552_eeprom_arr, + }; + +2) lp8556 platform data: pwm input mode with default rom data:: + + static struct lp855x_platform_data lp8556_pdata = { + .device_control = PWM_CONFIG(LP8556), + .initial_brightness = INITIAL_BRT, + .period_ns = 1000000, + }; diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 0f281f4f648f..b4c993ff7655 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -66,6 +66,7 @@ available subsections can be seen below. soundwire/index fpga/index acpi/index + backlight/lp855x-driver.rst generic-counter .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index 8f496d76bb53..3feb318e1433 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -15964,7 +15964,7 @@ F: sound/soc/codecs/isabelle* TI LP855x BACKLIGHT DRIVER M: Milo Kim S: Maintained -F: Documentation/backlight/lp855x-driver.rst +F: Documentation/driver-api/backlight/lp855x-driver.rst F: drivers/video/backlight/lp855x_bl.c F: include/linux/platform_data/lp855x.h -- cgit v1.2.3 From fe34c89d25429e079ba67416529514120dd715f8 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 12:34:59 -0300 Subject: docs: driver-model: move it to the driver-api book The audience for the Kernel driver-model is clearly Kernel hackers. Signed-off-by: Mauro Carvalho Chehab Acked-by: Jeff Kirsher # ice driver changes --- Documentation/driver-api/driver-model/binding.rst | 98 +++++ Documentation/driver-api/driver-model/bus.rst | 146 +++++++ Documentation/driver-api/driver-model/class.rst | 149 +++++++ .../driver-api/driver-model/design-patterns.rst | 116 ++++++ Documentation/driver-api/driver-model/device.rst | 109 +++++ Documentation/driver-api/driver-model/devres.rst | 414 +++++++++++++++++++ Documentation/driver-api/driver-model/driver.rst | 223 ++++++++++ Documentation/driver-api/driver-model/index.rst | 24 ++ Documentation/driver-api/driver-model/overview.rst | 124 ++++++ Documentation/driver-api/driver-model/platform.rst | 246 +++++++++++ Documentation/driver-api/driver-model/porting.rst | 448 +++++++++++++++++++++ Documentation/driver-api/gpio/driver.rst | 2 +- Documentation/driver-api/index.rst | 1 + Documentation/driver-model/binding.rst | 98 ----- Documentation/driver-model/bus.rst | 146 ------- Documentation/driver-model/class.rst | 149 ------- Documentation/driver-model/design-patterns.rst | 116 ------ Documentation/driver-model/device.rst | 109 ----- Documentation/driver-model/devres.rst | 414 ------------------- Documentation/driver-model/driver.rst | 223 ---------- Documentation/driver-model/index.rst | 26 -- Documentation/driver-model/overview.rst | 124 ------ Documentation/driver-model/platform.rst | 246 ----------- Documentation/driver-model/porting.rst | 448 --------------------- Documentation/eisa.txt | 4 +- Documentation/filesystems/sysfs.txt | 2 +- Documentation/hwmon/submitting-patches.rst | 2 +- .../translations/zh_CN/filesystems/sysfs.txt | 2 +- drivers/base/platform.c | 2 +- drivers/gpio/gpio-cs5535.c | 2 +- drivers/net/ethernet/intel/ice/ice_main.c | 2 +- drivers/staging/unisys/Documentation/overview.txt | 4 +- include/linux/device.h | 2 +- include/linux/platform_device.h | 2 +- scripts/coccinelle/free/devm_free.cocci | 2 +- 35 files changed, 2112 insertions(+), 2113 deletions(-) create mode 100644 Documentation/driver-api/driver-model/binding.rst create mode 100644 Documentation/driver-api/driver-model/bus.rst create mode 100644 Documentation/driver-api/driver-model/class.rst create mode 100644 Documentation/driver-api/driver-model/design-patterns.rst create mode 100644 Documentation/driver-api/driver-model/device.rst create mode 100644 Documentation/driver-api/driver-model/devres.rst create mode 100644 Documentation/driver-api/driver-model/driver.rst create mode 100644 Documentation/driver-api/driver-model/index.rst create mode 100644 Documentation/driver-api/driver-model/overview.rst create mode 100644 Documentation/driver-api/driver-model/platform.rst create mode 100644 Documentation/driver-api/driver-model/porting.rst delete mode 100644 Documentation/driver-model/binding.rst delete mode 100644 Documentation/driver-model/bus.rst delete mode 100644 Documentation/driver-model/class.rst delete mode 100644 Documentation/driver-model/design-patterns.rst delete mode 100644 Documentation/driver-model/device.rst delete mode 100644 Documentation/driver-model/devres.rst delete mode 100644 Documentation/driver-model/driver.rst delete mode 100644 Documentation/driver-model/index.rst delete mode 100644 Documentation/driver-model/overview.rst delete mode 100644 Documentation/driver-model/platform.rst delete mode 100644 Documentation/driver-model/porting.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/driver-model/binding.rst b/Documentation/driver-api/driver-model/binding.rst new file mode 100644 index 000000000000..7ea1d7a41e1d --- /dev/null +++ b/Documentation/driver-api/driver-model/binding.rst @@ -0,0 +1,98 @@ +============== +Driver Binding +============== + +Driver binding is the process of associating a device with a device +driver that can control it. Bus drivers have typically handled this +because there have been bus-specific structures to represent the +devices and the drivers. With generic device and device driver +structures, most of the binding can take place using common code. + + +Bus +~~~ + +The bus type structure contains a list of all devices that are on that bus +type in the system. When device_register is called for a device, it is +inserted into the end of this list. The bus object also contains a +list of all drivers of that bus type. When driver_register is called +for a driver, it is inserted at the end of this list. These are the +two events which trigger driver binding. + + +device_register +~~~~~~~~~~~~~~~ + +When a new device is added, the bus's list of drivers is iterated over +to find one that supports it. In order to determine that, the device +ID of the device must match one of the device IDs that the driver +supports. The format and semantics for comparing IDs is bus-specific. +Instead of trying to derive a complex state machine and matching +algorithm, it is up to the bus driver to provide a callback to compare +a device against the IDs of a driver. The bus returns 1 if a match was +found; 0 otherwise. + +int match(struct device * dev, struct device_driver * drv); + +If a match is found, the device's driver field is set to the driver +and the driver's probe callback is called. This gives the driver a +chance to verify that it really does support the hardware, and that +it's in a working state. + +Device Class +~~~~~~~~~~~~ + +Upon the successful completion of probe, the device is registered with +the class to which it belongs. Device drivers belong to one and only one +class, and that is set in the driver's devclass field. +devclass_add_device is called to enumerate the device within the class +and actually register it with the class, which happens with the +class's register_dev callback. + + +Driver +~~~~~~ + +When a driver is attached to a device, the device is inserted into the +driver's list of devices. + + +sysfs +~~~~~ + +A symlink is created in the bus's 'devices' directory that points to +the device's directory in the physical hierarchy. + +A symlink is created in the driver's 'devices' directory that points +to the device's directory in the physical hierarchy. + +A directory for the device is created in the class's directory. A +symlink is created in that directory that points to the device's +physical location in the sysfs tree. + +A symlink can be created (though this isn't done yet) in the device's +physical directory to either its class directory, or the class's +top-level directory. One can also be created to point to its driver's +directory also. + + +driver_register +~~~~~~~~~~~~~~~ + +The process is almost identical for when a new driver is added. +The bus's list of devices is iterated over to find a match. Devices +that already have a driver are skipped. All the devices are iterated +over, to bind as many devices as possible to the driver. + + +Removal +~~~~~~~ + +When a device is removed, the reference count for it will eventually +go to 0. When it does, the remove callback of the driver is called. It +is removed from the driver's list of devices and the reference count +of the driver is decremented. All symlinks between the two are removed. + +When a driver is removed, the list of devices that it supports is +iterated over, and the driver's remove callback is called for each +one. The device is removed from that list and the symlinks removed. diff --git a/Documentation/driver-api/driver-model/bus.rst b/Documentation/driver-api/driver-model/bus.rst new file mode 100644 index 000000000000..016b15a6e8ea --- /dev/null +++ b/Documentation/driver-api/driver-model/bus.rst @@ -0,0 +1,146 @@ +========= +Bus Types +========= + +Definition +~~~~~~~~~~ +See the kerneldoc for the struct bus_type. + +int bus_register(struct bus_type * bus); + + +Declaration +~~~~~~~~~~~ + +Each bus type in the kernel (PCI, USB, etc) should declare one static +object of this type. They must initialize the name field, and may +optionally initialize the match callback:: + + struct bus_type pci_bus_type = { + .name = "pci", + .match = pci_bus_match, + }; + +The structure should be exported to drivers in a header file: + +extern struct bus_type pci_bus_type; + + +Registration +~~~~~~~~~~~~ + +When a bus driver is initialized, it calls bus_register. This +initializes the rest of the fields in the bus object and inserts it +into a global list of bus types. Once the bus object is registered, +the fields in it are usable by the bus driver. + + +Callbacks +~~~~~~~~~ + +match(): Attaching Drivers to Devices +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The format of device ID structures and the semantics for comparing +them are inherently bus-specific. Drivers typically declare an array +of device IDs of devices they support that reside in a bus-specific +driver structure. + +The purpose of the match callback is to give the bus an opportunity to +determine if a particular driver supports a particular device by +comparing the device IDs the driver supports with the device ID of a +particular device, without sacrificing bus-specific functionality or +type-safety. + +When a driver is registered with the bus, the bus's list of devices is +iterated over, and the match callback is called for each device that +does not have a driver associated with it. + + + +Device and Driver Lists +~~~~~~~~~~~~~~~~~~~~~~~ + +The lists of devices and drivers are intended to replace the local +lists that many buses keep. They are lists of struct devices and +struct device_drivers, respectively. Bus drivers are free to use the +lists as they please, but conversion to the bus-specific type may be +necessary. + +The LDM core provides helper functions for iterating over each list:: + + int bus_for_each_dev(struct bus_type * bus, struct device * start, + void * data, + int (*fn)(struct device *, void *)); + + int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, + void * data, int (*fn)(struct device_driver *, void *)); + +These helpers iterate over the respective list, and call the callback +for each device or driver in the list. All list accesses are +synchronized by taking the bus's lock (read currently). The reference +count on each object in the list is incremented before the callback is +called; it is decremented after the next object has been obtained. The +lock is not held when calling the callback. + + +sysfs +~~~~~~~~ +There is a top-level directory named 'bus'. + +Each bus gets a directory in the bus directory, along with two default +directories:: + + /sys/bus/pci/ + |-- devices + `-- drivers + +Drivers registered with the bus get a directory in the bus's drivers +directory:: + + /sys/bus/pci/ + |-- devices + `-- drivers + |-- Intel ICH + |-- Intel ICH Joystick + |-- agpgart + `-- e100 + +Each device that is discovered on a bus of that type gets a symlink in +the bus's devices directory to the device's directory in the physical +hierarchy:: + + /sys/bus/pci/ + |-- devices + | |-- 00:00.0 -> ../../../root/pci0/00:00.0 + | |-- 00:01.0 -> ../../../root/pci0/00:01.0 + | `-- 00:02.0 -> ../../../root/pci0/00:02.0 + `-- drivers + + +Exporting Attributes +~~~~~~~~~~~~~~~~~~~~ + +:: + + struct bus_attribute { + struct attribute attr; + ssize_t (*show)(struct bus_type *, char * buf); + ssize_t (*store)(struct bus_type *, const char * buf, size_t count); + }; + +Bus drivers can export attributes using the BUS_ATTR_RW macro that works +similarly to the DEVICE_ATTR_RW macro for devices. For example, a +definition like this:: + + static BUS_ATTR_RW(debug); + +is equivalent to declaring:: + + static bus_attribute bus_attr_debug; + +This can then be used to add and remove the attribute from the bus's +sysfs directory using:: + + int bus_create_file(struct bus_type *, struct bus_attribute *); + void bus_remove_file(struct bus_type *, struct bus_attribute *); diff --git a/Documentation/driver-api/driver-model/class.rst b/Documentation/driver-api/driver-model/class.rst new file mode 100644 index 000000000000..fff55b80e86a --- /dev/null +++ b/Documentation/driver-api/driver-model/class.rst @@ -0,0 +1,149 @@ +============== +Device Classes +============== + +Introduction +~~~~~~~~~~~~ +A device class describes a type of device, like an audio or network +device. The following device classes have been identified: + + + + +Each device class defines a set of semantics and a programming interface +that devices of that class adhere to. Device drivers are the +implementation of that programming interface for a particular device on +a particular bus. + +Device classes are agnostic with respect to what bus a device resides +on. + + +Programming Interface +~~~~~~~~~~~~~~~~~~~~~ +The device class structure looks like:: + + + typedef int (*devclass_add)(struct device *); + typedef void (*devclass_remove)(struct device *); + +See the kerneldoc for the struct class. + +A typical device class definition would look like:: + + struct device_class input_devclass = { + .name = "input", + .add_device = input_add_device, + .remove_device = input_remove_device, + }; + +Each device class structure should be exported in a header file so it +can be used by drivers, extensions and interfaces. + +Device classes are registered and unregistered with the core using:: + + int devclass_register(struct device_class * cls); + void devclass_unregister(struct device_class * cls); + + +Devices +~~~~~~~ +As devices are bound to drivers, they are added to the device class +that the driver belongs to. Before the driver model core, this would +typically happen during the driver's probe() callback, once the device +has been initialized. It now happens after the probe() callback +finishes from the core. + +The device is enumerated in the class. Each time a device is added to +the class, the class's devnum field is incremented and assigned to the +device. The field is never decremented, so if the device is removed +from the class and re-added, it will receive a different enumerated +value. + +The class is allowed to create a class-specific structure for the +device and store it in the device's class_data pointer. + +There is no list of devices in the device class. Each driver has a +list of devices that it supports. The device class has a list of +drivers of that particular class. To access all of the devices in the +class, iterate over the device lists of each driver in the class. + + +Device Drivers +~~~~~~~~~~~~~~ +Device drivers are added to device classes when they are registered +with the core. A driver specifies the class it belongs to by setting +the struct device_driver::devclass field. + + +sysfs directory structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +There is a top-level sysfs directory named 'class'. + +Each class gets a directory in the class directory, along with two +default subdirectories:: + + class/ + `-- input + |-- devices + `-- drivers + + +Drivers registered with the class get a symlink in the drivers/ directory +that points to the driver's directory (under its bus directory):: + + class/ + `-- input + |-- devices + `-- drivers + `-- usb:usb_mouse -> ../../../bus/drivers/usb_mouse/ + + +Each device gets a symlink in the devices/ directory that points to the +device's directory in the physical hierarchy:: + + class/ + `-- input + |-- devices + | `-- 1 -> ../../../root/pci0/00:1f.0/usb_bus/00:1f.2-1:0/ + `-- drivers + + +Exporting Attributes +~~~~~~~~~~~~~~~~~~~~ + +:: + + struct devclass_attribute { + struct attribute attr; + ssize_t (*show)(struct device_class *, char * buf, size_t count, loff_t off); + ssize_t (*store)(struct device_class *, const char * buf, size_t count, loff_t off); + }; + +Class drivers can export attributes using the DEVCLASS_ATTR macro that works +similarly to the DEVICE_ATTR macro for devices. For example, a definition +like this:: + + static DEVCLASS_ATTR(debug,0644,show_debug,store_debug); + +is equivalent to declaring:: + + static devclass_attribute devclass_attr_debug; + +The bus driver can add and remove the attribute from the class's +sysfs directory using:: + + int devclass_create_file(struct device_class *, struct devclass_attribute *); + void devclass_remove_file(struct device_class *, struct devclass_attribute *); + +In the example above, the file will be named 'debug' in placed in the +class's directory in sysfs. + + +Interfaces +~~~~~~~~~~ +There may exist multiple mechanisms for accessing the same device of a +particular class type. Device interfaces describe these mechanisms. + +When a device is added to a device class, the core attempts to add it +to every interface that is registered with the device class. diff --git a/Documentation/driver-api/driver-model/design-patterns.rst b/Documentation/driver-api/driver-model/design-patterns.rst new file mode 100644 index 000000000000..41eb8f41f7dd --- /dev/null +++ b/Documentation/driver-api/driver-model/design-patterns.rst @@ -0,0 +1,116 @@ +============================= +Device Driver Design Patterns +============================= + +This document describes a few common design patterns found in device drivers. +It is likely that subsystem maintainers will ask driver developers to +conform to these design patterns. + +1. State Container +2. container_of() + + +1. State Container +~~~~~~~~~~~~~~~~~~ + +While the kernel contains a few device drivers that assume that they will +only be probed() once on a certain system (singletons), it is custom to assume +that the device the driver binds to will appear in several instances. This +means that the probe() function and all callbacks need to be reentrant. + +The most common way to achieve this is to use the state container design +pattern. It usually has this form:: + + struct foo { + spinlock_t lock; /* Example member */ + (...) + }; + + static int foo_probe(...) + { + struct foo *foo; + + foo = devm_kzalloc(dev, sizeof(*foo), GFP_KERNEL); + if (!foo) + return -ENOMEM; + spin_lock_init(&foo->lock); + (...) + } + +This will create an instance of struct foo in memory every time probe() is +called. This is our state container for this instance of the device driver. +Of course it is then necessary to always pass this instance of the +state around to all functions that need access to the state and its members. + +For example, if the driver is registering an interrupt handler, you would +pass around a pointer to struct foo like this:: + + static irqreturn_t foo_handler(int irq, void *arg) + { + struct foo *foo = arg; + (...) + } + + static int foo_probe(...) + { + struct foo *foo; + + (...) + ret = request_irq(irq, foo_handler, 0, "foo", foo); + } + +This way you always get a pointer back to the correct instance of foo in +your interrupt handler. + + +2. container_of() +~~~~~~~~~~~~~~~~~ + +Continuing on the above example we add an offloaded work:: + + struct foo { + spinlock_t lock; + struct workqueue_struct *wq; + struct work_struct offload; + (...) + }; + + static void foo_work(struct work_struct *work) + { + struct foo *foo = container_of(work, struct foo, offload); + + (...) + } + + static irqreturn_t foo_handler(int irq, void *arg) + { + struct foo *foo = arg; + + queue_work(foo->wq, &foo->offload); + (...) + } + + static int foo_probe(...) + { + struct foo *foo; + + foo->wq = create_singlethread_workqueue("foo-wq"); + INIT_WORK(&foo->offload, foo_work); + (...) + } + +The design pattern is the same for an hrtimer or something similar that will +return a single argument which is a pointer to a struct member in the +callback. + +container_of() is a macro defined in + +What container_of() does is to obtain a pointer to the containing struct from +a pointer to a member by a simple subtraction using the offsetof() macro from +standard C, which allows something similar to object oriented behaviours. +Notice that the contained member must not be a pointer, but an actual member +for this to work. + +We can see here that we avoid having global pointers to our struct foo * +instance this way, while still keeping the number of parameters passed to the +work function to a single pointer. diff --git a/Documentation/driver-api/driver-model/device.rst b/Documentation/driver-api/driver-model/device.rst new file mode 100644 index 000000000000..2b868d49d349 --- /dev/null +++ b/Documentation/driver-api/driver-model/device.rst @@ -0,0 +1,109 @@ +========================== +The Basic Device Structure +========================== + +See the kerneldoc for the struct device. + + +Programming Interface +~~~~~~~~~~~~~~~~~~~~~ +The bus driver that discovers the device uses this to register the +device with the core:: + + int device_register(struct device * dev); + +The bus should initialize the following fields: + + - parent + - name + - bus_id + - bus + +A device is removed from the core when its reference count goes to +0. The reference count can be adjusted using:: + + struct device * get_device(struct device * dev); + void put_device(struct device * dev); + +get_device() will return a pointer to the struct device passed to it +if the reference is not already 0 (if it's in the process of being +removed already). + +A driver can access the lock in the device structure using:: + + void lock_device(struct device * dev); + void unlock_device(struct device * dev); + + +Attributes +~~~~~~~~~~ + +:: + + struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); + }; + +Attributes of devices can be exported by a device driver through sysfs. + +Please see Documentation/filesystems/sysfs.txt for more information +on how sysfs works. + +As explained in Documentation/kobject.txt, device attributes must be +created before the KOBJ_ADD uevent is generated. The only way to realize +that is by defining an attribute group. + +Attributes are declared using a macro called DEVICE_ATTR:: + + #define DEVICE_ATTR(name,mode,show,store) + +Example::: + + static DEVICE_ATTR(type, 0444, show_type, NULL); + static DEVICE_ATTR(power, 0644, show_power, store_power); + +This declares two structures of type struct device_attribute with respective +names 'dev_attr_type' and 'dev_attr_power'. These two attributes can be +organized as follows into a group:: + + static struct attribute *dev_attrs[] = { + &dev_attr_type.attr, + &dev_attr_power.attr, + NULL, + }; + + static struct attribute_group dev_attr_group = { + .attrs = dev_attrs, + }; + + static const struct attribute_group *dev_attr_groups[] = { + &dev_attr_group, + NULL, + }; + +This array of groups can then be associated with a device by setting the +group pointer in struct device before device_register() is invoked:: + + dev->groups = dev_attr_groups; + device_register(dev); + +The device_register() function will use the 'groups' pointer to create the +device attributes and the device_unregister() function will use this pointer +to remove the device attributes. + +Word of warning: While the kernel allows device_create_file() and +device_remove_file() to be called on a device at any time, userspace has +strict expectations on when attributes get created. When a new device is +registered in the kernel, a uevent is generated to notify userspace (like +udev) that a new device is available. If attributes are added after the +device is registered, then userspace won't get notified and userspace will +not know about the new attributes. + +This is important for device driver that need to publish additional +attributes for a device at driver probe time. If the device driver simply +calls device_create_file() on the device structure passed to it, then +userspace will never be notified of the new attributes. diff --git a/Documentation/driver-api/driver-model/devres.rst b/Documentation/driver-api/driver-model/devres.rst new file mode 100644 index 000000000000..4ac99122b5f1 --- /dev/null +++ b/Documentation/driver-api/driver-model/devres.rst @@ -0,0 +1,414 @@ +================================ +Devres - Managed Device Resource +================================ + +Tejun Heo + +First draft 10 January 2007 + +.. contents + + 1. Intro : Huh? Devres? + 2. Devres : Devres in a nutshell + 3. Devres Group : Group devres'es and release them together + 4. Details : Life time rules, calling context, ... + 5. Overhead : How much do we have to pay for this? + 6. List of managed interfaces: Currently implemented managed interfaces + + +1. Intro +-------- + +devres came up while trying to convert libata to use iomap. Each +iomapped address should be kept and unmapped on driver detach. For +example, a plain SFF ATA controller (that is, good old PCI IDE) in +native mode makes use of 5 PCI BARs and all of them should be +maintained. + +As with many other device drivers, libata low level drivers have +sufficient bugs in ->remove and ->probe failure path. Well, yes, +that's probably because libata low level driver developers are lazy +bunch, but aren't all low level driver developers? After spending a +day fiddling with braindamaged hardware with no document or +braindamaged document, if it's finally working, well, it's working. + +For one reason or another, low level drivers don't receive as much +attention or testing as core code, and bugs on driver detach or +initialization failure don't happen often enough to be noticeable. +Init failure path is worse because it's much less travelled while +needs to handle multiple entry points. + +So, many low level drivers end up leaking resources on driver detach +and having half broken failure path implementation in ->probe() which +would leak resources or even cause oops when failure occurs. iomap +adds more to this mix. So do msi and msix. + + +2. Devres +--------- + +devres is basically linked list of arbitrarily sized memory areas +associated with a struct device. Each devres entry is associated with +a release function. A devres can be released in several ways. No +matter what, all devres entries are released on driver detach. On +release, the associated release function is invoked and then the +devres entry is freed. + +Managed interface is created for resources commonly used by device +drivers using devres. For example, coherent DMA memory is acquired +using dma_alloc_coherent(). The managed version is called +dmam_alloc_coherent(). It is identical to dma_alloc_coherent() except +for the DMA memory allocated using it is managed and will be +automatically released on driver detach. Implementation looks like +the following:: + + struct dma_devres { + size_t size; + void *vaddr; + dma_addr_t dma_handle; + }; + + static void dmam_coherent_release(struct device *dev, void *res) + { + struct dma_devres *this = res; + + dma_free_coherent(dev, this->size, this->vaddr, this->dma_handle); + } + + dmam_alloc_coherent(dev, size, dma_handle, gfp) + { + struct dma_devres *dr; + void *vaddr; + + dr = devres_alloc(dmam_coherent_release, sizeof(*dr), gfp); + ... + + /* alloc DMA memory as usual */ + vaddr = dma_alloc_coherent(...); + ... + + /* record size, vaddr, dma_handle in dr */ + dr->vaddr = vaddr; + ... + + devres_add(dev, dr); + + return vaddr; + } + +If a driver uses dmam_alloc_coherent(), the area is guaranteed to be +freed whether initialization fails half-way or the device gets +detached. If most resources are acquired using managed interface, a +driver can have much simpler init and exit code. Init path basically +looks like the following:: + + my_init_one() + { + struct mydev *d; + + d = devm_kzalloc(dev, sizeof(*d), GFP_KERNEL); + if (!d) + return -ENOMEM; + + d->ring = dmam_alloc_coherent(...); + if (!d->ring) + return -ENOMEM; + + if (check something) + return -EINVAL; + ... + + return register_to_upper_layer(d); + } + +And exit path:: + + my_remove_one() + { + unregister_from_upper_layer(d); + shutdown_my_hardware(); + } + +As shown above, low level drivers can be simplified a lot by using +devres. Complexity is shifted from less maintained low level drivers +to better maintained higher layer. Also, as init failure path is +shared with exit path, both can get more testing. + +Note though that when converting current calls or assignments to +managed devm_* versions it is up to you to check if internal operations +like allocating memory, have failed. Managed resources pertains to the +freeing of these resources *only* - all other checks needed are still +on you. In some cases this may mean introducing checks that were not +necessary before moving to the managed devm_* calls. + + +3. Devres group +--------------- + +Devres entries can be grouped using devres group. When a group is +released, all contained normal devres entries and properly nested +groups are released. One usage is to rollback series of acquired +resources on failure. For example:: + + if (!devres_open_group(dev, NULL, GFP_KERNEL)) + return -ENOMEM; + + acquire A; + if (failed) + goto err; + + acquire B; + if (failed) + goto err; + ... + + devres_remove_group(dev, NULL); + return 0; + + err: + devres_release_group(dev, NULL); + return err_code; + +As resource acquisition failure usually means probe failure, constructs +like above are usually useful in midlayer driver (e.g. libata core +layer) where interface function shouldn't have side effect on failure. +For LLDs, just returning error code suffices in most cases. + +Each group is identified by `void *id`. It can either be explicitly +specified by @id argument to devres_open_group() or automatically +created by passing NULL as @id as in the above example. In both +cases, devres_open_group() returns the group's id. The returned id +can be passed to other devres functions to select the target group. +If NULL is given to those functions, the latest open group is +selected. + +For example, you can do something like the following:: + + int my_midlayer_create_something() + { + if (!devres_open_group(dev, my_midlayer_create_something, GFP_KERNEL)) + return -ENOMEM; + + ... + + devres_close_group(dev, my_midlayer_create_something); + return 0; + } + + void my_midlayer_destroy_something() + { + devres_release_group(dev, my_midlayer_create_something); + } + + +4. Details +---------- + +Lifetime of a devres entry begins on devres allocation and finishes +when it is released or destroyed (removed and freed) - no reference +counting. + +devres core guarantees atomicity to all basic devres operations and +has support for single-instance devres types (atomic +lookup-and-add-if-not-found). Other than that, synchronizing +concurrent accesses to allocated devres data is caller's +responsibility. This is usually non-issue because bus ops and +resource allocations already do the job. + +For an example of single-instance devres type, read pcim_iomap_table() +in lib/devres.c. + +All devres interface functions can be called without context if the +right gfp mask is given. + + +5. Overhead +----------- + +Each devres bookkeeping info is allocated together with requested data +area. With debug option turned off, bookkeeping info occupies 16 +bytes on 32bit machines and 24 bytes on 64bit (three pointers rounded +up to ull alignment). If singly linked list is used, it can be +reduced to two pointers (8 bytes on 32bit, 16 bytes on 64bit). + +Each devres group occupies 8 pointers. It can be reduced to 6 if +singly linked list is used. + +Memory space overhead on ahci controller with two ports is between 300 +and 400 bytes on 32bit machine after naive conversion (we can +certainly invest a bit more effort into libata core layer). + + +6. List of managed interfaces +----------------------------- + +CLOCK + devm_clk_get() + devm_clk_get_optional() + devm_clk_put() + devm_clk_hw_register() + devm_of_clk_add_hw_provider() + devm_clk_hw_register_clkdev() + +DMA + dmaenginem_async_device_register() + dmam_alloc_coherent() + dmam_alloc_attrs() + dmam_free_coherent() + dmam_pool_create() + dmam_pool_destroy() + +DRM + devm_drm_dev_init() + +GPIO + devm_gpiod_get() + devm_gpiod_get_index() + devm_gpiod_get_index_optional() + devm_gpiod_get_optional() + devm_gpiod_put() + devm_gpiod_unhinge() + devm_gpiochip_add_data() + devm_gpio_request() + devm_gpio_request_one() + devm_gpio_free() + +I2C + devm_i2c_new_dummy_device() + +IIO + devm_iio_device_alloc() + devm_iio_device_free() + devm_iio_device_register() + devm_iio_device_unregister() + devm_iio_kfifo_allocate() + devm_iio_kfifo_free() + devm_iio_triggered_buffer_setup() + devm_iio_triggered_buffer_cleanup() + devm_iio_trigger_alloc() + devm_iio_trigger_free() + devm_iio_trigger_register() + devm_iio_trigger_unregister() + devm_iio_channel_get() + devm_iio_channel_release() + devm_iio_channel_get_all() + devm_iio_channel_release_all() + +INPUT + devm_input_allocate_device() + +IO region + devm_release_mem_region() + devm_release_region() + devm_release_resource() + devm_request_mem_region() + devm_request_region() + devm_request_resource() + +IOMAP + devm_ioport_map() + devm_ioport_unmap() + devm_ioremap() + devm_ioremap_nocache() + devm_ioremap_wc() + devm_ioremap_resource() : checks resource, requests memory region, ioremaps + devm_iounmap() + pcim_iomap() + pcim_iomap_regions() : do request_region() and iomap() on multiple BARs + pcim_iomap_table() : array of mapped addresses indexed by BAR + pcim_iounmap() + +IRQ + devm_free_irq() + devm_request_any_context_irq() + devm_request_irq() + devm_request_threaded_irq() + devm_irq_alloc_descs() + devm_irq_alloc_desc() + devm_irq_alloc_desc_at() + devm_irq_alloc_desc_from() + devm_irq_alloc_descs_from() + devm_irq_alloc_generic_chip() + devm_irq_setup_generic_chip() + devm_irq_sim_init() + +LED + devm_led_classdev_register() + devm_led_classdev_unregister() + +MDIO + devm_mdiobus_alloc() + devm_mdiobus_alloc_size() + devm_mdiobus_free() + +MEM + devm_free_pages() + devm_get_free_pages() + devm_kasprintf() + devm_kcalloc() + devm_kfree() + devm_kmalloc() + devm_kmalloc_array() + devm_kmemdup() + devm_kstrdup() + devm_kvasprintf() + devm_kzalloc() + +MFD + devm_mfd_add_devices() + +MUX + devm_mux_chip_alloc() + devm_mux_chip_register() + devm_mux_control_get() + +PER-CPU MEM + devm_alloc_percpu() + devm_free_percpu() + +PCI + devm_pci_alloc_host_bridge() : managed PCI host bridge allocation + devm_pci_remap_cfgspace() : ioremap PCI configuration space + devm_pci_remap_cfg_resource() : ioremap PCI configuration space resource + pcim_enable_device() : after success, all PCI ops become managed + pcim_pin_device() : keep PCI device enabled after release + +PHY + devm_usb_get_phy() + devm_usb_put_phy() + +PINCTRL + devm_pinctrl_get() + devm_pinctrl_put() + devm_pinctrl_register() + devm_pinctrl_unregister() + +POWER + devm_reboot_mode_register() + devm_reboot_mode_unregister() + +PWM + devm_pwm_get() + devm_pwm_put() + +REGULATOR + devm_regulator_bulk_get() + devm_regulator_get() + devm_regulator_put() + devm_regulator_register() + +RESET + devm_reset_control_get() + devm_reset_controller_register() + +SERDEV + devm_serdev_device_open() + +SLAVE DMA ENGINE + devm_acpi_dma_controller_register() + +SPI + devm_spi_register_master() + +WATCHDOG + devm_watchdog_register_device() diff --git a/Documentation/driver-api/driver-model/driver.rst b/Documentation/driver-api/driver-model/driver.rst new file mode 100644 index 000000000000..11d281506a04 --- /dev/null +++ b/Documentation/driver-api/driver-model/driver.rst @@ -0,0 +1,223 @@ +============== +Device Drivers +============== + +See the kerneldoc for the struct device_driver. + + +Allocation +~~~~~~~~~~ + +Device drivers are statically allocated structures. Though there may +be multiple devices in a system that a driver supports, struct +device_driver represents the driver as a whole (not a particular +device instance). + +Initialization +~~~~~~~~~~~~~~ + +The driver must initialize at least the name and bus fields. It should +also initialize the devclass field (when it arrives), so it may obtain +the proper linkage internally. It should also initialize as many of +the callbacks as possible, though each is optional. + +Declaration +~~~~~~~~~~~ + +As stated above, struct device_driver objects are statically +allocated. Below is an example declaration of the eepro100 +driver. This declaration is hypothetical only; it relies on the driver +being converted completely to the new model:: + + static struct device_driver eepro100_driver = { + .name = "eepro100", + .bus = &pci_bus_type, + + .probe = eepro100_probe, + .remove = eepro100_remove, + .suspend = eepro100_suspend, + .resume = eepro100_resume, + }; + +Most drivers will not be able to be converted completely to the new +model because the bus they belong to has a bus-specific structure with +bus-specific fields that cannot be generalized. + +The most common example of this are device ID structures. A driver +typically defines an array of device IDs that it supports. The format +of these structures and the semantics for comparing device IDs are +completely bus-specific. Defining them as bus-specific entities would +sacrifice type-safety, so we keep bus-specific structures around. + +Bus-specific drivers should include a generic struct device_driver in +the definition of the bus-specific driver. Like this:: + + struct pci_driver { + const struct pci_device_id *id_table; + struct device_driver driver; + }; + +A definition that included bus-specific fields would look like +(using the eepro100 driver again):: + + static struct pci_driver eepro100_driver = { + .id_table = eepro100_pci_tbl, + .driver = { + .name = "eepro100", + .bus = &pci_bus_type, + .probe = eepro100_probe, + .remove = eepro100_remove, + .suspend = eepro100_suspend, + .resume = eepro100_resume, + }, + }; + +Some may find the syntax of embedded struct initialization awkward or +even a bit ugly. So far, it's the best way we've found to do what we want... + +Registration +~~~~~~~~~~~~ + +:: + + int driver_register(struct device_driver *drv); + +The driver registers the structure on startup. For drivers that have +no bus-specific fields (i.e. don't have a bus-specific driver +structure), they would use driver_register and pass a pointer to their +struct device_driver object. + +Most drivers, however, will have a bus-specific structure and will +need to register with the bus using something like pci_driver_register. + +It is important that drivers register their driver structure as early as +possible. Registration with the core initializes several fields in the +struct device_driver object, including the reference count and the +lock. These fields are assumed to be valid at all times and may be +used by the device model core or the bus driver. + + +Transition Bus Drivers +~~~~~~~~~~~~~~~~~~~~~~ + +By defining wrapper functions, the transition to the new model can be +made easier. Drivers can ignore the generic structure altogether and +let the bus wrapper fill in the fields. For the callbacks, the bus can +define generic callbacks that forward the call to the bus-specific +callbacks of the drivers. + +This solution is intended to be only temporary. In order to get class +information in the driver, the drivers must be modified anyway. Since +converting drivers to the new model should reduce some infrastructural +complexity and code size, it is recommended that they are converted as +class information is added. + +Access +~~~~~~ + +Once the object has been registered, it may access the common fields of +the object, like the lock and the list of devices:: + + int driver_for_each_dev(struct device_driver *drv, void *data, + int (*callback)(struct device *dev, void *data)); + +The devices field is a list of all the devices that have been bound to +the driver. The LDM core provides a helper function to operate on all +the devices a driver controls. This helper locks the driver on each +node access, and does proper reference counting on each device as it +accesses it. + + +sysfs +~~~~~ + +When a driver is registered, a sysfs directory is created in its +bus's directory. In this directory, the driver can export an interface +to userspace to control operation of the driver on a global basis; +e.g. toggling debugging output in the driver. + +A future feature of this directory will be a 'devices' directory. This +directory will contain symlinks to the directories of devices it +supports. + + + +Callbacks +~~~~~~~~~ + +:: + + int (*probe) (struct device *dev); + +The probe() entry is called in task context, with the bus's rwsem locked +and the driver partially bound to the device. Drivers commonly use +container_of() to convert "dev" to a bus-specific type, both in probe() +and other routines. That type often provides device resource data, such +as pci_dev.resource[] or platform_device.resources, which is used in +addition to dev->platform_data to initialize the driver. + +This callback holds the driver-specific logic to bind the driver to a +given device. That includes verifying that the device is present, that +it's a version the driver can handle, that driver data structures can +be allocated and initialized, and that any hardware can be initialized. +Drivers often store a pointer to their state with dev_set_drvdata(). +When the driver has successfully bound itself to that device, then probe() +returns zero and the driver model code will finish its part of binding +the driver to that device. + +A driver's probe() may return a negative errno value to indicate that +the driver did not bind to this device, in which case it should have +released all resources it allocated:: + + int (*remove) (struct device *dev); + +remove is called to unbind a driver from a device. This may be +called if a device is physically removed from the system, if the +driver module is being unloaded, during a reboot sequence, or +in other cases. + +It is up to the driver to determine if the device is present or +not. It should free any resources allocated specifically for the +device; i.e. anything in the device's driver_data field. + +If the device is still present, it should quiesce the device and place +it into a supported low-power state:: + + int (*suspend) (struct device *dev, pm_message_t state); + +suspend is called to put the device in a low power state:: + + int (*resume) (struct device *dev); + +Resume is used to bring a device back from a low power state. + + +Attributes +~~~~~~~~~~ + +:: + + struct driver_attribute { + struct attribute attr; + ssize_t (*show)(struct device_driver *driver, char *buf); + ssize_t (*store)(struct device_driver *, const char *buf, size_t count); + }; + +Device drivers can export attributes via their sysfs directories. +Drivers can declare attributes using a DRIVER_ATTR_RW and DRIVER_ATTR_RO +macro that works identically to the DEVICE_ATTR_RW and DEVICE_ATTR_RO +macros. + +Example:: + + DRIVER_ATTR_RW(debug); + +This is equivalent to declaring:: + + struct driver_attribute driver_attr_debug; + +This can then be used to add and remove the attribute from the +driver's directory using:: + + int driver_create_file(struct device_driver *, const struct driver_attribute *); + void driver_remove_file(struct device_driver *, const struct driver_attribute *); diff --git a/Documentation/driver-api/driver-model/index.rst b/Documentation/driver-api/driver-model/index.rst new file mode 100644 index 000000000000..755016422269 --- /dev/null +++ b/Documentation/driver-api/driver-model/index.rst @@ -0,0 +1,24 @@ +============ +Driver Model +============ + +.. toctree:: + :maxdepth: 1 + + binding + bus + class + design-patterns + device + devres + driver + overview + platform + porting + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/driver-api/driver-model/overview.rst b/Documentation/driver-api/driver-model/overview.rst new file mode 100644 index 000000000000..d4d1e9b40e0c --- /dev/null +++ b/Documentation/driver-api/driver-model/overview.rst @@ -0,0 +1,124 @@ +============================= +The Linux Kernel Device Model +============================= + +Patrick Mochel + +Drafted 26 August 2002 +Updated 31 January 2006 + + +Overview +~~~~~~~~ + +The Linux Kernel Driver Model is a unification of all the disparate driver +models that were previously used in the kernel. It is intended to augment the +bus-specific drivers for bridges and devices by consolidating a set of data +and operations into globally accessible data structures. + +Traditional driver models implemented some sort of tree-like structure +(sometimes just a list) for the devices they control. There wasn't any +uniformity across the different bus types. + +The current driver model provides a common, uniform data model for describing +a bus and the devices that can appear under the bus. The unified bus +model includes a set of common attributes which all busses carry, and a set +of common callbacks, such as device discovery during bus probing, bus +shutdown, bus power management, etc. + +The common device and bridge interface reflects the goals of the modern +computer: namely the ability to do seamless device "plug and play", power +management, and hot plug. In particular, the model dictated by Intel and +Microsoft (namely ACPI) ensures that almost every device on almost any bus +on an x86-compatible system can work within this paradigm. Of course, +not every bus is able to support all such operations, although most +buses support most of those operations. + + +Downstream Access +~~~~~~~~~~~~~~~~~ + +Common data fields have been moved out of individual bus layers into a common +data structure. These fields must still be accessed by the bus layers, +and sometimes by the device-specific drivers. + +Other bus layers are encouraged to do what has been done for the PCI layer. +struct pci_dev now looks like this:: + + struct pci_dev { + ... + + struct device dev; /* Generic device interface */ + ... + }; + +Note first that the struct device dev within the struct pci_dev is +statically allocated. This means only one allocation on device discovery. + +Note also that that struct device dev is not necessarily defined at the +front of the pci_dev structure. This is to make people think about what +they're doing when switching between the bus driver and the global driver, +and to discourage meaningless and incorrect casts between the two. + +The PCI bus layer freely accesses the fields of struct device. It knows about +the structure of struct pci_dev, and it should know the structure of struct +device. Individual PCI device drivers that have been converted to the current +driver model generally do not and should not touch the fields of struct device, +unless there is a compelling reason to do so. + +The above abstraction prevents unnecessary pain during transitional phases. +If it were not done this way, then when a field was renamed or removed, every +downstream driver would break. On the other hand, if only the bus layer +(and not the device layer) accesses the struct device, it is only the bus +layer that needs to change. + + +User Interface +~~~~~~~~~~~~~~ + +By virtue of having a complete hierarchical view of all the devices in the +system, exporting a complete hierarchical view to userspace becomes relatively +easy. This has been accomplished by implementing a special purpose virtual +file system named sysfs. + +Almost all mainstream Linux distros mount this filesystem automatically; you +can see some variation of the following in the output of the "mount" command:: + + $ mount + ... + none on /sys type sysfs (rw,noexec,nosuid,nodev) + ... + $ + +The auto-mounting of sysfs is typically accomplished by an entry similar to +the following in the /etc/fstab file:: + + none /sys sysfs defaults 0 0 + +or something similar in the /lib/init/fstab file on Debian-based systems:: + + none /sys sysfs nodev,noexec,nosuid 0 0 + +If sysfs is not automatically mounted, you can always do it manually with:: + + # mount -t sysfs sysfs /sys + +Whenever a device is inserted into the tree, a directory is created for it. +This directory may be populated at each layer of discovery - the global layer, +the bus layer, or the device layer. + +The global layer currently creates two files - 'name' and 'power'. The +former only reports the name of the device. The latter reports the +current power state of the device. It will also be used to set the current +power state. + +The bus layer may also create files for the devices it finds while probing the +bus. For example, the PCI layer currently creates 'irq' and 'resource' files +for each PCI device. + +A device-specific driver may also export files in its directory to expose +device-specific data or tunable interfaces. + +More information about the sysfs directory layout can be found in +the other documents in this directory and in the file +Documentation/filesystems/sysfs.txt. diff --git a/Documentation/driver-api/driver-model/platform.rst b/Documentation/driver-api/driver-model/platform.rst new file mode 100644 index 000000000000..334dd4071ae4 --- /dev/null +++ b/Documentation/driver-api/driver-model/platform.rst @@ -0,0 +1,246 @@ +============================ +Platform Devices and Drivers +============================ + +See for the driver model interface to the +platform bus: platform_device, and platform_driver. This pseudo-bus +is used to connect devices on busses with minimal infrastructure, +like those used to integrate peripherals on many system-on-chip +processors, or some "legacy" PC interconnects; as opposed to large +formally specified ones like PCI or USB. + + +Platform devices +~~~~~~~~~~~~~~~~ +Platform devices are devices that typically appear as autonomous +entities in the system. This includes legacy port-based devices and +host bridges to peripheral buses, and most controllers integrated +into system-on-chip platforms. What they usually have in common +is direct addressing from a CPU bus. Rarely, a platform_device will +be connected through a segment of some other kind of bus; but its +registers will still be directly addressable. + +Platform devices are given a name, used in driver binding, and a +list of resources such as addresses and IRQs:: + + struct platform_device { + const char *name; + u32 id; + struct device dev; + u32 num_resources; + struct resource *resource; + }; + + +Platform drivers +~~~~~~~~~~~~~~~~ +Platform drivers follow the standard driver model convention, where +discovery/enumeration is handled outside the drivers, and drivers +provide probe() and remove() methods. They support power management +and shutdown notifications using the standard conventions:: + + struct platform_driver { + int (*probe)(struct platform_device *); + int (*remove)(struct platform_device *); + void (*shutdown)(struct platform_device *); + int (*suspend)(struct platform_device *, pm_message_t state); + int (*suspend_late)(struct platform_device *, pm_message_t state); + int (*resume_early)(struct platform_device *); + int (*resume)(struct platform_device *); + struct device_driver driver; + }; + +Note that probe() should in general verify that the specified device hardware +actually exists; sometimes platform setup code can't be sure. The probing +can use device resources, including clocks, and device platform_data. + +Platform drivers register themselves the normal way:: + + int platform_driver_register(struct platform_driver *drv); + +Or, in common situations where the device is known not to be hot-pluggable, +the probe() routine can live in an init section to reduce the driver's +runtime memory footprint:: + + int platform_driver_probe(struct platform_driver *drv, + int (*probe)(struct platform_device *)) + +Kernel modules can be composed of several platform drivers. The platform core +provides helpers to register and unregister an array of drivers:: + + int __platform_register_drivers(struct platform_driver * const *drivers, + unsigned int count, struct module *owner); + void platform_unregister_drivers(struct platform_driver * const *drivers, + unsigned int count); + +If one of the drivers fails to register, all drivers registered up to that +point will be unregistered in reverse order. Note that there is a convenience +macro that passes THIS_MODULE as owner parameter:: + + #define platform_register_drivers(drivers, count) + + +Device Enumeration +~~~~~~~~~~~~~~~~~~ +As a rule, platform specific (and often board-specific) setup code will +register platform devices:: + + int platform_device_register(struct platform_device *pdev); + + int platform_add_devices(struct platform_device **pdevs, int ndev); + +The general rule is to register only those devices that actually exist, +but in some cases extra devices might be registered. For example, a kernel +might be configured to work with an external network adapter that might not +be populated on all boards, or likewise to work with an integrated controller +that some boards might not hook up to any peripherals. + +In some cases, boot firmware will export tables describing the devices +that are populated on a given board. Without such tables, often the +only way for system setup code to set up the correct devices is to build +a kernel for a specific target board. Such board-specific kernels are +common with embedded and custom systems development. + +In many cases, the memory and IRQ resources associated with the platform +device are not enough to let the device's driver work. Board setup code +will often provide additional information using the device's platform_data +field to hold additional information. + +Embedded systems frequently need one or more clocks for platform devices, +which are normally kept off until they're actively needed (to save power). +System setup also associates those clocks with the device, so that that +calls to clk_get(&pdev->dev, clock_name) return them as needed. + + +Legacy Drivers: Device Probing +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Some drivers are not fully converted to the driver model, because they take +on a non-driver role: the driver registers its platform device, rather than +leaving that for system infrastructure. Such drivers can't be hotplugged +or coldplugged, since those mechanisms require device creation to be in a +different system component than the driver. + +The only "good" reason for this is to handle older system designs which, like +original IBM PCs, rely on error-prone "probe-the-hardware" models for hardware +configuration. Newer systems have largely abandoned that model, in favor of +bus-level support for dynamic configuration (PCI, USB), or device tables +provided by the boot firmware (e.g. PNPACPI on x86). There are too many +conflicting options about what might be where, and even educated guesses by +an operating system will be wrong often enough to make trouble. + +This style of driver is discouraged. If you're updating such a driver, +please try to move the device enumeration to a more appropriate location, +outside the driver. This will usually be cleanup, since such drivers +tend to already have "normal" modes, such as ones using device nodes that +were created by PNP or by platform device setup. + +None the less, there are some APIs to support such legacy drivers. Avoid +using these calls except with such hotplug-deficient drivers:: + + struct platform_device *platform_device_alloc( + const char *name, int id); + +You can use platform_device_alloc() to dynamically allocate a device, which +you will then initialize with resources and platform_device_register(). +A better solution is usually:: + + struct platform_device *platform_device_register_simple( + const char *name, int id, + struct resource *res, unsigned int nres); + +You can use platform_device_register_simple() as a one-step call to allocate +and register a device. + + +Device Naming and Driver Binding +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The platform_device.dev.bus_id is the canonical name for the devices. +It's built from two components: + + * platform_device.name ... which is also used to for driver matching. + + * platform_device.id ... the device instance number, or else "-1" + to indicate there's only one. + +These are concatenated, so name/id "serial"/0 indicates bus_id "serial.0", and +"serial/3" indicates bus_id "serial.3"; both would use the platform_driver +named "serial". While "my_rtc"/-1 would be bus_id "my_rtc" (no instance id) +and use the platform_driver called "my_rtc". + +Driver binding is performed automatically by the driver core, invoking +driver probe() after finding a match between device and driver. If the +probe() succeeds, the driver and device are bound as usual. There are +three different ways to find such a match: + + - Whenever a device is registered, the drivers for that bus are + checked for matches. Platform devices should be registered very + early during system boot. + + - When a driver is registered using platform_driver_register(), all + unbound devices on that bus are checked for matches. Drivers + usually register later during booting, or by module loading. + + - Registering a driver using platform_driver_probe() works just like + using platform_driver_register(), except that the driver won't + be probed later if another device registers. (Which is OK, since + this interface is only for use with non-hotpluggable devices.) + + +Early Platform Devices and Drivers +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The early platform interfaces provide platform data to platform device +drivers early on during the system boot. The code is built on top of the +early_param() command line parsing and can be executed very early on. + +Example: "earlyprintk" class early serial console in 6 steps + +1. Registering early platform device data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code registers platform device data using the function +early_platform_add_devices(). In the case of early serial console this +should be hardware configuration for the serial port. Devices registered +at this point will later on be matched against early platform drivers. + +2. Parsing kernel command line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code calls parse_early_param() to parse the kernel +command line. This will execute all matching early_param() callbacks. +User specified early platform devices will be registered at this point. +For the early serial console case the user can specify port on the +kernel command line as "earlyprintk=serial.0" where "earlyprintk" is +the class string, "serial" is the name of the platform driver and +0 is the platform device id. If the id is -1 then the dot and the +id can be omitted. + +3. Installing early platform drivers belonging to a certain class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code may optionally force registration of all early +platform drivers belonging to a certain class using the function +early_platform_driver_register_all(). User specified devices from +step 2 have priority over these. This step is omitted by the serial +driver example since the early serial driver code should be disabled +unless the user has specified port on the kernel command line. + +4. Early platform driver registration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Compiled-in platform drivers making use of early_platform_init() are +automatically registered during step 2 or 3. The serial driver example +should use early_platform_init("earlyprintk", &platform_driver). + +5. Probing of early platform drivers belonging to a certain class +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The architecture code calls early_platform_driver_probe() to match +registered early platform devices associated with a certain class with +registered early platform drivers. Matched devices will get probed(). +This step can be executed at any point during the early boot. As soon +as possible may be good for the serial port case. + +6. Inside the early platform driver probe() +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The driver code needs to take special care during early boot, especially +when it comes to memory allocation and interrupt registration. The code +in the probe() function can use is_early_platform_device() to check if +it is called at early platform device or at the regular platform device +time. The early serial driver performs register_console() at this point. + +For further information, see . diff --git a/Documentation/driver-api/driver-model/porting.rst b/Documentation/driver-api/driver-model/porting.rst new file mode 100644 index 000000000000..931ea879af3f --- /dev/null +++ b/Documentation/driver-api/driver-model/porting.rst @@ -0,0 +1,448 @@ +======================================= +Porting Drivers to the New Driver Model +======================================= + +Patrick Mochel + +7 January 2003 + + +Overview + +Please refer to `Documentation/driver-api/driver-model/*.rst` for definitions of +various driver types and concepts. + +Most of the work of porting devices drivers to the new model happens +at the bus driver layer. This was intentional, to minimize the +negative effect on kernel drivers, and to allow a gradual transition +of bus drivers. + +In a nutshell, the driver model consists of a set of objects that can +be embedded in larger, bus-specific objects. Fields in these generic +objects can replace fields in the bus-specific objects. + +The generic objects must be registered with the driver model core. By +doing so, they will exported via the sysfs filesystem. sysfs can be +mounted by doing:: + + # mount -t sysfs sysfs /sys + + + +The Process + +Step 0: Read include/linux/device.h for object and function definitions. + +Step 1: Registering the bus driver. + + +- Define a struct bus_type for the bus driver:: + + struct bus_type pci_bus_type = { + .name = "pci", + }; + + +- Register the bus type. + + This should be done in the initialization function for the bus type, + which is usually the module_init(), or equivalent, function:: + + static int __init pci_driver_init(void) + { + return bus_register(&pci_bus_type); + } + + subsys_initcall(pci_driver_init); + + + The bus type may be unregistered (if the bus driver may be compiled + as a module) by doing:: + + bus_unregister(&pci_bus_type); + + +- Export the bus type for others to use. + + Other code may wish to reference the bus type, so declare it in a + shared header file and export the symbol. + +From include/linux/pci.h:: + + extern struct bus_type pci_bus_type; + + +From file the above code appears in:: + + EXPORT_SYMBOL(pci_bus_type); + + + +- This will cause the bus to show up in /sys/bus/pci/ with two + subdirectories: 'devices' and 'drivers':: + + # tree -d /sys/bus/pci/ + /sys/bus/pci/ + |-- devices + `-- drivers + + + +Step 2: Registering Devices. + +struct device represents a single device. It mainly contains metadata +describing the relationship the device has to other entities. + + +- Embed a struct device in the bus-specific device type:: + + + struct pci_dev { + ... + struct device dev; /* Generic device interface */ + ... + }; + + It is recommended that the generic device not be the first item in + the struct to discourage programmers from doing mindless casts + between the object types. Instead macros, or inline functions, + should be created to convert from the generic object type:: + + + #define to_pci_dev(n) container_of(n, struct pci_dev, dev) + + or + + static inline struct pci_dev * to_pci_dev(struct kobject * kobj) + { + return container_of(n, struct pci_dev, dev); + } + + This allows the compiler to verify type-safety of the operations + that are performed (which is Good). + + +- Initialize the device on registration. + + When devices are discovered or registered with the bus type, the + bus driver should initialize the generic device. The most important + things to initialize are the bus_id, parent, and bus fields. + + The bus_id is an ASCII string that contains the device's address on + the bus. The format of this string is bus-specific. This is + necessary for representing devices in sysfs. + + parent is the physical parent of the device. It is important that + the bus driver sets this field correctly. + + The driver model maintains an ordered list of devices that it uses + for power management. This list must be in order to guarantee that + devices are shutdown before their physical parents, and vice versa. + The order of this list is determined by the parent of registered + devices. + + Also, the location of the device's sysfs directory depends on a + device's parent. sysfs exports a directory structure that mirrors + the device hierarchy. Accurately setting the parent guarantees that + sysfs will accurately represent the hierarchy. + + The device's bus field is a pointer to the bus type the device + belongs to. This should be set to the bus_type that was declared + and initialized before. + + Optionally, the bus driver may set the device's name and release + fields. + + The name field is an ASCII string describing the device, like + + "ATI Technologies Inc Radeon QD" + + The release field is a callback that the driver model core calls + when the device has been removed, and all references to it have + been released. More on this in a moment. + + +- Register the device. + + Once the generic device has been initialized, it can be registered + with the driver model core by doing:: + + device_register(&dev->dev); + + It can later be unregistered by doing:: + + device_unregister(&dev->dev); + + This should happen on buses that support hotpluggable devices. + If a bus driver unregisters a device, it should not immediately free + it. It should instead wait for the driver model core to call the + device's release method, then free the bus-specific object. + (There may be other code that is currently referencing the device + structure, and it would be rude to free the device while that is + happening). + + + When the device is registered, a directory in sysfs is created. + The PCI tree in sysfs looks like:: + + /sys/devices/pci0/ + |-- 00:00.0 + |-- 00:01.0 + | `-- 01:00.0 + |-- 00:02.0 + | `-- 02:1f.0 + | `-- 03:00.0 + |-- 00:1e.0 + | `-- 04:04.0 + |-- 00:1f.0 + |-- 00:1f.1 + | |-- ide0 + | | |-- 0.0 + | | `-- 0.1 + | `-- ide1 + | `-- 1.0 + |-- 00:1f.2 + |-- 00:1f.3 + `-- 00:1f.5 + + Also, symlinks are created in the bus's 'devices' directory + that point to the device's directory in the physical hierarchy:: + + /sys/bus/pci/devices/ + |-- 00:00.0 -> ../../../devices/pci0/00:00.0 + |-- 00:01.0 -> ../../../devices/pci0/00:01.0 + |-- 00:02.0 -> ../../../devices/pci0/00:02.0 + |-- 00:1e.0 -> ../../../devices/pci0/00:1e.0 + |-- 00:1f.0 -> ../../../devices/pci0/00:1f.0 + |-- 00:1f.1 -> ../../../devices/pci0/00:1f.1 + |-- 00:1f.2 -> ../../../devices/pci0/00:1f.2 + |-- 00:1f.3 -> ../../../devices/pci0/00:1f.3 + |-- 00:1f.5 -> ../../../devices/pci0/00:1f.5 + |-- 01:00.0 -> ../../../devices/pci0/00:01.0/01:00.0 + |-- 02:1f.0 -> ../../../devices/pci0/00:02.0/02:1f.0 + |-- 03:00.0 -> ../../../devices/pci0/00:02.0/02:1f.0/03:00.0 + `-- 04:04.0 -> ../../../devices/pci0/00:1e.0/04:04.0 + + + +Step 3: Registering Drivers. + +struct device_driver is a simple driver structure that contains a set +of operations that the driver model core may call. + + +- Embed a struct device_driver in the bus-specific driver. + + Just like with devices, do something like:: + + struct pci_driver { + ... + struct device_driver driver; + }; + + +- Initialize the generic driver structure. + + When the driver registers with the bus (e.g. doing pci_register_driver()), + initialize the necessary fields of the driver: the name and bus + fields. + + +- Register the driver. + + After the generic driver has been initialized, call:: + + driver_register(&drv->driver); + + to register the driver with the core. + + When the driver is unregistered from the bus, unregister it from the + core by doing:: + + driver_unregister(&drv->driver); + + Note that this will block until all references to the driver have + gone away. Normally, there will not be any. + + +- Sysfs representation. + + Drivers are exported via sysfs in their bus's 'driver's directory. + For example:: + + /sys/bus/pci/drivers/ + |-- 3c59x + |-- Ensoniq AudioPCI + |-- agpgart-amdk7 + |-- e100 + `-- serial + + +Step 4: Define Generic Methods for Drivers. + +struct device_driver defines a set of operations that the driver model +core calls. Most of these operations are probably similar to +operations the bus already defines for drivers, but taking different +parameters. + +It would be difficult and tedious to force every driver on a bus to +simultaneously convert their drivers to generic format. Instead, the +bus driver should define single instances of the generic methods that +forward call to the bus-specific drivers. For instance:: + + + static int pci_device_remove(struct device * dev) + { + struct pci_dev * pci_dev = to_pci_dev(dev); + struct pci_driver * drv = pci_dev->driver; + + if (drv) { + if (drv->remove) + drv->remove(pci_dev); + pci_dev->driver = NULL; + } + return 0; + } + + +The generic driver should be initialized with these methods before it +is registered:: + + /* initialize common driver fields */ + drv->driver.name = drv->name; + drv->driver.bus = &pci_bus_type; + drv->driver.probe = pci_device_probe; + drv->driver.resume = pci_device_resume; + drv->driver.suspend = pci_device_suspend; + drv->driver.remove = pci_device_remove; + + /* register with core */ + driver_register(&drv->driver); + + +Ideally, the bus should only initialize the fields if they are not +already set. This allows the drivers to implement their own generic +methods. + + +Step 5: Support generic driver binding. + +The model assumes that a device or driver can be dynamically +registered with the bus at any time. When registration happens, +devices must be bound to a driver, or drivers must be bound to all +devices that it supports. + +A driver typically contains a list of device IDs that it supports. The +bus driver compares these IDs to the IDs of devices registered with it. +The format of the device IDs, and the semantics for comparing them are +bus-specific, so the generic model does attempt to generalize them. + +Instead, a bus may supply a method in struct bus_type that does the +comparison:: + + int (*match)(struct device * dev, struct device_driver * drv); + +match should return positive value if the driver supports the device, +and zero otherwise. It may also return error code (for example +-EPROBE_DEFER) if determining that given driver supports the device is +not possible. + +When a device is registered, the bus's list of drivers is iterated +over. bus->match() is called for each one until a match is found. + +When a driver is registered, the bus's list of devices is iterated +over. bus->match() is called for each device that is not already +claimed by a driver. + +When a device is successfully bound to a driver, device->driver is +set, the device is added to a per-driver list of devices, and a +symlink is created in the driver's sysfs directory that points to the +device's physical directory:: + + /sys/bus/pci/drivers/ + |-- 3c59x + | `-- 00:0b.0 -> ../../../../devices/pci0/00:0b.0 + |-- Ensoniq AudioPCI + |-- agpgart-amdk7 + | `-- 00:00.0 -> ../../../../devices/pci0/00:00.0 + |-- e100 + | `-- 00:0c.0 -> ../../../../devices/pci0/00:0c.0 + `-- serial + + +This driver binding should replace the existing driver binding +mechanism the bus currently uses. + + +Step 6: Supply a hotplug callback. + +Whenever a device is registered with the driver model core, the +userspace program /sbin/hotplug is called to notify userspace. +Users can define actions to perform when a device is inserted or +removed. + +The driver model core passes several arguments to userspace via +environment variables, including + +- ACTION: set to 'add' or 'remove' +- DEVPATH: set to the device's physical path in sysfs. + +A bus driver may also supply additional parameters for userspace to +consume. To do this, a bus must implement the 'hotplug' method in +struct bus_type:: + + int (*hotplug) (struct device *dev, char **envp, + int num_envp, char *buffer, int buffer_size); + +This is called immediately before /sbin/hotplug is executed. + + +Step 7: Cleaning up the bus driver. + +The generic bus, device, and driver structures provide several fields +that can replace those defined privately to the bus driver. + +- Device list. + +struct bus_type contains a list of all devices registered with the bus +type. This includes all devices on all instances of that bus type. +An internal list that the bus uses may be removed, in favor of using +this one. + +The core provides an iterator to access these devices:: + + int bus_for_each_dev(struct bus_type * bus, struct device * start, + void * data, int (*fn)(struct device *, void *)); + + +- Driver list. + +struct bus_type also contains a list of all drivers registered with +it. An internal list of drivers that the bus driver maintains may +be removed in favor of using the generic one. + +The drivers may be iterated over, like devices:: + + int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, + void * data, int (*fn)(struct device_driver *, void *)); + + +Please see drivers/base/bus.c for more information. + + +- rwsem + +struct bus_type contains an rwsem that protects all core accesses to +the device and driver lists. This can be used by the bus driver +internally, and should be used when accessing the device or driver +lists the bus maintains. + + +- Device and driver fields. + +Some of the fields in struct device and struct device_driver duplicate +fields in the bus-specific representations of these objects. Feel free +to remove the bus-specific ones and favor the generic ones. Note +though, that this will likely mean fixing up all the drivers that +reference the bus-specific fields (though those should all be 1-line +changes). diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 349f2dc33029..921c71a3d683 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -399,7 +399,7 @@ symbol: will pass the struct gpio_chip* for the chip to all IRQ callbacks, so the callbacks need to embed the gpio_chip in its state container and obtain a pointer to the container using container_of(). - (See Documentation/driver-model/design-patterns.rst) + (See Documentation/driver-api/driver-model/design-patterns.rst) - gpiochip_irqchip_add_nested(): adds a nested cascaded irqchip to a gpiochip, as discussed above regarding different types of cascaded irqchips. The diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index b4c993ff7655..9fb03b7bdeb1 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -14,6 +14,7 @@ available subsections can be seen below. .. toctree:: :maxdepth: 2 + driver-model/index basics infrastructure early-userspace/index diff --git a/Documentation/driver-model/binding.rst b/Documentation/driver-model/binding.rst deleted file mode 100644 index 7ea1d7a41e1d..000000000000 --- a/Documentation/driver-model/binding.rst +++ /dev/null @@ -1,98 +0,0 @@ -============== -Driver Binding -============== - -Driver binding is the process of associating a device with a device -driver that can control it. Bus drivers have typically handled this -because there have been bus-specific structures to represent the -devices and the drivers. With generic device and device driver -structures, most of the binding can take place using common code. - - -Bus -~~~ - -The bus type structure contains a list of all devices that are on that bus -type in the system. When device_register is called for a device, it is -inserted into the end of this list. The bus object also contains a -list of all drivers of that bus type. When driver_register is called -for a driver, it is inserted at the end of this list. These are the -two events which trigger driver binding. - - -device_register -~~~~~~~~~~~~~~~ - -When a new device is added, the bus's list of drivers is iterated over -to find one that supports it. In order to determine that, the device -ID of the device must match one of the device IDs that the driver -supports. The format and semantics for comparing IDs is bus-specific. -Instead of trying to derive a complex state machine and matching -algorithm, it is up to the bus driver to provide a callback to compare -a device against the IDs of a driver. The bus returns 1 if a match was -found; 0 otherwise. - -int match(struct device * dev, struct device_driver * drv); - -If a match is found, the device's driver field is set to the driver -and the driver's probe callback is called. This gives the driver a -chance to verify that it really does support the hardware, and that -it's in a working state. - -Device Class -~~~~~~~~~~~~ - -Upon the successful completion of probe, the device is registered with -the class to which it belongs. Device drivers belong to one and only one -class, and that is set in the driver's devclass field. -devclass_add_device is called to enumerate the device within the class -and actually register it with the class, which happens with the -class's register_dev callback. - - -Driver -~~~~~~ - -When a driver is attached to a device, the device is inserted into the -driver's list of devices. - - -sysfs -~~~~~ - -A symlink is created in the bus's 'devices' directory that points to -the device's directory in the physical hierarchy. - -A symlink is created in the driver's 'devices' directory that points -to the device's directory in the physical hierarchy. - -A directory for the device is created in the class's directory. A -symlink is created in that directory that points to the device's -physical location in the sysfs tree. - -A symlink can be created (though this isn't done yet) in the device's -physical directory to either its class directory, or the class's -top-level directory. One can also be created to point to its driver's -directory also. - - -driver_register -~~~~~~~~~~~~~~~ - -The process is almost identical for when a new driver is added. -The bus's list of devices is iterated over to find a match. Devices -that already have a driver are skipped. All the devices are iterated -over, to bind as many devices as possible to the driver. - - -Removal -~~~~~~~ - -When a device is removed, the reference count for it will eventually -go to 0. When it does, the remove callback of the driver is called. It -is removed from the driver's list of devices and the reference count -of the driver is decremented. All symlinks between the two are removed. - -When a driver is removed, the list of devices that it supports is -iterated over, and the driver's remove callback is called for each -one. The device is removed from that list and the symlinks removed. diff --git a/Documentation/driver-model/bus.rst b/Documentation/driver-model/bus.rst deleted file mode 100644 index 016b15a6e8ea..000000000000 --- a/Documentation/driver-model/bus.rst +++ /dev/null @@ -1,146 +0,0 @@ -========= -Bus Types -========= - -Definition -~~~~~~~~~~ -See the kerneldoc for the struct bus_type. - -int bus_register(struct bus_type * bus); - - -Declaration -~~~~~~~~~~~ - -Each bus type in the kernel (PCI, USB, etc) should declare one static -object of this type. They must initialize the name field, and may -optionally initialize the match callback:: - - struct bus_type pci_bus_type = { - .name = "pci", - .match = pci_bus_match, - }; - -The structure should be exported to drivers in a header file: - -extern struct bus_type pci_bus_type; - - -Registration -~~~~~~~~~~~~ - -When a bus driver is initialized, it calls bus_register. This -initializes the rest of the fields in the bus object and inserts it -into a global list of bus types. Once the bus object is registered, -the fields in it are usable by the bus driver. - - -Callbacks -~~~~~~~~~ - -match(): Attaching Drivers to Devices -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The format of device ID structures and the semantics for comparing -them are inherently bus-specific. Drivers typically declare an array -of device IDs of devices they support that reside in a bus-specific -driver structure. - -The purpose of the match callback is to give the bus an opportunity to -determine if a particular driver supports a particular device by -comparing the device IDs the driver supports with the device ID of a -particular device, without sacrificing bus-specific functionality or -type-safety. - -When a driver is registered with the bus, the bus's list of devices is -iterated over, and the match callback is called for each device that -does not have a driver associated with it. - - - -Device and Driver Lists -~~~~~~~~~~~~~~~~~~~~~~~ - -The lists of devices and drivers are intended to replace the local -lists that many buses keep. They are lists of struct devices and -struct device_drivers, respectively. Bus drivers are free to use the -lists as they please, but conversion to the bus-specific type may be -necessary. - -The LDM core provides helper functions for iterating over each list:: - - int bus_for_each_dev(struct bus_type * bus, struct device * start, - void * data, - int (*fn)(struct device *, void *)); - - int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, - void * data, int (*fn)(struct device_driver *, void *)); - -These helpers iterate over the respective list, and call the callback -for each device or driver in the list. All list accesses are -synchronized by taking the bus's lock (read currently). The reference -count on each object in the list is incremented before the callback is -called; it is decremented after the next object has been obtained. The -lock is not held when calling the callback. - - -sysfs -~~~~~~~~ -There is a top-level directory named 'bus'. - -Each bus gets a directory in the bus directory, along with two default -directories:: - - /sys/bus/pci/ - |-- devices - `-- drivers - -Drivers registered with the bus get a directory in the bus's drivers -directory:: - - /sys/bus/pci/ - |-- devices - `-- drivers - |-- Intel ICH - |-- Intel ICH Joystick - |-- agpgart - `-- e100 - -Each device that is discovered on a bus of that type gets a symlink in -the bus's devices directory to the device's directory in the physical -hierarchy:: - - /sys/bus/pci/ - |-- devices - | |-- 00:00.0 -> ../../../root/pci0/00:00.0 - | |-- 00:01.0 -> ../../../root/pci0/00:01.0 - | `-- 00:02.0 -> ../../../root/pci0/00:02.0 - `-- drivers - - -Exporting Attributes -~~~~~~~~~~~~~~~~~~~~ - -:: - - struct bus_attribute { - struct attribute attr; - ssize_t (*show)(struct bus_type *, char * buf); - ssize_t (*store)(struct bus_type *, const char * buf, size_t count); - }; - -Bus drivers can export attributes using the BUS_ATTR_RW macro that works -similarly to the DEVICE_ATTR_RW macro for devices. For example, a -definition like this:: - - static BUS_ATTR_RW(debug); - -is equivalent to declaring:: - - static bus_attribute bus_attr_debug; - -This can then be used to add and remove the attribute from the bus's -sysfs directory using:: - - int bus_create_file(struct bus_type *, struct bus_attribute *); - void bus_remove_file(struct bus_type *, struct bus_attribute *); diff --git a/Documentation/driver-model/class.rst b/Documentation/driver-model/class.rst deleted file mode 100644 index fff55b80e86a..000000000000 --- a/Documentation/driver-model/class.rst +++ /dev/null @@ -1,149 +0,0 @@ -============== -Device Classes -============== - -Introduction -~~~~~~~~~~~~ -A device class describes a type of device, like an audio or network -device. The following device classes have been identified: - - - - -Each device class defines a set of semantics and a programming interface -that devices of that class adhere to. Device drivers are the -implementation of that programming interface for a particular device on -a particular bus. - -Device classes are agnostic with respect to what bus a device resides -on. - - -Programming Interface -~~~~~~~~~~~~~~~~~~~~~ -The device class structure looks like:: - - - typedef int (*devclass_add)(struct device *); - typedef void (*devclass_remove)(struct device *); - -See the kerneldoc for the struct class. - -A typical device class definition would look like:: - - struct device_class input_devclass = { - .name = "input", - .add_device = input_add_device, - .remove_device = input_remove_device, - }; - -Each device class structure should be exported in a header file so it -can be used by drivers, extensions and interfaces. - -Device classes are registered and unregistered with the core using:: - - int devclass_register(struct device_class * cls); - void devclass_unregister(struct device_class * cls); - - -Devices -~~~~~~~ -As devices are bound to drivers, they are added to the device class -that the driver belongs to. Before the driver model core, this would -typically happen during the driver's probe() callback, once the device -has been initialized. It now happens after the probe() callback -finishes from the core. - -The device is enumerated in the class. Each time a device is added to -the class, the class's devnum field is incremented and assigned to the -device. The field is never decremented, so if the device is removed -from the class and re-added, it will receive a different enumerated -value. - -The class is allowed to create a class-specific structure for the -device and store it in the device's class_data pointer. - -There is no list of devices in the device class. Each driver has a -list of devices that it supports. The device class has a list of -drivers of that particular class. To access all of the devices in the -class, iterate over the device lists of each driver in the class. - - -Device Drivers -~~~~~~~~~~~~~~ -Device drivers are added to device classes when they are registered -with the core. A driver specifies the class it belongs to by setting -the struct device_driver::devclass field. - - -sysfs directory structure -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -There is a top-level sysfs directory named 'class'. - -Each class gets a directory in the class directory, along with two -default subdirectories:: - - class/ - `-- input - |-- devices - `-- drivers - - -Drivers registered with the class get a symlink in the drivers/ directory -that points to the driver's directory (under its bus directory):: - - class/ - `-- input - |-- devices - `-- drivers - `-- usb:usb_mouse -> ../../../bus/drivers/usb_mouse/ - - -Each device gets a symlink in the devices/ directory that points to the -device's directory in the physical hierarchy:: - - class/ - `-- input - |-- devices - | `-- 1 -> ../../../root/pci0/00:1f.0/usb_bus/00:1f.2-1:0/ - `-- drivers - - -Exporting Attributes -~~~~~~~~~~~~~~~~~~~~ - -:: - - struct devclass_attribute { - struct attribute attr; - ssize_t (*show)(struct device_class *, char * buf, size_t count, loff_t off); - ssize_t (*store)(struct device_class *, const char * buf, size_t count, loff_t off); - }; - -Class drivers can export attributes using the DEVCLASS_ATTR macro that works -similarly to the DEVICE_ATTR macro for devices. For example, a definition -like this:: - - static DEVCLASS_ATTR(debug,0644,show_debug,store_debug); - -is equivalent to declaring:: - - static devclass_attribute devclass_attr_debug; - -The bus driver can add and remove the attribute from the class's -sysfs directory using:: - - int devclass_create_file(struct device_class *, struct devclass_attribute *); - void devclass_remove_file(struct device_class *, struct devclass_attribute *); - -In the example above, the file will be named 'debug' in placed in the -class's directory in sysfs. - - -Interfaces -~~~~~~~~~~ -There may exist multiple mechanisms for accessing the same device of a -particular class type. Device interfaces describe these mechanisms. - -When a device is added to a device class, the core attempts to add it -to every interface that is registered with the device class. diff --git a/Documentation/driver-model/design-patterns.rst b/Documentation/driver-model/design-patterns.rst deleted file mode 100644 index 41eb8f41f7dd..000000000000 --- a/Documentation/driver-model/design-patterns.rst +++ /dev/null @@ -1,116 +0,0 @@ -============================= -Device Driver Design Patterns -============================= - -This document describes a few common design patterns found in device drivers. -It is likely that subsystem maintainers will ask driver developers to -conform to these design patterns. - -1. State Container -2. container_of() - - -1. State Container -~~~~~~~~~~~~~~~~~~ - -While the kernel contains a few device drivers that assume that they will -only be probed() once on a certain system (singletons), it is custom to assume -that the device the driver binds to will appear in several instances. This -means that the probe() function and all callbacks need to be reentrant. - -The most common way to achieve this is to use the state container design -pattern. It usually has this form:: - - struct foo { - spinlock_t lock; /* Example member */ - (...) - }; - - static int foo_probe(...) - { - struct foo *foo; - - foo = devm_kzalloc(dev, sizeof(*foo), GFP_KERNEL); - if (!foo) - return -ENOMEM; - spin_lock_init(&foo->lock); - (...) - } - -This will create an instance of struct foo in memory every time probe() is -called. This is our state container for this instance of the device driver. -Of course it is then necessary to always pass this instance of the -state around to all functions that need access to the state and its members. - -For example, if the driver is registering an interrupt handler, you would -pass around a pointer to struct foo like this:: - - static irqreturn_t foo_handler(int irq, void *arg) - { - struct foo *foo = arg; - (...) - } - - static int foo_probe(...) - { - struct foo *foo; - - (...) - ret = request_irq(irq, foo_handler, 0, "foo", foo); - } - -This way you always get a pointer back to the correct instance of foo in -your interrupt handler. - - -2. container_of() -~~~~~~~~~~~~~~~~~ - -Continuing on the above example we add an offloaded work:: - - struct foo { - spinlock_t lock; - struct workqueue_struct *wq; - struct work_struct offload; - (...) - }; - - static void foo_work(struct work_struct *work) - { - struct foo *foo = container_of(work, struct foo, offload); - - (...) - } - - static irqreturn_t foo_handler(int irq, void *arg) - { - struct foo *foo = arg; - - queue_work(foo->wq, &foo->offload); - (...) - } - - static int foo_probe(...) - { - struct foo *foo; - - foo->wq = create_singlethread_workqueue("foo-wq"); - INIT_WORK(&foo->offload, foo_work); - (...) - } - -The design pattern is the same for an hrtimer or something similar that will -return a single argument which is a pointer to a struct member in the -callback. - -container_of() is a macro defined in - -What container_of() does is to obtain a pointer to the containing struct from -a pointer to a member by a simple subtraction using the offsetof() macro from -standard C, which allows something similar to object oriented behaviours. -Notice that the contained member must not be a pointer, but an actual member -for this to work. - -We can see here that we avoid having global pointers to our struct foo * -instance this way, while still keeping the number of parameters passed to the -work function to a single pointer. diff --git a/Documentation/driver-model/device.rst b/Documentation/driver-model/device.rst deleted file mode 100644 index 2b868d49d349..000000000000 --- a/Documentation/driver-model/device.rst +++ /dev/null @@ -1,109 +0,0 @@ -========================== -The Basic Device Structure -========================== - -See the kerneldoc for the struct device. - - -Programming Interface -~~~~~~~~~~~~~~~~~~~~~ -The bus driver that discovers the device uses this to register the -device with the core:: - - int device_register(struct device * dev); - -The bus should initialize the following fields: - - - parent - - name - - bus_id - - bus - -A device is removed from the core when its reference count goes to -0. The reference count can be adjusted using:: - - struct device * get_device(struct device * dev); - void put_device(struct device * dev); - -get_device() will return a pointer to the struct device passed to it -if the reference is not already 0 (if it's in the process of being -removed already). - -A driver can access the lock in the device structure using:: - - void lock_device(struct device * dev); - void unlock_device(struct device * dev); - - -Attributes -~~~~~~~~~~ - -:: - - struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device *dev, struct device_attribute *attr, - char *buf); - ssize_t (*store)(struct device *dev, struct device_attribute *attr, - const char *buf, size_t count); - }; - -Attributes of devices can be exported by a device driver through sysfs. - -Please see Documentation/filesystems/sysfs.txt for more information -on how sysfs works. - -As explained in Documentation/kobject.txt, device attributes must be -created before the KOBJ_ADD uevent is generated. The only way to realize -that is by defining an attribute group. - -Attributes are declared using a macro called DEVICE_ATTR:: - - #define DEVICE_ATTR(name,mode,show,store) - -Example::: - - static DEVICE_ATTR(type, 0444, show_type, NULL); - static DEVICE_ATTR(power, 0644, show_power, store_power); - -This declares two structures of type struct device_attribute with respective -names 'dev_attr_type' and 'dev_attr_power'. These two attributes can be -organized as follows into a group:: - - static struct attribute *dev_attrs[] = { - &dev_attr_type.attr, - &dev_attr_power.attr, - NULL, - }; - - static struct attribute_group dev_attr_group = { - .attrs = dev_attrs, - }; - - static const struct attribute_group *dev_attr_groups[] = { - &dev_attr_group, - NULL, - }; - -This array of groups can then be associated with a device by setting the -group pointer in struct device before device_register() is invoked:: - - dev->groups = dev_attr_groups; - device_register(dev); - -The device_register() function will use the 'groups' pointer to create the -device attributes and the device_unregister() function will use this pointer -to remove the device attributes. - -Word of warning: While the kernel allows device_create_file() and -device_remove_file() to be called on a device at any time, userspace has -strict expectations on when attributes get created. When a new device is -registered in the kernel, a uevent is generated to notify userspace (like -udev) that a new device is available. If attributes are added after the -device is registered, then userspace won't get notified and userspace will -not know about the new attributes. - -This is important for device driver that need to publish additional -attributes for a device at driver probe time. If the device driver simply -calls device_create_file() on the device structure passed to it, then -userspace will never be notified of the new attributes. diff --git a/Documentation/driver-model/devres.rst b/Documentation/driver-model/devres.rst deleted file mode 100644 index 4ac99122b5f1..000000000000 --- a/Documentation/driver-model/devres.rst +++ /dev/null @@ -1,414 +0,0 @@ -================================ -Devres - Managed Device Resource -================================ - -Tejun Heo - -First draft 10 January 2007 - -.. contents - - 1. Intro : Huh? Devres? - 2. Devres : Devres in a nutshell - 3. Devres Group : Group devres'es and release them together - 4. Details : Life time rules, calling context, ... - 5. Overhead : How much do we have to pay for this? - 6. List of managed interfaces: Currently implemented managed interfaces - - -1. Intro --------- - -devres came up while trying to convert libata to use iomap. Each -iomapped address should be kept and unmapped on driver detach. For -example, a plain SFF ATA controller (that is, good old PCI IDE) in -native mode makes use of 5 PCI BARs and all of them should be -maintained. - -As with many other device drivers, libata low level drivers have -sufficient bugs in ->remove and ->probe failure path. Well, yes, -that's probably because libata low level driver developers are lazy -bunch, but aren't all low level driver developers? After spending a -day fiddling with braindamaged hardware with no document or -braindamaged document, if it's finally working, well, it's working. - -For one reason or another, low level drivers don't receive as much -attention or testing as core code, and bugs on driver detach or -initialization failure don't happen often enough to be noticeable. -Init failure path is worse because it's much less travelled while -needs to handle multiple entry points. - -So, many low level drivers end up leaking resources on driver detach -and having half broken failure path implementation in ->probe() which -would leak resources or even cause oops when failure occurs. iomap -adds more to this mix. So do msi and msix. - - -2. Devres ---------- - -devres is basically linked list of arbitrarily sized memory areas -associated with a struct device. Each devres entry is associated with -a release function. A devres can be released in several ways. No -matter what, all devres entries are released on driver detach. On -release, the associated release function is invoked and then the -devres entry is freed. - -Managed interface is created for resources commonly used by device -drivers using devres. For example, coherent DMA memory is acquired -using dma_alloc_coherent(). The managed version is called -dmam_alloc_coherent(). It is identical to dma_alloc_coherent() except -for the DMA memory allocated using it is managed and will be -automatically released on driver detach. Implementation looks like -the following:: - - struct dma_devres { - size_t size; - void *vaddr; - dma_addr_t dma_handle; - }; - - static void dmam_coherent_release(struct device *dev, void *res) - { - struct dma_devres *this = res; - - dma_free_coherent(dev, this->size, this->vaddr, this->dma_handle); - } - - dmam_alloc_coherent(dev, size, dma_handle, gfp) - { - struct dma_devres *dr; - void *vaddr; - - dr = devres_alloc(dmam_coherent_release, sizeof(*dr), gfp); - ... - - /* alloc DMA memory as usual */ - vaddr = dma_alloc_coherent(...); - ... - - /* record size, vaddr, dma_handle in dr */ - dr->vaddr = vaddr; - ... - - devres_add(dev, dr); - - return vaddr; - } - -If a driver uses dmam_alloc_coherent(), the area is guaranteed to be -freed whether initialization fails half-way or the device gets -detached. If most resources are acquired using managed interface, a -driver can have much simpler init and exit code. Init path basically -looks like the following:: - - my_init_one() - { - struct mydev *d; - - d = devm_kzalloc(dev, sizeof(*d), GFP_KERNEL); - if (!d) - return -ENOMEM; - - d->ring = dmam_alloc_coherent(...); - if (!d->ring) - return -ENOMEM; - - if (check something) - return -EINVAL; - ... - - return register_to_upper_layer(d); - } - -And exit path:: - - my_remove_one() - { - unregister_from_upper_layer(d); - shutdown_my_hardware(); - } - -As shown above, low level drivers can be simplified a lot by using -devres. Complexity is shifted from less maintained low level drivers -to better maintained higher layer. Also, as init failure path is -shared with exit path, both can get more testing. - -Note though that when converting current calls or assignments to -managed devm_* versions it is up to you to check if internal operations -like allocating memory, have failed. Managed resources pertains to the -freeing of these resources *only* - all other checks needed are still -on you. In some cases this may mean introducing checks that were not -necessary before moving to the managed devm_* calls. - - -3. Devres group ---------------- - -Devres entries can be grouped using devres group. When a group is -released, all contained normal devres entries and properly nested -groups are released. One usage is to rollback series of acquired -resources on failure. For example:: - - if (!devres_open_group(dev, NULL, GFP_KERNEL)) - return -ENOMEM; - - acquire A; - if (failed) - goto err; - - acquire B; - if (failed) - goto err; - ... - - devres_remove_group(dev, NULL); - return 0; - - err: - devres_release_group(dev, NULL); - return err_code; - -As resource acquisition failure usually means probe failure, constructs -like above are usually useful in midlayer driver (e.g. libata core -layer) where interface function shouldn't have side effect on failure. -For LLDs, just returning error code suffices in most cases. - -Each group is identified by `void *id`. It can either be explicitly -specified by @id argument to devres_open_group() or automatically -created by passing NULL as @id as in the above example. In both -cases, devres_open_group() returns the group's id. The returned id -can be passed to other devres functions to select the target group. -If NULL is given to those functions, the latest open group is -selected. - -For example, you can do something like the following:: - - int my_midlayer_create_something() - { - if (!devres_open_group(dev, my_midlayer_create_something, GFP_KERNEL)) - return -ENOMEM; - - ... - - devres_close_group(dev, my_midlayer_create_something); - return 0; - } - - void my_midlayer_destroy_something() - { - devres_release_group(dev, my_midlayer_create_something); - } - - -4. Details ----------- - -Lifetime of a devres entry begins on devres allocation and finishes -when it is released or destroyed (removed and freed) - no reference -counting. - -devres core guarantees atomicity to all basic devres operations and -has support for single-instance devres types (atomic -lookup-and-add-if-not-found). Other than that, synchronizing -concurrent accesses to allocated devres data is caller's -responsibility. This is usually non-issue because bus ops and -resource allocations already do the job. - -For an example of single-instance devres type, read pcim_iomap_table() -in lib/devres.c. - -All devres interface functions can be called without context if the -right gfp mask is given. - - -5. Overhead ------------ - -Each devres bookkeeping info is allocated together with requested data -area. With debug option turned off, bookkeeping info occupies 16 -bytes on 32bit machines and 24 bytes on 64bit (three pointers rounded -up to ull alignment). If singly linked list is used, it can be -reduced to two pointers (8 bytes on 32bit, 16 bytes on 64bit). - -Each devres group occupies 8 pointers. It can be reduced to 6 if -singly linked list is used. - -Memory space overhead on ahci controller with two ports is between 300 -and 400 bytes on 32bit machine after naive conversion (we can -certainly invest a bit more effort into libata core layer). - - -6. List of managed interfaces ------------------------------ - -CLOCK - devm_clk_get() - devm_clk_get_optional() - devm_clk_put() - devm_clk_hw_register() - devm_of_clk_add_hw_provider() - devm_clk_hw_register_clkdev() - -DMA - dmaenginem_async_device_register() - dmam_alloc_coherent() - dmam_alloc_attrs() - dmam_free_coherent() - dmam_pool_create() - dmam_pool_destroy() - -DRM - devm_drm_dev_init() - -GPIO - devm_gpiod_get() - devm_gpiod_get_index() - devm_gpiod_get_index_optional() - devm_gpiod_get_optional() - devm_gpiod_put() - devm_gpiod_unhinge() - devm_gpiochip_add_data() - devm_gpio_request() - devm_gpio_request_one() - devm_gpio_free() - -I2C - devm_i2c_new_dummy_device() - -IIO - devm_iio_device_alloc() - devm_iio_device_free() - devm_iio_device_register() - devm_iio_device_unregister() - devm_iio_kfifo_allocate() - devm_iio_kfifo_free() - devm_iio_triggered_buffer_setup() - devm_iio_triggered_buffer_cleanup() - devm_iio_trigger_alloc() - devm_iio_trigger_free() - devm_iio_trigger_register() - devm_iio_trigger_unregister() - devm_iio_channel_get() - devm_iio_channel_release() - devm_iio_channel_get_all() - devm_iio_channel_release_all() - -INPUT - devm_input_allocate_device() - -IO region - devm_release_mem_region() - devm_release_region() - devm_release_resource() - devm_request_mem_region() - devm_request_region() - devm_request_resource() - -IOMAP - devm_ioport_map() - devm_ioport_unmap() - devm_ioremap() - devm_ioremap_nocache() - devm_ioremap_wc() - devm_ioremap_resource() : checks resource, requests memory region, ioremaps - devm_iounmap() - pcim_iomap() - pcim_iomap_regions() : do request_region() and iomap() on multiple BARs - pcim_iomap_table() : array of mapped addresses indexed by BAR - pcim_iounmap() - -IRQ - devm_free_irq() - devm_request_any_context_irq() - devm_request_irq() - devm_request_threaded_irq() - devm_irq_alloc_descs() - devm_irq_alloc_desc() - devm_irq_alloc_desc_at() - devm_irq_alloc_desc_from() - devm_irq_alloc_descs_from() - devm_irq_alloc_generic_chip() - devm_irq_setup_generic_chip() - devm_irq_sim_init() - -LED - devm_led_classdev_register() - devm_led_classdev_unregister() - -MDIO - devm_mdiobus_alloc() - devm_mdiobus_alloc_size() - devm_mdiobus_free() - -MEM - devm_free_pages() - devm_get_free_pages() - devm_kasprintf() - devm_kcalloc() - devm_kfree() - devm_kmalloc() - devm_kmalloc_array() - devm_kmemdup() - devm_kstrdup() - devm_kvasprintf() - devm_kzalloc() - -MFD - devm_mfd_add_devices() - -MUX - devm_mux_chip_alloc() - devm_mux_chip_register() - devm_mux_control_get() - -PER-CPU MEM - devm_alloc_percpu() - devm_free_percpu() - -PCI - devm_pci_alloc_host_bridge() : managed PCI host bridge allocation - devm_pci_remap_cfgspace() : ioremap PCI configuration space - devm_pci_remap_cfg_resource() : ioremap PCI configuration space resource - pcim_enable_device() : after success, all PCI ops become managed - pcim_pin_device() : keep PCI device enabled after release - -PHY - devm_usb_get_phy() - devm_usb_put_phy() - -PINCTRL - devm_pinctrl_get() - devm_pinctrl_put() - devm_pinctrl_register() - devm_pinctrl_unregister() - -POWER - devm_reboot_mode_register() - devm_reboot_mode_unregister() - -PWM - devm_pwm_get() - devm_pwm_put() - -REGULATOR - devm_regulator_bulk_get() - devm_regulator_get() - devm_regulator_put() - devm_regulator_register() - -RESET - devm_reset_control_get() - devm_reset_controller_register() - -SERDEV - devm_serdev_device_open() - -SLAVE DMA ENGINE - devm_acpi_dma_controller_register() - -SPI - devm_spi_register_master() - -WATCHDOG - devm_watchdog_register_device() diff --git a/Documentation/driver-model/driver.rst b/Documentation/driver-model/driver.rst deleted file mode 100644 index 11d281506a04..000000000000 --- a/Documentation/driver-model/driver.rst +++ /dev/null @@ -1,223 +0,0 @@ -============== -Device Drivers -============== - -See the kerneldoc for the struct device_driver. - - -Allocation -~~~~~~~~~~ - -Device drivers are statically allocated structures. Though there may -be multiple devices in a system that a driver supports, struct -device_driver represents the driver as a whole (not a particular -device instance). - -Initialization -~~~~~~~~~~~~~~ - -The driver must initialize at least the name and bus fields. It should -also initialize the devclass field (when it arrives), so it may obtain -the proper linkage internally. It should also initialize as many of -the callbacks as possible, though each is optional. - -Declaration -~~~~~~~~~~~ - -As stated above, struct device_driver objects are statically -allocated. Below is an example declaration of the eepro100 -driver. This declaration is hypothetical only; it relies on the driver -being converted completely to the new model:: - - static struct device_driver eepro100_driver = { - .name = "eepro100", - .bus = &pci_bus_type, - - .probe = eepro100_probe, - .remove = eepro100_remove, - .suspend = eepro100_suspend, - .resume = eepro100_resume, - }; - -Most drivers will not be able to be converted completely to the new -model because the bus they belong to has a bus-specific structure with -bus-specific fields that cannot be generalized. - -The most common example of this are device ID structures. A driver -typically defines an array of device IDs that it supports. The format -of these structures and the semantics for comparing device IDs are -completely bus-specific. Defining them as bus-specific entities would -sacrifice type-safety, so we keep bus-specific structures around. - -Bus-specific drivers should include a generic struct device_driver in -the definition of the bus-specific driver. Like this:: - - struct pci_driver { - const struct pci_device_id *id_table; - struct device_driver driver; - }; - -A definition that included bus-specific fields would look like -(using the eepro100 driver again):: - - static struct pci_driver eepro100_driver = { - .id_table = eepro100_pci_tbl, - .driver = { - .name = "eepro100", - .bus = &pci_bus_type, - .probe = eepro100_probe, - .remove = eepro100_remove, - .suspend = eepro100_suspend, - .resume = eepro100_resume, - }, - }; - -Some may find the syntax of embedded struct initialization awkward or -even a bit ugly. So far, it's the best way we've found to do what we want... - -Registration -~~~~~~~~~~~~ - -:: - - int driver_register(struct device_driver *drv); - -The driver registers the structure on startup. For drivers that have -no bus-specific fields (i.e. don't have a bus-specific driver -structure), they would use driver_register and pass a pointer to their -struct device_driver object. - -Most drivers, however, will have a bus-specific structure and will -need to register with the bus using something like pci_driver_register. - -It is important that drivers register their driver structure as early as -possible. Registration with the core initializes several fields in the -struct device_driver object, including the reference count and the -lock. These fields are assumed to be valid at all times and may be -used by the device model core or the bus driver. - - -Transition Bus Drivers -~~~~~~~~~~~~~~~~~~~~~~ - -By defining wrapper functions, the transition to the new model can be -made easier. Drivers can ignore the generic structure altogether and -let the bus wrapper fill in the fields. For the callbacks, the bus can -define generic callbacks that forward the call to the bus-specific -callbacks of the drivers. - -This solution is intended to be only temporary. In order to get class -information in the driver, the drivers must be modified anyway. Since -converting drivers to the new model should reduce some infrastructural -complexity and code size, it is recommended that they are converted as -class information is added. - -Access -~~~~~~ - -Once the object has been registered, it may access the common fields of -the object, like the lock and the list of devices:: - - int driver_for_each_dev(struct device_driver *drv, void *data, - int (*callback)(struct device *dev, void *data)); - -The devices field is a list of all the devices that have been bound to -the driver. The LDM core provides a helper function to operate on all -the devices a driver controls. This helper locks the driver on each -node access, and does proper reference counting on each device as it -accesses it. - - -sysfs -~~~~~ - -When a driver is registered, a sysfs directory is created in its -bus's directory. In this directory, the driver can export an interface -to userspace to control operation of the driver on a global basis; -e.g. toggling debugging output in the driver. - -A future feature of this directory will be a 'devices' directory. This -directory will contain symlinks to the directories of devices it -supports. - - - -Callbacks -~~~~~~~~~ - -:: - - int (*probe) (struct device *dev); - -The probe() entry is called in task context, with the bus's rwsem locked -and the driver partially bound to the device. Drivers commonly use -container_of() to convert "dev" to a bus-specific type, both in probe() -and other routines. That type often provides device resource data, such -as pci_dev.resource[] or platform_device.resources, which is used in -addition to dev->platform_data to initialize the driver. - -This callback holds the driver-specific logic to bind the driver to a -given device. That includes verifying that the device is present, that -it's a version the driver can handle, that driver data structures can -be allocated and initialized, and that any hardware can be initialized. -Drivers often store a pointer to their state with dev_set_drvdata(). -When the driver has successfully bound itself to that device, then probe() -returns zero and the driver model code will finish its part of binding -the driver to that device. - -A driver's probe() may return a negative errno value to indicate that -the driver did not bind to this device, in which case it should have -released all resources it allocated:: - - int (*remove) (struct device *dev); - -remove is called to unbind a driver from a device. This may be -called if a device is physically removed from the system, if the -driver module is being unloaded, during a reboot sequence, or -in other cases. - -It is up to the driver to determine if the device is present or -not. It should free any resources allocated specifically for the -device; i.e. anything in the device's driver_data field. - -If the device is still present, it should quiesce the device and place -it into a supported low-power state:: - - int (*suspend) (struct device *dev, pm_message_t state); - -suspend is called to put the device in a low power state:: - - int (*resume) (struct device *dev); - -Resume is used to bring a device back from a low power state. - - -Attributes -~~~~~~~~~~ - -:: - - struct driver_attribute { - struct attribute attr; - ssize_t (*show)(struct device_driver *driver, char *buf); - ssize_t (*store)(struct device_driver *, const char *buf, size_t count); - }; - -Device drivers can export attributes via their sysfs directories. -Drivers can declare attributes using a DRIVER_ATTR_RW and DRIVER_ATTR_RO -macro that works identically to the DEVICE_ATTR_RW and DEVICE_ATTR_RO -macros. - -Example:: - - DRIVER_ATTR_RW(debug); - -This is equivalent to declaring:: - - struct driver_attribute driver_attr_debug; - -This can then be used to add and remove the attribute from the -driver's directory using:: - - int driver_create_file(struct device_driver *, const struct driver_attribute *); - void driver_remove_file(struct device_driver *, const struct driver_attribute *); diff --git a/Documentation/driver-model/index.rst b/Documentation/driver-model/index.rst deleted file mode 100644 index 9f85d579ce56..000000000000 --- a/Documentation/driver-model/index.rst +++ /dev/null @@ -1,26 +0,0 @@ -:orphan: - -============ -Driver Model -============ - -.. toctree:: - :maxdepth: 1 - - binding - bus - class - design-patterns - device - devres - driver - overview - platform - porting - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/driver-model/overview.rst b/Documentation/driver-model/overview.rst deleted file mode 100644 index d4d1e9b40e0c..000000000000 --- a/Documentation/driver-model/overview.rst +++ /dev/null @@ -1,124 +0,0 @@ -============================= -The Linux Kernel Device Model -============================= - -Patrick Mochel - -Drafted 26 August 2002 -Updated 31 January 2006 - - -Overview -~~~~~~~~ - -The Linux Kernel Driver Model is a unification of all the disparate driver -models that were previously used in the kernel. It is intended to augment the -bus-specific drivers for bridges and devices by consolidating a set of data -and operations into globally accessible data structures. - -Traditional driver models implemented some sort of tree-like structure -(sometimes just a list) for the devices they control. There wasn't any -uniformity across the different bus types. - -The current driver model provides a common, uniform data model for describing -a bus and the devices that can appear under the bus. The unified bus -model includes a set of common attributes which all busses carry, and a set -of common callbacks, such as device discovery during bus probing, bus -shutdown, bus power management, etc. - -The common device and bridge interface reflects the goals of the modern -computer: namely the ability to do seamless device "plug and play", power -management, and hot plug. In particular, the model dictated by Intel and -Microsoft (namely ACPI) ensures that almost every device on almost any bus -on an x86-compatible system can work within this paradigm. Of course, -not every bus is able to support all such operations, although most -buses support most of those operations. - - -Downstream Access -~~~~~~~~~~~~~~~~~ - -Common data fields have been moved out of individual bus layers into a common -data structure. These fields must still be accessed by the bus layers, -and sometimes by the device-specific drivers. - -Other bus layers are encouraged to do what has been done for the PCI layer. -struct pci_dev now looks like this:: - - struct pci_dev { - ... - - struct device dev; /* Generic device interface */ - ... - }; - -Note first that the struct device dev within the struct pci_dev is -statically allocated. This means only one allocation on device discovery. - -Note also that that struct device dev is not necessarily defined at the -front of the pci_dev structure. This is to make people think about what -they're doing when switching between the bus driver and the global driver, -and to discourage meaningless and incorrect casts between the two. - -The PCI bus layer freely accesses the fields of struct device. It knows about -the structure of struct pci_dev, and it should know the structure of struct -device. Individual PCI device drivers that have been converted to the current -driver model generally do not and should not touch the fields of struct device, -unless there is a compelling reason to do so. - -The above abstraction prevents unnecessary pain during transitional phases. -If it were not done this way, then when a field was renamed or removed, every -downstream driver would break. On the other hand, if only the bus layer -(and not the device layer) accesses the struct device, it is only the bus -layer that needs to change. - - -User Interface -~~~~~~~~~~~~~~ - -By virtue of having a complete hierarchical view of all the devices in the -system, exporting a complete hierarchical view to userspace becomes relatively -easy. This has been accomplished by implementing a special purpose virtual -file system named sysfs. - -Almost all mainstream Linux distros mount this filesystem automatically; you -can see some variation of the following in the output of the "mount" command:: - - $ mount - ... - none on /sys type sysfs (rw,noexec,nosuid,nodev) - ... - $ - -The auto-mounting of sysfs is typically accomplished by an entry similar to -the following in the /etc/fstab file:: - - none /sys sysfs defaults 0 0 - -or something similar in the /lib/init/fstab file on Debian-based systems:: - - none /sys sysfs nodev,noexec,nosuid 0 0 - -If sysfs is not automatically mounted, you can always do it manually with:: - - # mount -t sysfs sysfs /sys - -Whenever a device is inserted into the tree, a directory is created for it. -This directory may be populated at each layer of discovery - the global layer, -the bus layer, or the device layer. - -The global layer currently creates two files - 'name' and 'power'. The -former only reports the name of the device. The latter reports the -current power state of the device. It will also be used to set the current -power state. - -The bus layer may also create files for the devices it finds while probing the -bus. For example, the PCI layer currently creates 'irq' and 'resource' files -for each PCI device. - -A device-specific driver may also export files in its directory to expose -device-specific data or tunable interfaces. - -More information about the sysfs directory layout can be found in -the other documents in this directory and in the file -Documentation/filesystems/sysfs.txt. diff --git a/Documentation/driver-model/platform.rst b/Documentation/driver-model/platform.rst deleted file mode 100644 index 334dd4071ae4..000000000000 --- a/Documentation/driver-model/platform.rst +++ /dev/null @@ -1,246 +0,0 @@ -============================ -Platform Devices and Drivers -============================ - -See for the driver model interface to the -platform bus: platform_device, and platform_driver. This pseudo-bus -is used to connect devices on busses with minimal infrastructure, -like those used to integrate peripherals on many system-on-chip -processors, or some "legacy" PC interconnects; as opposed to large -formally specified ones like PCI or USB. - - -Platform devices -~~~~~~~~~~~~~~~~ -Platform devices are devices that typically appear as autonomous -entities in the system. This includes legacy port-based devices and -host bridges to peripheral buses, and most controllers integrated -into system-on-chip platforms. What they usually have in common -is direct addressing from a CPU bus. Rarely, a platform_device will -be connected through a segment of some other kind of bus; but its -registers will still be directly addressable. - -Platform devices are given a name, used in driver binding, and a -list of resources such as addresses and IRQs:: - - struct platform_device { - const char *name; - u32 id; - struct device dev; - u32 num_resources; - struct resource *resource; - }; - - -Platform drivers -~~~~~~~~~~~~~~~~ -Platform drivers follow the standard driver model convention, where -discovery/enumeration is handled outside the drivers, and drivers -provide probe() and remove() methods. They support power management -and shutdown notifications using the standard conventions:: - - struct platform_driver { - int (*probe)(struct platform_device *); - int (*remove)(struct platform_device *); - void (*shutdown)(struct platform_device *); - int (*suspend)(struct platform_device *, pm_message_t state); - int (*suspend_late)(struct platform_device *, pm_message_t state); - int (*resume_early)(struct platform_device *); - int (*resume)(struct platform_device *); - struct device_driver driver; - }; - -Note that probe() should in general verify that the specified device hardware -actually exists; sometimes platform setup code can't be sure. The probing -can use device resources, including clocks, and device platform_data. - -Platform drivers register themselves the normal way:: - - int platform_driver_register(struct platform_driver *drv); - -Or, in common situations where the device is known not to be hot-pluggable, -the probe() routine can live in an init section to reduce the driver's -runtime memory footprint:: - - int platform_driver_probe(struct platform_driver *drv, - int (*probe)(struct platform_device *)) - -Kernel modules can be composed of several platform drivers. The platform core -provides helpers to register and unregister an array of drivers:: - - int __platform_register_drivers(struct platform_driver * const *drivers, - unsigned int count, struct module *owner); - void platform_unregister_drivers(struct platform_driver * const *drivers, - unsigned int count); - -If one of the drivers fails to register, all drivers registered up to that -point will be unregistered in reverse order. Note that there is a convenience -macro that passes THIS_MODULE as owner parameter:: - - #define platform_register_drivers(drivers, count) - - -Device Enumeration -~~~~~~~~~~~~~~~~~~ -As a rule, platform specific (and often board-specific) setup code will -register platform devices:: - - int platform_device_register(struct platform_device *pdev); - - int platform_add_devices(struct platform_device **pdevs, int ndev); - -The general rule is to register only those devices that actually exist, -but in some cases extra devices might be registered. For example, a kernel -might be configured to work with an external network adapter that might not -be populated on all boards, or likewise to work with an integrated controller -that some boards might not hook up to any peripherals. - -In some cases, boot firmware will export tables describing the devices -that are populated on a given board. Without such tables, often the -only way for system setup code to set up the correct devices is to build -a kernel for a specific target board. Such board-specific kernels are -common with embedded and custom systems development. - -In many cases, the memory and IRQ resources associated with the platform -device are not enough to let the device's driver work. Board setup code -will often provide additional information using the device's platform_data -field to hold additional information. - -Embedded systems frequently need one or more clocks for platform devices, -which are normally kept off until they're actively needed (to save power). -System setup also associates those clocks with the device, so that that -calls to clk_get(&pdev->dev, clock_name) return them as needed. - - -Legacy Drivers: Device Probing -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Some drivers are not fully converted to the driver model, because they take -on a non-driver role: the driver registers its platform device, rather than -leaving that for system infrastructure. Such drivers can't be hotplugged -or coldplugged, since those mechanisms require device creation to be in a -different system component than the driver. - -The only "good" reason for this is to handle older system designs which, like -original IBM PCs, rely on error-prone "probe-the-hardware" models for hardware -configuration. Newer systems have largely abandoned that model, in favor of -bus-level support for dynamic configuration (PCI, USB), or device tables -provided by the boot firmware (e.g. PNPACPI on x86). There are too many -conflicting options about what might be where, and even educated guesses by -an operating system will be wrong often enough to make trouble. - -This style of driver is discouraged. If you're updating such a driver, -please try to move the device enumeration to a more appropriate location, -outside the driver. This will usually be cleanup, since such drivers -tend to already have "normal" modes, such as ones using device nodes that -were created by PNP or by platform device setup. - -None the less, there are some APIs to support such legacy drivers. Avoid -using these calls except with such hotplug-deficient drivers:: - - struct platform_device *platform_device_alloc( - const char *name, int id); - -You can use platform_device_alloc() to dynamically allocate a device, which -you will then initialize with resources and platform_device_register(). -A better solution is usually:: - - struct platform_device *platform_device_register_simple( - const char *name, int id, - struct resource *res, unsigned int nres); - -You can use platform_device_register_simple() as a one-step call to allocate -and register a device. - - -Device Naming and Driver Binding -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The platform_device.dev.bus_id is the canonical name for the devices. -It's built from two components: - - * platform_device.name ... which is also used to for driver matching. - - * platform_device.id ... the device instance number, or else "-1" - to indicate there's only one. - -These are concatenated, so name/id "serial"/0 indicates bus_id "serial.0", and -"serial/3" indicates bus_id "serial.3"; both would use the platform_driver -named "serial". While "my_rtc"/-1 would be bus_id "my_rtc" (no instance id) -and use the platform_driver called "my_rtc". - -Driver binding is performed automatically by the driver core, invoking -driver probe() after finding a match between device and driver. If the -probe() succeeds, the driver and device are bound as usual. There are -three different ways to find such a match: - - - Whenever a device is registered, the drivers for that bus are - checked for matches. Platform devices should be registered very - early during system boot. - - - When a driver is registered using platform_driver_register(), all - unbound devices on that bus are checked for matches. Drivers - usually register later during booting, or by module loading. - - - Registering a driver using platform_driver_probe() works just like - using platform_driver_register(), except that the driver won't - be probed later if another device registers. (Which is OK, since - this interface is only for use with non-hotpluggable devices.) - - -Early Platform Devices and Drivers -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The early platform interfaces provide platform data to platform device -drivers early on during the system boot. The code is built on top of the -early_param() command line parsing and can be executed very early on. - -Example: "earlyprintk" class early serial console in 6 steps - -1. Registering early platform device data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The architecture code registers platform device data using the function -early_platform_add_devices(). In the case of early serial console this -should be hardware configuration for the serial port. Devices registered -at this point will later on be matched against early platform drivers. - -2. Parsing kernel command line -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The architecture code calls parse_early_param() to parse the kernel -command line. This will execute all matching early_param() callbacks. -User specified early platform devices will be registered at this point. -For the early serial console case the user can specify port on the -kernel command line as "earlyprintk=serial.0" where "earlyprintk" is -the class string, "serial" is the name of the platform driver and -0 is the platform device id. If the id is -1 then the dot and the -id can be omitted. - -3. Installing early platform drivers belonging to a certain class -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The architecture code may optionally force registration of all early -platform drivers belonging to a certain class using the function -early_platform_driver_register_all(). User specified devices from -step 2 have priority over these. This step is omitted by the serial -driver example since the early serial driver code should be disabled -unless the user has specified port on the kernel command line. - -4. Early platform driver registration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Compiled-in platform drivers making use of early_platform_init() are -automatically registered during step 2 or 3. The serial driver example -should use early_platform_init("earlyprintk", &platform_driver). - -5. Probing of early platform drivers belonging to a certain class -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The architecture code calls early_platform_driver_probe() to match -registered early platform devices associated with a certain class with -registered early platform drivers. Matched devices will get probed(). -This step can be executed at any point during the early boot. As soon -as possible may be good for the serial port case. - -6. Inside the early platform driver probe() -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The driver code needs to take special care during early boot, especially -when it comes to memory allocation and interrupt registration. The code -in the probe() function can use is_early_platform_device() to check if -it is called at early platform device or at the regular platform device -time. The early serial driver performs register_console() at this point. - -For further information, see . diff --git a/Documentation/driver-model/porting.rst b/Documentation/driver-model/porting.rst deleted file mode 100644 index ae4bf843c1d6..000000000000 --- a/Documentation/driver-model/porting.rst +++ /dev/null @@ -1,448 +0,0 @@ -======================================= -Porting Drivers to the New Driver Model -======================================= - -Patrick Mochel - -7 January 2003 - - -Overview - -Please refer to `Documentation/driver-model/*.rst` for definitions of -various driver types and concepts. - -Most of the work of porting devices drivers to the new model happens -at the bus driver layer. This was intentional, to minimize the -negative effect on kernel drivers, and to allow a gradual transition -of bus drivers. - -In a nutshell, the driver model consists of a set of objects that can -be embedded in larger, bus-specific objects. Fields in these generic -objects can replace fields in the bus-specific objects. - -The generic objects must be registered with the driver model core. By -doing so, they will exported via the sysfs filesystem. sysfs can be -mounted by doing:: - - # mount -t sysfs sysfs /sys - - - -The Process - -Step 0: Read include/linux/device.h for object and function definitions. - -Step 1: Registering the bus driver. - - -- Define a struct bus_type for the bus driver:: - - struct bus_type pci_bus_type = { - .name = "pci", - }; - - -- Register the bus type. - - This should be done in the initialization function for the bus type, - which is usually the module_init(), or equivalent, function:: - - static int __init pci_driver_init(void) - { - return bus_register(&pci_bus_type); - } - - subsys_initcall(pci_driver_init); - - - The bus type may be unregistered (if the bus driver may be compiled - as a module) by doing:: - - bus_unregister(&pci_bus_type); - - -- Export the bus type for others to use. - - Other code may wish to reference the bus type, so declare it in a - shared header file and export the symbol. - -From include/linux/pci.h:: - - extern struct bus_type pci_bus_type; - - -From file the above code appears in:: - - EXPORT_SYMBOL(pci_bus_type); - - - -- This will cause the bus to show up in /sys/bus/pci/ with two - subdirectories: 'devices' and 'drivers':: - - # tree -d /sys/bus/pci/ - /sys/bus/pci/ - |-- devices - `-- drivers - - - -Step 2: Registering Devices. - -struct device represents a single device. It mainly contains metadata -describing the relationship the device has to other entities. - - -- Embed a struct device in the bus-specific device type:: - - - struct pci_dev { - ... - struct device dev; /* Generic device interface */ - ... - }; - - It is recommended that the generic device not be the first item in - the struct to discourage programmers from doing mindless casts - between the object types. Instead macros, or inline functions, - should be created to convert from the generic object type:: - - - #define to_pci_dev(n) container_of(n, struct pci_dev, dev) - - or - - static inline struct pci_dev * to_pci_dev(struct kobject * kobj) - { - return container_of(n, struct pci_dev, dev); - } - - This allows the compiler to verify type-safety of the operations - that are performed (which is Good). - - -- Initialize the device on registration. - - When devices are discovered or registered with the bus type, the - bus driver should initialize the generic device. The most important - things to initialize are the bus_id, parent, and bus fields. - - The bus_id is an ASCII string that contains the device's address on - the bus. The format of this string is bus-specific. This is - necessary for representing devices in sysfs. - - parent is the physical parent of the device. It is important that - the bus driver sets this field correctly. - - The driver model maintains an ordered list of devices that it uses - for power management. This list must be in order to guarantee that - devices are shutdown before their physical parents, and vice versa. - The order of this list is determined by the parent of registered - devices. - - Also, the location of the device's sysfs directory depends on a - device's parent. sysfs exports a directory structure that mirrors - the device hierarchy. Accurately setting the parent guarantees that - sysfs will accurately represent the hierarchy. - - The device's bus field is a pointer to the bus type the device - belongs to. This should be set to the bus_type that was declared - and initialized before. - - Optionally, the bus driver may set the device's name and release - fields. - - The name field is an ASCII string describing the device, like - - "ATI Technologies Inc Radeon QD" - - The release field is a callback that the driver model core calls - when the device has been removed, and all references to it have - been released. More on this in a moment. - - -- Register the device. - - Once the generic device has been initialized, it can be registered - with the driver model core by doing:: - - device_register(&dev->dev); - - It can later be unregistered by doing:: - - device_unregister(&dev->dev); - - This should happen on buses that support hotpluggable devices. - If a bus driver unregisters a device, it should not immediately free - it. It should instead wait for the driver model core to call the - device's release method, then free the bus-specific object. - (There may be other code that is currently referencing the device - structure, and it would be rude to free the device while that is - happening). - - - When the device is registered, a directory in sysfs is created. - The PCI tree in sysfs looks like:: - - /sys/devices/pci0/ - |-- 00:00.0 - |-- 00:01.0 - | `-- 01:00.0 - |-- 00:02.0 - | `-- 02:1f.0 - | `-- 03:00.0 - |-- 00:1e.0 - | `-- 04:04.0 - |-- 00:1f.0 - |-- 00:1f.1 - | |-- ide0 - | | |-- 0.0 - | | `-- 0.1 - | `-- ide1 - | `-- 1.0 - |-- 00:1f.2 - |-- 00:1f.3 - `-- 00:1f.5 - - Also, symlinks are created in the bus's 'devices' directory - that point to the device's directory in the physical hierarchy:: - - /sys/bus/pci/devices/ - |-- 00:00.0 -> ../../../devices/pci0/00:00.0 - |-- 00:01.0 -> ../../../devices/pci0/00:01.0 - |-- 00:02.0 -> ../../../devices/pci0/00:02.0 - |-- 00:1e.0 -> ../../../devices/pci0/00:1e.0 - |-- 00:1f.0 -> ../../../devices/pci0/00:1f.0 - |-- 00:1f.1 -> ../../../devices/pci0/00:1f.1 - |-- 00:1f.2 -> ../../../devices/pci0/00:1f.2 - |-- 00:1f.3 -> ../../../devices/pci0/00:1f.3 - |-- 00:1f.5 -> ../../../devices/pci0/00:1f.5 - |-- 01:00.0 -> ../../../devices/pci0/00:01.0/01:00.0 - |-- 02:1f.0 -> ../../../devices/pci0/00:02.0/02:1f.0 - |-- 03:00.0 -> ../../../devices/pci0/00:02.0/02:1f.0/03:00.0 - `-- 04:04.0 -> ../../../devices/pci0/00:1e.0/04:04.0 - - - -Step 3: Registering Drivers. - -struct device_driver is a simple driver structure that contains a set -of operations that the driver model core may call. - - -- Embed a struct device_driver in the bus-specific driver. - - Just like with devices, do something like:: - - struct pci_driver { - ... - struct device_driver driver; - }; - - -- Initialize the generic driver structure. - - When the driver registers with the bus (e.g. doing pci_register_driver()), - initialize the necessary fields of the driver: the name and bus - fields. - - -- Register the driver. - - After the generic driver has been initialized, call:: - - driver_register(&drv->driver); - - to register the driver with the core. - - When the driver is unregistered from the bus, unregister it from the - core by doing:: - - driver_unregister(&drv->driver); - - Note that this will block until all references to the driver have - gone away. Normally, there will not be any. - - -- Sysfs representation. - - Drivers are exported via sysfs in their bus's 'driver's directory. - For example:: - - /sys/bus/pci/drivers/ - |-- 3c59x - |-- Ensoniq AudioPCI - |-- agpgart-amdk7 - |-- e100 - `-- serial - - -Step 4: Define Generic Methods for Drivers. - -struct device_driver defines a set of operations that the driver model -core calls. Most of these operations are probably similar to -operations the bus already defines for drivers, but taking different -parameters. - -It would be difficult and tedious to force every driver on a bus to -simultaneously convert their drivers to generic format. Instead, the -bus driver should define single instances of the generic methods that -forward call to the bus-specific drivers. For instance:: - - - static int pci_device_remove(struct device * dev) - { - struct pci_dev * pci_dev = to_pci_dev(dev); - struct pci_driver * drv = pci_dev->driver; - - if (drv) { - if (drv->remove) - drv->remove(pci_dev); - pci_dev->driver = NULL; - } - return 0; - } - - -The generic driver should be initialized with these methods before it -is registered:: - - /* initialize common driver fields */ - drv->driver.name = drv->name; - drv->driver.bus = &pci_bus_type; - drv->driver.probe = pci_device_probe; - drv->driver.resume = pci_device_resume; - drv->driver.suspend = pci_device_suspend; - drv->driver.remove = pci_device_remove; - - /* register with core */ - driver_register(&drv->driver); - - -Ideally, the bus should only initialize the fields if they are not -already set. This allows the drivers to implement their own generic -methods. - - -Step 5: Support generic driver binding. - -The model assumes that a device or driver can be dynamically -registered with the bus at any time. When registration happens, -devices must be bound to a driver, or drivers must be bound to all -devices that it supports. - -A driver typically contains a list of device IDs that it supports. The -bus driver compares these IDs to the IDs of devices registered with it. -The format of the device IDs, and the semantics for comparing them are -bus-specific, so the generic model does attempt to generalize them. - -Instead, a bus may supply a method in struct bus_type that does the -comparison:: - - int (*match)(struct device * dev, struct device_driver * drv); - -match should return positive value if the driver supports the device, -and zero otherwise. It may also return error code (for example --EPROBE_DEFER) if determining that given driver supports the device is -not possible. - -When a device is registered, the bus's list of drivers is iterated -over. bus->match() is called for each one until a match is found. - -When a driver is registered, the bus's list of devices is iterated -over. bus->match() is called for each device that is not already -claimed by a driver. - -When a device is successfully bound to a driver, device->driver is -set, the device is added to a per-driver list of devices, and a -symlink is created in the driver's sysfs directory that points to the -device's physical directory:: - - /sys/bus/pci/drivers/ - |-- 3c59x - | `-- 00:0b.0 -> ../../../../devices/pci0/00:0b.0 - |-- Ensoniq AudioPCI - |-- agpgart-amdk7 - | `-- 00:00.0 -> ../../../../devices/pci0/00:00.0 - |-- e100 - | `-- 00:0c.0 -> ../../../../devices/pci0/00:0c.0 - `-- serial - - -This driver binding should replace the existing driver binding -mechanism the bus currently uses. - - -Step 6: Supply a hotplug callback. - -Whenever a device is registered with the driver model core, the -userspace program /sbin/hotplug is called to notify userspace. -Users can define actions to perform when a device is inserted or -removed. - -The driver model core passes several arguments to userspace via -environment variables, including - -- ACTION: set to 'add' or 'remove' -- DEVPATH: set to the device's physical path in sysfs. - -A bus driver may also supply additional parameters for userspace to -consume. To do this, a bus must implement the 'hotplug' method in -struct bus_type:: - - int (*hotplug) (struct device *dev, char **envp, - int num_envp, char *buffer, int buffer_size); - -This is called immediately before /sbin/hotplug is executed. - - -Step 7: Cleaning up the bus driver. - -The generic bus, device, and driver structures provide several fields -that can replace those defined privately to the bus driver. - -- Device list. - -struct bus_type contains a list of all devices registered with the bus -type. This includes all devices on all instances of that bus type. -An internal list that the bus uses may be removed, in favor of using -this one. - -The core provides an iterator to access these devices:: - - int bus_for_each_dev(struct bus_type * bus, struct device * start, - void * data, int (*fn)(struct device *, void *)); - - -- Driver list. - -struct bus_type also contains a list of all drivers registered with -it. An internal list of drivers that the bus driver maintains may -be removed in favor of using the generic one. - -The drivers may be iterated over, like devices:: - - int bus_for_each_drv(struct bus_type * bus, struct device_driver * start, - void * data, int (*fn)(struct device_driver *, void *)); - - -Please see drivers/base/bus.c for more information. - - -- rwsem - -struct bus_type contains an rwsem that protects all core accesses to -the device and driver lists. This can be used by the bus driver -internally, and should be used when accessing the device or driver -lists the bus maintains. - - -- Device and driver fields. - -Some of the fields in struct device and struct device_driver duplicate -fields in the bus-specific representations of these objects. Feel free -to remove the bus-specific ones and favor the generic ones. Note -though, that this will likely mean fixing up all the drivers that -reference the bus-specific fields (though those should all be 1-line -changes). diff --git a/Documentation/eisa.txt b/Documentation/eisa.txt index f388545a85a7..c07565ba57da 100644 --- a/Documentation/eisa.txt +++ b/Documentation/eisa.txt @@ -103,7 +103,7 @@ id_table an array of NULL terminated EISA id strings, (driver_data). driver a generic driver, such as described in - Documentation/driver-model/driver.rst. Only .name, + Documentation/driver-api/driver-model/driver.rst. Only .name, .probe and .remove members are mandatory. =============== ==================================================== @@ -152,7 +152,7 @@ state set of flags indicating the state of the device. Current flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. res set of four 256 bytes I/O regions allocated to this device dma_mask DMA mask set from the parent device. -dev generic device (see Documentation/driver-model/device.rst) +dev generic device (see Documentation/driver-api/driver-model/device.rst) ======== ============================================================ You can get the 'struct eisa_device' from 'struct device' using the diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index 5b5311f9358d..ddf15b1b0d5a 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -319,7 +319,7 @@ quick way to lookup the sysfs interface for a device from the result of a stat(2) operation. More information can driver-model specific features can be found in -Documentation/driver-model/. +Documentation/driver-api/driver-model/. TODO: Finish this section. diff --git a/Documentation/hwmon/submitting-patches.rst b/Documentation/hwmon/submitting-patches.rst index d5b05d3e54ba..452fc28d8e0b 100644 --- a/Documentation/hwmon/submitting-patches.rst +++ b/Documentation/hwmon/submitting-patches.rst @@ -89,7 +89,7 @@ increase the chances of your change being accepted. console. Excessive logging can seriously affect system performance. * Use devres functions whenever possible to allocate resources. For rationale - and supported functions, please see Documentation/driver-model/devres.rst. + and supported functions, please see Documentation/driver-api/driver-model/devres.rst. If a function is not supported by devres, consider using devm_add_action(). * If the driver has a detect function, make sure it is silent. Debug messages diff --git a/Documentation/translations/zh_CN/filesystems/sysfs.txt b/Documentation/translations/zh_CN/filesystems/sysfs.txt index 452271dda141..ee1f37da5b23 100644 --- a/Documentation/translations/zh_CN/filesystems/sysfs.txt +++ b/Documentation/translations/zh_CN/filesystems/sysfs.txt @@ -288,7 +288,7 @@ dev/ 包含两个子目录: char/ 和 block/。在这两个子目录中,有 中相应的设备。/sys/dev 提供一个通过一个 stat(2) 操作结果,查找 设备 sysfs 接口快捷的方法。 -更多有关 driver-model 的特性信息可以在 Documentation/driver-model/ +更多有关 driver-model 的特性信息可以在 Documentation/driver-api/driver-model/ 中找到。 diff --git a/drivers/base/platform.c b/drivers/base/platform.c index 713903290385..506a0175a5a7 100644 --- a/drivers/base/platform.c +++ b/drivers/base/platform.c @@ -5,7 +5,7 @@ * Copyright (c) 2002-3 Patrick Mochel * Copyright (c) 2002-3 Open Source Development Labs * - * Please see Documentation/driver-model/platform.rst for more + * Please see Documentation/driver-api/driver-model/platform.rst for more * information. */ diff --git a/drivers/gpio/gpio-cs5535.c b/drivers/gpio/gpio-cs5535.c index 3611a0571667..53b24e3ae7de 100644 --- a/drivers/gpio/gpio-cs5535.c +++ b/drivers/gpio/gpio-cs5535.c @@ -41,7 +41,7 @@ MODULE_PARM_DESC(mask, "GPIO channel mask."); /* * FIXME: convert this singleton driver to use the state container - * design pattern, see Documentation/driver-model/design-patterns.rst + * design pattern, see Documentation/driver-api/driver-model/design-patterns.rst */ static struct cs5535_gpio_chip { struct gpio_chip chip; diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 41c90f2ddb31..63db08d9bafa 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -2286,7 +2286,7 @@ ice_probe(struct pci_dev *pdev, const struct pci_device_id __always_unused *ent) struct ice_hw *hw; int err; - /* this driver uses devres, see Documentation/driver-model/devres.rst */ + /* this driver uses devres, see Documentation/driver-api/driver-model/devres.rst */ err = pcim_enable_device(pdev); if (err) return err; diff --git a/drivers/staging/unisys/Documentation/overview.txt b/drivers/staging/unisys/Documentation/overview.txt index 9ab30af265a5..f8a4144b239c 100644 --- a/drivers/staging/unisys/Documentation/overview.txt +++ b/drivers/staging/unisys/Documentation/overview.txt @@ -15,7 +15,7 @@ normally be unsharable, specifically: * visorinput - keyboard and mouse These drivers conform to the standard Linux bus/device model described -within Documentation/driver-model/, and utilize a driver named visorbus to +within Documentation/driver-api/driver-model/, and utilize a driver named visorbus to present the virtual busses involved. Drivers in the 'visor*' driver set are commonly referred to as "guest drivers" or "client drivers". All drivers except visorbus expose a device of a specific usable class to the Linux guest @@ -141,7 +141,7 @@ called automatically by the visorbus driver at appropriate times: ----------------------------------- Because visorbus is a standard Linux bus driver in the model described in -Documentation/driver-model/, the hierarchy of s-Par virtual devices is +Documentation/driver-api/driver-model/, the hierarchy of s-Par virtual devices is published in the sysfs tree beneath /bus/visorbus/, e.g., /sys/bus/visorbus/devices/ might look like: diff --git a/include/linux/device.h b/include/linux/device.h index 5eabfa0c4dee..c330b75c6c57 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -6,7 +6,7 @@ * Copyright (c) 2004-2009 Greg Kroah-Hartman * Copyright (c) 2008-2009 Novell Inc. * - * See Documentation/driver-model/ for more information. + * See Documentation/driver-api/driver-model/ for more information. */ #ifndef _DEVICE_H_ diff --git a/include/linux/platform_device.h b/include/linux/platform_device.h index beb25f277889..9bc36b589827 100644 --- a/include/linux/platform_device.h +++ b/include/linux/platform_device.h @@ -4,7 +4,7 @@ * * Copyright (c) 2001-2003 Patrick Mochel * - * See Documentation/driver-model/ for more information. + * See Documentation/driver-api/driver-model/ for more information. */ #ifndef _PLATFORM_DEVICE_H_ diff --git a/scripts/coccinelle/free/devm_free.cocci b/scripts/coccinelle/free/devm_free.cocci index fefd0331a2de..441799b5359b 100644 --- a/scripts/coccinelle/free/devm_free.cocci +++ b/scripts/coccinelle/free/devm_free.cocci @@ -3,7 +3,7 @@ /// functions. Values allocated using the devm_functions are freed when /// the device is detached, and thus the use of the standard freeing /// function would cause a double free. -/// See Documentation/driver-model/devres.rst for more information. +/// See Documentation/driver-api/driver-model/devres.rst for more information. /// /// A difficulty of detecting this problem is that the standard freeing /// function might be called from a different function than the one -- cgit v1.2.3 From baa293e9544bea71361950d071579f0e4d5713ed Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 27 Jun 2019 15:39:22 -0300 Subject: docs: driver-api: add a series of orphaned documents There are lots of documents under Documentation/*.txt and a few other orphan documents elsehwere that belong to the driver-API book. Move them to their right place. Reviewed-by: Cornelia Huck # vfio-related parts Acked-by: Logan Gunthorpe # switchtec Signed-off-by: Mauro Carvalho Chehab --- Documentation/ABI/removed/sysfs-class-rfkill | 2 +- Documentation/ABI/stable/sysfs-class-rfkill | 2 +- Documentation/ABI/testing/sysfs-class-switchtec | 2 +- Documentation/EDID/howto.rst | 58 - Documentation/SM501.txt | 74 - Documentation/admin-guide/kernel-parameters.txt | 2 +- .../admin-guide/laptops/thinkpad-acpi.rst | 6 +- Documentation/bt8xxgpio.txt | 62 - Documentation/connector/connector.rst | 156 -- Documentation/console/console.rst | 152 -- Documentation/dcdbas.txt | 99 -- Documentation/dell_rbu.txt | 128 -- Documentation/driver-api/bt8xxgpio.rst | 62 + Documentation/driver-api/connector.rst | 156 ++ Documentation/driver-api/console.rst | 152 ++ Documentation/driver-api/dcdbas.rst | 99 ++ Documentation/driver-api/dell_rbu.rst | 128 ++ Documentation/driver-api/edid.rst | 58 + Documentation/driver-api/eisa.rst | 230 +++ Documentation/driver-api/index.rst | 26 + Documentation/driver-api/isa.rst | 122 ++ Documentation/driver-api/isapnp.rst | 15 + Documentation/driver-api/lightnvm-pblk.rst | 21 + Documentation/driver-api/men-chameleon-bus.rst | 175 ++ Documentation/driver-api/ntb.rst | 236 +++ Documentation/driver-api/nvmem.rst | 189 ++ Documentation/driver-api/parport-lowlevel.rst | 1832 ++++++++++++++++++++ Documentation/driver-api/pti_intel_mid.rst | 106 ++ Documentation/driver-api/pwm.rst | 165 ++ Documentation/driver-api/rfkill.rst | 132 ++ Documentation/driver-api/sgi-ioc4.rst | 49 + Documentation/driver-api/sm501.rst | 74 + Documentation/driver-api/smsc_ece1099.rst | 60 + Documentation/driver-api/switchtec.rst | 102 ++ Documentation/driver-api/sync_file.rst | 86 + Documentation/driver-api/vfio-mediated-device.rst | 414 +++++ Documentation/driver-api/vfio.rst | 520 ++++++ Documentation/driver-api/xillybus.rst | 379 ++++ Documentation/driver-api/zorro.rst | 104 ++ Documentation/eisa.txt | 230 --- Documentation/fb/fbcon.rst | 4 +- Documentation/isa.txt | 122 -- Documentation/isapnp.txt | 15 - Documentation/lightnvm/pblk.txt | 21 - Documentation/men-chameleon-bus.txt | 175 -- Documentation/ntb.txt | 236 --- Documentation/nvmem/nvmem.rst | 189 -- Documentation/parport-lowlevel.txt | 1832 -------------------- Documentation/pti/pti_intel_mid.rst | 106 -- Documentation/pwm.txt | 165 -- Documentation/rfkill.txt | 132 -- Documentation/s390/vfio-ccw.rst | 6 +- Documentation/sgi-ioc4.txt | 49 - Documentation/smsc_ece1099.txt | 60 - Documentation/switchtec.txt | 102 -- Documentation/sync_file.txt | 86 - Documentation/vfio-mediated-device.txt | 414 ----- Documentation/vfio.txt | 520 ------ Documentation/w1/w1.netlink | 2 +- Documentation/xillybus.txt | 379 ---- Documentation/zorro.txt | 104 -- MAINTAINERS | 22 +- drivers/dma-buf/Kconfig | 2 +- drivers/gpio/Kconfig | 2 +- drivers/gpu/drm/Kconfig | 2 +- drivers/pci/switch/Kconfig | 2 +- drivers/platform/x86/Kconfig | 4 +- drivers/platform/x86/dcdbas.c | 2 +- drivers/platform/x86/dell_rbu.c | 2 +- drivers/pnp/isapnp/Kconfig | 2 +- drivers/tty/Kconfig | 2 +- drivers/vfio/Kconfig | 2 +- drivers/vfio/mdev/Kconfig | 2 +- drivers/w1/Kconfig | 2 +- samples/Kconfig | 2 +- 75 files changed, 5730 insertions(+), 5704 deletions(-) delete mode 100644 Documentation/EDID/howto.rst delete mode 100644 Documentation/SM501.txt delete mode 100644 Documentation/bt8xxgpio.txt delete mode 100644 Documentation/connector/connector.rst delete mode 100644 Documentation/console/console.rst delete mode 100644 Documentation/dcdbas.txt delete mode 100644 Documentation/dell_rbu.txt create mode 100644 Documentation/driver-api/bt8xxgpio.rst create mode 100644 Documentation/driver-api/connector.rst create mode 100644 Documentation/driver-api/console.rst create mode 100644 Documentation/driver-api/dcdbas.rst create mode 100644 Documentation/driver-api/dell_rbu.rst create mode 100644 Documentation/driver-api/edid.rst create mode 100644 Documentation/driver-api/eisa.rst create mode 100644 Documentation/driver-api/isa.rst create mode 100644 Documentation/driver-api/isapnp.rst create mode 100644 Documentation/driver-api/lightnvm-pblk.rst create mode 100644 Documentation/driver-api/men-chameleon-bus.rst create mode 100644 Documentation/driver-api/ntb.rst create mode 100644 Documentation/driver-api/nvmem.rst create mode 100644 Documentation/driver-api/parport-lowlevel.rst create mode 100644 Documentation/driver-api/pti_intel_mid.rst create mode 100644 Documentation/driver-api/pwm.rst create mode 100644 Documentation/driver-api/rfkill.rst create mode 100644 Documentation/driver-api/sgi-ioc4.rst create mode 100644 Documentation/driver-api/sm501.rst create mode 100644 Documentation/driver-api/smsc_ece1099.rst create mode 100644 Documentation/driver-api/switchtec.rst create mode 100644 Documentation/driver-api/sync_file.rst create mode 100644 Documentation/driver-api/vfio-mediated-device.rst create mode 100644 Documentation/driver-api/vfio.rst create mode 100644 Documentation/driver-api/xillybus.rst create mode 100644 Documentation/driver-api/zorro.rst delete mode 100644 Documentation/eisa.txt delete mode 100644 Documentation/isa.txt delete mode 100644 Documentation/isapnp.txt delete mode 100644 Documentation/lightnvm/pblk.txt delete mode 100644 Documentation/men-chameleon-bus.txt delete mode 100644 Documentation/ntb.txt delete mode 100644 Documentation/nvmem/nvmem.rst delete mode 100644 Documentation/parport-lowlevel.txt delete mode 100644 Documentation/pti/pti_intel_mid.rst delete mode 100644 Documentation/pwm.txt delete mode 100644 Documentation/rfkill.txt delete mode 100644 Documentation/sgi-ioc4.txt delete mode 100644 Documentation/smsc_ece1099.txt delete mode 100644 Documentation/switchtec.txt delete mode 100644 Documentation/sync_file.txt delete mode 100644 Documentation/vfio-mediated-device.txt delete mode 100644 Documentation/vfio.txt delete mode 100644 Documentation/xillybus.txt delete mode 100644 Documentation/zorro.txt (limited to 'Documentation/driver-api') diff --git a/Documentation/ABI/removed/sysfs-class-rfkill b/Documentation/ABI/removed/sysfs-class-rfkill index 3ce6231f20b2..9c08c7f98ffb 100644 --- a/Documentation/ABI/removed/sysfs-class-rfkill +++ b/Documentation/ABI/removed/sysfs-class-rfkill @@ -1,6 +1,6 @@ rfkill - radio frequency (RF) connector kill switch support -For details to this subsystem look at Documentation/rfkill.txt. +For details to this subsystem look at Documentation/driver-api/rfkill.rst. What: /sys/class/rfkill/rfkill[0-9]+/claim Date: 09-Jul-2007 diff --git a/Documentation/ABI/stable/sysfs-class-rfkill b/Documentation/ABI/stable/sysfs-class-rfkill index 80151a409d67..5b154f922643 100644 --- a/Documentation/ABI/stable/sysfs-class-rfkill +++ b/Documentation/ABI/stable/sysfs-class-rfkill @@ -1,6 +1,6 @@ rfkill - radio frequency (RF) connector kill switch support -For details to this subsystem look at Documentation/rfkill.txt. +For details to this subsystem look at Documentation/driver-api/rfkill.rst. For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in Documentation/ABI/removed/sysfs-class-rfkill. diff --git a/Documentation/ABI/testing/sysfs-class-switchtec b/Documentation/ABI/testing/sysfs-class-switchtec index 48cb4c15e430..76c7a661a595 100644 --- a/Documentation/ABI/testing/sysfs-class-switchtec +++ b/Documentation/ABI/testing/sysfs-class-switchtec @@ -1,6 +1,6 @@ switchtec - Microsemi Switchtec PCI Switch Management Endpoint -For details on this subsystem look at Documentation/switchtec.txt. +For details on this subsystem look at Documentation/driver-api/switchtec.rst. What: /sys/class/switchtec Date: 05-Jan-2017 diff --git a/Documentation/EDID/howto.rst b/Documentation/EDID/howto.rst deleted file mode 100644 index 725fd49a88ca..000000000000 --- a/Documentation/EDID/howto.rst +++ /dev/null @@ -1,58 +0,0 @@ -:orphan: - -==== -EDID -==== - -In the good old days when graphics parameters were configured explicitly -in a file called xorg.conf, even broken hardware could be managed. - -Today, with the advent of Kernel Mode Setting, a graphics board is -either correctly working because all components follow the standards - -or the computer is unusable, because the screen remains dark after -booting or it displays the wrong area. Cases when this happens are: -- The graphics board does not recognize the monitor. -- The graphics board is unable to detect any EDID data. -- The graphics board incorrectly forwards EDID data to the driver. -- The monitor sends no or bogus EDID data. -- A KVM sends its own EDID data instead of querying the connected monitor. -Adding the kernel parameter "nomodeset" helps in most cases, but causes -restrictions later on. - -As a remedy for such situations, the kernel configuration item -CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an -individually prepared or corrected EDID data set in the /lib/firmware -directory from where it is loaded via the firmware interface. The code -(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for -commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200, -1680x1050, 1920x1080) as binary blobs, but the kernel source tree does -not contain code to create these data. In order to elucidate the origin -of the built-in binary EDID blobs and to facilitate the creation of -individual data for a specific misbehaving monitor, commented sources -and a Makefile environment are given here. - -To create binary EDID and C source code files from the existing data -material, simply type "make". - -If you want to create your own EDID file, copy the file 1024x768.S, -replace the settings with your own data and add a new target to the -Makefile. Please note that the EDID data structure expects the timing -values in a different way as compared to the standard X11 format. - -X11: - HTimings: - hdisp hsyncstart hsyncend htotal - VTimings: - vdisp vsyncstart vsyncend vtotal - -EDID:: - - #define XPIX hdisp - #define XBLANK htotal-hdisp - #define XOFFSET hsyncstart-hdisp - #define XPULSE hsyncend-hsyncstart - - #define YPIX vdisp - #define YBLANK vtotal-vdisp - #define YOFFSET vsyncstart-vdisp - #define YPULSE vsyncend-vsyncstart diff --git a/Documentation/SM501.txt b/Documentation/SM501.txt deleted file mode 100644 index 882507453ba4..000000000000 --- a/Documentation/SM501.txt +++ /dev/null @@ -1,74 +0,0 @@ -.. include:: - -============ -SM501 Driver -============ - -:Copyright: |copy| 2006, 2007 Simtec Electronics - -The Silicon Motion SM501 multimedia companion chip is a multifunction device -which may provide numerous interfaces including USB host controller USB gadget, -asynchronous serial ports, audio functions, and a dual display video interface. -The device may be connected by PCI or local bus with varying functions enabled. - -Core ----- - -The core driver in drivers/mfd provides common services for the -drivers which manage the specific hardware blocks. These services -include locking for common registers, clock control and resource -management. - -The core registers drivers for both PCI and generic bus based -chips via the platform device and driver system. - -On detection of a device, the core initialises the chip (which may -be specified by the platform data) and then exports the selected -peripheral set as platform devices for the specific drivers. - -The core re-uses the platform device system as the platform device -system provides enough features to support the drivers without the -need to create a new bus-type and the associated code to go with it. - - -Resources ---------- - -Each peripheral has a view of the device which is implicitly narrowed to -the specific set of resources that peripheral requires in order to -function correctly. - -The centralised memory allocation allows the driver to ensure that the -maximum possible resource allocation can be made to the video subsystem -as this is by-far the most resource-sensitive of the on-chip functions. - -The primary issue with memory allocation is that of moving the video -buffers once a display mode is chosen. Indeed when a video mode change -occurs the memory footprint of the video subsystem changes. - -Since video memory is difficult to move without changing the display -(unless sufficient contiguous memory can be provided for the old and new -modes simultaneously) the video driver fully utilises the memory area -given to it by aligning fb0 to the start of the area and fb1 to the end -of it. Any memory left over in the middle is used for the acceleration -functions, which are transient and thus their location is less critical -as it can be moved. - - -Configuration -------------- - -The platform device driver uses a set of platform data to pass -configurations through to the core and the subsidiary drivers -so that there can be support for more than one system carrying -an SM501 built into a single kernel image. - -The PCI driver assumes that the PCI card behaves as per the Silicon -Motion reference design. - -There is an errata (AB-5) affecting the selection of the -of the M1XCLK and M1CLK frequencies. These two clocks -must be sourced from the same PLL, although they can then -be divided down individually. If this is not set, then SM501 may -lock and hang the whole system. The driver will refuse to -attach if the PLL selection is different. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 19b1e3bef56c..04f7b537ee51 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -930,7 +930,7 @@ edid/1680x1050.bin, or edid/1920x1080.bin is given and no file with the same name exists. Details and instructions how to build your own EDID data are - available in Documentation/EDID/howto.rst. An EDID + available in Documentation/driver-api/edid.rst. An EDID data set will only be used for a particular connector, if its name and a colon are prepended to the EDID name. Each connector may use a unique EDID data diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst index 19d52fc3c5e9..adea0bf2acc5 100644 --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst @@ -643,7 +643,7 @@ Sysfs notes 2010. rfkill controller switch "tpacpi_bluetooth_sw": refer to - Documentation/rfkill.txt for details. + Documentation/driver-api/rfkill.rst for details. Video output control -- /proc/acpi/ibm/video @@ -1406,7 +1406,7 @@ Sysfs notes 2010. rfkill controller switch "tpacpi_wwan_sw": refer to - Documentation/rfkill.txt for details. + Documentation/driver-api/rfkill.rst for details. EXPERIMENTAL: UWB @@ -1426,7 +1426,7 @@ Sysfs notes ^^^^^^^^^^^ rfkill controller switch "tpacpi_uwb_sw": refer to - Documentation/rfkill.txt for details. + Documentation/driver-api/rfkill.rst for details. Adaptive keyboard ----------------- diff --git a/Documentation/bt8xxgpio.txt b/Documentation/bt8xxgpio.txt deleted file mode 100644 index a845feb074de..000000000000 --- a/Documentation/bt8xxgpio.txt +++ /dev/null @@ -1,62 +0,0 @@ -=================================================================== -A driver for a selfmade cheap BT8xx based PCI GPIO-card (bt8xxgpio) -=================================================================== - -For advanced documentation, see http://www.bu3sch.de/btgpio.php - -A generic digital 24-port PCI GPIO card can be built out of an ordinary -Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The -Brooktree chip is used in old analog Hauppauge WinTV PCI cards. You can easily -find them used for low prices on the net. - -The bt8xx chip does have 24 digital GPIO ports. -These ports are accessible via 24 pins on the SMD chip package. - - -How to physically access the GPIO pins -====================================== - -The are several ways to access these pins. One might unsolder the whole chip -and put it on a custom PCI board, or one might only unsolder each individual -GPIO pin and solder that to some tiny wire. As the chip package really is tiny -there are some advanced soldering skills needed in any case. - -The physical pinouts are drawn in the following ASCII art. -The GPIO pins are marked with G00-G23:: - - G G G G G G G G G G G G G G G G G G - 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 - 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 - | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - --------------------------------------------------------------------------- - --| ^ ^ |-- - --| pin 86 pin 67 |-- - --| |-- - --| pin 61 > |-- G18 - --| |-- G19 - --| |-- G20 - --| |-- G21 - --| |-- G22 - --| pin 56 > |-- G23 - --| |-- - --| Brooktree 878/879 |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| |-- - --| O |-- - --| |-- - --------------------------------------------------------------------------- - | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - ^ - This is pin 1 - diff --git a/Documentation/connector/connector.rst b/Documentation/connector/connector.rst deleted file mode 100644 index 24e26dc22dbf..000000000000 --- a/Documentation/connector/connector.rst +++ /dev/null @@ -1,156 +0,0 @@ -:orphan: - -================ -Kernel Connector -================ - -Kernel connector - new netlink based userspace <-> kernel space easy -to use communication module. - -The Connector driver makes it easy to connect various agents using a -netlink based network. One must register a callback and an identifier. -When the driver receives a special netlink message with the appropriate -identifier, the appropriate callback will be called. - -From the userspace point of view it's quite straightforward: - - - socket(); - - bind(); - - send(); - - recv(); - -But if kernelspace wants to use the full power of such connections, the -driver writer must create special sockets, must know about struct sk_buff -handling, etc... The Connector driver allows any kernelspace agents to use -netlink based networking for inter-process communication in a significantly -easier way:: - - int cn_add_callback(struct cb_id *id, char *name, void (*callback) (struct cn_msg *, struct netlink_skb_parms *)); - void cn_netlink_send_multi(struct cn_msg *msg, u16 len, u32 portid, u32 __group, int gfp_mask); - void cn_netlink_send(struct cn_msg *msg, u32 portid, u32 __group, int gfp_mask); - - struct cb_id - { - __u32 idx; - __u32 val; - }; - -idx and val are unique identifiers which must be registered in the -connector.h header for in-kernel usage. `void (*callback) (void *)` is a -callback function which will be called when a message with above idx.val -is received by the connector core. The argument for that function must -be dereferenced to `struct cn_msg *`:: - - struct cn_msg - { - struct cb_id id; - - __u32 seq; - __u32 ack; - - __u32 len; /* Length of the following data */ - __u8 data[0]; - }; - -Connector interfaces -==================== - - .. kernel-doc:: include/linux/connector.h - - Note: - When registering new callback user, connector core assigns - netlink group to the user which is equal to its id.idx. - -Protocol description -==================== - -The current framework offers a transport layer with fixed headers. The -recommended protocol which uses such a header is as following: - -msg->seq and msg->ack are used to determine message genealogy. When -someone sends a message, they use a locally unique sequence and random -acknowledge number. The sequence number may be copied into -nlmsghdr->nlmsg_seq too. - -The sequence number is incremented with each message sent. - -If you expect a reply to the message, then the sequence number in the -received message MUST be the same as in the original message, and the -acknowledge number MUST be the same + 1. - -If we receive a message and its sequence number is not equal to one we -are expecting, then it is a new message. If we receive a message and -its sequence number is the same as one we are expecting, but its -acknowledge is not equal to the sequence number in the original -message + 1, then it is a new message. - -Obviously, the protocol header contains the above id. - -The connector allows event notification in the following form: kernel -driver or userspace process can ask connector to notify it when -selected ids will be turned on or off (registered or unregistered its -callback). It is done by sending a special command to the connector -driver (it also registers itself with id={-1, -1}). - -As example of this usage can be found in the cn_test.c module which -uses the connector to request notification and to send messages. - -Reliability -=========== - -Netlink itself is not a reliable protocol. That means that messages can -be lost due to memory pressure or process' receiving queue overflowed, -so caller is warned that it must be prepared. That is why the struct -cn_msg [main connector's message header] contains u32 seq and u32 ack -fields. - -Userspace usage -=============== - -2.6.14 has a new netlink socket implementation, which by default does not -allow people to send data to netlink groups other than 1. -So, if you wish to use a netlink socket (for example using connector) -with a different group number, the userspace application must subscribe to -that group first. It can be achieved by the following pseudocode:: - - s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); - - l_local.nl_family = AF_NETLINK; - l_local.nl_groups = 12345; - l_local.nl_pid = 0; - - if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { - perror("bind"); - close(s); - return -1; - } - - { - int on = l_local.nl_groups; - setsockopt(s, 270, 1, &on, sizeof(on)); - } - -Where 270 above is SOL_NETLINK, and 1 is a NETLINK_ADD_MEMBERSHIP socket -option. To drop a multicast subscription, one should call the above socket -option with the NETLINK_DROP_MEMBERSHIP parameter which is defined as 0. - -2.6.14 netlink code only allows to select a group which is less or equal to -the maximum group number, which is used at netlink_kernel_create() time. -In case of connector it is CN_NETLINK_USERS + 0xf, so if you want to use -group number 12345, you must increment CN_NETLINK_USERS to that number. -Additional 0xf numbers are allocated to be used by non-in-kernel users. - -Due to this limitation, group 0xffffffff does not work now, so one can -not use add/remove connector's group notifications, but as far as I know, -only cn_test.c test module used it. - -Some work in netlink area is still being done, so things can be changed in -2.6.15 timeframe, if it will happen, documentation will be updated for that -kernel. - -Code samples -============ - -Sample code for a connector test module and user space can be found -in samples/connector/. To build this code, enable CONFIG_CONNECTOR -and CONFIG_SAMPLES. diff --git a/Documentation/console/console.rst b/Documentation/console/console.rst deleted file mode 100644 index b374141b027e..000000000000 --- a/Documentation/console/console.rst +++ /dev/null @@ -1,152 +0,0 @@ -:orphan: - -=============== -Console Drivers -=============== - -The Linux kernel has 2 general types of console drivers. The first type is -assigned by the kernel to all the virtual consoles during the boot process. -This type will be called 'system driver', and only one system driver is allowed -to exist. The system driver is persistent and it can never be unloaded, though -it may become inactive. - -The second type has to be explicitly loaded and unloaded. This will be called -'modular driver' by this document. Multiple modular drivers can coexist at -any time with each driver sharing the console with other drivers including -the system driver. However, modular drivers cannot take over the console -that is currently occupied by another modular driver. (Exception: Drivers that -call do_take_over_console() will succeed in the takeover regardless of the type -of driver occupying the consoles.) They can only take over the console that is -occupied by the system driver. In the same token, if the modular driver is -released by the console, the system driver will take over. - -Modular drivers, from the programmer's point of view, have to call:: - - do_take_over_console() - load and bind driver to console layer - give_up_console() - unload driver; it will only work if driver - is fully unbound - -In newer kernels, the following are also available:: - - do_register_con_driver() - do_unregister_con_driver() - -If sysfs is enabled, the contents of /sys/class/vtconsole can be -examined. This shows the console backends currently registered by the -system which are named vtcon where is an integer from 0 to 15. -Thus:: - - ls /sys/class/vtconsole - . .. vtcon0 vtcon1 - -Each directory in /sys/class/vtconsole has 3 files:: - - ls /sys/class/vtconsole/vtcon0 - . .. bind name uevent - -What do these files signify? - - 1. bind - this is a read/write file. It shows the status of the driver if - read, or acts to bind or unbind the driver to the virtual consoles - when written to. The possible values are: - - 0 - - means the driver is not bound and if echo'ed, commands the driver - to unbind - - 1 - - means the driver is bound and if echo'ed, commands the driver to - bind - - 2. name - read-only file. Shows the name of the driver in this format:: - - cat /sys/class/vtconsole/vtcon0/name - (S) VGA+ - - '(S)' stands for a (S)ystem driver, i.e., it cannot be directly - commanded to bind or unbind - - 'VGA+' is the name of the driver - - cat /sys/class/vtconsole/vtcon1/name - (M) frame buffer device - - In this case, '(M)' stands for a (M)odular driver, one that can be - directly commanded to bind or unbind. - - 3. uevent - ignore this file - -When unbinding, the modular driver is detached first, and then the system -driver takes over the consoles vacated by the driver. Binding, on the other -hand, will bind the driver to the consoles that are currently occupied by a -system driver. - -NOTE1: - Binding and unbinding must be selected in Kconfig. It's under:: - - Device Drivers -> - Character devices -> - Support for binding and unbinding console drivers - -NOTE2: - If any of the virtual consoles are in KD_GRAPHICS mode, then binding or - unbinding will not succeed. An example of an application that sets the - console to KD_GRAPHICS is X. - -How useful is this feature? This is very useful for console driver -developers. By unbinding the driver from the console layer, one can unload the -driver, make changes, recompile, reload and rebind the driver without any need -for rebooting the kernel. For regular users who may want to switch from -framebuffer console to VGA console and vice versa, this feature also makes -this possible. (NOTE NOTE NOTE: Please read fbcon.txt under Documentation/fb -for more details.) - -Notes for developers -==================== - -do_take_over_console() is now broken up into:: - - do_register_con_driver() - do_bind_con_driver() - private function - -give_up_console() is a wrapper to do_unregister_con_driver(), and a driver must -be fully unbound for this call to succeed. con_is_bound() will check if the -driver is bound or not. - -Guidelines for console driver writers -===================================== - -In order for binding to and unbinding from the console to properly work, -console drivers must follow these guidelines: - -1. All drivers, except system drivers, must call either do_register_con_driver() - or do_take_over_console(). do_register_con_driver() will just add the driver - to the console's internal list. It won't take over the - console. do_take_over_console(), as it name implies, will also take over (or - bind to) the console. - -2. All resources allocated during con->con_init() must be released in - con->con_deinit(). - -3. All resources allocated in con->con_startup() must be released when the - driver, which was previously bound, becomes unbound. The console layer - does not have a complementary call to con->con_startup() so it's up to the - driver to check when it's legal to release these resources. Calling - con_is_bound() in con->con_deinit() will help. If the call returned - false(), then it's safe to release the resources. This balance has to be - ensured because con->con_startup() can be called again when a request to - rebind the driver to the console arrives. - -4. Upon exit of the driver, ensure that the driver is totally unbound. If the - condition is satisfied, then the driver must call do_unregister_con_driver() - or give_up_console(). - -5. do_unregister_con_driver() can also be called on conditions which make it - impossible for the driver to service console requests. This can happen - with the framebuffer console that suddenly lost all of its drivers. - -The current crop of console drivers should still work correctly, but binding -and unbinding them may cause problems. With minimal fixes, these drivers can -be made to work correctly. - -Antonino Daplas diff --git a/Documentation/dcdbas.txt b/Documentation/dcdbas.txt deleted file mode 100644 index 309cc57a7c1c..000000000000 --- a/Documentation/dcdbas.txt +++ /dev/null @@ -1,99 +0,0 @@ -=================================== -Dell Systems Management Base Driver -=================================== - -Overview -======== - -The Dell Systems Management Base Driver provides a sysfs interface for -systems management software such as Dell OpenManage to perform system -management interrupts and host control actions (system power cycle or -power off after OS shutdown) on certain Dell systems. - -Dell OpenManage requires this driver on the following Dell PowerEdge systems: -300, 1300, 1400, 400SC, 500SC, 1500SC, 1550, 600SC, 1600SC, 650, 1655MC, -700, and 750. Other Dell software such as the open source libsmbios project -is expected to make use of this driver, and it may include the use of this -driver on other Dell systems. - -The Dell libsmbios project aims towards providing access to as much BIOS -information as possible. See http://linux.dell.com/libsmbios/main/ for -more information about the libsmbios project. - - -System Management Interrupt -=========================== - -On some Dell systems, systems management software must access certain -management information via a system management interrupt (SMI). The SMI data -buffer must reside in 32-bit address space, and the physical address of the -buffer is required for the SMI. The driver maintains the memory required for -the SMI and provides a way for the application to generate the SMI. -The driver creates the following sysfs entries for systems management -software to perform these system management interrupts:: - - /sys/devices/platform/dcdbas/smi_data - /sys/devices/platform/dcdbas/smi_data_buf_phys_addr - /sys/devices/platform/dcdbas/smi_data_buf_size - /sys/devices/platform/dcdbas/smi_request - -Systems management software must perform the following steps to execute -a SMI using this driver: - -1) Lock smi_data. -2) Write system management command to smi_data. -3) Write "1" to smi_request to generate a calling interface SMI or - "2" to generate a raw SMI. -4) Read system management command response from smi_data. -5) Unlock smi_data. - - -Host Control Action -=================== - -Dell OpenManage supports a host control feature that allows the administrator -to perform a power cycle or power off of the system after the OS has finished -shutting down. On some Dell systems, this host control feature requires that -a driver perform a SMI after the OS has finished shutting down. - -The driver creates the following sysfs entries for systems management software -to schedule the driver to perform a power cycle or power off host control -action after the system has finished shutting down: - -/sys/devices/platform/dcdbas/host_control_action -/sys/devices/platform/dcdbas/host_control_smi_type -/sys/devices/platform/dcdbas/host_control_on_shutdown - -Dell OpenManage performs the following steps to execute a power cycle or -power off host control action using this driver: - -1) Write host control action to be performed to host_control_action. -2) Write type of SMI that driver needs to perform to host_control_smi_type. -3) Write "1" to host_control_on_shutdown to enable host control action. -4) Initiate OS shutdown. - (Driver will perform host control SMI when it is notified that the OS - has finished shutting down.) - - -Host Control SMI Type -===================== - -The following table shows the value to write to host_control_smi_type to -perform a power cycle or power off host control action: - -=================== ===================== -PowerEdge System Host Control SMI Type -=================== ===================== - 300 HC_SMITYPE_TYPE1 - 1300 HC_SMITYPE_TYPE1 - 1400 HC_SMITYPE_TYPE2 - 500SC HC_SMITYPE_TYPE2 - 1500SC HC_SMITYPE_TYPE2 - 1550 HC_SMITYPE_TYPE2 - 600SC HC_SMITYPE_TYPE2 - 1600SC HC_SMITYPE_TYPE2 - 650 HC_SMITYPE_TYPE2 - 1655MC HC_SMITYPE_TYPE2 - 700 HC_SMITYPE_TYPE3 - 750 HC_SMITYPE_TYPE3 -=================== ===================== diff --git a/Documentation/dell_rbu.txt b/Documentation/dell_rbu.txt deleted file mode 100644 index 5d1ce7bcd04d..000000000000 --- a/Documentation/dell_rbu.txt +++ /dev/null @@ -1,128 +0,0 @@ -============================================================= -Usage of the new open sourced rbu (Remote BIOS Update) driver -============================================================= - -Purpose -======= - -Document demonstrating the use of the Dell Remote BIOS Update driver. -for updating BIOS images on Dell servers and desktops. - -Scope -===== - -This document discusses the functionality of the rbu driver only. -It does not cover the support needed from applications to enable the BIOS to -update itself with the image downloaded in to the memory. - -Overview -======== - -This driver works with Dell OpenManage or Dell Update Packages for updating -the BIOS on Dell servers (starting from servers sold since 1999), desktops -and notebooks (starting from those sold in 2005). - -Please go to http://support.dell.com register and you can find info on -OpenManage and Dell Update packages (DUP). - -Libsmbios can also be used to update BIOS on Dell systems go to -http://linux.dell.com/libsmbios/ for details. - -Dell_RBU driver supports BIOS update using the monolithic image and packetized -image methods. In case of monolithic the driver allocates a contiguous chunk -of physical pages having the BIOS image. In case of packetized the app -using the driver breaks the image in to packets of fixed sizes and the driver -would place each packet in contiguous physical memory. The driver also -maintains a link list of packets for reading them back. - -If the dell_rbu driver is unloaded all the allocated memory is freed. - -The rbu driver needs to have an application (as mentioned above)which will -inform the BIOS to enable the update in the next system reboot. - -The user should not unload the rbu driver after downloading the BIOS image -or updating. - -The driver load creates the following directories under the /sys file system:: - - /sys/class/firmware/dell_rbu/loading - /sys/class/firmware/dell_rbu/data - /sys/devices/platform/dell_rbu/image_type - /sys/devices/platform/dell_rbu/data - /sys/devices/platform/dell_rbu/packet_size - -The driver supports two types of update mechanism; monolithic and packetized. -These update mechanism depends upon the BIOS currently running on the system. -Most of the Dell systems support a monolithic update where the BIOS image is -copied to a single contiguous block of physical memory. - -In case of packet mechanism the single memory can be broken in smaller chunks -of contiguous memory and the BIOS image is scattered in these packets. - -By default the driver uses monolithic memory for the update type. This can be -changed to packets during the driver load time by specifying the load -parameter image_type=packet. This can also be changed later as below:: - - echo packet > /sys/devices/platform/dell_rbu/image_type - -In packet update mode the packet size has to be given before any packets can -be downloaded. It is done as below:: - - echo XXXX > /sys/devices/platform/dell_rbu/packet_size - -In the packet update mechanism, the user needs to create a new file having -packets of data arranged back to back. It can be done as follows -The user creates packets header, gets the chunk of the BIOS image and -places it next to the packetheader; now, the packetheader + BIOS image chunk -added together should match the specified packet_size. This makes one -packet, the user needs to create more such packets out of the entire BIOS -image file and then arrange all these packets back to back in to one single -file. - -This file is then copied to /sys/class/firmware/dell_rbu/data. -Once this file gets to the driver, the driver extracts packet_size data from -the file and spreads it across the physical memory in contiguous packet_sized -space. - -This method makes sure that all the packets get to the driver in a single operation. - -In monolithic update the user simply get the BIOS image (.hdr file) and copies -to the data file as is without any change to the BIOS image itself. - -Do the steps below to download the BIOS image. - -1) echo 1 > /sys/class/firmware/dell_rbu/loading -2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data -3) echo 0 > /sys/class/firmware/dell_rbu/loading - -The /sys/class/firmware/dell_rbu/ entries will remain till the following is -done. - -:: - - echo -1 > /sys/class/firmware/dell_rbu/loading - -Until this step is completed the driver cannot be unloaded. - -Also echoing either mono, packet or init in to image_type will free up the -memory allocated by the driver. - -If a user by accident executes steps 1 and 3 above without executing step 2; -it will make the /sys/class/firmware/dell_rbu/ entries disappear. - -The entries can be recreated by doing the following:: - - echo init > /sys/devices/platform/dell_rbu/image_type - -.. note:: echoing init in image_type does not change it original value. - -Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to -read back the image downloaded. - -.. note:: - - After updating the BIOS image a user mode application needs to execute - code which sends the BIOS update request to the BIOS. So on the next reboot - the BIOS knows about the new image downloaded and it updates itself. - Also don't unload the rbu driver if the image has to be updated. - diff --git a/Documentation/driver-api/bt8xxgpio.rst b/Documentation/driver-api/bt8xxgpio.rst new file mode 100644 index 000000000000..a845feb074de --- /dev/null +++ b/Documentation/driver-api/bt8xxgpio.rst @@ -0,0 +1,62 @@ +=================================================================== +A driver for a selfmade cheap BT8xx based PCI GPIO-card (bt8xxgpio) +=================================================================== + +For advanced documentation, see http://www.bu3sch.de/btgpio.php + +A generic digital 24-port PCI GPIO card can be built out of an ordinary +Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The +Brooktree chip is used in old analog Hauppauge WinTV PCI cards. You can easily +find them used for low prices on the net. + +The bt8xx chip does have 24 digital GPIO ports. +These ports are accessible via 24 pins on the SMD chip package. + + +How to physically access the GPIO pins +====================================== + +The are several ways to access these pins. One might unsolder the whole chip +and put it on a custom PCI board, or one might only unsolder each individual +GPIO pin and solder that to some tiny wire. As the chip package really is tiny +there are some advanced soldering skills needed in any case. + +The physical pinouts are drawn in the following ASCII art. +The GPIO pins are marked with G00-G23:: + + G G G G G G G G G G G G G G G G G G + 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 + | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | + --------------------------------------------------------------------------- + --| ^ ^ |-- + --| pin 86 pin 67 |-- + --| |-- + --| pin 61 > |-- G18 + --| |-- G19 + --| |-- G20 + --| |-- G21 + --| |-- G22 + --| pin 56 > |-- G23 + --| |-- + --| Brooktree 878/879 |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| |-- + --| O |-- + --| |-- + --------------------------------------------------------------------------- + | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | + ^ + This is pin 1 + diff --git a/Documentation/driver-api/connector.rst b/Documentation/driver-api/connector.rst new file mode 100644 index 000000000000..c100c7482289 --- /dev/null +++ b/Documentation/driver-api/connector.rst @@ -0,0 +1,156 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +Kernel Connector +================ + +Kernel connector - new netlink based userspace <-> kernel space easy +to use communication module. + +The Connector driver makes it easy to connect various agents using a +netlink based network. One must register a callback and an identifier. +When the driver receives a special netlink message with the appropriate +identifier, the appropriate callback will be called. + +From the userspace point of view it's quite straightforward: + + - socket(); + - bind(); + - send(); + - recv(); + +But if kernelspace wants to use the full power of such connections, the +driver writer must create special sockets, must know about struct sk_buff +handling, etc... The Connector driver allows any kernelspace agents to use +netlink based networking for inter-process communication in a significantly +easier way:: + + int cn_add_callback(struct cb_id *id, char *name, void (*callback) (struct cn_msg *, struct netlink_skb_parms *)); + void cn_netlink_send_multi(struct cn_msg *msg, u16 len, u32 portid, u32 __group, int gfp_mask); + void cn_netlink_send(struct cn_msg *msg, u32 portid, u32 __group, int gfp_mask); + + struct cb_id + { + __u32 idx; + __u32 val; + }; + +idx and val are unique identifiers which must be registered in the +connector.h header for in-kernel usage. `void (*callback) (void *)` is a +callback function which will be called when a message with above idx.val +is received by the connector core. The argument for that function must +be dereferenced to `struct cn_msg *`:: + + struct cn_msg + { + struct cb_id id; + + __u32 seq; + __u32 ack; + + __u32 len; /* Length of the following data */ + __u8 data[0]; + }; + +Connector interfaces +==================== + + .. kernel-doc:: include/linux/connector.h + + Note: + When registering new callback user, connector core assigns + netlink group to the user which is equal to its id.idx. + +Protocol description +==================== + +The current framework offers a transport layer with fixed headers. The +recommended protocol which uses such a header is as following: + +msg->seq and msg->ack are used to determine message genealogy. When +someone sends a message, they use a locally unique sequence and random +acknowledge number. The sequence number may be copied into +nlmsghdr->nlmsg_seq too. + +The sequence number is incremented with each message sent. + +If you expect a reply to the message, then the sequence number in the +received message MUST be the same as in the original message, and the +acknowledge number MUST be the same + 1. + +If we receive a message and its sequence number is not equal to one we +are expecting, then it is a new message. If we receive a message and +its sequence number is the same as one we are expecting, but its +acknowledge is not equal to the sequence number in the original +message + 1, then it is a new message. + +Obviously, the protocol header contains the above id. + +The connector allows event notification in the following form: kernel +driver or userspace process can ask connector to notify it when +selected ids will be turned on or off (registered or unregistered its +callback). It is done by sending a special command to the connector +driver (it also registers itself with id={-1, -1}). + +As example of this usage can be found in the cn_test.c module which +uses the connector to request notification and to send messages. + +Reliability +=========== + +Netlink itself is not a reliable protocol. That means that messages can +be lost due to memory pressure or process' receiving queue overflowed, +so caller is warned that it must be prepared. That is why the struct +cn_msg [main connector's message header] contains u32 seq and u32 ack +fields. + +Userspace usage +=============== + +2.6.14 has a new netlink socket implementation, which by default does not +allow people to send data to netlink groups other than 1. +So, if you wish to use a netlink socket (for example using connector) +with a different group number, the userspace application must subscribe to +that group first. It can be achieved by the following pseudocode:: + + s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); + + l_local.nl_family = AF_NETLINK; + l_local.nl_groups = 12345; + l_local.nl_pid = 0; + + if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { + perror("bind"); + close(s); + return -1; + } + + { + int on = l_local.nl_groups; + setsockopt(s, 270, 1, &on, sizeof(on)); + } + +Where 270 above is SOL_NETLINK, and 1 is a NETLINK_ADD_MEMBERSHIP socket +option. To drop a multicast subscription, one should call the above socket +option with the NETLINK_DROP_MEMBERSHIP parameter which is defined as 0. + +2.6.14 netlink code only allows to select a group which is less or equal to +the maximum group number, which is used at netlink_kernel_create() time. +In case of connector it is CN_NETLINK_USERS + 0xf, so if you want to use +group number 12345, you must increment CN_NETLINK_USERS to that number. +Additional 0xf numbers are allocated to be used by non-in-kernel users. + +Due to this limitation, group 0xffffffff does not work now, so one can +not use add/remove connector's group notifications, but as far as I know, +only cn_test.c test module used it. + +Some work in netlink area is still being done, so things can be changed in +2.6.15 timeframe, if it will happen, documentation will be updated for that +kernel. + +Code samples +============ + +Sample code for a connector test module and user space can be found +in samples/connector/. To build this code, enable CONFIG_CONNECTOR +and CONFIG_SAMPLES. diff --git a/Documentation/driver-api/console.rst b/Documentation/driver-api/console.rst new file mode 100644 index 000000000000..8394ad7747ac --- /dev/null +++ b/Documentation/driver-api/console.rst @@ -0,0 +1,152 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Console Drivers +=============== + +The Linux kernel has 2 general types of console drivers. The first type is +assigned by the kernel to all the virtual consoles during the boot process. +This type will be called 'system driver', and only one system driver is allowed +to exist. The system driver is persistent and it can never be unloaded, though +it may become inactive. + +The second type has to be explicitly loaded and unloaded. This will be called +'modular driver' by this document. Multiple modular drivers can coexist at +any time with each driver sharing the console with other drivers including +the system driver. However, modular drivers cannot take over the console +that is currently occupied by another modular driver. (Exception: Drivers that +call do_take_over_console() will succeed in the takeover regardless of the type +of driver occupying the consoles.) They can only take over the console that is +occupied by the system driver. In the same token, if the modular driver is +released by the console, the system driver will take over. + +Modular drivers, from the programmer's point of view, have to call:: + + do_take_over_console() - load and bind driver to console layer + give_up_console() - unload driver; it will only work if driver + is fully unbound + +In newer kernels, the following are also available:: + + do_register_con_driver() + do_unregister_con_driver() + +If sysfs is enabled, the contents of /sys/class/vtconsole can be +examined. This shows the console backends currently registered by the +system which are named vtcon where is an integer from 0 to 15. +Thus:: + + ls /sys/class/vtconsole + . .. vtcon0 vtcon1 + +Each directory in /sys/class/vtconsole has 3 files:: + + ls /sys/class/vtconsole/vtcon0 + . .. bind name uevent + +What do these files signify? + + 1. bind - this is a read/write file. It shows the status of the driver if + read, or acts to bind or unbind the driver to the virtual consoles + when written to. The possible values are: + + 0 + - means the driver is not bound and if echo'ed, commands the driver + to unbind + + 1 + - means the driver is bound and if echo'ed, commands the driver to + bind + + 2. name - read-only file. Shows the name of the driver in this format:: + + cat /sys/class/vtconsole/vtcon0/name + (S) VGA+ + + '(S)' stands for a (S)ystem driver, i.e., it cannot be directly + commanded to bind or unbind + + 'VGA+' is the name of the driver + + cat /sys/class/vtconsole/vtcon1/name + (M) frame buffer device + + In this case, '(M)' stands for a (M)odular driver, one that can be + directly commanded to bind or unbind. + + 3. uevent - ignore this file + +When unbinding, the modular driver is detached first, and then the system +driver takes over the consoles vacated by the driver. Binding, on the other +hand, will bind the driver to the consoles that are currently occupied by a +system driver. + +NOTE1: + Binding and unbinding must be selected in Kconfig. It's under:: + + Device Drivers -> + Character devices -> + Support for binding and unbinding console drivers + +NOTE2: + If any of the virtual consoles are in KD_GRAPHICS mode, then binding or + unbinding will not succeed. An example of an application that sets the + console to KD_GRAPHICS is X. + +How useful is this feature? This is very useful for console driver +developers. By unbinding the driver from the console layer, one can unload the +driver, make changes, recompile, reload and rebind the driver without any need +for rebooting the kernel. For regular users who may want to switch from +framebuffer console to VGA console and vice versa, this feature also makes +this possible. (NOTE NOTE NOTE: Please read fbcon.txt under Documentation/fb +for more details.) + +Notes for developers +==================== + +do_take_over_console() is now broken up into:: + + do_register_con_driver() + do_bind_con_driver() - private function + +give_up_console() is a wrapper to do_unregister_con_driver(), and a driver must +be fully unbound for this call to succeed. con_is_bound() will check if the +driver is bound or not. + +Guidelines for console driver writers +===================================== + +In order for binding to and unbinding from the console to properly work, +console drivers must follow these guidelines: + +1. All drivers, except system drivers, must call either do_register_con_driver() + or do_take_over_console(). do_register_con_driver() will just add the driver + to the console's internal list. It won't take over the + console. do_take_over_console(), as it name implies, will also take over (or + bind to) the console. + +2. All resources allocated during con->con_init() must be released in + con->con_deinit(). + +3. All resources allocated in con->con_startup() must be released when the + driver, which was previously bound, becomes unbound. The console layer + does not have a complementary call to con->con_startup() so it's up to the + driver to check when it's legal to release these resources. Calling + con_is_bound() in con->con_deinit() will help. If the call returned + false(), then it's safe to release the resources. This balance has to be + ensured because con->con_startup() can be called again when a request to + rebind the driver to the console arrives. + +4. Upon exit of the driver, ensure that the driver is totally unbound. If the + condition is satisfied, then the driver must call do_unregister_con_driver() + or give_up_console(). + +5. do_unregister_con_driver() can also be called on conditions which make it + impossible for the driver to service console requests. This can happen + with the framebuffer console that suddenly lost all of its drivers. + +The current crop of console drivers should still work correctly, but binding +and unbinding them may cause problems. With minimal fixes, these drivers can +be made to work correctly. + +Antonino Daplas diff --git a/Documentation/driver-api/dcdbas.rst b/Documentation/driver-api/dcdbas.rst new file mode 100644 index 000000000000..309cc57a7c1c --- /dev/null +++ b/Documentation/driver-api/dcdbas.rst @@ -0,0 +1,99 @@ +=================================== +Dell Systems Management Base Driver +=================================== + +Overview +======== + +The Dell Systems Management Base Driver provides a sysfs interface for +systems management software such as Dell OpenManage to perform system +management interrupts and host control actions (system power cycle or +power off after OS shutdown) on certain Dell systems. + +Dell OpenManage requires this driver on the following Dell PowerEdge systems: +300, 1300, 1400, 400SC, 500SC, 1500SC, 1550, 600SC, 1600SC, 650, 1655MC, +700, and 750. Other Dell software such as the open source libsmbios project +is expected to make use of this driver, and it may include the use of this +driver on other Dell systems. + +The Dell libsmbios project aims towards providing access to as much BIOS +information as possible. See http://linux.dell.com/libsmbios/main/ for +more information about the libsmbios project. + + +System Management Interrupt +=========================== + +On some Dell systems, systems management software must access certain +management information via a system management interrupt (SMI). The SMI data +buffer must reside in 32-bit address space, and the physical address of the +buffer is required for the SMI. The driver maintains the memory required for +the SMI and provides a way for the application to generate the SMI. +The driver creates the following sysfs entries for systems management +software to perform these system management interrupts:: + + /sys/devices/platform/dcdbas/smi_data + /sys/devices/platform/dcdbas/smi_data_buf_phys_addr + /sys/devices/platform/dcdbas/smi_data_buf_size + /sys/devices/platform/dcdbas/smi_request + +Systems management software must perform the following steps to execute +a SMI using this driver: + +1) Lock smi_data. +2) Write system management command to smi_data. +3) Write "1" to smi_request to generate a calling interface SMI or + "2" to generate a raw SMI. +4) Read system management command response from smi_data. +5) Unlock smi_data. + + +Host Control Action +=================== + +Dell OpenManage supports a host control feature that allows the administrator +to perform a power cycle or power off of the system after the OS has finished +shutting down. On some Dell systems, this host control feature requires that +a driver perform a SMI after the OS has finished shutting down. + +The driver creates the following sysfs entries for systems management software +to schedule the driver to perform a power cycle or power off host control +action after the system has finished shutting down: + +/sys/devices/platform/dcdbas/host_control_action +/sys/devices/platform/dcdbas/host_control_smi_type +/sys/devices/platform/dcdbas/host_control_on_shutdown + +Dell OpenManage performs the following steps to execute a power cycle or +power off host control action using this driver: + +1) Write host control action to be performed to host_control_action. +2) Write type of SMI that driver needs to perform to host_control_smi_type. +3) Write "1" to host_control_on_shutdown to enable host control action. +4) Initiate OS shutdown. + (Driver will perform host control SMI when it is notified that the OS + has finished shutting down.) + + +Host Control SMI Type +===================== + +The following table shows the value to write to host_control_smi_type to +perform a power cycle or power off host control action: + +=================== ===================== +PowerEdge System Host Control SMI Type +=================== ===================== + 300 HC_SMITYPE_TYPE1 + 1300 HC_SMITYPE_TYPE1 + 1400 HC_SMITYPE_TYPE2 + 500SC HC_SMITYPE_TYPE2 + 1500SC HC_SMITYPE_TYPE2 + 1550 HC_SMITYPE_TYPE2 + 600SC HC_SMITYPE_TYPE2 + 1600SC HC_SMITYPE_TYPE2 + 650 HC_SMITYPE_TYPE2 + 1655MC HC_SMITYPE_TYPE2 + 700 HC_SMITYPE_TYPE3 + 750 HC_SMITYPE_TYPE3 +=================== ===================== diff --git a/Documentation/driver-api/dell_rbu.rst b/Documentation/driver-api/dell_rbu.rst new file mode 100644 index 000000000000..5d1ce7bcd04d --- /dev/null +++ b/Documentation/driver-api/dell_rbu.rst @@ -0,0 +1,128 @@ +============================================================= +Usage of the new open sourced rbu (Remote BIOS Update) driver +============================================================= + +Purpose +======= + +Document demonstrating the use of the Dell Remote BIOS Update driver. +for updating BIOS images on Dell servers and desktops. + +Scope +===== + +This document discusses the functionality of the rbu driver only. +It does not cover the support needed from applications to enable the BIOS to +update itself with the image downloaded in to the memory. + +Overview +======== + +This driver works with Dell OpenManage or Dell Update Packages for updating +the BIOS on Dell servers (starting from servers sold since 1999), desktops +and notebooks (starting from those sold in 2005). + +Please go to http://support.dell.com register and you can find info on +OpenManage and Dell Update packages (DUP). + +Libsmbios can also be used to update BIOS on Dell systems go to +http://linux.dell.com/libsmbios/ for details. + +Dell_RBU driver supports BIOS update using the monolithic image and packetized +image methods. In case of monolithic the driver allocates a contiguous chunk +of physical pages having the BIOS image. In case of packetized the app +using the driver breaks the image in to packets of fixed sizes and the driver +would place each packet in contiguous physical memory. The driver also +maintains a link list of packets for reading them back. + +If the dell_rbu driver is unloaded all the allocated memory is freed. + +The rbu driver needs to have an application (as mentioned above)which will +inform the BIOS to enable the update in the next system reboot. + +The user should not unload the rbu driver after downloading the BIOS image +or updating. + +The driver load creates the following directories under the /sys file system:: + + /sys/class/firmware/dell_rbu/loading + /sys/class/firmware/dell_rbu/data + /sys/devices/platform/dell_rbu/image_type + /sys/devices/platform/dell_rbu/data + /sys/devices/platform/dell_rbu/packet_size + +The driver supports two types of update mechanism; monolithic and packetized. +These update mechanism depends upon the BIOS currently running on the system. +Most of the Dell systems support a monolithic update where the BIOS image is +copied to a single contiguous block of physical memory. + +In case of packet mechanism the single memory can be broken in smaller chunks +of contiguous memory and the BIOS image is scattered in these packets. + +By default the driver uses monolithic memory for the update type. This can be +changed to packets during the driver load time by specifying the load +parameter image_type=packet. This can also be changed later as below:: + + echo packet > /sys/devices/platform/dell_rbu/image_type + +In packet update mode the packet size has to be given before any packets can +be downloaded. It is done as below:: + + echo XXXX > /sys/devices/platform/dell_rbu/packet_size + +In the packet update mechanism, the user needs to create a new file having +packets of data arranged back to back. It can be done as follows +The user creates packets header, gets the chunk of the BIOS image and +places it next to the packetheader; now, the packetheader + BIOS image chunk +added together should match the specified packet_size. This makes one +packet, the user needs to create more such packets out of the entire BIOS +image file and then arrange all these packets back to back in to one single +file. + +This file is then copied to /sys/class/firmware/dell_rbu/data. +Once this file gets to the driver, the driver extracts packet_size data from +the file and spreads it across the physical memory in contiguous packet_sized +space. + +This method makes sure that all the packets get to the driver in a single operation. + +In monolithic update the user simply get the BIOS image (.hdr file) and copies +to the data file as is without any change to the BIOS image itself. + +Do the steps below to download the BIOS image. + +1) echo 1 > /sys/class/firmware/dell_rbu/loading +2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data +3) echo 0 > /sys/class/firmware/dell_rbu/loading + +The /sys/class/firmware/dell_rbu/ entries will remain till the following is +done. + +:: + + echo -1 > /sys/class/firmware/dell_rbu/loading + +Until this step is completed the driver cannot be unloaded. + +Also echoing either mono, packet or init in to image_type will free up the +memory allocated by the driver. + +If a user by accident executes steps 1 and 3 above without executing step 2; +it will make the /sys/class/firmware/dell_rbu/ entries disappear. + +The entries can be recreated by doing the following:: + + echo init > /sys/devices/platform/dell_rbu/image_type + +.. note:: echoing init in image_type does not change it original value. + +Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to +read back the image downloaded. + +.. note:: + + After updating the BIOS image a user mode application needs to execute + code which sends the BIOS update request to the BIOS. So on the next reboot + the BIOS knows about the new image downloaded and it updates itself. + Also don't unload the rbu driver if the image has to be updated. + diff --git a/Documentation/driver-api/edid.rst b/Documentation/driver-api/edid.rst new file mode 100644 index 000000000000..b1b5acd501ed --- /dev/null +++ b/Documentation/driver-api/edid.rst @@ -0,0 +1,58 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +EDID +==== + +In the good old days when graphics parameters were configured explicitly +in a file called xorg.conf, even broken hardware could be managed. + +Today, with the advent of Kernel Mode Setting, a graphics board is +either correctly working because all components follow the standards - +or the computer is unusable, because the screen remains dark after +booting or it displays the wrong area. Cases when this happens are: +- The graphics board does not recognize the monitor. +- The graphics board is unable to detect any EDID data. +- The graphics board incorrectly forwards EDID data to the driver. +- The monitor sends no or bogus EDID data. +- A KVM sends its own EDID data instead of querying the connected monitor. +Adding the kernel parameter "nomodeset" helps in most cases, but causes +restrictions later on. + +As a remedy for such situations, the kernel configuration item +CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an +individually prepared or corrected EDID data set in the /lib/firmware +directory from where it is loaded via the firmware interface. The code +(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for +commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200, +1680x1050, 1920x1080) as binary blobs, but the kernel source tree does +not contain code to create these data. In order to elucidate the origin +of the built-in binary EDID blobs and to facilitate the creation of +individual data for a specific misbehaving monitor, commented sources +and a Makefile environment are given here. + +To create binary EDID and C source code files from the existing data +material, simply type "make". + +If you want to create your own EDID file, copy the file 1024x768.S, +replace the settings with your own data and add a new target to the +Makefile. Please note that the EDID data structure expects the timing +values in a different way as compared to the standard X11 format. + +X11: + HTimings: + hdisp hsyncstart hsyncend htotal + VTimings: + vdisp vsyncstart vsyncend vtotal + +EDID:: + + #define XPIX hdisp + #define XBLANK htotal-hdisp + #define XOFFSET hsyncstart-hdisp + #define XPULSE hsyncend-hsyncstart + + #define YPIX vdisp + #define YBLANK vtotal-vdisp + #define YOFFSET vsyncstart-vdisp + #define YPULSE vsyncend-vsyncstart diff --git a/Documentation/driver-api/eisa.rst b/Documentation/driver-api/eisa.rst new file mode 100644 index 000000000000..c07565ba57da --- /dev/null +++ b/Documentation/driver-api/eisa.rst @@ -0,0 +1,230 @@ +================ +EISA bus support +================ + +:Author: Marc Zyngier + +This document groups random notes about porting EISA drivers to the +new EISA/sysfs API. + +Starting from version 2.5.59, the EISA bus is almost given the same +status as other much more mainstream busses such as PCI or USB. This +has been possible through sysfs, which defines a nice enough set of +abstractions to manage busses, devices and drivers. + +Although the new API is quite simple to use, converting existing +drivers to the new infrastructure is not an easy task (mostly because +detection code is generally also used to probe ISA cards). Moreover, +most EISA drivers are among the oldest Linux drivers so, as you can +imagine, some dust has settled here over the years. + +The EISA infrastructure is made up of three parts: + + - The bus code implements most of the generic code. It is shared + among all the architectures that the EISA code runs on. It + implements bus probing (detecting EISA cards available on the bus), + allocates I/O resources, allows fancy naming through sysfs, and + offers interfaces for driver to register. + + - The bus root driver implements the glue between the bus hardware + and the generic bus code. It is responsible for discovering the + device implementing the bus, and setting it up to be latter probed + by the bus code. This can go from something as simple as reserving + an I/O region on x86, to the rather more complex, like the hppa + EISA code. This is the part to implement in order to have EISA + running on an "new" platform. + + - The driver offers the bus a list of devices that it manages, and + implements the necessary callbacks to probe and release devices + whenever told to. + +Every function/structure below lives in , which depends +heavily on . + +Bus root driver +=============== + +:: + + int eisa_root_register (struct eisa_root_device *root); + +The eisa_root_register function is used to declare a device as the +root of an EISA bus. The eisa_root_device structure holds a reference +to this device, as well as some parameters for probing purposes:: + + struct eisa_root_device { + struct device *dev; /* Pointer to bridge device */ + struct resource *res; + unsigned long bus_base_addr; + int slots; /* Max slot number */ + int force_probe; /* Probe even when no slot 0 */ + u64 dma_mask; /* from bridge device */ + int bus_nr; /* Set by eisa_root_register */ + struct resource eisa_root_res; /* ditto */ + }; + +============= ====================================================== +node used for eisa_root_register internal purpose +dev pointer to the root device +res root device I/O resource +bus_base_addr slot 0 address on this bus +slots max slot number to probe +force_probe Probe even when slot 0 is empty (no EISA mainboard) +dma_mask Default DMA mask. Usually the bridge device dma_mask. +bus_nr unique bus id, set by eisa_root_register +============= ====================================================== + +Driver +====== + +:: + + int eisa_driver_register (struct eisa_driver *edrv); + void eisa_driver_unregister (struct eisa_driver *edrv); + +Clear enough ? + +:: + + struct eisa_device_id { + char sig[EISA_SIG_LEN]; + unsigned long driver_data; + }; + + struct eisa_driver { + const struct eisa_device_id *id_table; + struct device_driver driver; + }; + +=============== ==================================================== +id_table an array of NULL terminated EISA id strings, + followed by an empty string. Each string can + optionally be paired with a driver-dependent value + (driver_data). + +driver a generic driver, such as described in + Documentation/driver-api/driver-model/driver.rst. Only .name, + .probe and .remove members are mandatory. +=============== ==================================================== + +An example is the 3c59x driver:: + + static struct eisa_device_id vortex_eisa_ids[] = { + { "TCM5920", EISA_3C592_OFFSET }, + { "TCM5970", EISA_3C597_OFFSET }, + { "" } + }; + + static struct eisa_driver vortex_eisa_driver = { + .id_table = vortex_eisa_ids, + .driver = { + .name = "3c59x", + .probe = vortex_eisa_probe, + .remove = vortex_eisa_remove + } + }; + +Device +====== + +The sysfs framework calls .probe and .remove functions upon device +discovery and removal (note that the .remove function is only called +when driver is built as a module). + +Both functions are passed a pointer to a 'struct device', which is +encapsulated in a 'struct eisa_device' described as follows:: + + struct eisa_device { + struct eisa_device_id id; + int slot; + int state; + unsigned long base_addr; + struct resource res[EISA_MAX_RESOURCES]; + u64 dma_mask; + struct device dev; /* generic device */ + }; + +======== ============================================================ +id EISA id, as read from device. id.driver_data is set from the + matching driver EISA id. +slot slot number which the device was detected on +state set of flags indicating the state of the device. Current + flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. +res set of four 256 bytes I/O regions allocated to this device +dma_mask DMA mask set from the parent device. +dev generic device (see Documentation/driver-api/driver-model/device.rst) +======== ============================================================ + +You can get the 'struct eisa_device' from 'struct device' using the +'to_eisa_device' macro. + +Misc stuff +========== + +:: + + void eisa_set_drvdata (struct eisa_device *edev, void *data); + +Stores data into the device's driver_data area. + +:: + + void *eisa_get_drvdata (struct eisa_device *edev): + +Gets the pointer previously stored into the device's driver_data area. + +:: + + int eisa_get_region_index (void *addr); + +Returns the region number (0 <= x < EISA_MAX_RESOURCES) of a given +address. + +Kernel parameters +================= + +eisa_bus.enable_dev + A comma-separated list of slots to be enabled, even if the firmware + set the card as disabled. The driver must be able to properly + initialize the device in such conditions. + +eisa_bus.disable_dev + A comma-separated list of slots to be enabled, even if the firmware + set the card as enabled. The driver won't be called to handle this + device. + +virtual_root.force_probe + Force the probing code to probe EISA slots even when it cannot find an + EISA compliant mainboard (nothing appears on slot 0). Defaults to 0 + (don't force), and set to 1 (force probing) when either + CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set. + +Random notes +============ + +Converting an EISA driver to the new API mostly involves *deleting* +code (since probing is now in the core EISA code). Unfortunately, most +drivers share their probing routine between ISA, and EISA. Special +care must be taken when ripping out the EISA code, so other busses +won't suffer from these surgical strikes... + +You *must not* expect any EISA device to be detected when returning +from eisa_driver_register, since the chances are that the bus has not +yet been probed. In fact, that's what happens most of the time (the +bus root driver usually kicks in rather late in the boot process). +Unfortunately, most drivers are doing the probing by themselves, and +expect to have explored the whole machine when they exit their probe +routine. + +For example, switching your favorite EISA SCSI card to the "hotplug" +model is "the right thing"(tm). + +Thanks +====== + +I'd like to thank the following people for their help: + +- Xavier Benigni for lending me a wonderful Alpha Jensen, +- James Bottomley, Jeff Garzik for getting this stuff into the kernel, +- Andries Brouwer for contributing numerous EISA ids, +- Catrin Jones for coping with far too many machines at home. diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 9fb03b7bdeb1..d1c6513dd20d 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -68,7 +68,33 @@ available subsections can be seen below. fpga/index acpi/index backlight/lp855x-driver.rst + bt8xxgpio + connector + console + dcdbas + dell_rbu + edid + eisa + isa + isapnp generic-counter + lightnvm-pblk + men-chameleon-bus + ntb + nvmem + parport-lowlevel + pti_intel_mid + pwm + rfkill + sgi-ioc4 + sm501 + smsc_ece1099 + switchtec + sync_file + vfio-mediated-device + vfio + xillybus + zorro .. only:: subproject and html diff --git a/Documentation/driver-api/isa.rst b/Documentation/driver-api/isa.rst new file mode 100644 index 000000000000..def4a7b690b5 --- /dev/null +++ b/Documentation/driver-api/isa.rst @@ -0,0 +1,122 @@ +=========== +ISA Drivers +=========== + +The following text is adapted from the commit message of the initial +commit of the ISA bus driver authored by Rene Herman. + +During the recent "isa drivers using platform devices" discussion it was +pointed out that (ALSA) ISA drivers ran into the problem of not having +the option to fail driver load (device registration rather) upon not +finding their hardware due to a probe() error not being passed up +through the driver model. In the course of that, I suggested a separate +ISA bus might be best; Russell King agreed and suggested this bus could +use the .match() method for the actual device discovery. + +The attached does this. For this old non (generically) discoverable ISA +hardware only the driver itself can do discovery so as a difference with +the platform_bus, this isa_bus also distributes match() up to the +driver. + +As another difference: these devices only exist in the driver model due +to the driver creating them because it might want to drive them, meaning +that all device creation has been made internal as well. + +The usage model this provides is nice, and has been acked from the ALSA +side by Takashi Iwai and Jaroslav Kysela. The ALSA driver module_init's +now (for oldisa-only drivers) become:: + + static int __init alsa_card_foo_init(void) + { + return isa_register_driver(&snd_foo_isa_driver, SNDRV_CARDS); + } + + static void __exit alsa_card_foo_exit(void) + { + isa_unregister_driver(&snd_foo_isa_driver); + } + +Quite like the other bus models therefore. This removes a lot of +duplicated init code from the ALSA ISA drivers. + +The passed in isa_driver struct is the regular driver struct embedding a +struct device_driver, the normal probe/remove/shutdown/suspend/resume +callbacks, and as indicated that .match callback. + +The "SNDRV_CARDS" you see being passed in is a "unsigned int ndev" +parameter, indicating how many devices to create and call our methods +with. + +The platform_driver callbacks are called with a platform_device param; +the isa_driver callbacks are being called with a ``struct device *dev, +unsigned int id`` pair directly -- with the device creation completely +internal to the bus it's much cleaner to not leak isa_dev's by passing +them in at all. The id is the only thing we ever want other then the +struct device anyways, and it makes for nicer code in the callbacks as +well. + +With this additional .match() callback ISA drivers have all options. If +ALSA would want to keep the old non-load behaviour, it could stick all +of the old .probe in .match, which would only keep them registered after +everything was found to be present and accounted for. If it wanted the +behaviour of always loading as it inadvertently did for a bit after the +changeover to platform devices, it could just not provide a .match() and +do everything in .probe() as before. + +If it, as Takashi Iwai already suggested earlier as a way of following +the model from saner buses more closely, wants to load when a later bind +could conceivably succeed, it could use .match() for the prerequisites +(such as checking the user wants the card enabled and that port/irq/dma +values have been passed in) and .probe() for everything else. This is +the nicest model. + +To the code... + +This exports only two functions; isa_{,un}register_driver(). + +isa_register_driver() register's the struct device_driver, and then +loops over the passed in ndev creating devices and registering them. +This causes the bus match method to be called for them, which is:: + + int isa_bus_match(struct device *dev, struct device_driver *driver) + { + struct isa_driver *isa_driver = to_isa_driver(driver); + + if (dev->platform_data == isa_driver) { + if (!isa_driver->match || + isa_driver->match(dev, to_isa_dev(dev)->id)) + return 1; + dev->platform_data = NULL; + } + return 0; + } + +The first thing this does is check if this device is in fact one of this +driver's devices by seeing if the device's platform_data pointer is set +to this driver. Platform devices compare strings, but we don't need to +do that with everything being internal, so isa_register_driver() abuses +dev->platform_data as a isa_driver pointer which we can then check here. +I believe platform_data is available for this, but if rather not, moving +the isa_driver pointer to the private struct isa_dev is ofcourse fine as +well. + +Then, if the the driver did not provide a .match, it matches. If it did, +the driver match() method is called to determine a match. + +If it did **not** match, dev->platform_data is reset to indicate this to +isa_register_driver which can then unregister the device again. + +If during all this, there's any error, or no devices matched at all +everything is backed out again and the error, or -ENODEV, is returned. + +isa_unregister_driver() just unregisters the matched devices and the +driver itself. + +module_isa_driver is a helper macro for ISA drivers which do not do +anything special in module init/exit. This eliminates a lot of +boilerplate code. Each module may only use this macro once, and calling +it replaces module_init and module_exit. + +max_num_isa_dev is a macro to determine the maximum possible number of +ISA devices which may be registered in the I/O port address space given +the address extent of the ISA devices. diff --git a/Documentation/driver-api/isapnp.rst b/Documentation/driver-api/isapnp.rst new file mode 100644 index 000000000000..8d0840ac847b --- /dev/null +++ b/Documentation/driver-api/isapnp.rst @@ -0,0 +1,15 @@ +========================================================== +ISA Plug & Play support by Jaroslav Kysela +========================================================== + +Interface /proc/isapnp +====================== + +The interface has been removed. See pnp.txt for more details. + +Interface /proc/bus/isapnp +========================== + +This directory allows access to ISA PnP cards and logical devices. +The regular files contain the contents of ISA PnP registers for +a logical device. diff --git a/Documentation/driver-api/lightnvm-pblk.rst b/Documentation/driver-api/lightnvm-pblk.rst new file mode 100644 index 000000000000..1040ed1cec81 --- /dev/null +++ b/Documentation/driver-api/lightnvm-pblk.rst @@ -0,0 +1,21 @@ +pblk: Physical Block Device Target +================================== + +pblk implements a fully associative, host-based FTL that exposes a traditional +block I/O interface. Its primary responsibilities are: + + - Map logical addresses onto physical addresses (4KB granularity) in a + logical-to-physical (L2P) table. + - Maintain the integrity and consistency of the L2P table as well as its + recovery from normal tear down and power outage. + - Deal with controller- and media-specific constrains. + - Handle I/O errors. + - Implement garbage collection. + - Maintain consistency across the I/O stack during synchronization points. + +For more information please refer to: + + http://lightnvm.io + +which maintains updated FAQs, manual pages, technical documentation, tools, +contacts, etc. diff --git a/Documentation/driver-api/men-chameleon-bus.rst b/Documentation/driver-api/men-chameleon-bus.rst new file mode 100644 index 000000000000..1b1f048aa748 --- /dev/null +++ b/Documentation/driver-api/men-chameleon-bus.rst @@ -0,0 +1,175 @@ +================= +MEN Chameleon Bus +================= + +.. Table of Contents + ================= + 1 Introduction + 1.1 Scope of this Document + 1.2 Limitations of the current implementation + 2 Architecture + 2.1 MEN Chameleon Bus + 2.2 Carrier Devices + 2.3 Parser + 3 Resource handling + 3.1 Memory Resources + 3.2 IRQs + 4 Writing an MCB driver + 4.1 The driver structure + 4.2 Probing and attaching + 4.3 Initializing the driver + + +Introduction +============ + +This document describes the architecture and implementation of the MEN +Chameleon Bus (called MCB throughout this document). + +Scope of this Document +---------------------- + +This document is intended to be a short overview of the current +implementation and does by no means describe the complete possibilities of MCB +based devices. + +Limitations of the current implementation +----------------------------------------- + +The current implementation is limited to PCI and PCIe based carrier devices +that only use a single memory resource and share the PCI legacy IRQ. Not +implemented are: + +- Multi-resource MCB devices like the VME Controller or M-Module carrier. +- MCB devices that need another MCB device, like SRAM for a DMA Controller's + buffer descriptors or a video controller's video memory. +- A per-carrier IRQ domain for carrier devices that have one (or more) IRQs + per MCB device like PCIe based carriers with MSI or MSI-X support. + +Architecture +============ + +MCB is divided into 3 functional blocks: + +- The MEN Chameleon Bus itself, +- drivers for MCB Carrier Devices and +- the parser for the Chameleon table. + +MEN Chameleon Bus +----------------- + +The MEN Chameleon Bus is an artificial bus system that attaches to a so +called Chameleon FPGA device found on some hardware produced my MEN Mikro +Elektronik GmbH. These devices are multi-function devices implemented in a +single FPGA and usually attached via some sort of PCI or PCIe link. Each +FPGA contains a header section describing the content of the FPGA. The +header lists the device id, PCI BAR, offset from the beginning of the PCI +BAR, size in the FPGA, interrupt number and some other properties currently +not handled by the MCB implementation. + +Carrier Devices +--------------- + +A carrier device is just an abstraction for the real world physical bus the +Chameleon FPGA is attached to. Some IP Core drivers may need to interact with +properties of the carrier device (like querying the IRQ number of a PCI +device). To provide abstraction from the real hardware bus, an MCB carrier +device provides callback methods to translate the driver's MCB function calls +to hardware related function calls. For example a carrier device may +implement the get_irq() method which can be translated into a hardware bus +query for the IRQ number the device should use. + +Parser +------ + +The parser reads the first 512 bytes of a Chameleon device and parses the +Chameleon table. Currently the parser only supports the Chameleon v2 variant +of the Chameleon table but can easily be adopted to support an older or +possible future variant. While parsing the table's entries new MCB devices +are allocated and their resources are assigned according to the resource +assignment in the Chameleon table. After resource assignment is finished, the +MCB devices are registered at the MCB and thus at the driver core of the +Linux kernel. + +Resource handling +================= + +The current implementation assigns exactly one memory and one IRQ resource +per MCB device. But this is likely going to change in the future. + +Memory Resources +---------------- + +Each MCB device has exactly one memory resource, which can be requested from +the MCB bus. This memory resource is the physical address of the MCB device +inside the carrier and is intended to be passed to ioremap() and friends. It +is already requested from the kernel by calling request_mem_region(). + +IRQs +---- + +Each MCB device has exactly one IRQ resource, which can be requested from the +MCB bus. If a carrier device driver implements the ->get_irq() callback +method, the IRQ number assigned by the carrier device will be returned, +otherwise the IRQ number inside the Chameleon table will be returned. This +number is suitable to be passed to request_irq(). + +Writing an MCB driver +===================== + +The driver structure +-------------------- + +Each MCB driver has a structure to identify the device driver as well as +device ids which identify the IP Core inside the FPGA. The driver structure +also contains callback methods which get executed on driver probe and +removal from the system:: + + static const struct mcb_device_id foo_ids[] = { + { .device = 0x123 }, + { } + }; + MODULE_DEVICE_TABLE(mcb, foo_ids); + + static struct mcb_driver foo_driver = { + driver = { + .name = "foo-bar", + .owner = THIS_MODULE, + }, + .probe = foo_probe, + .remove = foo_remove, + .id_table = foo_ids, + }; + +Probing and attaching +--------------------- + +When a driver is loaded and the MCB devices it services are found, the MCB +core will call the driver's probe callback method. When the driver is removed +from the system, the MCB core will call the driver's remove callback method:: + + static init foo_probe(struct mcb_device *mdev, const struct mcb_device_id *id); + static void foo_remove(struct mcb_device *mdev); + +Initializing the driver +----------------------- + +When the kernel is booted or your foo driver module is inserted, you have to +perform driver initialization. Usually it is enough to register your driver +module at the MCB core:: + + static int __init foo_init(void) + { + return mcb_register_driver(&foo_driver); + } + module_init(foo_init); + + static void __exit foo_exit(void) + { + mcb_unregister_driver(&foo_driver); + } + module_exit(foo_exit); + +The module_mcb_driver() macro can be used to reduce the above code:: + + module_mcb_driver(foo_driver); diff --git a/Documentation/driver-api/ntb.rst b/Documentation/driver-api/ntb.rst new file mode 100644 index 000000000000..074a423c853c --- /dev/null +++ b/Documentation/driver-api/ntb.rst @@ -0,0 +1,236 @@ +=========== +NTB Drivers +=========== + +NTB (Non-Transparent Bridge) is a type of PCI-Express bridge chip that connects +the separate memory systems of two or more computers to the same PCI-Express +fabric. Existing NTB hardware supports a common feature set: doorbell +registers and memory translation windows, as well as non common features like +scratchpad and message registers. Scratchpad registers are read-and-writable +registers that are accessible from either side of the device, so that peers can +exchange a small amount of information at a fixed address. Message registers can +be utilized for the same purpose. Additionally they are provided with with +special status bits to make sure the information isn't rewritten by another +peer. Doorbell registers provide a way for peers to send interrupt events. +Memory windows allow translated read and write access to the peer memory. + +NTB Core Driver (ntb) +===================== + +The NTB core driver defines an api wrapping the common feature set, and allows +clients interested in NTB features to discover NTB the devices supported by +hardware drivers. The term "client" is used here to mean an upper layer +component making use of the NTB api. The term "driver," or "hardware driver," +is used here to mean a driver for a specific vendor and model of NTB hardware. + +NTB Client Drivers +================== + +NTB client drivers should register with the NTB core driver. After +registering, the client probe and remove functions will be called appropriately +as ntb hardware, or hardware drivers, are inserted and removed. The +registration uses the Linux Device framework, so it should feel familiar to +anyone who has written a pci driver. + +NTB Typical client driver implementation +---------------------------------------- + +Primary purpose of NTB is to share some peace of memory between at least two +systems. So the NTB device features like Scratchpad/Message registers are +mainly used to perform the proper memory window initialization. Typically +there are two types of memory window interfaces supported by the NTB API: +inbound translation configured on the local ntb port and outbound translation +configured by the peer, on the peer ntb port. The first type is +depicted on the next figure:: + + Inbound translation: + + Memory: Local NTB Port: Peer NTB Port: Peer MMIO: + ____________ + | dma-mapped |-ntb_mw_set_trans(addr) | + | memory | _v____________ | ______________ + | (addr) |<======| MW xlat addr |<====| MW base addr |<== memory-mapped IO + |------------| |--------------| | |--------------| + +So typical scenario of the first type memory window initialization looks: +1) allocate a memory region, 2) put translated address to NTB config, +3) somehow notify a peer device of performed initialization, 4) peer device +maps corresponding outbound memory window so to have access to the shared +memory region. + +The second type of interface, that implies the shared windows being +initialized by a peer device, is depicted on the figure:: + + Outbound translation: + + Memory: Local NTB Port: Peer NTB Port: Peer MMIO: + ____________ ______________ + | dma-mapped | | | MW base addr |<== memory-mapped IO + | memory | | |--------------| + | (addr) |<===================| MW xlat addr |<-ntb_peer_mw_set_trans(addr) + |------------| | |--------------| + +Typical scenario of the second type interface initialization would be: +1) allocate a memory region, 2) somehow deliver a translated address to a peer +device, 3) peer puts the translated address to NTB config, 4) peer device maps +outbound memory window so to have access to the shared memory region. + +As one can see the described scenarios can be combined in one portable +algorithm. + + Local device: + 1) Allocate memory for a shared window + 2) Initialize memory window by translated address of the allocated region + (it may fail if local memory window initialization is unsupported) + 3) Send the translated address and memory window index to a peer device + + Peer device: + 1) Initialize memory window with retrieved address of the allocated + by another device memory region (it may fail if peer memory window + initialization is unsupported) + 2) Map outbound memory window + +In accordance with this scenario, the NTB Memory Window API can be used as +follows: + + Local device: + 1) ntb_mw_count(pidx) - retrieve number of memory ranges, which can + be allocated for memory windows between local device and peer device + of port with specified index. + 2) ntb_get_align(pidx, midx) - retrieve parameters restricting the + shared memory region alignment and size. Then memory can be properly + allocated. + 3) Allocate physically contiguous memory region in compliance with + restrictions retrieved in 2). + 4) ntb_mw_set_trans(pidx, midx) - try to set translation address of + the memory window with specified index for the defined peer device + (it may fail if local translated address setting is not supported) + 5) Send translated base address (usually together with memory window + number) to the peer device using, for instance, scratchpad or message + registers. + + Peer device: + 1) ntb_peer_mw_set_trans(pidx, midx) - try to set received from other + device (related to pidx) translated address for specified memory + window. It may fail if retrieved address, for instance, exceeds + maximum possible address or isn't properly aligned. + 2) ntb_peer_mw_get_addr(widx) - retrieve MMIO address to map the memory + window so to have an access to the shared memory. + +Also it is worth to note, that method ntb_mw_count(pidx) should return the +same value as ntb_peer_mw_count() on the peer with port index - pidx. + +NTB Transport Client (ntb\_transport) and NTB Netdev (ntb\_netdev) +------------------------------------------------------------------ + +The primary client for NTB is the Transport client, used in tandem with NTB +Netdev. These drivers function together to create a logical link to the peer, +across the ntb, to exchange packets of network data. The Transport client +establishes a logical link to the peer, and creates queue pairs to exchange +messages and data. The NTB Netdev then creates an ethernet device using a +Transport queue pair. Network data is copied between socket buffers and the +Transport queue pair buffer. The Transport client may be used for other things +besides Netdev, however no other applications have yet been written. + +NTB Ping Pong Test Client (ntb\_pingpong) +----------------------------------------- + +The Ping Pong test client serves as a demonstration to exercise the doorbell +and scratchpad registers of NTB hardware, and as an example simple NTB client. +Ping Pong enables the link when started, waits for the NTB link to come up, and +then proceeds to read and write the doorbell scratchpad registers of the NTB. +The peers interrupt each other using a bit mask of doorbell bits, which is +shifted by one in each round, to test the behavior of multiple doorbell bits +and interrupt vectors. The Ping Pong driver also reads the first local +scratchpad, and writes the value plus one to the first peer scratchpad, each +round before writing the peer doorbell register. + +Module Parameters: + +* unsafe - Some hardware has known issues with scratchpad and doorbell + registers. By default, Ping Pong will not attempt to exercise such + hardware. You may override this behavior at your own risk by setting + unsafe=1. +* delay\_ms - Specify the delay between receiving a doorbell + interrupt event and setting the peer doorbell register for the next + round. +* init\_db - Specify the doorbell bits to start new series of rounds. A new + series begins once all the doorbell bits have been shifted out of + range. +* dyndbg - It is suggested to specify dyndbg=+p when loading this module, and + then to observe debugging output on the console. + +NTB Tool Test Client (ntb\_tool) +-------------------------------- + +The Tool test client serves for debugging, primarily, ntb hardware and drivers. +The Tool provides access through debugfs for reading, setting, and clearing the +NTB doorbell, and reading and writing scratchpads. + +The Tool does not currently have any module parameters. + +Debugfs Files: + +* *debugfs*/ntb\_tool/*hw*/ + A directory in debugfs will be created for each + NTB device probed by the tool. This directory is shortened to *hw* + below. +* *hw*/db + This file is used to read, set, and clear the local doorbell. Not + all operations may be supported by all hardware. To read the doorbell, + read the file. To set the doorbell, write `s` followed by the bits to + set (eg: `echo 's 0x0101' > db`). To clear the doorbell, write `c` + followed by the bits to clear. +* *hw*/mask + This file is used to read, set, and clear the local doorbell mask. + See *db* for details. +* *hw*/peer\_db + This file is used to read, set, and clear the peer doorbell. + See *db* for details. +* *hw*/peer\_mask + This file is used to read, set, and clear the peer doorbell + mask. See *db* for details. +* *hw*/spad + This file is used to read and write local scratchpads. To read + the values of all scratchpads, read the file. To write values, write a + series of pairs of scratchpad number and value + (eg: `echo '4 0x123 7 0xabc' > spad` + # to set scratchpads `4` and `7` to `0x123` and `0xabc`, respectively). +* *hw*/peer\_spad + This file is used to read and write peer scratchpads. See + *spad* for details. + +NTB Hardware Drivers +==================== + +NTB hardware drivers should register devices with the NTB core driver. After +registering, clients probe and remove functions will be called. + +NTB Intel Hardware Driver (ntb\_hw\_intel) +------------------------------------------ + +The Intel hardware driver supports NTB on Xeon and Atom CPUs. + +Module Parameters: + +* b2b\_mw\_idx + If the peer ntb is to be accessed via a memory window, then use + this memory window to access the peer ntb. A value of zero or positive + starts from the first mw idx, and a negative value starts from the last + mw idx. Both sides MUST set the same value here! The default value is + `-1`. +* b2b\_mw\_share + If the peer ntb is to be accessed via a memory window, and if + the memory window is large enough, still allow the client to use the + second half of the memory window for address translation to the peer. +* xeon\_b2b\_usd\_bar2\_addr64 + If using B2B topology on Xeon hardware, use + this 64 bit address on the bus between the NTB devices for the window + at BAR2, on the upstream side of the link. +* xeon\_b2b\_usd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*. +* xeon\_b2b\_usd\_bar4\_addr32 - See *xeon\_b2b\_bar2\_addr64*. +* xeon\_b2b\_usd\_bar5\_addr32 - See *xeon\_b2b\_bar2\_addr64*. +* xeon\_b2b\_dsd\_bar2\_addr64 - See *xeon\_b2b\_bar2\_addr64*. +* xeon\_b2b\_dsd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*. +* xeon\_b2b\_dsd\_bar4\_addr32 - See *xeon\_b2b\_bar2\_addr64*. +* xeon\_b2b\_dsd\_bar5\_addr32 - See *xeon\_b2b\_bar2\_addr64*. diff --git a/Documentation/driver-api/nvmem.rst b/Documentation/driver-api/nvmem.rst new file mode 100644 index 000000000000..d9d958d5c824 --- /dev/null +++ b/Documentation/driver-api/nvmem.rst @@ -0,0 +1,189 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +NVMEM Subsystem +=============== + + Srinivas Kandagatla + +This document explains the NVMEM Framework along with the APIs provided, +and how to use it. + +1. Introduction +=============== +*NVMEM* is the abbreviation for Non Volatile Memory layer. It is used to +retrieve configuration of SOC or Device specific data from non volatile +memories like eeprom, efuses and so on. + +Before this framework existed, NVMEM drivers like eeprom were stored in +drivers/misc, where they all had to duplicate pretty much the same code to +register a sysfs file, allow in-kernel users to access the content of the +devices they were driving, etc. + +This was also a problem as far as other in-kernel users were involved, since +the solutions used were pretty much different from one driver to another, there +was a rather big abstraction leak. + +This framework aims at solve these problems. It also introduces DT +representation for consumer devices to go get the data they require (MAC +Addresses, SoC/Revision ID, part numbers, and so on) from the NVMEMs. This +framework is based on regmap, so that most of the abstraction available in +regmap can be reused, across multiple types of buses. + +NVMEM Providers ++++++++++++++++ + +NVMEM provider refers to an entity that implements methods to initialize, read +and write the non-volatile memory. + +2. Registering/Unregistering the NVMEM provider +=============================================== + +A NVMEM provider can register with NVMEM core by supplying relevant +nvmem configuration to nvmem_register(), on success core would return a valid +nvmem_device pointer. + +nvmem_unregister(nvmem) is used to unregister a previously registered provider. + +For example, a simple qfprom case:: + + static struct nvmem_config econfig = { + .name = "qfprom", + .owner = THIS_MODULE, + }; + + static int qfprom_probe(struct platform_device *pdev) + { + ... + econfig.dev = &pdev->dev; + nvmem = nvmem_register(&econfig); + ... + } + +It is mandatory that the NVMEM provider has a regmap associated with its +struct device. Failure to do would return error code from nvmem_register(). + +Users of board files can define and register nvmem cells using the +nvmem_cell_table struct:: + + static struct nvmem_cell_info foo_nvmem_cells[] = { + { + .name = "macaddr", + .offset = 0x7f00, + .bytes = ETH_ALEN, + } + }; + + static struct nvmem_cell_table foo_nvmem_cell_table = { + .nvmem_name = "i2c-eeprom", + .cells = foo_nvmem_cells, + .ncells = ARRAY_SIZE(foo_nvmem_cells), + }; + + nvmem_add_cell_table(&foo_nvmem_cell_table); + +Additionally it is possible to create nvmem cell lookup entries and register +them with the nvmem framework from machine code as shown in the example below:: + + static struct nvmem_cell_lookup foo_nvmem_lookup = { + .nvmem_name = "i2c-eeprom", + .cell_name = "macaddr", + .dev_id = "foo_mac.0", + .con_id = "mac-address", + }; + + nvmem_add_cell_lookups(&foo_nvmem_lookup, 1); + +NVMEM Consumers ++++++++++++++++ + +NVMEM consumers are the entities which make use of the NVMEM provider to +read from and to NVMEM. + +3. NVMEM cell based consumer APIs +================================= + +NVMEM cells are the data entries/fields in the NVMEM. +The NVMEM framework provides 3 APIs to read/write NVMEM cells:: + + struct nvmem_cell *nvmem_cell_get(struct device *dev, const char *name); + struct nvmem_cell *devm_nvmem_cell_get(struct device *dev, const char *name); + + void nvmem_cell_put(struct nvmem_cell *cell); + void devm_nvmem_cell_put(struct device *dev, struct nvmem_cell *cell); + + void *nvmem_cell_read(struct nvmem_cell *cell, ssize_t *len); + int nvmem_cell_write(struct nvmem_cell *cell, void *buf, ssize_t len); + +`*nvmem_cell_get()` apis will get a reference to nvmem cell for a given id, +and nvmem_cell_read/write() can then read or write to the cell. +Once the usage of the cell is finished the consumer should call +`*nvmem_cell_put()` to free all the allocation memory for the cell. + +4. Direct NVMEM device based consumer APIs +========================================== + +In some instances it is necessary to directly read/write the NVMEM. +To facilitate such consumers NVMEM framework provides below apis:: + + struct nvmem_device *nvmem_device_get(struct device *dev, const char *name); + struct nvmem_device *devm_nvmem_device_get(struct device *dev, + const char *name); + void nvmem_device_put(struct nvmem_device *nvmem); + int nvmem_device_read(struct nvmem_device *nvmem, unsigned int offset, + size_t bytes, void *buf); + int nvmem_device_write(struct nvmem_device *nvmem, unsigned int offset, + size_t bytes, void *buf); + int nvmem_device_cell_read(struct nvmem_device *nvmem, + struct nvmem_cell_info *info, void *buf); + int nvmem_device_cell_write(struct nvmem_device *nvmem, + struct nvmem_cell_info *info, void *buf); + +Before the consumers can read/write NVMEM directly, it should get hold +of nvmem_controller from one of the `*nvmem_device_get()` api. + +The difference between these apis and cell based apis is that these apis always +take nvmem_device as parameter. + +5. Releasing a reference to the NVMEM +===================================== + +When a consumer no longer needs the NVMEM, it has to release the reference +to the NVMEM it has obtained using the APIs mentioned in the above section. +The NVMEM framework provides 2 APIs to release a reference to the NVMEM:: + + void nvmem_cell_put(struct nvmem_cell *cell); + void devm_nvmem_cell_put(struct device *dev, struct nvmem_cell *cell); + void nvmem_device_put(struct nvmem_device *nvmem); + void devm_nvmem_device_put(struct device *dev, struct nvmem_device *nvmem); + +Both these APIs are used to release a reference to the NVMEM and +devm_nvmem_cell_put and devm_nvmem_device_put destroys the devres associated +with this NVMEM. + +Userspace ++++++++++ + +6. Userspace binary interface +============================== + +Userspace can read/write the raw NVMEM file located at:: + + /sys/bus/nvmem/devices/*/nvmem + +ex:: + + hexdump /sys/bus/nvmem/devices/qfprom0/nvmem + + 0000000 0000 0000 0000 0000 0000 0000 0000 0000 + * + 00000a0 db10 2240 0000 e000 0c00 0c00 0000 0c00 + 0000000 0000 0000 0000 0000 0000 0000 0000 0000 + ... + * + 0001000 + +7. DeviceTree Binding +===================== + +See Documentation/devicetree/bindings/nvmem/nvmem.txt diff --git a/Documentation/driver-api/parport-lowlevel.rst b/Documentation/driver-api/parport-lowlevel.rst new file mode 100644 index 000000000000..0633d70ffda7 --- /dev/null +++ b/Documentation/driver-api/parport-lowlevel.rst @@ -0,0 +1,1832 @@ +=============================== +PARPORT interface documentation +=============================== + +:Time-stamp: <2000-02-24 13:30:20 twaugh> + +Described here are the following functions: + +Global functions:: + parport_register_driver + parport_unregister_driver + parport_enumerate + parport_register_device + parport_unregister_device + parport_claim + parport_claim_or_block + parport_release + parport_yield + parport_yield_blocking + parport_wait_peripheral + parport_poll_peripheral + parport_wait_event + parport_negotiate + parport_read + parport_write + parport_open + parport_close + parport_device_id + parport_device_coords + parport_find_class + parport_find_device + parport_set_timeout + +Port functions (can be overridden by low-level drivers): + + SPP:: + port->ops->read_data + port->ops->write_data + port->ops->read_status + port->ops->read_control + port->ops->write_control + port->ops->frob_control + port->ops->enable_irq + port->ops->disable_irq + port->ops->data_forward + port->ops->data_reverse + + EPP:: + port->ops->epp_write_data + port->ops->epp_read_data + port->ops->epp_write_addr + port->ops->epp_read_addr + + ECP:: + port->ops->ecp_write_data + port->ops->ecp_read_data + port->ops->ecp_write_addr + + Other:: + port->ops->nibble_read_data + port->ops->byte_read_data + port->ops->compat_write_data + +The parport subsystem comprises ``parport`` (the core port-sharing +code), and a variety of low-level drivers that actually do the port +accesses. Each low-level driver handles a particular style of port +(PC, Amiga, and so on). + +The parport interface to the device driver author can be broken down +into global functions and port functions. + +The global functions are mostly for communicating between the device +driver and the parport subsystem: acquiring a list of available ports, +claiming a port for exclusive use, and so on. They also include +``generic`` functions for doing standard things that will work on any +IEEE 1284-capable architecture. + +The port functions are provided by the low-level drivers, although the +core parport module provides generic ``defaults`` for some routines. +The port functions can be split into three groups: SPP, EPP, and ECP. + +SPP (Standard Parallel Port) functions modify so-called ``SPP`` +registers: data, status, and control. The hardware may not actually +have registers exactly like that, but the PC does and this interface is +modelled after common PC implementations. Other low-level drivers may +be able to emulate most of the functionality. + +EPP (Enhanced Parallel Port) functions are provided for reading and +writing in IEEE 1284 EPP mode, and ECP (Extended Capabilities Port) +functions are used for IEEE 1284 ECP mode. (What about BECP? Does +anyone care?) + +Hardware assistance for EPP and/or ECP transfers may or may not be +available, and if it is available it may or may not be used. If +hardware is not used, the transfer will be software-driven. In order +to cope with peripherals that only tenuously support IEEE 1284, a +low-level driver specific function is provided, for altering 'fudge +factors'. + +Global functions +================ + +parport_register_driver - register a device driver with parport +--------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_driver { + const char *name; + void (*attach) (struct parport *); + void (*detach) (struct parport *); + struct parport_driver *next; + }; + int parport_register_driver (struct parport_driver *driver); + +DESCRIPTION +^^^^^^^^^^^ + +In order to be notified about parallel ports when they are detected, +parport_register_driver should be called. Your driver will +immediately be notified of all ports that have already been detected, +and of each new port as low-level drivers are loaded. + +A ``struct parport_driver`` contains the textual name of your driver, +a pointer to a function to handle new ports, and a pointer to a +function to handle ports going away due to a low-level driver +unloading. Ports will only be detached if they are not being used +(i.e. there are no devices registered on them). + +The visible parts of the ``struct parport *`` argument given to +attach/detach are:: + + struct parport + { + struct parport *next; /* next parport in list */ + const char *name; /* port's name */ + unsigned int modes; /* bitfield of hardware modes */ + struct parport_device_info probe_info; + /* IEEE1284 info */ + int number; /* parport index */ + struct parport_operations *ops; + ... + }; + +There are other members of the structure, but they should not be +touched. + +The ``modes`` member summarises the capabilities of the underlying +hardware. It consists of flags which may be bitwise-ored together: + + ============================= =============================================== + PARPORT_MODE_PCSPP IBM PC registers are available, + i.e. functions that act on data, + control and status registers are + probably writing directly to the + hardware. + PARPORT_MODE_TRISTATE The data drivers may be turned off. + This allows the data lines to be used + for reverse (peripheral to host) + transfers. + PARPORT_MODE_COMPAT The hardware can assist with + compatibility-mode (printer) + transfers, i.e. compat_write_block. + PARPORT_MODE_EPP The hardware can assist with EPP + transfers. + PARPORT_MODE_ECP The hardware can assist with ECP + transfers. + PARPORT_MODE_DMA The hardware can use DMA, so you might + want to pass ISA DMA-able memory + (i.e. memory allocated using the + GFP_DMA flag with kmalloc) to the + low-level driver in order to take + advantage of it. + ============================= =============================================== + +There may be other flags in ``modes`` as well. + +The contents of ``modes`` is advisory only. For example, if the +hardware is capable of DMA, and PARPORT_MODE_DMA is in ``modes``, it +doesn't necessarily mean that DMA will always be used when possible. +Similarly, hardware that is capable of assisting ECP transfers won't +necessarily be used. + +RETURN VALUE +^^^^^^^^^^^^ + +Zero on success, otherwise an error code. + +ERRORS +^^^^^^ + +None. (Can it fail? Why return int?) + +EXAMPLE +^^^^^^^ + +:: + + static void lp_attach (struct parport *port) + { + ... + private = kmalloc (...); + dev[count++] = parport_register_device (...); + ... + } + + static void lp_detach (struct parport *port) + { + ... + } + + static struct parport_driver lp_driver = { + "lp", + lp_attach, + lp_detach, + NULL /* always put NULL here */ + }; + + int lp_init (void) + { + ... + if (parport_register_driver (&lp_driver)) { + /* Failed; nothing we can do. */ + return -EIO; + } + ... + } + + +SEE ALSO +^^^^^^^^ + +parport_unregister_driver, parport_register_device, parport_enumerate + + + +parport_unregister_driver - tell parport to forget about this driver +-------------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_driver { + const char *name; + void (*attach) (struct parport *); + void (*detach) (struct parport *); + struct parport_driver *next; + }; + void parport_unregister_driver (struct parport_driver *driver); + +DESCRIPTION +^^^^^^^^^^^ + +This tells parport not to notify the device driver of new ports or of +ports going away. Registered devices belonging to that driver are NOT +unregistered: parport_unregister_device must be used for each one. + +EXAMPLE +^^^^^^^ + +:: + + void cleanup_module (void) + { + ... + /* Stop notifications. */ + parport_unregister_driver (&lp_driver); + + /* Unregister devices. */ + for (i = 0; i < NUM_DEVS; i++) + parport_unregister_device (dev[i]); + ... + } + +SEE ALSO +^^^^^^^^ + +parport_register_driver, parport_enumerate + + + +parport_enumerate - retrieve a list of parallel ports (DEPRECATED) +------------------------------------------------------------------ + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport *parport_enumerate (void); + +DESCRIPTION +^^^^^^^^^^^ + +Retrieve the first of a list of valid parallel ports for this machine. +Successive parallel ports can be found using the ``struct parport +*next`` element of the ``struct parport *`` that is returned. If ``next`` +is NULL, there are no more parallel ports in the list. The number of +ports in the list will not exceed PARPORT_MAX. + +RETURN VALUE +^^^^^^^^^^^^ + +A ``struct parport *`` describing a valid parallel port for the machine, +or NULL if there are none. + +ERRORS +^^^^^^ + +This function can return NULL to indicate that there are no parallel +ports to use. + +EXAMPLE +^^^^^^^ + +:: + + int detect_device (void) + { + struct parport *port; + + for (port = parport_enumerate (); + port != NULL; + port = port->next) { + /* Try to detect a device on the port... */ + ... + } + } + + ... + } + +NOTES +^^^^^ + +parport_enumerate is deprecated; parport_register_driver should be +used instead. + +SEE ALSO +^^^^^^^^ + +parport_register_driver, parport_unregister_driver + + + +parport_register_device - register to use a port +------------------------------------------------ + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + typedef int (*preempt_func) (void *handle); + typedef void (*wakeup_func) (void *handle); + typedef int (*irq_func) (int irq, void *handle, struct pt_regs *); + + struct pardevice *parport_register_device(struct parport *port, + const char *name, + preempt_func preempt, + wakeup_func wakeup, + irq_func irq, + int flags, + void *handle); + +DESCRIPTION +^^^^^^^^^^^ + +Use this function to register your device driver on a parallel port +(``port``). Once you have done that, you will be able to use +parport_claim and parport_release in order to use the port. + +The (``name``) argument is the name of the device that appears in /proc +filesystem. The string must be valid for the whole lifetime of the +device (until parport_unregister_device is called). + +This function will register three callbacks into your driver: +``preempt``, ``wakeup`` and ``irq``. Each of these may be NULL in order to +indicate that you do not want a callback. + +When the ``preempt`` function is called, it is because another driver +wishes to use the parallel port. The ``preempt`` function should return +non-zero if the parallel port cannot be released yet -- if zero is +returned, the port is lost to another driver and the port must be +re-claimed before use. + +The ``wakeup`` function is called once another driver has released the +port and no other driver has yet claimed it. You can claim the +parallel port from within the ``wakeup`` function (in which case the +claim is guaranteed to succeed), or choose not to if you don't need it +now. + +If an interrupt occurs on the parallel port your driver has claimed, +the ``irq`` function will be called. (Write something about shared +interrupts here.) + +The ``handle`` is a pointer to driver-specific data, and is passed to +the callback functions. + +``flags`` may be a bitwise combination of the following flags: + + ===================== ================================================= + Flag Meaning + ===================== ================================================= + PARPORT_DEV_EXCL The device cannot share the parallel port at all. + Use this only when absolutely necessary. + ===================== ================================================= + +The typedefs are not actually defined -- they are only shown in order +to make the function prototype more readable. + +The visible parts of the returned ``struct pardevice`` are:: + + struct pardevice { + struct parport *port; /* Associated port */ + void *private; /* Device driver's 'handle' */ + ... + }; + +RETURN VALUE +^^^^^^^^^^^^ + +A ``struct pardevice *``: a handle to the registered parallel port +device that can be used for parport_claim, parport_release, etc. + +ERRORS +^^^^^^ + +A return value of NULL indicates that there was a problem registering +a device on that port. + +EXAMPLE +^^^^^^^ + +:: + + static int preempt (void *handle) + { + if (busy_right_now) + return 1; + + must_reclaim_port = 1; + return 0; + } + + static void wakeup (void *handle) + { + struct toaster *private = handle; + struct pardevice *dev = private->dev; + if (!dev) return; /* avoid races */ + + if (want_port) + parport_claim (dev); + } + + static int toaster_detect (struct toaster *private, struct parport *port) + { + private->dev = parport_register_device (port, "toaster", preempt, + wakeup, NULL, 0, + private); + if (!private->dev) + /* Couldn't register with parport. */ + return -EIO; + + must_reclaim_port = 0; + busy_right_now = 1; + parport_claim_or_block (private->dev); + ... + /* Don't need the port while the toaster warms up. */ + busy_right_now = 0; + ... + busy_right_now = 1; + if (must_reclaim_port) { + parport_claim_or_block (private->dev); + must_reclaim_port = 0; + } + ... + } + +SEE ALSO +^^^^^^^^ + +parport_unregister_device, parport_claim + + + +parport_unregister_device - finish using a port +----------------------------------------------- + +SYNPOPSIS + +:: + + #include + + void parport_unregister_device (struct pardevice *dev); + +DESCRIPTION +^^^^^^^^^^^ + +This function is the opposite of parport_register_device. After using +parport_unregister_device, ``dev`` is no longer a valid device handle. + +You should not unregister a device that is currently claimed, although +if you do it will be released automatically. + +EXAMPLE +^^^^^^^ + +:: + + ... + kfree (dev->private); /* before we lose the pointer */ + parport_unregister_device (dev); + ... + +SEE ALSO +^^^^^^^^ + + +parport_unregister_driver + +parport_claim, parport_claim_or_block - claim the parallel port for a device +---------------------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + int parport_claim (struct pardevice *dev); + int parport_claim_or_block (struct pardevice *dev); + +DESCRIPTION +^^^^^^^^^^^ + +These functions attempt to gain control of the parallel port on which +``dev`` is registered. ``parport_claim`` does not block, but +``parport_claim_or_block`` may do. (Put something here about blocking +interruptibly or non-interruptibly.) + +You should not try to claim a port that you have already claimed. + +RETURN VALUE +^^^^^^^^^^^^ + +A return value of zero indicates that the port was successfully +claimed, and the caller now has possession of the parallel port. + +If ``parport_claim_or_block`` blocks before returning successfully, the +return value is positive. + +ERRORS +^^^^^^ + +========== ========================================================== + -EAGAIN The port is unavailable at the moment, but another attempt + to claim it may succeed. +========== ========================================================== + +SEE ALSO +^^^^^^^^ + + +parport_release + +parport_release - release the parallel port +------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + void parport_release (struct pardevice *dev); + +DESCRIPTION +^^^^^^^^^^^ + +Once a parallel port device has been claimed, it can be released using +``parport_release``. It cannot fail, but you should not release a +device that you do not have possession of. + +EXAMPLE +^^^^^^^ + +:: + + static size_t write (struct pardevice *dev, const void *buf, + size_t len) + { + ... + written = dev->port->ops->write_ecp_data (dev->port, buf, + len); + parport_release (dev); + ... + } + + +SEE ALSO +^^^^^^^^ + +change_mode, parport_claim, parport_claim_or_block, parport_yield + + + +parport_yield, parport_yield_blocking - temporarily release a parallel port +--------------------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + int parport_yield (struct pardevice *dev) + int parport_yield_blocking (struct pardevice *dev); + +DESCRIPTION +^^^^^^^^^^^ + +When a driver has control of a parallel port, it may allow another +driver to temporarily ``borrow`` it. ``parport_yield`` does not block; +``parport_yield_blocking`` may do. + +RETURN VALUE +^^^^^^^^^^^^ + +A return value of zero indicates that the caller still owns the port +and the call did not block. + +A positive return value from ``parport_yield_blocking`` indicates that +the caller still owns the port and the call blocked. + +A return value of -EAGAIN indicates that the caller no longer owns the +port, and it must be re-claimed before use. + +ERRORS +^^^^^^ + +========= ========================================================== + -EAGAIN Ownership of the parallel port was given away. +========= ========================================================== + +SEE ALSO +^^^^^^^^ + +parport_release + + + +parport_wait_peripheral - wait for status lines, up to 35ms +----------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + int parport_wait_peripheral (struct parport *port, + unsigned char mask, + unsigned char val); + +DESCRIPTION +^^^^^^^^^^^ + +Wait for the status lines in mask to match the values in val. + +RETURN VALUE +^^^^^^^^^^^^ + +======== ========================================================== + -EINTR a signal is pending + 0 the status lines in mask have values in val + 1 timed out while waiting (35ms elapsed) +======== ========================================================== + +SEE ALSO +^^^^^^^^ + +parport_poll_peripheral + + + +parport_poll_peripheral - wait for status lines, in usec +-------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + int parport_poll_peripheral (struct parport *port, + unsigned char mask, + unsigned char val, + int usec); + +DESCRIPTION +^^^^^^^^^^^ + +Wait for the status lines in mask to match the values in val. + +RETURN VALUE +^^^^^^^^^^^^ + +======== ========================================================== + -EINTR a signal is pending + 0 the status lines in mask have values in val + 1 timed out while waiting (usec microseconds have elapsed) +======== ========================================================== + +SEE ALSO +^^^^^^^^ + +parport_wait_peripheral + + + +parport_wait_event - wait for an event on a port +------------------------------------------------ + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + int parport_wait_event (struct parport *port, signed long timeout) + +DESCRIPTION +^^^^^^^^^^^ + +Wait for an event (e.g. interrupt) on a port. The timeout is in +jiffies. + +RETURN VALUE +^^^^^^^^^^^^ + +======= ========================================================== + 0 success + <0 error (exit as soon as possible) + >0 timed out +======= ========================================================== + +parport_negotiate - perform IEEE 1284 negotiation +------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + int parport_negotiate (struct parport *, int mode); + +DESCRIPTION +^^^^^^^^^^^ + +Perform IEEE 1284 negotiation. + +RETURN VALUE +^^^^^^^^^^^^ + +======= ========================================================== + 0 handshake OK; IEEE 1284 peripheral and mode available + -1 handshake failed; peripheral not compliant (or none present) + 1 handshake OK; IEEE 1284 peripheral present but mode not + available +======= ========================================================== + +SEE ALSO +^^^^^^^^ + +parport_read, parport_write + + + +parport_read - read data from device +------------------------------------ + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + ssize_t parport_read (struct parport *, void *buf, size_t len); + +DESCRIPTION +^^^^^^^^^^^ + +Read data from device in current IEEE 1284 transfer mode. This only +works for modes that support reverse data transfer. + +RETURN VALUE +^^^^^^^^^^^^ + +If negative, an error code; otherwise the number of bytes transferred. + +SEE ALSO +^^^^^^^^ + +parport_write, parport_negotiate + + + +parport_write - write data to device +------------------------------------ + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + ssize_t parport_write (struct parport *, const void *buf, size_t len); + +DESCRIPTION +^^^^^^^^^^^ + +Write data to device in current IEEE 1284 transfer mode. This only +works for modes that support forward data transfer. + +RETURN VALUE +^^^^^^^^^^^^ + +If negative, an error code; otherwise the number of bytes transferred. + +SEE ALSO +^^^^^^^^ + +parport_read, parport_negotiate + + + +parport_open - register device for particular device number +----------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct pardevice *parport_open (int devnum, const char *name, + int (*pf) (void *), + void (*kf) (void *), + void (*irqf) (int, void *, + struct pt_regs *), + int flags, void *handle); + +DESCRIPTION +^^^^^^^^^^^ + +This is like parport_register_device but takes a device number instead +of a pointer to a struct parport. + +RETURN VALUE +^^^^^^^^^^^^ + +See parport_register_device. If no device is associated with devnum, +NULL is returned. + +SEE ALSO +^^^^^^^^ + +parport_register_device + + + +parport_close - unregister device for particular device number +-------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + void parport_close (struct pardevice *dev); + +DESCRIPTION +^^^^^^^^^^^ + +This is the equivalent of parport_unregister_device for parport_open. + +SEE ALSO +^^^^^^^^ + +parport_unregister_device, parport_open + + + +parport_device_id - obtain IEEE 1284 Device ID +---------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + ssize_t parport_device_id (int devnum, char *buffer, size_t len); + +DESCRIPTION +^^^^^^^^^^^ + +Obtains the IEEE 1284 Device ID associated with a given device. + +RETURN VALUE +^^^^^^^^^^^^ + +If negative, an error code; otherwise, the number of bytes of buffer +that contain the device ID. The format of the device ID is as +follows:: + + [length][ID] + +The first two bytes indicate the inclusive length of the entire Device +ID, and are in big-endian order. The ID is a sequence of pairs of the +form:: + + key:value; + +NOTES +^^^^^ + +Many devices have ill-formed IEEE 1284 Device IDs. + +SEE ALSO +^^^^^^^^ + +parport_find_class, parport_find_device + + + +parport_device_coords - convert device number to device coordinates +------------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + int parport_device_coords (int devnum, int *parport, int *mux, + int *daisy); + +DESCRIPTION +^^^^^^^^^^^ + +Convert between device number (zero-based) and device coordinates +(port, multiplexor, daisy chain address). + +RETURN VALUE +^^^^^^^^^^^^ + +Zero on success, in which case the coordinates are (``*parport``, ``*mux``, +``*daisy``). + +SEE ALSO +^^^^^^^^ + +parport_open, parport_device_id + + + +parport_find_class - find a device by its class +----------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + typedef enum { + PARPORT_CLASS_LEGACY = 0, /* Non-IEEE1284 device */ + PARPORT_CLASS_PRINTER, + PARPORT_CLASS_MODEM, + PARPORT_CLASS_NET, + PARPORT_CLASS_HDC, /* Hard disk controller */ + PARPORT_CLASS_PCMCIA, + PARPORT_CLASS_MEDIA, /* Multimedia device */ + PARPORT_CLASS_FDC, /* Floppy disk controller */ + PARPORT_CLASS_PORTS, + PARPORT_CLASS_SCANNER, + PARPORT_CLASS_DIGCAM, + PARPORT_CLASS_OTHER, /* Anything else */ + PARPORT_CLASS_UNSPEC, /* No CLS field in ID */ + PARPORT_CLASS_SCSIADAPTER + } parport_device_class; + + int parport_find_class (parport_device_class cls, int from); + +DESCRIPTION +^^^^^^^^^^^ + +Find a device by class. The search starts from device number from+1. + +RETURN VALUE +^^^^^^^^^^^^ + +The device number of the next device in that class, or -1 if no such +device exists. + +NOTES +^^^^^ + +Example usage:: + + int devnum = -1; + while ((devnum = parport_find_class (PARPORT_CLASS_DIGCAM, devnum)) != -1) { + struct pardevice *dev = parport_open (devnum, ...); + ... + } + +SEE ALSO +^^^^^^^^ + +parport_find_device, parport_open, parport_device_id + + + +parport_find_device - find a device by its class +------------------------------------------------ + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + int parport_find_device (const char *mfg, const char *mdl, int from); + +DESCRIPTION +^^^^^^^^^^^ + +Find a device by vendor and model. The search starts from device +number from+1. + +RETURN VALUE +^^^^^^^^^^^^ + +The device number of the next device matching the specifications, or +-1 if no such device exists. + +NOTES +^^^^^ + +Example usage:: + + int devnum = -1; + while ((devnum = parport_find_device ("IOMEGA", "ZIP+", devnum)) != -1) { + struct pardevice *dev = parport_open (devnum, ...); + ... + } + +SEE ALSO +^^^^^^^^ + +parport_find_class, parport_open, parport_device_id + + + +parport_set_timeout - set the inactivity timeout +------------------------------------------------ + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + long parport_set_timeout (struct pardevice *dev, long inactivity); + +DESCRIPTION +^^^^^^^^^^^ + +Set the inactivity timeout, in jiffies, for a registered device. The +previous timeout is returned. + +RETURN VALUE +^^^^^^^^^^^^ + +The previous timeout, in jiffies. + +NOTES +^^^^^ + +Some of the port->ops functions for a parport may take time, owing to +delays at the peripheral. After the peripheral has not responded for +``inactivity`` jiffies, a timeout will occur and the blocking function +will return. + +A timeout of 0 jiffies is a special case: the function must do as much +as it can without blocking or leaving the hardware in an unknown +state. If port operations are performed from within an interrupt +handler, for instance, a timeout of 0 jiffies should be used. + +Once set for a registered device, the timeout will remain at the set +value until set again. + +SEE ALSO +^^^^^^^^ + +port->ops->xxx_read/write_yyy + + + + +PORT FUNCTIONS +============== + +The functions in the port->ops structure (struct parport_operations) +are provided by the low-level driver responsible for that port. + +port->ops->read_data - read the data register +--------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + unsigned char (*read_data) (struct parport *port); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +If port->modes contains the PARPORT_MODE_TRISTATE flag and the +PARPORT_CONTROL_DIRECTION bit in the control register is set, this +returns the value on the data pins. If port->modes contains the +PARPORT_MODE_TRISTATE flag and the PARPORT_CONTROL_DIRECTION bit is +not set, the return value _may_ be the last value written to the data +register. Otherwise the return value is undefined. + +SEE ALSO +^^^^^^^^ + +write_data, read_status, write_control + + + +port->ops->write_data - write the data register +----------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + void (*write_data) (struct parport *port, unsigned char d); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Writes to the data register. May have side-effects (a STROBE pulse, +for instance). + +SEE ALSO +^^^^^^^^ + +read_data, read_status, write_control + + + +port->ops->read_status - read the status register +------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + unsigned char (*read_status) (struct parport *port); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Reads from the status register. This is a bitmask: + +- PARPORT_STATUS_ERROR (printer fault, "nFault") +- PARPORT_STATUS_SELECT (on-line, "Select") +- PARPORT_STATUS_PAPEROUT (no paper, "PError") +- PARPORT_STATUS_ACK (handshake, "nAck") +- PARPORT_STATUS_BUSY (busy, "Busy") + +There may be other bits set. + +SEE ALSO +^^^^^^^^ + +read_data, write_data, write_control + + + +port->ops->read_control - read the control register +--------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + unsigned char (*read_control) (struct parport *port); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Returns the last value written to the control register (either from +write_control or frob_control). No port access is performed. + +SEE ALSO +^^^^^^^^ + +read_data, write_data, read_status, write_control + + + +port->ops->write_control - write the control register +----------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + void (*write_control) (struct parport *port, unsigned char s); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Writes to the control register. This is a bitmask:: + + _______ + - PARPORT_CONTROL_STROBE (nStrobe) + _______ + - PARPORT_CONTROL_AUTOFD (nAutoFd) + _____ + - PARPORT_CONTROL_INIT (nInit) + _________ + - PARPORT_CONTROL_SELECT (nSelectIn) + +SEE ALSO +^^^^^^^^ + +read_data, write_data, read_status, frob_control + + + +port->ops->frob_control - write control register bits +----------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + unsigned char (*frob_control) (struct parport *port, + unsigned char mask, + unsigned char val); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +This is equivalent to reading from the control register, masking out +the bits in mask, exclusive-or'ing with the bits in val, and writing +the result to the control register. + +As some ports don't allow reads from the control port, a software copy +of its contents is maintained, so frob_control is in fact only one +port access. + +SEE ALSO +^^^^^^^^ + +read_data, write_data, read_status, write_control + + + +port->ops->enable_irq - enable interrupt generation +--------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + void (*enable_irq) (struct parport *port); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +The parallel port hardware is instructed to generate interrupts at +appropriate moments, although those moments are +architecture-specific. For the PC architecture, interrupts are +commonly generated on the rising edge of nAck. + +SEE ALSO +^^^^^^^^ + +disable_irq + + + +port->ops->disable_irq - disable interrupt generation +----------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + void (*disable_irq) (struct parport *port); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +The parallel port hardware is instructed not to generate interrupts. +The interrupt itself is not masked. + +SEE ALSO +^^^^^^^^ + +enable_irq + + + +port->ops->data_forward - enable data drivers +--------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + void (*data_forward) (struct parport *port); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Enables the data line drivers, for 8-bit host-to-peripheral +communications. + +SEE ALSO +^^^^^^^^ + +data_reverse + + + +port->ops->data_reverse - tristate the buffer +--------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + void (*data_reverse) (struct parport *port); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Places the data bus in a high impedance state, if port->modes has the +PARPORT_MODE_TRISTATE bit set. + +SEE ALSO +^^^^^^^^ + +data_forward + + + +port->ops->epp_write_data - write EPP data +------------------------------------------ + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*epp_write_data) (struct parport *port, const void *buf, + size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Writes data in EPP mode, and returns the number of bytes written. + +The ``flags`` parameter may be one or more of the following, +bitwise-or'ed together: + +======================= ================================================= +PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and + 32-bit registers. However, if a transfer + times out, the return value may be unreliable. +======================= ================================================= + +SEE ALSO +^^^^^^^^ + +epp_read_data, epp_write_addr, epp_read_addr + + + +port->ops->epp_read_data - read EPP data +---------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*epp_read_data) (struct parport *port, void *buf, + size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Reads data in EPP mode, and returns the number of bytes read. + +The ``flags`` parameter may be one or more of the following, +bitwise-or'ed together: + +======================= ================================================= +PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and + 32-bit registers. However, if a transfer + times out, the return value may be unreliable. +======================= ================================================= + +SEE ALSO +^^^^^^^^ + +epp_write_data, epp_write_addr, epp_read_addr + + + +port->ops->epp_write_addr - write EPP address +--------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*epp_write_addr) (struct parport *port, + const void *buf, size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Writes EPP addresses (8 bits each), and returns the number written. + +The ``flags`` parameter may be one or more of the following, +bitwise-or'ed together: + +======================= ================================================= +PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and + 32-bit registers. However, if a transfer + times out, the return value may be unreliable. +======================= ================================================= + +(Does PARPORT_EPP_FAST make sense for this function?) + +SEE ALSO +^^^^^^^^ + +epp_write_data, epp_read_data, epp_read_addr + + + +port->ops->epp_read_addr - read EPP address +------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*epp_read_addr) (struct parport *port, void *buf, + size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Reads EPP addresses (8 bits each), and returns the number read. + +The ``flags`` parameter may be one or more of the following, +bitwise-or'ed together: + +======================= ================================================= +PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and + 32-bit registers. However, if a transfer + times out, the return value may be unreliable. +======================= ================================================= + +(Does PARPORT_EPP_FAST make sense for this function?) + +SEE ALSO +^^^^^^^^ + +epp_write_data, epp_read_data, epp_write_addr + + + +port->ops->ecp_write_data - write a block of ECP data +----------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*ecp_write_data) (struct parport *port, + const void *buf, size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Writes a block of ECP data. The ``flags`` parameter is ignored. + +RETURN VALUE +^^^^^^^^^^^^ + +The number of bytes written. + +SEE ALSO +^^^^^^^^ + +ecp_read_data, ecp_write_addr + + + +port->ops->ecp_read_data - read a block of ECP data +--------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*ecp_read_data) (struct parport *port, + void *buf, size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Reads a block of ECP data. The ``flags`` parameter is ignored. + +RETURN VALUE +^^^^^^^^^^^^ + +The number of bytes read. NB. There may be more unread data in a +FIFO. Is there a way of stunning the FIFO to prevent this? + +SEE ALSO +^^^^^^^^ + +ecp_write_block, ecp_write_addr + + + +port->ops->ecp_write_addr - write a block of ECP addresses +---------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*ecp_write_addr) (struct parport *port, + const void *buf, size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Writes a block of ECP addresses. The ``flags`` parameter is ignored. + +RETURN VALUE +^^^^^^^^^^^^ + +The number of bytes written. + +NOTES +^^^^^ + +This may use a FIFO, and if so shall not return until the FIFO is empty. + +SEE ALSO +^^^^^^^^ + +ecp_read_data, ecp_write_data + + + +port->ops->nibble_read_data - read a block of data in nibble mode +----------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*nibble_read_data) (struct parport *port, + void *buf, size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Reads a block of data in nibble mode. The ``flags`` parameter is ignored. + +RETURN VALUE +^^^^^^^^^^^^ + +The number of whole bytes read. + +SEE ALSO +^^^^^^^^ + +byte_read_data, compat_write_data + + + +port->ops->byte_read_data - read a block of data in byte mode +------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*byte_read_data) (struct parport *port, + void *buf, size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Reads a block of data in byte mode. The ``flags`` parameter is ignored. + +RETURN VALUE +^^^^^^^^^^^^ + +The number of bytes read. + +SEE ALSO +^^^^^^^^ + +nibble_read_data, compat_write_data + + + +port->ops->compat_write_data - write a block of data in compatibility mode +-------------------------------------------------------------------------- + +SYNOPSIS +^^^^^^^^ + +:: + + #include + + struct parport_operations { + ... + size_t (*compat_write_data) (struct parport *port, + const void *buf, size_t len, int flags); + ... + }; + +DESCRIPTION +^^^^^^^^^^^ + +Writes a block of data in compatibility mode. The ``flags`` parameter +is ignored. + +RETURN VALUE +^^^^^^^^^^^^ + +The number of bytes written. + +SEE ALSO +^^^^^^^^ + +nibble_read_data, byte_read_data diff --git a/Documentation/driver-api/pti_intel_mid.rst b/Documentation/driver-api/pti_intel_mid.rst new file mode 100644 index 000000000000..20f1cff42d5f --- /dev/null +++ b/Documentation/driver-api/pti_intel_mid.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Intel MID PTI +============= + +The Intel MID PTI project is HW implemented in Intel Atom +system-on-a-chip designs based on the Parallel Trace +Interface for MIPI P1149.7 cJTAG standard. The kernel solution +for this platform involves the following files:: + + ./include/linux/pti.h + ./drivers/.../n_tracesink.h + ./drivers/.../n_tracerouter.c + ./drivers/.../n_tracesink.c + ./drivers/.../pti.c + +pti.c is the driver that enables various debugging features +popular on platforms from certain mobile manufacturers. +n_tracerouter.c and n_tracesink.c allow extra system information to +be collected and routed to the pti driver, such as trace +debugging data from a modem. Although n_tracerouter +and n_tracesink are a part of the complete PTI solution, +these two line disciplines can work separately from +pti.c and route any data stream from one /dev/tty node +to another /dev/tty node via kernel-space. This provides +a stable, reliable connection that will not break unless +the user-space application shuts down (plus avoids +kernel->user->kernel context switch overheads of routing +data). + +An example debugging usage for this driver system: + + * Hook /dev/ttyPTI0 to syslogd. Opening this port will also start + a console device to further capture debugging messages to PTI. + * Hook /dev/ttyPTI1 to modem debugging data to write to PTI HW. + This is where n_tracerouter and n_tracesink are used. + * Hook /dev/pti to a user-level debugging application for writing + to PTI HW. + * `Use mipi_` Kernel Driver API in other device drivers for + debugging to PTI by first requesting a PTI write address via + mipi_request_masterchannel(1). + +Below is example pseudo-code on how a 'privileged' application +can hook up n_tracerouter and n_tracesink to any tty on +a system. 'Privileged' means the application has enough +privileges to successfully manipulate the ldisc drivers +but is not just blindly executing as 'root'. Keep in mind +the use of ioctl(,TIOCSETD,) is not specific to the n_tracerouter +and n_tracesink line discpline drivers but is a generic +operation for a program to use a line discpline driver +on a tty port other than the default n_tty:: + + /////////// To hook up n_tracerouter and n_tracesink ///////// + + // Note that n_tracerouter depends on n_tracesink. + #include + #define ONE_TTY "/dev/ttyOne" + #define TWO_TTY "/dev/ttyTwo" + + // needed global to hand onto ldisc connection + static int g_fd_source = -1; + static int g_fd_sink = -1; + + // these two vars used to grab LDISC values from loaded ldisc drivers + // in OS. Look at /proc/tty/ldiscs to get the right numbers from + // the ldiscs loaded in the system. + int source_ldisc_num, sink_ldisc_num = -1; + int retval; + + g_fd_source = open(ONE_TTY, O_RDWR); // must be R/W + g_fd_sink = open(TWO_TTY, O_RDWR); // must be R/W + + if (g_fd_source <= 0) || (g_fd_sink <= 0) { + // doubt you'll want to use these exact error lines of code + printf("Error on open(). errno: %d\n",errno); + return errno; + } + + retval = ioctl(g_fd_sink, TIOCSETD, &sink_ldisc_num); + if (retval < 0) { + printf("Error on ioctl(). errno: %d\n", errno); + return errno; + } + + retval = ioctl(g_fd_source, TIOCSETD, &source_ldisc_num); + if (retval < 0) { + printf("Error on ioctl(). errno: %d\n", errno); + return errno; + } + + /////////// To disconnect n_tracerouter and n_tracesink //////// + + // First make sure data through the ldiscs has stopped. + + // Second, disconnect ldiscs. This provides a + // little cleaner shutdown on tty stack. + sink_ldisc_num = 0; + source_ldisc_num = 0; + ioctl(g_fd_uart, TIOCSETD, &sink_ldisc_num); + ioctl(g_fd_gadget, TIOCSETD, &source_ldisc_num); + + // Three, program closes connection, and cleanup: + close(g_fd_uart); + close(g_fd_gadget); + g_fd_uart = g_fd_gadget = NULL; diff --git a/Documentation/driver-api/pwm.rst b/Documentation/driver-api/pwm.rst new file mode 100644 index 000000000000..ab62f1bb0366 --- /dev/null +++ b/Documentation/driver-api/pwm.rst @@ -0,0 +1,165 @@ +====================================== +Pulse Width Modulation (PWM) interface +====================================== + +This provides an overview about the Linux PWM interface + +PWMs are commonly used for controlling LEDs, fans or vibrators in +cell phones. PWMs with a fixed purpose have no need implementing +the Linux PWM API (although they could). However, PWMs are often +found as discrete devices on SoCs which have no fixed purpose. It's +up to the board designer to connect them to LEDs or fans. To provide +this kind of flexibility the generic PWM API exists. + +Identifying PWMs +---------------- + +Users of the legacy PWM API use unique IDs to refer to PWM devices. + +Instead of referring to a PWM device via its unique ID, board setup code +should instead register a static mapping that can be used to match PWM +consumers to providers, as given in the following example:: + + static struct pwm_lookup board_pwm_lookup[] = { + PWM_LOOKUP("tegra-pwm", 0, "pwm-backlight", NULL, + 50000, PWM_POLARITY_NORMAL), + }; + + static void __init board_init(void) + { + ... + pwm_add_table(board_pwm_lookup, ARRAY_SIZE(board_pwm_lookup)); + ... + } + +Using PWMs +---------- + +Legacy users can request a PWM device using pwm_request() and free it +after usage with pwm_free(). + +New users should use the pwm_get() function and pass to it the consumer +device or a consumer name. pwm_put() is used to free the PWM device. Managed +variants of these functions, devm_pwm_get() and devm_pwm_put(), also exist. + +After being requested, a PWM has to be configured using:: + + int pwm_apply_state(struct pwm_device *pwm, struct pwm_state *state); + +This API controls both the PWM period/duty_cycle config and the +enable/disable state. + +The pwm_config(), pwm_enable() and pwm_disable() functions are just wrappers +around pwm_apply_state() and should not be used if the user wants to change +several parameter at once. For example, if you see pwm_config() and +pwm_{enable,disable}() calls in the same function, this probably means you +should switch to pwm_apply_state(). + +The PWM user API also allows one to query the PWM state with pwm_get_state(). + +In addition to the PWM state, the PWM API also exposes PWM arguments, which +are the reference PWM config one should use on this PWM. +PWM arguments are usually platform-specific and allows the PWM user to only +care about dutycycle relatively to the full period (like, duty = 50% of the +period). struct pwm_args contains 2 fields (period and polarity) and should +be used to set the initial PWM config (usually done in the probe function +of the PWM user). PWM arguments are retrieved with pwm_get_args(). + +All consumers should really be reconfiguring the PWM upon resume as +appropriate. This is the only way to ensure that everything is resumed in +the proper order. + +Using PWMs with the sysfs interface +----------------------------------- + +If CONFIG_SYSFS is enabled in your kernel configuration a simple sysfs +interface is provided to use the PWMs from userspace. It is exposed at +/sys/class/pwm/. Each probed PWM controller/chip will be exported as +pwmchipN, where N is the base of the PWM chip. Inside the directory you +will find: + + npwm + The number of PWM channels this chip supports (read-only). + + export + Exports a PWM channel for use with sysfs (write-only). + + unexport + Unexports a PWM channel from sysfs (write-only). + +The PWM channels are numbered using a per-chip index from 0 to npwm-1. + +When a PWM channel is exported a pwmX directory will be created in the +pwmchipN directory it is associated with, where X is the number of the +channel that was exported. The following properties will then be available: + + period + The total period of the PWM signal (read/write). + Value is in nanoseconds and is the sum of the active and inactive + time of the PWM. + + duty_cycle + The active time of the PWM signal (read/write). + Value is in nanoseconds and must be less than the period. + + polarity + Changes the polarity of the PWM signal (read/write). + Writes to this property only work if the PWM chip supports changing + the polarity. The polarity can only be changed if the PWM is not + enabled. Value is the string "normal" or "inversed". + + enable + Enable/disable the PWM signal (read/write). + + - 0 - disabled + - 1 - enabled + +Implementing a PWM driver +------------------------- + +Currently there are two ways to implement pwm drivers. Traditionally +there only has been the barebone API meaning that each driver has +to implement the pwm_*() functions itself. This means that it's impossible +to have multiple PWM drivers in the system. For this reason it's mandatory +for new drivers to use the generic PWM framework. + +A new PWM controller/chip can be added using pwmchip_add() and removed +again with pwmchip_remove(). pwmchip_add() takes a filled in struct +pwm_chip as argument which provides a description of the PWM chip, the +number of PWM devices provided by the chip and the chip-specific +implementation of the supported PWM operations to the framework. + +When implementing polarity support in a PWM driver, make sure to respect the +signal conventions in the PWM framework. By definition, normal polarity +characterizes a signal starts high for the duration of the duty cycle and +goes low for the remainder of the period. Conversely, a signal with inversed +polarity starts low for the duration of the duty cycle and goes high for the +remainder of the period. + +Drivers are encouraged to implement ->apply() instead of the legacy +->enable(), ->disable() and ->config() methods. Doing that should provide +atomicity in the PWM config workflow, which is required when the PWM controls +a critical device (like a regulator). + +The implementation of ->get_state() (a method used to retrieve initial PWM +state) is also encouraged for the same reason: letting the PWM user know +about the current PWM state would allow him to avoid glitches. + +Drivers should not implement any power management. In other words, +consumers should implement it as described in the "Using PWMs" section. + +Locking +------- + +The PWM core list manipulations are protected by a mutex, so pwm_request() +and pwm_free() may not be called from an atomic context. Currently the +PWM core does not enforce any locking to pwm_enable(), pwm_disable() and +pwm_config(), so the calling context is currently driver specific. This +is an issue derived from the former barebone API and should be fixed soon. + +Helpers +------- + +Currently a PWM can only be configured with period_ns and duty_ns. For several +use cases freq_hz and duty_percent might be better. Instead of calculating +this in your driver please consider adding appropriate helpers to the framework. diff --git a/Documentation/driver-api/rfkill.rst b/Documentation/driver-api/rfkill.rst new file mode 100644 index 000000000000..7d3684e81df6 --- /dev/null +++ b/Documentation/driver-api/rfkill.rst @@ -0,0 +1,132 @@ +=============================== +rfkill - RF kill switch support +=============================== + + +.. contents:: + :depth: 2 + +Introduction +============ + +The rfkill subsystem provides a generic interface for disabling any radio +transmitter in the system. When a transmitter is blocked, it shall not +radiate any power. + +The subsystem also provides the ability to react on button presses and +disable all transmitters of a certain type (or all). This is intended for +situations where transmitters need to be turned off, for example on +aircraft. + +The rfkill subsystem has a concept of "hard" and "soft" block, which +differ little in their meaning (block == transmitters off) but rather in +whether they can be changed or not: + + - hard block + read-only radio block that cannot be overridden by software + + - soft block + writable radio block (need not be readable) that is set by + the system software. + +The rfkill subsystem has two parameters, rfkill.default_state and +rfkill.master_switch_mode, which are documented in +admin-guide/kernel-parameters.rst. + + +Implementation details +====================== + +The rfkill subsystem is composed of three main components: + + * the rfkill core, + * the deprecated rfkill-input module (an input layer handler, being + replaced by userspace policy code) and + * the rfkill drivers. + +The rfkill core provides API for kernel drivers to register their radio +transmitter with the kernel, methods for turning it on and off, and letting +the system know about hardware-disabled states that may be implemented on +the device. + +The rfkill core code also notifies userspace of state changes, and provides +ways for userspace to query the current states. See the "Userspace support" +section below. + +When the device is hard-blocked (either by a call to rfkill_set_hw_state() +or from query_hw_block), set_block() will be invoked for additional software +block, but drivers can ignore the method call since they can use the return +value of the function rfkill_set_hw_state() to sync the software state +instead of keeping track of calls to set_block(). In fact, drivers should +use the return value of rfkill_set_hw_state() unless the hardware actually +keeps track of soft and hard block separately. + + +Kernel API +========== + +Drivers for radio transmitters normally implement an rfkill driver. + +Platform drivers might implement input devices if the rfkill button is just +that, a button. If that button influences the hardware then you need to +implement an rfkill driver instead. This also applies if the platform provides +a way to turn on/off the transmitter(s). + +For some platforms, it is possible that the hardware state changes during +suspend/hibernation, in which case it will be necessary to update the rfkill +core with the current state at resume time. + +To create an rfkill driver, driver's Kconfig needs to have:: + + depends on RFKILL || !RFKILL + +to ensure the driver cannot be built-in when rfkill is modular. The !RFKILL +case allows the driver to be built when rfkill is not configured, in which +case all rfkill API can still be used but will be provided by static inlines +which compile to almost nothing. + +Calling rfkill_set_hw_state() when a state change happens is required from +rfkill drivers that control devices that can be hard-blocked unless they also +assign the poll_hw_block() callback (then the rfkill core will poll the +device). Don't do this unless you cannot get the event in any other way. + +rfkill provides per-switch LED triggers, which can be used to drive LEDs +according to the switch state (LED_FULL when blocked, LED_OFF otherwise). + + +Userspace support +================= + +The recommended userspace interface to use is /dev/rfkill, which is a misc +character device that allows userspace to obtain and set the state of rfkill +devices and sets of devices. It also notifies userspace about device addition +and removal. The API is a simple read/write API that is defined in +linux/rfkill.h, with one ioctl that allows turning off the deprecated input +handler in the kernel for the transition period. + +Except for the one ioctl, communication with the kernel is done via read() +and write() of instances of 'struct rfkill_event'. In this structure, the +soft and hard block are properly separated (unlike sysfs, see below) and +userspace is able to get a consistent snapshot of all rfkill devices in the +system. Also, it is possible to switch all rfkill drivers (or all drivers of +a specified type) into a state which also updates the default state for +hotplugged devices. + +After an application opens /dev/rfkill, it can read the current state of all +devices. Changes can be obtained by either polling the descriptor for +hotplug or state change events or by listening for uevents emitted by the +rfkill core framework. + +Additionally, each rfkill device is registered in sysfs and emits uevents. + +rfkill devices issue uevents (with an action of "change"), with the following +environment variables set:: + + RFKILL_NAME + RFKILL_STATE + RFKILL_TYPE + +The content of these variables corresponds to the "name", "state" and +"type" sysfs files explained above. + +For further details consult Documentation/ABI/stable/sysfs-class-rfkill. diff --git a/Documentation/driver-api/sgi-ioc4.rst b/Documentation/driver-api/sgi-ioc4.rst new file mode 100644 index 000000000000..72709222d3c0 --- /dev/null +++ b/Documentation/driver-api/sgi-ioc4.rst @@ -0,0 +1,49 @@ +==================================== +SGI IOC4 PCI (multi function) device +==================================== + +The SGI IOC4 PCI device is a bit of a strange beast, so some notes on +it are in order. + +First, even though the IOC4 performs multiple functions, such as an +IDE controller, a serial controller, a PS/2 keyboard/mouse controller, +and an external interrupt mechanism, it's not implemented as a +multifunction device. The consequence of this from a software +standpoint is that all these functions share a single IRQ, and +they can't all register to own the same PCI device ID. To make +matters a bit worse, some of the register blocks (and even registers +themselves) present in IOC4 are mixed-purpose between these several +functions, meaning that there's no clear "owning" device driver. + +The solution is to organize the IOC4 driver into several independent +drivers, "ioc4", "sgiioc4", and "ioc4_serial". Note that there is no +PS/2 controller driver as this functionality has never been wired up +on a shipping IO card. + +ioc4 +==== +This is the core (or shim) driver for IOC4. It is responsible for +initializing the basic functionality of the chip, and allocating +the PCI resources that are shared between the IOC4 functions. + +This driver also provides registration functions that the other +IOC4 drivers can call to make their presence known. Each driver +needs to provide a probe and remove function, which are invoked +by the core driver at appropriate times. The interface of these +IOC4 function probe and remove operations isn't precisely the same +as PCI device probe and remove operations, but is logically the +same operation. + +sgiioc4 +======= +This is the IDE driver for IOC4. Its name isn't very descriptive +simply for historical reasons (it used to be the only IOC4 driver +component). There's not much to say about it other than it hooks +up to the ioc4 driver via the appropriate registration, probe, and +remove functions. + +ioc4_serial +=========== +This is the serial driver for IOC4. There's not much to say about it +other than it hooks up to the ioc4 driver via the appropriate registration, +probe, and remove functions. diff --git a/Documentation/driver-api/sm501.rst b/Documentation/driver-api/sm501.rst new file mode 100644 index 000000000000..882507453ba4 --- /dev/null +++ b/Documentation/driver-api/sm501.rst @@ -0,0 +1,74 @@ +.. include:: + +============ +SM501 Driver +============ + +:Copyright: |copy| 2006, 2007 Simtec Electronics + +The Silicon Motion SM501 multimedia companion chip is a multifunction device +which may provide numerous interfaces including USB host controller USB gadget, +asynchronous serial ports, audio functions, and a dual display video interface. +The device may be connected by PCI or local bus with varying functions enabled. + +Core +---- + +The core driver in drivers/mfd provides common services for the +drivers which manage the specific hardware blocks. These services +include locking for common registers, clock control and resource +management. + +The core registers drivers for both PCI and generic bus based +chips via the platform device and driver system. + +On detection of a device, the core initialises the chip (which may +be specified by the platform data) and then exports the selected +peripheral set as platform devices for the specific drivers. + +The core re-uses the platform device system as the platform device +system provides enough features to support the drivers without the +need to create a new bus-type and the associated code to go with it. + + +Resources +--------- + +Each peripheral has a view of the device which is implicitly narrowed to +the specific set of resources that peripheral requires in order to +function correctly. + +The centralised memory allocation allows the driver to ensure that the +maximum possible resource allocation can be made to the video subsystem +as this is by-far the most resource-sensitive of the on-chip functions. + +The primary issue with memory allocation is that of moving the video +buffers once a display mode is chosen. Indeed when a video mode change +occurs the memory footprint of the video subsystem changes. + +Since video memory is difficult to move without changing the display +(unless sufficient contiguous memory can be provided for the old and new +modes simultaneously) the video driver fully utilises the memory area +given to it by aligning fb0 to the start of the area and fb1 to the end +of it. Any memory left over in the middle is used for the acceleration +functions, which are transient and thus their location is less critical +as it can be moved. + + +Configuration +------------- + +The platform device driver uses a set of platform data to pass +configurations through to the core and the subsidiary drivers +so that there can be support for more than one system carrying +an SM501 built into a single kernel image. + +The PCI driver assumes that the PCI card behaves as per the Silicon +Motion reference design. + +There is an errata (AB-5) affecting the selection of the +of the M1XCLK and M1CLK frequencies. These two clocks +must be sourced from the same PLL, although they can then +be divided down individually. If this is not set, then SM501 may +lock and hang the whole system. The driver will refuse to +attach if the PLL selection is different. diff --git a/Documentation/driver-api/smsc_ece1099.rst b/Documentation/driver-api/smsc_ece1099.rst new file mode 100644 index 000000000000..079277421eaf --- /dev/null +++ b/Documentation/driver-api/smsc_ece1099.rst @@ -0,0 +1,60 @@ +================================================= +Msc Keyboard Scan Expansion/GPIO Expansion device +================================================= + +What is smsc-ece1099? +---------------------- + +The ECE1099 is a 40-Pin 3.3V Keyboard Scan Expansion +or GPIO Expansion device. The device supports a keyboard +scan matrix of 23x8. The device is connected to a Master +via the SMSC BC-Link interface or via the SMBus. +Keypad scan Input(KSI) and Keypad Scan Output(KSO) signals +are multiplexed with GPIOs. + +Interrupt generation +-------------------- + +Interrupts can be generated by an edge detection on a GPIO +pin or an edge detection on one of the bus interface pins. +Interrupts can also be detected on the keyboard scan interface. +The bus interrupt pin (BC_INT# or SMBUS_INT#) is asserted if +any bit in one of the Interrupt Status registers is 1 and +the corresponding Interrupt Mask bit is also 1. + +In order for software to determine which device is the source +of an interrupt, it should first read the Group Interrupt Status Register +to determine which Status register group is a source for the interrupt. +Software should read both the Status register and the associated Mask register, +then AND the two values together. Bits that are 1 in the result of the AND +are active interrupts. Software clears an interrupt by writing a 1 to the +corresponding bit in the Status register. + +Communication Protocol +---------------------- + +- SMbus slave Interface + The host processor communicates with the ECE1099 device + through a series of read/write registers via the SMBus + interface. SMBus is a serial communication protocol between + a computer host and its peripheral devices. The SMBus data + rate is 10KHz minimum to 400 KHz maximum + +- Slave Bus Interface + The ECE1099 device SMBus implementation is a subset of the + SMBus interface to the host. The device is a slave-only SMBus device. + The implementation in the device is a subset of SMBus since it + only supports four protocols. + + The Write Byte, Read Byte, Send Byte, and Receive Byte protocols are the + only valid SMBus protocols for the device. + +- BC-LinkTM Interface + The BC-Link is a proprietary bus that allows communication + between a Master device and a Companion device. The Master + device uses this serial bus to read and write registers + located on the Companion device. The bus comprises three signals, + BC_CLK, BC_DAT and BC_INT#. The Master device always provides the + clock, BC_CLK, and the Companion device is the source for an + independent asynchronous interrupt signal, BC_INT#. The ECE1099 + supports BC-Link speeds up to 24MHz. diff --git a/Documentation/driver-api/switchtec.rst b/Documentation/driver-api/switchtec.rst new file mode 100644 index 000000000000..7611fdc53e19 --- /dev/null +++ b/Documentation/driver-api/switchtec.rst @@ -0,0 +1,102 @@ +======================== +Linux Switchtec Support +======================== + +Microsemi's "Switchtec" line of PCI switch devices is already +supported by the kernel with standard PCI switch drivers. However, the +Switchtec device advertises a special management endpoint which +enables some additional functionality. This includes: + +* Packet and Byte Counters +* Firmware Upgrades +* Event and Error logs +* Querying port link status +* Custom user firmware commands + +The switchtec kernel module implements this functionality. + + +Interface +========= + +The primary means of communicating with the Switchtec management firmware is +through the Memory-mapped Remote Procedure Call (MRPC) interface. +Commands are submitted to the interface with a 4-byte command +identifier and up to 1KB of command specific data. The firmware will +respond with a 4-byte return code and up to 1KB of command-specific +data. The interface only processes a single command at a time. + + +Userspace Interface +=================== + +The MRPC interface will be exposed to userspace through a simple char +device: /dev/switchtec#, one for each management endpoint in the system. + +The char device has the following semantics: + +* A write must consist of at least 4 bytes and no more than 1028 bytes. + The first 4 bytes will be interpreted as the Command ID and the + remainder will be used as the input data. A write will send the + command to the firmware to begin processing. + +* Each write must be followed by exactly one read. Any double write will + produce an error and any read that doesn't follow a write will + produce an error. + +* A read will block until the firmware completes the command and return + the 4-byte Command Return Value plus up to 1024 bytes of output + data. (The length will be specified by the size parameter of the read + call -- reading less than 4 bytes will produce an error.) + +* The poll call will also be supported for userspace applications that + need to do other things while waiting for the command to complete. + +The following IOCTLs are also supported by the device: + +* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number + of partitions in the device. + +* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for + any specified partition in flash. + +* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps + indicating all uncleared events. + +* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags + for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct + with the event_id, index and flags set (index being the partition or PFF + number for non-global events). It returns whether the event has + occurred, the number of times and any event specific data. The flags + can be used to clear the count or enable and disable actions to + happen when the event occurs. + By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag, + you can set an event to trigger a poll command to return with + POLLPRI. In this way, userspace can wait for events to occur. + +* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert + between PCI Function Framework number (used by the event system) + and Switchtec Logic Port ID and Partition number (which is more + user friendly). + + +Non-Transparent Bridge (NTB) Driver +=================================== + +An NTB hardware driver is provided for the Switchtec hardware in +ntb_hw_switchtec. Currently, it only supports switches configured with +exactly 2 NT partitions and zero or more non-NT partitions. It also requires +the following configuration settings: + +* Both NT partitions must be able to access each other's GAS spaces. + Thus, the bits in the GAS Access Vector under Management Settings + must be set to support this. +* Kernel configuration MUST include support for NTB (CONFIG_NTB needs + to be set) + +NT EP BAR 2 will be dynamically configured as a Direct Window, and +the configuration file does not need to configure it explicitly. + +Please refer to Documentation/driver-api/ntb.rst in Linux source tree for an overall +understanding of the Linux NTB stack. ntb_hw_switchtec works as an NTB +Hardware Driver in this stack. diff --git a/Documentation/driver-api/sync_file.rst b/Documentation/driver-api/sync_file.rst new file mode 100644 index 000000000000..496fb2c3b3e6 --- /dev/null +++ b/Documentation/driver-api/sync_file.rst @@ -0,0 +1,86 @@ +=================== +Sync File API Guide +=================== + +:Author: Gustavo Padovan + +This document serves as a guide for device drivers writers on what the +sync_file API is, and how drivers can support it. Sync file is the carrier of +the fences(struct dma_fence) that are needed to synchronize between drivers or +across process boundaries. + +The sync_file API is meant to be used to send and receive fence information +to/from userspace. It enables userspace to do explicit fencing, where instead +of attaching a fence to the buffer a producer driver (such as a GPU or V4L +driver) sends the fence related to the buffer to userspace via a sync_file. + +The sync_file then can be sent to the consumer (DRM driver for example), that +will not use the buffer for anything before the fence(s) signals, i.e., the +driver that issued the fence is not using/processing the buffer anymore, so it +signals that the buffer is ready to use. And vice-versa for the consumer -> +producer part of the cycle. + +Sync files allows userspace awareness on buffer sharing synchronization between +drivers. + +Sync file was originally added in the Android kernel but current Linux Desktop +can benefit a lot from it. + +in-fences and out-fences +------------------------ + +Sync files can go either to or from userspace. When a sync_file is sent from +the driver to userspace we call the fences it contains 'out-fences'. They are +related to a buffer that the driver is processing or is going to process, so +the driver creates an out-fence to be able to notify, through +dma_fence_signal(), when it has finished using (or processing) that buffer. +Out-fences are fences that the driver creates. + +On the other hand if the driver receives fence(s) through a sync_file from +userspace we call these fence(s) 'in-fences'. Receiving in-fences means that +we need to wait for the fence(s) to signal before using any buffer related to +the in-fences. + +Creating Sync Files +------------------- + +When a driver needs to send an out-fence userspace it creates a sync_file. + +Interface:: + + struct sync_file *sync_file_create(struct dma_fence *fence); + +The caller pass the out-fence and gets back the sync_file. That is just the +first step, next it needs to install an fd on sync_file->file. So it gets an +fd:: + + fd = get_unused_fd_flags(O_CLOEXEC); + +and installs it on sync_file->file:: + + fd_install(fd, sync_file->file); + +The sync_file fd now can be sent to userspace. + +If the creation process fail, or the sync_file needs to be released by any +other reason fput(sync_file->file) should be used. + +Receiving Sync Files from Userspace +----------------------------------- + +When userspace needs to send an in-fence to the driver it passes file descriptor +of the Sync File to the kernel. The kernel can then retrieve the fences +from it. + +Interface:: + + struct dma_fence *sync_file_get_fence(int fd); + + +The returned reference is owned by the caller and must be disposed of +afterwards using dma_fence_put(). In case of error, a NULL is returned instead. + +References: + +1. struct sync_file in include/linux/sync_file.h +2. All interfaces mentioned above defined in include/linux/sync_file.h diff --git a/Documentation/driver-api/vfio-mediated-device.rst b/Documentation/driver-api/vfio-mediated-device.rst new file mode 100644 index 000000000000..25eb7d5b834b --- /dev/null +++ b/Documentation/driver-api/vfio-mediated-device.rst @@ -0,0 +1,414 @@ +.. include:: + +===================== +VFIO Mediated devices +===================== + +:Copyright: |copy| 2016, NVIDIA CORPORATION. All rights reserved. +:Author: Neo Jia +:Author: Kirti Wankhede + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License version 2 as +published by the Free Software Foundation. + + +Virtual Function I/O (VFIO) Mediated devices[1] +=============================================== + +The number of use cases for virtualizing DMA devices that do not have built-in +SR_IOV capability is increasing. Previously, to virtualize such devices, +developers had to create their own management interfaces and APIs, and then +integrate them with user space software. To simplify integration with user space +software, we have identified common requirements and a unified management +interface for such devices. + +The VFIO driver framework provides unified APIs for direct device access. It is +an IOMMU/device-agnostic framework for exposing direct device access to user +space in a secure, IOMMU-protected environment. This framework is used for +multiple devices, such as GPUs, network adapters, and compute accelerators. With +direct device access, virtual machines or user space applications have direct +access to the physical device. This framework is reused for mediated devices. + +The mediated core driver provides a common interface for mediated device +management that can be used by drivers of different devices. This module +provides a generic interface to perform these operations: + +* Create and destroy a mediated device +* Add a mediated device to and remove it from a mediated bus driver +* Add a mediated device to and remove it from an IOMMU group + +The mediated core driver also provides an interface to register a bus driver. +For example, the mediated VFIO mdev driver is designed for mediated devices and +supports VFIO APIs. The mediated bus driver adds a mediated device to and +removes it from a VFIO group. + +The following high-level block diagram shows the main components and interfaces +in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM +devices as examples, as these devices are the first devices to use this module:: + + +---------------+ + | | + | +-----------+ | mdev_register_driver() +--------------+ + | | | +<------------------------+ | + | | mdev | | | | + | | bus | +------------------------>+ vfio_mdev.ko |<-> VFIO user + | | driver | | probe()/remove() | | APIs + | | | | +--------------+ + | +-----------+ | + | | + | MDEV CORE | + | MODULE | + | mdev.ko | + | +-----------+ | mdev_register_device() +--------------+ + | | | +<------------------------+ | + | | | | | nvidia.ko |<-> physical + | | | +------------------------>+ | device + | | | | callbacks +--------------+ + | | Physical | | + | | device | | mdev_register_device() +--------------+ + | | interface | |<------------------------+ | + | | | | | i915.ko |<-> physical + | | | +------------------------>+ | device + | | | | callbacks +--------------+ + | | | | + | | | | mdev_register_device() +--------------+ + | | | +<------------------------+ | + | | | | | ccw_device.ko|<-> physical + | | | +------------------------>+ | device + | | | | callbacks +--------------+ + | +-----------+ | + +---------------+ + + +Registration Interfaces +======================= + +The mediated core driver provides the following types of registration +interfaces: + +* Registration interface for a mediated bus driver +* Physical device driver interface + +Registration Interface for a Mediated Bus Driver +------------------------------------------------ + +The registration interface for a mediated bus driver provides the following +structure to represent a mediated device's driver:: + + /* + * struct mdev_driver [2] - Mediated device's driver + * @name: driver name + * @probe: called when new device created + * @remove: called when device removed + * @driver: device driver structure + */ + struct mdev_driver { + const char *name; + int (*probe) (struct device *dev); + void (*remove) (struct device *dev); + struct device_driver driver; + }; + +A mediated bus driver for mdev should use this structure in the function calls +to register and unregister itself with the core driver: + +* Register:: + + extern int mdev_register_driver(struct mdev_driver *drv, + struct module *owner); + +* Unregister:: + + extern void mdev_unregister_driver(struct mdev_driver *drv); + +The mediated bus driver is responsible for adding mediated devices to the VFIO +group when devices are bound to the driver and removing mediated devices from +the VFIO when devices are unbound from the driver. + + +Physical Device Driver Interface +-------------------------------- + +The physical device driver interface provides the mdev_parent_ops[3] structure +to define the APIs to manage work in the mediated core driver that is related +to the physical device. + +The structures in the mdev_parent_ops structure are as follows: + +* dev_attr_groups: attributes of the parent device +* mdev_attr_groups: attributes of the mediated device +* supported_config: attributes to define supported configurations + +The functions in the mdev_parent_ops structure are as follows: + +* create: allocate basic resources in a driver for a mediated device +* remove: free resources in a driver when a mediated device is destroyed + +(Note that mdev-core provides no implicit serialization of create/remove +callbacks per mdev parent device, per mdev type, or any other categorization. +Vendor drivers are expected to be fully asynchronous in this respect or +provide their own internal resource protection.) + +The callbacks in the mdev_parent_ops structure are as follows: + +* open: open callback of mediated device +* close: close callback of mediated device +* ioctl: ioctl callback of mediated device +* read : read emulation callback +* write: write emulation callback +* mmap: mmap emulation callback + +A driver should use the mdev_parent_ops structure in the function call to +register itself with the mdev core driver:: + + extern int mdev_register_device(struct device *dev, + const struct mdev_parent_ops *ops); + +However, the mdev_parent_ops structure is not required in the function call +that a driver should use to unregister itself with the mdev core driver:: + + extern void mdev_unregister_device(struct device *dev); + + +Mediated Device Management Interface Through sysfs +================================================== + +The management interface through sysfs enables user space software, such as +libvirt, to query and configure mediated devices in a hardware-agnostic fashion. +This management interface provides flexibility to the underlying physical +device's driver to support features such as: + +* Mediated device hot plug +* Multiple mediated devices in a single virtual machine +* Multiple mediated devices from different physical devices + +Links in the mdev_bus Class Directory +------------------------------------- +The /sys/class/mdev_bus/ directory contains links to devices that are registered +with the mdev core driver. + +Directories and files under the sysfs for Each Physical Device +-------------------------------------------------------------- + +:: + + |- [parent physical device] + |--- Vendor-specific-attributes [optional] + |--- [mdev_supported_types] + | |--- [] + | | |--- create + | | |--- name + | | |--- available_instances + | | |--- device_api + | | |--- description + | | |--- [devices] + | |--- [] + | | |--- create + | | |--- name + | | |--- available_instances + | | |--- device_api + | | |--- description + | | |--- [devices] + | |--- [] + | |--- create + | |--- name + | |--- available_instances + | |--- device_api + | |--- description + | |--- [devices] + +* [mdev_supported_types] + + The list of currently supported mediated device types and their details. + + [], device_api, and available_instances are mandatory attributes + that should be provided by vendor driver. + +* [] + + The [] name is created by adding the device driver string as a prefix + to the string provided by the vendor driver. This format of this name is as + follows:: + + sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name); + + (or using mdev_parent_dev(mdev) to arrive at the parent device outside + of the core mdev code) + +* device_api + + This attribute should show which device API is being created, for example, + "vfio-pci" for a PCI device. + +* available_instances + + This attribute should show the number of devices of type that can be + created. + +* [device] + + This directory contains links to the devices of type that have been + created. + +* name + + This attribute should show human readable name. This is optional attribute. + +* description + + This attribute should show brief features/description of the type. This is + optional attribute. + +Directories and Files Under the sysfs for Each mdev Device +---------------------------------------------------------- + +:: + + |- [parent phy device] + |--- [$MDEV_UUID] + |--- remove + |--- mdev_type {link to its type} + |--- vendor-specific-attributes [optional] + +* remove (write only) + +Writing '1' to the 'remove' file destroys the mdev device. The vendor driver can +fail the remove() callback if that device is active and the vendor driver +doesn't support hot unplug. + +Example:: + + # echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove + +Mediated device Hot plug +------------------------ + +Mediated devices can be created and assigned at runtime. The procedure to hot +plug a mediated device is the same as the procedure to hot plug a PCI device. + +Translation APIs for Mediated Devices +===================================== + +The following APIs are provided for translating user pfn to host pfn in a VFIO +driver:: + + extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, + int npage, int prot, unsigned long *phys_pfn); + + extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, + int npage); + +These functions call back into the back-end IOMMU module by using the pin_pages +and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently +these callbacks are supported in the TYPE1 IOMMU module. To enable them for +other IOMMU backend modules, such as PPC64 sPAPR module, they need to provide +these two callback functions. + +Using the Sample Code +===================== + +mtty.c in samples/vfio-mdev/ directory is a sample driver program to +demonstrate how to use the mediated device framework. + +The sample driver creates an mdev device that simulates a serial port over a PCI +card. + +1. Build and load the mtty.ko module. + + This step creates a dummy device, /sys/devices/virtual/mtty/mtty/ + + Files in this device directory in sysfs are similar to the following:: + + # tree /sys/devices/virtual/mtty/mtty/ + /sys/devices/virtual/mtty/mtty/ + |-- mdev_supported_types + | |-- mtty-1 + | | |-- available_instances + | | |-- create + | | |-- device_api + | | |-- devices + | | `-- name + | `-- mtty-2 + | |-- available_instances + | |-- create + | |-- device_api + | |-- devices + | `-- name + |-- mtty_dev + | `-- sample_mtty_dev + |-- power + | |-- autosuspend_delay_ms + | |-- control + | |-- runtime_active_time + | |-- runtime_status + | `-- runtime_suspended_time + |-- subsystem -> ../../../../class/mtty + `-- uevent + +2. Create a mediated device by using the dummy device that you created in the + previous step:: + + # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ + /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create + +3. Add parameters to qemu-kvm:: + + -device vfio-pci,\ + sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 + +4. Boot the VM. + + In the Linux guest VM, with no hardware on the host, the device appears + as follows:: + + # lspci -s 00:05.0 -xxvv + 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) + Subsystem: Device 4348:3253 + Physical Slot: 5 + Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- + Stepping- SERR- FastB2B- DisINTx- + Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- + SERR- Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 + 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A + 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A + + +5. In the Linux guest VM, check the serial ports:: + + # setserial -g /dev/ttyS* + /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 + /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10 + /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10 + +6. Using minicom or any terminal emulation program, open port /dev/ttyS1 or + /dev/ttyS2 with hardware flow control disabled. + +7. Type data on the minicom terminal or send data to the terminal emulation + program and read the data. + + Data is loop backed from hosts mtty driver. + +8. Destroy the mediated device that you created:: + + # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove + +References +========== + +1. See Documentation/driver-api/vfio.rst for more information on VFIO. +2. struct mdev_driver in include/linux/mdev.h +3. struct mdev_parent_ops in include/linux/mdev.h +4. struct vfio_iommu_driver_ops in include/linux/vfio.h diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst new file mode 100644 index 000000000000..f1a4d3c3ba0b --- /dev/null +++ b/Documentation/driver-api/vfio.rst @@ -0,0 +1,520 @@ +================================== +VFIO - "Virtual Function I/O" [1]_ +================================== + +Many modern system now provide DMA and interrupt remapping facilities +to help ensure I/O devices behave within the boundaries they've been +allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, +POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC +systems such as Freescale PAMU. The VFIO driver is an IOMMU/device +agnostic framework for exposing direct device access to userspace, in +a secure, IOMMU protected environment. In other words, this allows +safe [2]_, non-privileged, userspace drivers. + +Why do we want that? Virtual machines often make use of direct device +access ("device assignment") when configured for the highest possible +I/O performance. From a device and host perspective, this simply +turns the VM into a userspace driver, with the benefits of +significantly reduced latency, higher bandwidth, and direct use of +bare-metal device drivers [3]_. + +Some applications, particularly in the high performance computing +field, also benefit from low-overhead, direct device access from +userspace. Examples include network adapters (often non-TCP/IP based) +and compute accelerators. Prior to VFIO, these drivers had to either +go through the full development cycle to become proper upstream +driver, be maintained out of tree, or make use of the UIO framework, +which has no notion of IOMMU protection, limited interrupt support, +and requires root privileges to access things like PCI configuration +space. + +The VFIO driver framework intends to unify these, replacing both the +KVM PCI specific device assignment code as well as provide a more +secure, more featureful userspace driver environment than UIO. + +Groups, Devices, and IOMMUs +--------------------------- + +Devices are the main target of any I/O driver. Devices typically +create a programming interface made up of I/O access, interrupts, +and DMA. Without going into the details of each of these, DMA is +by far the most critical aspect for maintaining a secure environment +as allowing a device read-write access to system memory imposes the +greatest risk to the overall system integrity. + +To help mitigate this risk, many modern IOMMUs now incorporate +isolation properties into what was, in many cases, an interface only +meant for translation (ie. solving the addressing problems of devices +with limited address spaces). With this, devices can now be isolated +from each other and from arbitrary memory access, thus allowing +things like secure direct assignment of devices into virtual machines. + +This isolation is not always at the granularity of a single device +though. Even when an IOMMU is capable of this, properties of devices, +interconnects, and IOMMU topologies can each reduce this isolation. +For instance, an individual device may be part of a larger multi- +function enclosure. While the IOMMU may be able to distinguish +between devices within the enclosure, the enclosure may not require +transactions between devices to reach the IOMMU. Examples of this +could be anything from a multi-function PCI device with backdoors +between functions to a non-PCI-ACS (Access Control Services) capable +bridge allowing redirection without reaching the IOMMU. Topology +can also play a factor in terms of hiding devices. A PCIe-to-PCI +bridge masks the devices behind it, making transaction appear as if +from the bridge itself. Obviously IOMMU design plays a major factor +as well. + +Therefore, while for the most part an IOMMU may have device level +granularity, any system is susceptible to reduced granularity. The +IOMMU API therefore supports a notion of IOMMU groups. A group is +a set of devices which is isolatable from all other devices in the +system. Groups are therefore the unit of ownership used by VFIO. + +While the group is the minimum granularity that must be used to +ensure secure user access, it's not necessarily the preferred +granularity. In IOMMUs which make use of page tables, it may be +possible to share a set of page tables between different groups, +reducing the overhead both to the platform (reduced TLB thrashing, +reduced duplicate page tables), and to the user (programming only +a single set of translations). For this reason, VFIO makes use of +a container class, which may hold one or more groups. A container +is created by simply opening the /dev/vfio/vfio character device. + +On its own, the container provides little functionality, with all +but a couple version and extension query interfaces locked away. +The user needs to add a group into the container for the next level +of functionality. To do this, the user first needs to identify the +group associated with the desired device. This can be done using +the sysfs links described in the example below. By unbinding the +device from the host driver and binding it to a VFIO driver, a new +VFIO group will appear for the group as /dev/vfio/$GROUP, where +$GROUP is the IOMMU group number of which the device is a member. +If the IOMMU group contains multiple devices, each will need to +be bound to a VFIO driver before operations on the VFIO group +are allowed (it's also sufficient to only unbind the device from +host drivers if a VFIO driver is unavailable; this will make the +group available, but not that particular device). TBD - interface +for disabling driver probing/locking a device. + +Once the group is ready, it may be added to the container by opening +the VFIO group character device (/dev/vfio/$GROUP) and using the +VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the +previously opened container file. If desired and if the IOMMU driver +supports sharing the IOMMU context between groups, multiple groups may +be set to the same container. If a group fails to set to a container +with existing groups, a new empty container will need to be used +instead. + +With a group (or groups) attached to a container, the remaining +ioctls become available, enabling access to the VFIO IOMMU interfaces. +Additionally, it now becomes possible to get file descriptors for each +device within a group using an ioctl on the VFIO group file descriptor. + +The VFIO device API includes ioctls for describing the device, the I/O +regions and their read/write/mmap offsets on the device descriptor, as +well as mechanisms for describing and registering interrupt +notifications. + +VFIO Usage Example +------------------ + +Assume user wants to access PCI device 0000:06:0d.0:: + + $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group + ../../../../kernel/iommu_groups/26 + +This device is therefore in IOMMU group 26. This device is on the +pci bus, therefore the user will make use of vfio-pci to manage the +group:: + + # modprobe vfio-pci + +Binding this device to the vfio-pci driver creates the VFIO group +character devices for this group:: + + $ lspci -n -s 0000:06:0d.0 + 06:0d.0 0401: 1102:0002 (rev 08) + # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind + # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id + +Now we need to look at what other devices are in the group to free +it for use by VFIO:: + + $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices + total 0 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> + ../../../../devices/pci0000:00/0000:00:1e.0 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> + ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 + lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> + ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 + +This device is behind a PCIe-to-PCI bridge [4]_, therefore we also +need to add device 0000:06:0d.1 to the group following the same +procedure as above. Device 0000:00:1e.0 is a bridge that does +not currently have a host driver, therefore it's not required to +bind this device to the vfio-pci driver (vfio-pci does not currently +support PCI bridges). + +The final step is to provide the user with access to the group if +unprivileged operation is desired (note that /dev/vfio/vfio provides +no capabilities on its own and is therefore expected to be set to +mode 0666 by the system):: + + # chown user:user /dev/vfio/26 + +The user now has full access to all the devices and the iommu for this +group and can access them as follows:: + + int container, group, device, i; + struct vfio_group_status group_status = + { .argsz = sizeof(group_status) }; + struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) }; + struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) }; + struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; + + /* Create a new container */ + container = open("/dev/vfio/vfio", O_RDWR); + + if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) + /* Unknown API version */ + + if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) + /* Doesn't support the IOMMU driver we want. */ + + /* Open the group */ + group = open("/dev/vfio/26", O_RDWR); + + /* Test the group is viable and available */ + ioctl(group, VFIO_GROUP_GET_STATUS, &group_status); + + if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) + /* Group is not viable (ie, not all devices bound for vfio) */ + + /* Add the group to the container */ + ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); + + /* Enable the IOMMU model we want */ + ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); + + /* Get addition IOMMU info */ + ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info); + + /* Allocate some space and setup a DMA mapping */ + dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); + dma_map.size = 1024 * 1024; + dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ + dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; + + ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); + + /* Get a file descriptor for the device */ + device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); + + /* Test and setup the device */ + ioctl(device, VFIO_DEVICE_GET_INFO, &device_info); + + for (i = 0; i < device_info.num_regions; i++) { + struct vfio_region_info reg = { .argsz = sizeof(reg) }; + + reg.index = i; + + ioctl(device, VFIO_DEVICE_GET_REGION_INFO, ®); + + /* Setup mappings... read/write offsets, mmaps + * For PCI devices, config space is a region */ + } + + for (i = 0; i < device_info.num_irqs; i++) { + struct vfio_irq_info irq = { .argsz = sizeof(irq) }; + + irq.index = i; + + ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq); + + /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */ + } + + /* Gratuitous device reset and go... */ + ioctl(device, VFIO_DEVICE_RESET); + +VFIO User API +------------------------------------------------------------------------------- + +Please see include/linux/vfio.h for complete API documentation. + +VFIO bus driver API +------------------------------------------------------------------------------- + +VFIO bus drivers, such as vfio-pci make use of only a few interfaces +into VFIO core. When devices are bound and unbound to the driver, +the driver should call vfio_add_group_dev() and vfio_del_group_dev() +respectively:: + + extern int vfio_add_group_dev(struct device *dev, + const struct vfio_device_ops *ops, + void *device_data); + + extern void *vfio_del_group_dev(struct device *dev); + +vfio_add_group_dev() indicates to the core to begin tracking the +iommu_group of the specified dev and register the dev as owned by +a VFIO bus driver. The driver provides an ops structure for callbacks +similar to a file operations structure:: + + struct vfio_device_ops { + int (*open)(void *device_data); + void (*release)(void *device_data); + ssize_t (*read)(void *device_data, char __user *buf, + size_t count, loff_t *ppos); + ssize_t (*write)(void *device_data, const char __user *buf, + size_t size, loff_t *ppos); + long (*ioctl)(void *device_data, unsigned int cmd, + unsigned long arg); + int (*mmap)(void *device_data, struct vm_area_struct *vma); + }; + +Each function is passed the device_data that was originally registered +in the vfio_add_group_dev() call above. This allows the bus driver +an easy place to store its opaque, private data. The open/release +callbacks are issued when a new file descriptor is created for a +device (via VFIO_GROUP_GET_DEVICE_FD). The ioctl interface provides +a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap +interfaces implement the device region access defined by the device's +own VFIO_DEVICE_GET_REGION_INFO ioctl. + + +PPC64 sPAPR implementation note +------------------------------- + +This implementation has some specifics: + +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per + container is supported as an IOMMU table is allocated at the boot time, + one table per a IOMMU group which is a Partitionable Endpoint (PE) + (PE is often a PCI domain but not always). + + Newer systems (POWER8 with IODA2) have improved hardware design which allows + to remove this limitation and have multiple IOMMU groups per a VFIO + container. + +2) The hardware supports so called DMA windows - the PCI address range + within which DMA transfer is allowed, any attempt to access address space + out of the window leads to the whole PE isolation. + +3) PPC64 guests are paravirtualized but not fully emulated. There is an API + to map/unmap pages for DMA, and it normally maps 1..32 pages per call and + currently there is no way to reduce the number of calls. In order to make + things faster, the map/unmap handling has been implemented in real mode + which provides an excellent performance which has limitations such as + inability to do locked pages accounting in real time. + +4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O + subtree that can be treated as a unit for the purposes of partitioning and + error recovery. A PE may be a single or multi-function IOA (IO Adapter), a + function of a multi-function IOA, or multiple IOAs (possibly including + switch and bridge structures above the multiple IOAs). PPC64 guests detect + PCI errors and recover from them via EEH RTAS services, which works on the + basis of additional ioctl commands. + + So 4 additional ioctls have been added: + + VFIO_IOMMU_SPAPR_TCE_GET_INFO + returns the size and the start of the DMA window on the PCI bus. + + VFIO_IOMMU_ENABLE + enables the container. The locked pages accounting + is done at this point. This lets user first to know what + the DMA window is and adjust rlimit before doing any real job. + + VFIO_IOMMU_DISABLE + disables the container. + + VFIO_EEH_PE_OP + provides an API for EEH setup, error detection and recovery. + + The code flow from the example above should be slightly changed:: + + struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; + + ..... + /* Add the group to the container */ + ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); + + /* Enable the IOMMU model we want */ + ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU) + + /* Get addition sPAPR IOMMU info */ + vfio_iommu_spapr_tce_info spapr_iommu_info; + ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info); + + if (ioctl(container, VFIO_IOMMU_ENABLE)) + /* Cannot enable container, may be low rlimit */ + + /* Allocate some space and setup a DMA mapping */ + dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); + + dma_map.size = 1024 * 1024; + dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ + dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; + + /* Check here is .iova/.size are within DMA window from spapr_iommu_info */ + ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); + + /* Get a file descriptor for the device */ + device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); + + .... + + /* Gratuitous device reset and go... */ + ioctl(device, VFIO_DEVICE_RESET); + + /* Make sure EEH is supported */ + ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH); + + /* Enable the EEH functionality on the device */ + pe_op.op = VFIO_EEH_PE_ENABLE; + ioctl(container, VFIO_EEH_PE_OP, &pe_op); + + /* You're suggested to create additional data struct to represent + * PE, and put child devices belonging to same IOMMU group to the + * PE instance for later reference. + */ + + /* Check the PE's state and make sure it's in functional state */ + pe_op.op = VFIO_EEH_PE_GET_STATE; + ioctl(container, VFIO_EEH_PE_OP, &pe_op); + + /* Save device state using pci_save_state(). + * EEH should be enabled on the specified device. + */ + + .... + + /* Inject EEH error, which is expected to be caused by 32-bits + * config load. + */ + pe_op.op = VFIO_EEH_PE_INJECT_ERR; + pe_op.err.type = EEH_ERR_TYPE_32; + pe_op.err.func = EEH_ERR_FUNC_LD_CFG_ADDR; + pe_op.err.addr = 0ul; + pe_op.err.mask = 0ul; + ioctl(container, VFIO_EEH_PE_OP, &pe_op); + + .... + + /* When 0xFF's returned from reading PCI config space or IO BARs + * of the PCI device. Check the PE's state to see if that has been + * frozen. + */ + ioctl(container, VFIO_EEH_PE_OP, &pe_op); + + /* Waiting for pending PCI transactions to be completed and don't + * produce any more PCI traffic from/to the affected PE until + * recovery is finished. + */ + + /* Enable IO for the affected PE and collect logs. Usually, the + * standard part of PCI config space, AER registers are dumped + * as logs for further analysis. + */ + pe_op.op = VFIO_EEH_PE_UNFREEZE_IO; + ioctl(container, VFIO_EEH_PE_OP, &pe_op); + + /* + * Issue PE reset: hot or fundamental reset. Usually, hot reset + * is enough. However, the firmware of some PCI adapters would + * require fundamental reset. + */ + pe_op.op = VFIO_EEH_PE_RESET_HOT; + ioctl(container, VFIO_EEH_PE_OP, &pe_op); + pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE; + ioctl(container, VFIO_EEH_PE_OP, &pe_op); + + /* Configure the PCI bridges for the affected PE */ + pe_op.op = VFIO_EEH_PE_CONFIGURE; + ioctl(container, VFIO_EEH_PE_OP, &pe_op); + + /* Restored state we saved at initialization time. pci_restore_state() + * is good enough as an example. + */ + + /* Hopefully, error is recovered successfully. Now, you can resume to + * start PCI traffic to/from the affected PE. + */ + + .... + +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ + VFIO_IOMMU_DISABLE and implements 2 new ioctls: + VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY + (which are unsupported in v1 IOMMU). + + PPC64 paravirtualized guests generate a lot of map/unmap requests, + and the handling of those includes pinning/unpinning pages and updating + mm::locked_vm counter to make sure we do not exceed the rlimit. + The v2 IOMMU splits accounting and pinning into separate operations: + + - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls + receive a user space address and size of the block to be pinned. + Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to + be called with the exact address and size used for registering + the memory block. The userspace is not expected to call these often. + The ranges are stored in a linked list in a VFIO container. + + - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual + IOMMU table and do not do pinning; instead these check that the userspace + address is from pre-registered range. + + This separation helps in optimizing DMA for guests. + +6) sPAPR specification allows guests to have an additional DMA window(s) on + a PCI bus with a variable page size. Two ioctls have been added to support + this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. + The platform has to support the functionality or error will be returned to + the userspace. The existing hardware supports up to 2 DMA windows, one is + 2GB long, uses 4K pages and called "default 32bit window"; the other can + be as big as entire RAM, use different page size, it is optional - guests + create those in run-time if the guest driver supports 64bit DMA. + + VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and + a number of TCE table levels (if a TCE table is going to be big enough and + the kernel may not be able to allocate enough of physically contiguous + memory). It creates a new window in the available slot and returns the bus + address where the new window starts. Due to hardware limitation, the user + space cannot choose the location of DMA windows. + + VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window + and removes it. + +------------------------------------------------------------------------------- + +.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its + initial implementation by Tom Lyon while as Cisco. We've since + outgrown the acronym, but it's catchy. + +.. [2] "safe" also depends upon a device being "well behaved". It's + possible for multi-function devices to have backdoors between + functions and even for single function devices to have alternative + access to things like PCI config space through MMIO registers. To + guard against the former we can include additional precautions in the + IOMMU driver to group multi-function PCI devices together + (iommu=group_mf). The latter we can't prevent, but the IOMMU should + still provide isolation. For PCI, SR-IOV Virtual Functions are the + best indicator of "well behaved", as these are designed for + virtualization usage models. + +.. [3] As always there are trade-offs to virtual machine device + assignment that are beyond the scope of VFIO. It's expected that + future IOMMU technologies will reduce some, but maybe not all, of + these trade-offs. + +.. [4] In this case the device is below a PCI bridge, so transactions + from either function of the device are indistinguishable to the iommu:: + + -[0000:00]-+-1e.0-[06]--+-0d.0 + \-0d.1 + + 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) diff --git a/Documentation/driver-api/xillybus.rst b/Documentation/driver-api/xillybus.rst new file mode 100644 index 000000000000..2446ee303c09 --- /dev/null +++ b/Documentation/driver-api/xillybus.rst @@ -0,0 +1,379 @@ +========================================== +Xillybus driver for generic FPGA interface +========================================== + +:Author: Eli Billauer, Xillybus Ltd. (http://xillybus.com) +:Email: eli.billauer@gmail.com or as advertised on Xillybus' site. + +.. Contents: + + - Introduction + -- Background + -- Xillybus Overview + + - Usage + -- User interface + -- Synchronization + -- Seekable pipes + + - Internals + -- Source code organization + -- Pipe attributes + -- Host never reads from the FPGA + -- Channels, pipes, and the message channel + -- Data streaming + -- Data granularity + -- Probing + -- Buffer allocation + -- The "nonempty" message (supporting poll) + + +Introduction +============ + +Background +---------- + +An FPGA (Field Programmable Gate Array) is a piece of logic hardware, which +can be programmed to become virtually anything that is usually found as a +dedicated chipset: For instance, a display adapter, network interface card, +or even a processor with its peripherals. FPGAs are the LEGO of hardware: +Based upon certain building blocks, you make your own toys the way you like +them. It's usually pointless to reimplement something that is already +available on the market as a chipset, so FPGAs are mostly used when some +special functionality is needed, and the production volume is relatively low +(hence not justifying the development of an ASIC). + +The challenge with FPGAs is that everything is implemented at a very low +level, even lower than assembly language. In order to allow FPGA designers to +focus on their specific project, and not reinvent the wheel over and over +again, pre-designed building blocks, IP cores, are often used. These are the +FPGA parallels of library functions. IP cores may implement certain +mathematical functions, a functional unit (e.g. a USB interface), an entire +processor (e.g. ARM) or anything that might come handy. Think of them as a +building block, with electrical wires dangling on the sides for connection to +other blocks. + +One of the daunting tasks in FPGA design is communicating with a fullblown +operating system (actually, with the processor running it): Implementing the +low-level bus protocol and the somewhat higher-level interface with the host +(registers, interrupts, DMA etc.) is a project in itself. When the FPGA's +function is a well-known one (e.g. a video adapter card, or a NIC), it can +make sense to design the FPGA's interface logic specifically for the project. +A special driver is then written to present the FPGA as a well-known interface +to the kernel and/or user space. In that case, there is no reason to treat the +FPGA differently than any device on the bus. + +It's however common that the desired data communication doesn't fit any well- +known peripheral function. Also, the effort of designing an elegant +abstraction for the data exchange is often considered too big. In those cases, +a quicker and possibly less elegant solution is sought: The driver is +effectively written as a user space program, leaving the kernel space part +with just elementary data transport. This still requires designing some +interface logic for the FPGA, and write a simple ad-hoc driver for the kernel. + +Xillybus Overview +----------------- + +Xillybus is an IP core and a Linux driver. Together, they form a kit for +elementary data transport between an FPGA and the host, providing pipe-like +data streams with a straightforward user interface. It's intended as a low- +effort solution for mixed FPGA-host projects, for which it makes sense to +have the project-specific part of the driver running in a user-space program. + +Since the communication requirements may vary significantly from one FPGA +project to another (the number of data pipes needed in each direction and +their attributes), there isn't one specific chunk of logic being the Xillybus +IP core. Rather, the IP core is configured and built based upon a +specification given by its end user. + +Xillybus presents independent data streams, which resemble pipes or TCP/IP +communication to the user. At the host side, a character device file is used +just like any pipe file. On the FPGA side, hardware FIFOs are used to stream +the data. This is contrary to a common method of communicating through fixed- +sized buffers (even though such buffers are used by Xillybus under the hood). +There may be more than a hundred of these streams on a single IP core, but +also no more than one, depending on the configuration. + +In order to ease the deployment of the Xillybus IP core, it contains a simple +data structure which completely defines the core's configuration. The Linux +driver fetches this data structure during its initialization process, and sets +up the DMA buffers and character devices accordingly. As a result, a single +driver is used to work out of the box with any Xillybus IP core. + +The data structure just mentioned should not be confused with PCI's +configuration space or the Flattened Device Tree. + +Usage +===== + +User interface +-------------- + +On the host, all interface with Xillybus is done through /dev/xillybus_* +device files, which are generated automatically as the drivers loads. The +names of these files depend on the IP core that is loaded in the FPGA (see +Probing below). To communicate with the FPGA, open the device file that +corresponds to the hardware FIFO you want to send data or receive data from, +and use plain write() or read() calls, just like with a regular pipe. In +particular, it makes perfect sense to go:: + + $ cat mydata > /dev/xillybus_thisfifo + + $ cat /dev/xillybus_thatfifo > hisdata + +possibly pressing CTRL-C as some stage, even though the xillybus_* pipes have +the capability to send an EOF (but may not use it). + +The driver and hardware are designed to behave sensibly as pipes, including: + +* Supporting non-blocking I/O (by setting O_NONBLOCK on open() ). + +* Supporting poll() and select(). + +* Being bandwidth efficient under load (using DMA) but also handle small + pieces of data sent across (like TCP/IP) by autoflushing. + +A device file can be read only, write only or bidirectional. Bidirectional +device files are treated like two independent pipes (except for sharing a +"channel" structure in the implementation code). + +Synchronization +--------------- + +Xillybus pipes are configured (on the IP core) to be either synchronous or +asynchronous. For a synchronous pipe, write() returns successfully only after +some data has been submitted and acknowledged by the FPGA. This slows down +bulk data transfers, and is nearly impossible for use with streams that +require data at a constant rate: There is no data transmitted to the FPGA +between write() calls, in particular when the process loses the CPU. + +When a pipe is configured asynchronous, write() returns if there was enough +room in the buffers to store any of the data in the buffers. + +For FPGA to host pipes, asynchronous pipes allow data transfer from the FPGA +as soon as the respective device file is opened, regardless of if the data +has been requested by a read() call. On synchronous pipes, only the amount +of data requested by a read() call is transmitted. + +In summary, for synchronous pipes, data between the host and FPGA is +transmitted only to satisfy the read() or write() call currently handled +by the driver, and those calls wait for the transmission to complete before +returning. + +Note that the synchronization attribute has nothing to do with the possibility +that read() or write() completes less bytes than requested. There is a +separate configuration flag ("allowpartial") that determines whether such a +partial completion is allowed. + +Seekable pipes +-------------- + +A synchronous pipe can be configured to have the stream's position exposed +to the user logic at the FPGA. Such a pipe is also seekable on the host API. +With this feature, a memory or register interface can be attached on the +FPGA side to the seekable stream. Reading or writing to a certain address in +the attached memory is done by seeking to the desired address, and calling +read() or write() as required. + + +Internals +========= + +Source code organization +------------------------ + +The Xillybus driver consists of a core module, xillybus_core.c, and modules +that depend on the specific bus interface (xillybus_of.c and xillybus_pcie.c). + +The bus specific modules are those probed when a suitable device is found by +the kernel. Since the DMA mapping and synchronization functions, which are bus +dependent by their nature, are used by the core module, a +xilly_endpoint_hardware structure is passed to the core module on +initialization. This structure is populated with pointers to wrapper functions +which execute the DMA-related operations on the bus. + +Pipe attributes +--------------- + +Each pipe has a number of attributes which are set when the FPGA component +(IP core) is built. They are fetched from the IDT (the data structure which +defines the core's configuration, see Probing below) by xilly_setupchannels() +in xillybus_core.c as follows: + +* is_writebuf: The pipe's direction. A non-zero value means it's an FPGA to + host pipe (the FPGA "writes"). + +* channelnum: The pipe's identification number in communication between the + host and FPGA. + +* format: The underlying data width. See Data Granularity below. + +* allowpartial: A non-zero value means that a read() or write() (whichever + applies) may return with less than the requested number of bytes. The common + choice is a non-zero value, to match standard UNIX behavior. + +* synchronous: A non-zero value means that the pipe is synchronous. See + Synchronization above. + +* bufsize: Each DMA buffer's size. Always a power of two. + +* bufnum: The number of buffers allocated for this pipe. Always a power of two. + +* exclusive_open: A non-zero value forces exclusive opening of the associated + device file. If the device file is bidirectional, and already opened only in + one direction, the opposite direction may be opened once. + +* seekable: A non-zero value indicates that the pipe is seekable. See + Seekable pipes above. + +* supports_nonempty: A non-zero value (which is typical) indicates that the + hardware will send the messages that are necessary to support select() and + poll() for this pipe. + +Host never reads from the FPGA +------------------------------ + +Even though PCI Express is hotpluggable in general, a typical motherboard +doesn't expect a card to go away all of the sudden. But since the PCIe card +is based upon reprogrammable logic, a sudden disappearance from the bus is +quite likely as a result of an accidental reprogramming of the FPGA while the +host is up. In practice, nothing happens immediately in such a situation. But +if the host attempts to read from an address that is mapped to the PCI Express +device, that leads to an immediate freeze of the system on some motherboards, +even though the PCIe standard requires a graceful recovery. + +In order to avoid these freezes, the Xillybus driver refrains completely from +reading from the device's register space. All communication from the FPGA to +the host is done through DMA. In particular, the Interrupt Service Routine +doesn't follow the common practice of checking a status register when it's +invoked. Rather, the FPGA prepares a small buffer which contains short +messages, which inform the host what the interrupt was about. + +This mechanism is used on non-PCIe buses as well for the sake of uniformity. + + +Channels, pipes, and the message channel +---------------------------------------- + +Each of the (possibly bidirectional) pipes presented to the user is allocated +a data channel between the FPGA and the host. The distinction between channels +and pipes is necessary only because of channel 0, which is used for interrupt- +related messages from the FPGA, and has no pipe attached to it. + +Data streaming +-------------- + +Even though a non-segmented data stream is presented to the user at both +sides, the implementation relies on a set of DMA buffers which is allocated +for each channel. For the sake of illustration, let's take the FPGA to host +direction: As data streams into the respective channel's interface in the +FPGA, the Xillybus IP core writes it to one of the DMA buffers. When the +buffer is full, the FPGA informs the host about that (appending a +XILLYMSG_OPCODE_RELEASEBUF message channel 0 and sending an interrupt if +necessary). The host responds by making the data available for reading through +the character device. When all data has been read, the host writes on the +the FPGA's buffer control register, allowing the buffer's overwriting. Flow +control mechanisms exist on both sides to prevent underflows and overflows. + +This is not good enough for creating a TCP/IP-like stream: If the data flow +stops momentarily before a DMA buffer is filled, the intuitive expectation is +that the partial data in buffer will arrive anyhow, despite the buffer not +being completed. This is implemented by adding a field in the +XILLYMSG_OPCODE_RELEASEBUF message, through which the FPGA informs not just +which buffer is submitted, but how much data it contains. + +But the FPGA will submit a partially filled buffer only if directed to do so +by the host. This situation occurs when the read() method has been blocking +for XILLY_RX_TIMEOUT jiffies (currently 10 ms), after which the host commands +the FPGA to submit a DMA buffer as soon as it can. This timeout mechanism +balances between bus bandwidth efficiency (preventing a lot of partially +filled buffers being sent) and a latency held fairly low for tails of data. + +A similar setting is used in the host to FPGA direction. The handling of +partial DMA buffers is somewhat different, though. The user can tell the +driver to submit all data it has in the buffers to the FPGA, by issuing a +write() with the byte count set to zero. This is similar to a flush request, +but it doesn't block. There is also an autoflushing mechanism, which triggers +an equivalent flush roughly XILLY_RX_TIMEOUT jiffies after the last write(). +This allows the user to be oblivious about the underlying buffering mechanism +and yet enjoy a stream-like interface. + +Note that the issue of partial buffer flushing is irrelevant for pipes having +the "synchronous" attribute nonzero, since synchronous pipes don't allow data +to lay around in the DMA buffers between read() and write() anyhow. + +Data granularity +---------------- + +The data arrives or is sent at the FPGA as 8, 16 or 32 bit wide words, as +configured by the "format" attribute. Whenever possible, the driver attempts +to hide this when the pipe is accessed differently from its natural alignment. +For example, reading single bytes from a pipe with 32 bit granularity works +with no issues. Writing single bytes to pipes with 16 or 32 bit granularity +will also work, but the driver can't send partially completed words to the +FPGA, so the transmission of up to one word may be held until it's fully +occupied with user data. + +This somewhat complicates the handling of host to FPGA streams, because +when a buffer is flushed, it may contain up to 3 bytes don't form a word in +the FPGA, and hence can't be sent. To prevent loss of data, these leftover +bytes need to be moved to the next buffer. The parts in xillybus_core.c +that mention "leftovers" in some way are related to this complication. + +Probing +------- + +As mentioned earlier, the number of pipes that are created when the driver +loads and their attributes depend on the Xillybus IP core in the FPGA. During +the driver's initialization, a blob containing configuration info, the +Interface Description Table (IDT), is sent from the FPGA to the host. The +bootstrap process is done in three phases: + +1. Acquire the length of the IDT, so a buffer can be allocated for it. This + is done by sending a quiesce command to the device, since the acknowledge + for this command contains the IDT's buffer length. + +2. Acquire the IDT itself. + +3. Create the interfaces according to the IDT. + +Buffer allocation +----------------- + +In order to simplify the logic that prevents illegal boundary crossings of +PCIe packets, the following rule applies: If a buffer is smaller than 4kB, +it must not cross a 4kB boundary. Otherwise, it must be 4kB aligned. The +xilly_setupchannels() functions allocates these buffers by requesting whole +pages from the kernel, and diving them into DMA buffers as necessary. Since +all buffers' sizes are powers of two, it's possible to pack any set of such +buffers, with a maximal waste of one page of memory. + +All buffers are allocated when the driver is loaded. This is necessary, +since large continuous physical memory segments are sometimes requested, +which are more likely to be available when the system is freshly booted. + +The allocation of buffer memory takes place in the same order they appear in +the IDT. The driver relies on a rule that the pipes are sorted with decreasing +buffer size in the IDT. If a requested buffer is larger or equal to a page, +the necessary number of pages is requested from the kernel, and these are +used for this buffer. If the requested buffer is smaller than a page, one +single page is requested from the kernel, and that page is partially used. +Or, if there already is a partially used page at hand, the buffer is packed +into that page. It can be shown that all pages requested from the kernel +(except possibly for the last) are 100% utilized this way. + +The "nonempty" message (supporting poll) +---------------------------------------- + +In order to support the "poll" method (and hence select() ), there is a small +catch regarding the FPGA to host direction: The FPGA may have filled a DMA +buffer with some data, but not submitted that buffer. If the host waited for +the buffer's submission by the FPGA, there would be a possibility that the +FPGA side has sent data, but a select() call would still block, because the +host has not received any notification about this. This is solved with +XILLYMSG_OPCODE_NONEMPTY messages sent by the FPGA when a channel goes from +completely empty to containing some data. + +These messages are used only to support poll() and select(). The IP core can +be configured not to send them for a slight reduction of bandwidth. diff --git a/Documentation/driver-api/zorro.rst b/Documentation/driver-api/zorro.rst new file mode 100644 index 000000000000..664072b017e3 --- /dev/null +++ b/Documentation/driver-api/zorro.rst @@ -0,0 +1,104 @@ +======================================== +Writing Device Drivers for Zorro Devices +======================================== + +:Author: Written by Geert Uytterhoeven +:Last revised: September 5, 2003 + + +Introduction +------------ + +The Zorro bus is the bus used in the Amiga family of computers. Thanks to +AutoConfig(tm), it's 100% Plug-and-Play. + +There are two types of Zorro buses, Zorro II and Zorro III: + + - The Zorro II address space is 24-bit and lies within the first 16 MB of the + Amiga's address map. + + - Zorro III is a 32-bit extension of Zorro II, which is backwards compatible + with Zorro II. The Zorro III address space lies outside the first 16 MB. + + +Probing for Zorro Devices +------------------------- + +Zorro devices are found by calling ``zorro_find_device()``, which returns a +pointer to the ``next`` Zorro device with the specified Zorro ID. A probe loop +for the board with Zorro ID ``ZORRO_PROD_xxx`` looks like:: + + struct zorro_dev *z = NULL; + + while ((z = zorro_find_device(ZORRO_PROD_xxx, z))) { + if (!zorro_request_region(z->resource.start+MY_START, MY_SIZE, + "My explanation")) + ... + } + +``ZORRO_WILDCARD`` acts as a wildcard and finds any Zorro device. If your driver +supports different types of boards, you can use a construct like:: + + struct zorro_dev *z = NULL; + + while ((z = zorro_find_device(ZORRO_WILDCARD, z))) { + if (z->id != ZORRO_PROD_xxx1 && z->id != ZORRO_PROD_xxx2 && ...) + continue; + if (!zorro_request_region(z->resource.start+MY_START, MY_SIZE, + "My explanation")) + ... + } + + +Zorro Resources +--------------- + +Before you can access a Zorro device's registers, you have to make sure it's +not yet in use. This is done using the I/O memory space resource management +functions:: + + request_mem_region() + release_mem_region() + +Shortcuts to claim the whole device's address space are provided as well:: + + zorro_request_device + zorro_release_device + + +Accessing the Zorro Address Space +--------------------------------- + +The address regions in the Zorro device resources are Zorro bus address +regions. Due to the identity bus-physical address mapping on the Zorro bus, +they are CPU physical addresses as well. + +The treatment of these regions depends on the type of Zorro space: + + - Zorro II address space is always mapped and does not have to be mapped + explicitly using z_ioremap(). + + Conversion from bus/physical Zorro II addresses to kernel virtual addresses + and vice versa is done using:: + + virt_addr = ZTWO_VADDR(bus_addr); + bus_addr = ZTWO_PADDR(virt_addr); + + - Zorro III address space must be mapped explicitly using z_ioremap() first + before it can be accessed:: + + virt_addr = z_ioremap(bus_addr, size); + ... + z_iounmap(virt_addr); + + +References +---------- + +#. linux/include/linux/zorro.h +#. linux/include/uapi/linux/zorro.h +#. linux/include/uapi/linux/zorro_ids.h +#. linux/arch/m68k/include/asm/zorro.h +#. linux/drivers/zorro +#. /proc/bus/zorro + diff --git a/Documentation/eisa.txt b/Documentation/eisa.txt deleted file mode 100644 index c07565ba57da..000000000000 --- a/Documentation/eisa.txt +++ /dev/null @@ -1,230 +0,0 @@ -================ -EISA bus support -================ - -:Author: Marc Zyngier - -This document groups random notes about porting EISA drivers to the -new EISA/sysfs API. - -Starting from version 2.5.59, the EISA bus is almost given the same -status as other much more mainstream busses such as PCI or USB. This -has been possible through sysfs, which defines a nice enough set of -abstractions to manage busses, devices and drivers. - -Although the new API is quite simple to use, converting existing -drivers to the new infrastructure is not an easy task (mostly because -detection code is generally also used to probe ISA cards). Moreover, -most EISA drivers are among the oldest Linux drivers so, as you can -imagine, some dust has settled here over the years. - -The EISA infrastructure is made up of three parts: - - - The bus code implements most of the generic code. It is shared - among all the architectures that the EISA code runs on. It - implements bus probing (detecting EISA cards available on the bus), - allocates I/O resources, allows fancy naming through sysfs, and - offers interfaces for driver to register. - - - The bus root driver implements the glue between the bus hardware - and the generic bus code. It is responsible for discovering the - device implementing the bus, and setting it up to be latter probed - by the bus code. This can go from something as simple as reserving - an I/O region on x86, to the rather more complex, like the hppa - EISA code. This is the part to implement in order to have EISA - running on an "new" platform. - - - The driver offers the bus a list of devices that it manages, and - implements the necessary callbacks to probe and release devices - whenever told to. - -Every function/structure below lives in , which depends -heavily on . - -Bus root driver -=============== - -:: - - int eisa_root_register (struct eisa_root_device *root); - -The eisa_root_register function is used to declare a device as the -root of an EISA bus. The eisa_root_device structure holds a reference -to this device, as well as some parameters for probing purposes:: - - struct eisa_root_device { - struct device *dev; /* Pointer to bridge device */ - struct resource *res; - unsigned long bus_base_addr; - int slots; /* Max slot number */ - int force_probe; /* Probe even when no slot 0 */ - u64 dma_mask; /* from bridge device */ - int bus_nr; /* Set by eisa_root_register */ - struct resource eisa_root_res; /* ditto */ - }; - -============= ====================================================== -node used for eisa_root_register internal purpose -dev pointer to the root device -res root device I/O resource -bus_base_addr slot 0 address on this bus -slots max slot number to probe -force_probe Probe even when slot 0 is empty (no EISA mainboard) -dma_mask Default DMA mask. Usually the bridge device dma_mask. -bus_nr unique bus id, set by eisa_root_register -============= ====================================================== - -Driver -====== - -:: - - int eisa_driver_register (struct eisa_driver *edrv); - void eisa_driver_unregister (struct eisa_driver *edrv); - -Clear enough ? - -:: - - struct eisa_device_id { - char sig[EISA_SIG_LEN]; - unsigned long driver_data; - }; - - struct eisa_driver { - const struct eisa_device_id *id_table; - struct device_driver driver; - }; - -=============== ==================================================== -id_table an array of NULL terminated EISA id strings, - followed by an empty string. Each string can - optionally be paired with a driver-dependent value - (driver_data). - -driver a generic driver, such as described in - Documentation/driver-api/driver-model/driver.rst. Only .name, - .probe and .remove members are mandatory. -=============== ==================================================== - -An example is the 3c59x driver:: - - static struct eisa_device_id vortex_eisa_ids[] = { - { "TCM5920", EISA_3C592_OFFSET }, - { "TCM5970", EISA_3C597_OFFSET }, - { "" } - }; - - static struct eisa_driver vortex_eisa_driver = { - .id_table = vortex_eisa_ids, - .driver = { - .name = "3c59x", - .probe = vortex_eisa_probe, - .remove = vortex_eisa_remove - } - }; - -Device -====== - -The sysfs framework calls .probe and .remove functions upon device -discovery and removal (note that the .remove function is only called -when driver is built as a module). - -Both functions are passed a pointer to a 'struct device', which is -encapsulated in a 'struct eisa_device' described as follows:: - - struct eisa_device { - struct eisa_device_id id; - int slot; - int state; - unsigned long base_addr; - struct resource res[EISA_MAX_RESOURCES]; - u64 dma_mask; - struct device dev; /* generic device */ - }; - -======== ============================================================ -id EISA id, as read from device. id.driver_data is set from the - matching driver EISA id. -slot slot number which the device was detected on -state set of flags indicating the state of the device. Current - flags are EISA_CONFIG_ENABLED and EISA_CONFIG_FORCED. -res set of four 256 bytes I/O regions allocated to this device -dma_mask DMA mask set from the parent device. -dev generic device (see Documentation/driver-api/driver-model/device.rst) -======== ============================================================ - -You can get the 'struct eisa_device' from 'struct device' using the -'to_eisa_device' macro. - -Misc stuff -========== - -:: - - void eisa_set_drvdata (struct eisa_device *edev, void *data); - -Stores data into the device's driver_data area. - -:: - - void *eisa_get_drvdata (struct eisa_device *edev): - -Gets the pointer previously stored into the device's driver_data area. - -:: - - int eisa_get_region_index (void *addr); - -Returns the region number (0 <= x < EISA_MAX_RESOURCES) of a given -address. - -Kernel parameters -================= - -eisa_bus.enable_dev - A comma-separated list of slots to be enabled, even if the firmware - set the card as disabled. The driver must be able to properly - initialize the device in such conditions. - -eisa_bus.disable_dev - A comma-separated list of slots to be enabled, even if the firmware - set the card as enabled. The driver won't be called to handle this - device. - -virtual_root.force_probe - Force the probing code to probe EISA slots even when it cannot find an - EISA compliant mainboard (nothing appears on slot 0). Defaults to 0 - (don't force), and set to 1 (force probing) when either - CONFIG_ALPHA_JENSEN or CONFIG_EISA_VLB_PRIMING are set. - -Random notes -============ - -Converting an EISA driver to the new API mostly involves *deleting* -code (since probing is now in the core EISA code). Unfortunately, most -drivers share their probing routine between ISA, and EISA. Special -care must be taken when ripping out the EISA code, so other busses -won't suffer from these surgical strikes... - -You *must not* expect any EISA device to be detected when returning -from eisa_driver_register, since the chances are that the bus has not -yet been probed. In fact, that's what happens most of the time (the -bus root driver usually kicks in rather late in the boot process). -Unfortunately, most drivers are doing the probing by themselves, and -expect to have explored the whole machine when they exit their probe -routine. - -For example, switching your favorite EISA SCSI card to the "hotplug" -model is "the right thing"(tm). - -Thanks -====== - -I'd like to thank the following people for their help: - -- Xavier Benigni for lending me a wonderful Alpha Jensen, -- James Bottomley, Jeff Garzik for getting this stuff into the kernel, -- Andries Brouwer for contributing numerous EISA ids, -- Catrin Jones for coping with far too many machines at home. diff --git a/Documentation/fb/fbcon.rst b/Documentation/fb/fbcon.rst index 26bc5cdaabab..ebca41785abe 100644 --- a/Documentation/fb/fbcon.rst +++ b/Documentation/fb/fbcon.rst @@ -187,7 +187,7 @@ the hardware. Thus, in a VGA console:: Assuming the VGA driver can be unloaded, one must first unbind the VGA driver from the console layer before unloading the driver. The VGA driver cannot be unloaded if it is still bound to the console layer. (See -Documentation/console/console.rst for more information). +Documentation/driver-api/console.rst for more information). This is more complicated in the case of the framebuffer console (fbcon), because fbcon is an intermediate layer between the console and the drivers:: @@ -204,7 +204,7 @@ fbcon. Thus, there is no need to explicitly unbind the fbdev drivers from fbcon. So, how do we unbind fbcon from the console? Part of the answer is in -Documentation/console/console.rst. To summarize: +Documentation/driver-api/console.rst. To summarize: Echo a value to the bind file that represents the framebuffer console driver. So assuming vtcon1 represents fbcon, then:: diff --git a/Documentation/isa.txt b/Documentation/isa.txt deleted file mode 100644 index def4a7b690b5..000000000000 --- a/Documentation/isa.txt +++ /dev/null @@ -1,122 +0,0 @@ -=========== -ISA Drivers -=========== - -The following text is adapted from the commit message of the initial -commit of the ISA bus driver authored by Rene Herman. - -During the recent "isa drivers using platform devices" discussion it was -pointed out that (ALSA) ISA drivers ran into the problem of not having -the option to fail driver load (device registration rather) upon not -finding their hardware due to a probe() error not being passed up -through the driver model. In the course of that, I suggested a separate -ISA bus might be best; Russell King agreed and suggested this bus could -use the .match() method for the actual device discovery. - -The attached does this. For this old non (generically) discoverable ISA -hardware only the driver itself can do discovery so as a difference with -the platform_bus, this isa_bus also distributes match() up to the -driver. - -As another difference: these devices only exist in the driver model due -to the driver creating them because it might want to drive them, meaning -that all device creation has been made internal as well. - -The usage model this provides is nice, and has been acked from the ALSA -side by Takashi Iwai and Jaroslav Kysela. The ALSA driver module_init's -now (for oldisa-only drivers) become:: - - static int __init alsa_card_foo_init(void) - { - return isa_register_driver(&snd_foo_isa_driver, SNDRV_CARDS); - } - - static void __exit alsa_card_foo_exit(void) - { - isa_unregister_driver(&snd_foo_isa_driver); - } - -Quite like the other bus models therefore. This removes a lot of -duplicated init code from the ALSA ISA drivers. - -The passed in isa_driver struct is the regular driver struct embedding a -struct device_driver, the normal probe/remove/shutdown/suspend/resume -callbacks, and as indicated that .match callback. - -The "SNDRV_CARDS" you see being passed in is a "unsigned int ndev" -parameter, indicating how many devices to create and call our methods -with. - -The platform_driver callbacks are called with a platform_device param; -the isa_driver callbacks are being called with a ``struct device *dev, -unsigned int id`` pair directly -- with the device creation completely -internal to the bus it's much cleaner to not leak isa_dev's by passing -them in at all. The id is the only thing we ever want other then the -struct device anyways, and it makes for nicer code in the callbacks as -well. - -With this additional .match() callback ISA drivers have all options. If -ALSA would want to keep the old non-load behaviour, it could stick all -of the old .probe in .match, which would only keep them registered after -everything was found to be present and accounted for. If it wanted the -behaviour of always loading as it inadvertently did for a bit after the -changeover to platform devices, it could just not provide a .match() and -do everything in .probe() as before. - -If it, as Takashi Iwai already suggested earlier as a way of following -the model from saner buses more closely, wants to load when a later bind -could conceivably succeed, it could use .match() for the prerequisites -(such as checking the user wants the card enabled and that port/irq/dma -values have been passed in) and .probe() for everything else. This is -the nicest model. - -To the code... - -This exports only two functions; isa_{,un}register_driver(). - -isa_register_driver() register's the struct device_driver, and then -loops over the passed in ndev creating devices and registering them. -This causes the bus match method to be called for them, which is:: - - int isa_bus_match(struct device *dev, struct device_driver *driver) - { - struct isa_driver *isa_driver = to_isa_driver(driver); - - if (dev->platform_data == isa_driver) { - if (!isa_driver->match || - isa_driver->match(dev, to_isa_dev(dev)->id)) - return 1; - dev->platform_data = NULL; - } - return 0; - } - -The first thing this does is check if this device is in fact one of this -driver's devices by seeing if the device's platform_data pointer is set -to this driver. Platform devices compare strings, but we don't need to -do that with everything being internal, so isa_register_driver() abuses -dev->platform_data as a isa_driver pointer which we can then check here. -I believe platform_data is available for this, but if rather not, moving -the isa_driver pointer to the private struct isa_dev is ofcourse fine as -well. - -Then, if the the driver did not provide a .match, it matches. If it did, -the driver match() method is called to determine a match. - -If it did **not** match, dev->platform_data is reset to indicate this to -isa_register_driver which can then unregister the device again. - -If during all this, there's any error, or no devices matched at all -everything is backed out again and the error, or -ENODEV, is returned. - -isa_unregister_driver() just unregisters the matched devices and the -driver itself. - -module_isa_driver is a helper macro for ISA drivers which do not do -anything special in module init/exit. This eliminates a lot of -boilerplate code. Each module may only use this macro once, and calling -it replaces module_init and module_exit. - -max_num_isa_dev is a macro to determine the maximum possible number of -ISA devices which may be registered in the I/O port address space given -the address extent of the ISA devices. diff --git a/Documentation/isapnp.txt b/Documentation/isapnp.txt deleted file mode 100644 index 8d0840ac847b..000000000000 --- a/Documentation/isapnp.txt +++ /dev/null @@ -1,15 +0,0 @@ -========================================================== -ISA Plug & Play support by Jaroslav Kysela -========================================================== - -Interface /proc/isapnp -====================== - -The interface has been removed. See pnp.txt for more details. - -Interface /proc/bus/isapnp -========================== - -This directory allows access to ISA PnP cards and logical devices. -The regular files contain the contents of ISA PnP registers for -a logical device. diff --git a/Documentation/lightnvm/pblk.txt b/Documentation/lightnvm/pblk.txt deleted file mode 100644 index 1040ed1cec81..000000000000 --- a/Documentation/lightnvm/pblk.txt +++ /dev/null @@ -1,21 +0,0 @@ -pblk: Physical Block Device Target -================================== - -pblk implements a fully associative, host-based FTL that exposes a traditional -block I/O interface. Its primary responsibilities are: - - - Map logical addresses onto physical addresses (4KB granularity) in a - logical-to-physical (L2P) table. - - Maintain the integrity and consistency of the L2P table as well as its - recovery from normal tear down and power outage. - - Deal with controller- and media-specific constrains. - - Handle I/O errors. - - Implement garbage collection. - - Maintain consistency across the I/O stack during synchronization points. - -For more information please refer to: - - http://lightnvm.io - -which maintains updated FAQs, manual pages, technical documentation, tools, -contacts, etc. diff --git a/Documentation/men-chameleon-bus.txt b/Documentation/men-chameleon-bus.txt deleted file mode 100644 index 1b1f048aa748..000000000000 --- a/Documentation/men-chameleon-bus.txt +++ /dev/null @@ -1,175 +0,0 @@ -================= -MEN Chameleon Bus -================= - -.. Table of Contents - ================= - 1 Introduction - 1.1 Scope of this Document - 1.2 Limitations of the current implementation - 2 Architecture - 2.1 MEN Chameleon Bus - 2.2 Carrier Devices - 2.3 Parser - 3 Resource handling - 3.1 Memory Resources - 3.2 IRQs - 4 Writing an MCB driver - 4.1 The driver structure - 4.2 Probing and attaching - 4.3 Initializing the driver - - -Introduction -============ - -This document describes the architecture and implementation of the MEN -Chameleon Bus (called MCB throughout this document). - -Scope of this Document ----------------------- - -This document is intended to be a short overview of the current -implementation and does by no means describe the complete possibilities of MCB -based devices. - -Limitations of the current implementation ------------------------------------------ - -The current implementation is limited to PCI and PCIe based carrier devices -that only use a single memory resource and share the PCI legacy IRQ. Not -implemented are: - -- Multi-resource MCB devices like the VME Controller or M-Module carrier. -- MCB devices that need another MCB device, like SRAM for a DMA Controller's - buffer descriptors or a video controller's video memory. -- A per-carrier IRQ domain for carrier devices that have one (or more) IRQs - per MCB device like PCIe based carriers with MSI or MSI-X support. - -Architecture -============ - -MCB is divided into 3 functional blocks: - -- The MEN Chameleon Bus itself, -- drivers for MCB Carrier Devices and -- the parser for the Chameleon table. - -MEN Chameleon Bus ------------------ - -The MEN Chameleon Bus is an artificial bus system that attaches to a so -called Chameleon FPGA device found on some hardware produced my MEN Mikro -Elektronik GmbH. These devices are multi-function devices implemented in a -single FPGA and usually attached via some sort of PCI or PCIe link. Each -FPGA contains a header section describing the content of the FPGA. The -header lists the device id, PCI BAR, offset from the beginning of the PCI -BAR, size in the FPGA, interrupt number and some other properties currently -not handled by the MCB implementation. - -Carrier Devices ---------------- - -A carrier device is just an abstraction for the real world physical bus the -Chameleon FPGA is attached to. Some IP Core drivers may need to interact with -properties of the carrier device (like querying the IRQ number of a PCI -device). To provide abstraction from the real hardware bus, an MCB carrier -device provides callback methods to translate the driver's MCB function calls -to hardware related function calls. For example a carrier device may -implement the get_irq() method which can be translated into a hardware bus -query for the IRQ number the device should use. - -Parser ------- - -The parser reads the first 512 bytes of a Chameleon device and parses the -Chameleon table. Currently the parser only supports the Chameleon v2 variant -of the Chameleon table but can easily be adopted to support an older or -possible future variant. While parsing the table's entries new MCB devices -are allocated and their resources are assigned according to the resource -assignment in the Chameleon table. After resource assignment is finished, the -MCB devices are registered at the MCB and thus at the driver core of the -Linux kernel. - -Resource handling -================= - -The current implementation assigns exactly one memory and one IRQ resource -per MCB device. But this is likely going to change in the future. - -Memory Resources ----------------- - -Each MCB device has exactly one memory resource, which can be requested from -the MCB bus. This memory resource is the physical address of the MCB device -inside the carrier and is intended to be passed to ioremap() and friends. It -is already requested from the kernel by calling request_mem_region(). - -IRQs ----- - -Each MCB device has exactly one IRQ resource, which can be requested from the -MCB bus. If a carrier device driver implements the ->get_irq() callback -method, the IRQ number assigned by the carrier device will be returned, -otherwise the IRQ number inside the Chameleon table will be returned. This -number is suitable to be passed to request_irq(). - -Writing an MCB driver -===================== - -The driver structure --------------------- - -Each MCB driver has a structure to identify the device driver as well as -device ids which identify the IP Core inside the FPGA. The driver structure -also contains callback methods which get executed on driver probe and -removal from the system:: - - static const struct mcb_device_id foo_ids[] = { - { .device = 0x123 }, - { } - }; - MODULE_DEVICE_TABLE(mcb, foo_ids); - - static struct mcb_driver foo_driver = { - driver = { - .name = "foo-bar", - .owner = THIS_MODULE, - }, - .probe = foo_probe, - .remove = foo_remove, - .id_table = foo_ids, - }; - -Probing and attaching ---------------------- - -When a driver is loaded and the MCB devices it services are found, the MCB -core will call the driver's probe callback method. When the driver is removed -from the system, the MCB core will call the driver's remove callback method:: - - static init foo_probe(struct mcb_device *mdev, const struct mcb_device_id *id); - static void foo_remove(struct mcb_device *mdev); - -Initializing the driver ------------------------ - -When the kernel is booted or your foo driver module is inserted, you have to -perform driver initialization. Usually it is enough to register your driver -module at the MCB core:: - - static int __init foo_init(void) - { - return mcb_register_driver(&foo_driver); - } - module_init(foo_init); - - static void __exit foo_exit(void) - { - mcb_unregister_driver(&foo_driver); - } - module_exit(foo_exit); - -The module_mcb_driver() macro can be used to reduce the above code:: - - module_mcb_driver(foo_driver); diff --git a/Documentation/ntb.txt b/Documentation/ntb.txt deleted file mode 100644 index 074a423c853c..000000000000 --- a/Documentation/ntb.txt +++ /dev/null @@ -1,236 +0,0 @@ -=========== -NTB Drivers -=========== - -NTB (Non-Transparent Bridge) is a type of PCI-Express bridge chip that connects -the separate memory systems of two or more computers to the same PCI-Express -fabric. Existing NTB hardware supports a common feature set: doorbell -registers and memory translation windows, as well as non common features like -scratchpad and message registers. Scratchpad registers are read-and-writable -registers that are accessible from either side of the device, so that peers can -exchange a small amount of information at a fixed address. Message registers can -be utilized for the same purpose. Additionally they are provided with with -special status bits to make sure the information isn't rewritten by another -peer. Doorbell registers provide a way for peers to send interrupt events. -Memory windows allow translated read and write access to the peer memory. - -NTB Core Driver (ntb) -===================== - -The NTB core driver defines an api wrapping the common feature set, and allows -clients interested in NTB features to discover NTB the devices supported by -hardware drivers. The term "client" is used here to mean an upper layer -component making use of the NTB api. The term "driver," or "hardware driver," -is used here to mean a driver for a specific vendor and model of NTB hardware. - -NTB Client Drivers -================== - -NTB client drivers should register with the NTB core driver. After -registering, the client probe and remove functions will be called appropriately -as ntb hardware, or hardware drivers, are inserted and removed. The -registration uses the Linux Device framework, so it should feel familiar to -anyone who has written a pci driver. - -NTB Typical client driver implementation ----------------------------------------- - -Primary purpose of NTB is to share some peace of memory between at least two -systems. So the NTB device features like Scratchpad/Message registers are -mainly used to perform the proper memory window initialization. Typically -there are two types of memory window interfaces supported by the NTB API: -inbound translation configured on the local ntb port and outbound translation -configured by the peer, on the peer ntb port. The first type is -depicted on the next figure:: - - Inbound translation: - - Memory: Local NTB Port: Peer NTB Port: Peer MMIO: - ____________ - | dma-mapped |-ntb_mw_set_trans(addr) | - | memory | _v____________ | ______________ - | (addr) |<======| MW xlat addr |<====| MW base addr |<== memory-mapped IO - |------------| |--------------| | |--------------| - -So typical scenario of the first type memory window initialization looks: -1) allocate a memory region, 2) put translated address to NTB config, -3) somehow notify a peer device of performed initialization, 4) peer device -maps corresponding outbound memory window so to have access to the shared -memory region. - -The second type of interface, that implies the shared windows being -initialized by a peer device, is depicted on the figure:: - - Outbound translation: - - Memory: Local NTB Port: Peer NTB Port: Peer MMIO: - ____________ ______________ - | dma-mapped | | | MW base addr |<== memory-mapped IO - | memory | | |--------------| - | (addr) |<===================| MW xlat addr |<-ntb_peer_mw_set_trans(addr) - |------------| | |--------------| - -Typical scenario of the second type interface initialization would be: -1) allocate a memory region, 2) somehow deliver a translated address to a peer -device, 3) peer puts the translated address to NTB config, 4) peer device maps -outbound memory window so to have access to the shared memory region. - -As one can see the described scenarios can be combined in one portable -algorithm. - - Local device: - 1) Allocate memory for a shared window - 2) Initialize memory window by translated address of the allocated region - (it may fail if local memory window initialization is unsupported) - 3) Send the translated address and memory window index to a peer device - - Peer device: - 1) Initialize memory window with retrieved address of the allocated - by another device memory region (it may fail if peer memory window - initialization is unsupported) - 2) Map outbound memory window - -In accordance with this scenario, the NTB Memory Window API can be used as -follows: - - Local device: - 1) ntb_mw_count(pidx) - retrieve number of memory ranges, which can - be allocated for memory windows between local device and peer device - of port with specified index. - 2) ntb_get_align(pidx, midx) - retrieve parameters restricting the - shared memory region alignment and size. Then memory can be properly - allocated. - 3) Allocate physically contiguous memory region in compliance with - restrictions retrieved in 2). - 4) ntb_mw_set_trans(pidx, midx) - try to set translation address of - the memory window with specified index for the defined peer device - (it may fail if local translated address setting is not supported) - 5) Send translated base address (usually together with memory window - number) to the peer device using, for instance, scratchpad or message - registers. - - Peer device: - 1) ntb_peer_mw_set_trans(pidx, midx) - try to set received from other - device (related to pidx) translated address for specified memory - window. It may fail if retrieved address, for instance, exceeds - maximum possible address or isn't properly aligned. - 2) ntb_peer_mw_get_addr(widx) - retrieve MMIO address to map the memory - window so to have an access to the shared memory. - -Also it is worth to note, that method ntb_mw_count(pidx) should return the -same value as ntb_peer_mw_count() on the peer with port index - pidx. - -NTB Transport Client (ntb\_transport) and NTB Netdev (ntb\_netdev) ------------------------------------------------------------------- - -The primary client for NTB is the Transport client, used in tandem with NTB -Netdev. These drivers function together to create a logical link to the peer, -across the ntb, to exchange packets of network data. The Transport client -establishes a logical link to the peer, and creates queue pairs to exchange -messages and data. The NTB Netdev then creates an ethernet device using a -Transport queue pair. Network data is copied between socket buffers and the -Transport queue pair buffer. The Transport client may be used for other things -besides Netdev, however no other applications have yet been written. - -NTB Ping Pong Test Client (ntb\_pingpong) ------------------------------------------ - -The Ping Pong test client serves as a demonstration to exercise the doorbell -and scratchpad registers of NTB hardware, and as an example simple NTB client. -Ping Pong enables the link when started, waits for the NTB link to come up, and -then proceeds to read and write the doorbell scratchpad registers of the NTB. -The peers interrupt each other using a bit mask of doorbell bits, which is -shifted by one in each round, to test the behavior of multiple doorbell bits -and interrupt vectors. The Ping Pong driver also reads the first local -scratchpad, and writes the value plus one to the first peer scratchpad, each -round before writing the peer doorbell register. - -Module Parameters: - -* unsafe - Some hardware has known issues with scratchpad and doorbell - registers. By default, Ping Pong will not attempt to exercise such - hardware. You may override this behavior at your own risk by setting - unsafe=1. -* delay\_ms - Specify the delay between receiving a doorbell - interrupt event and setting the peer doorbell register for the next - round. -* init\_db - Specify the doorbell bits to start new series of rounds. A new - series begins once all the doorbell bits have been shifted out of - range. -* dyndbg - It is suggested to specify dyndbg=+p when loading this module, and - then to observe debugging output on the console. - -NTB Tool Test Client (ntb\_tool) --------------------------------- - -The Tool test client serves for debugging, primarily, ntb hardware and drivers. -The Tool provides access through debugfs for reading, setting, and clearing the -NTB doorbell, and reading and writing scratchpads. - -The Tool does not currently have any module parameters. - -Debugfs Files: - -* *debugfs*/ntb\_tool/*hw*/ - A directory in debugfs will be created for each - NTB device probed by the tool. This directory is shortened to *hw* - below. -* *hw*/db - This file is used to read, set, and clear the local doorbell. Not - all operations may be supported by all hardware. To read the doorbell, - read the file. To set the doorbell, write `s` followed by the bits to - set (eg: `echo 's 0x0101' > db`). To clear the doorbell, write `c` - followed by the bits to clear. -* *hw*/mask - This file is used to read, set, and clear the local doorbell mask. - See *db* for details. -* *hw*/peer\_db - This file is used to read, set, and clear the peer doorbell. - See *db* for details. -* *hw*/peer\_mask - This file is used to read, set, and clear the peer doorbell - mask. See *db* for details. -* *hw*/spad - This file is used to read and write local scratchpads. To read - the values of all scratchpads, read the file. To write values, write a - series of pairs of scratchpad number and value - (eg: `echo '4 0x123 7 0xabc' > spad` - # to set scratchpads `4` and `7` to `0x123` and `0xabc`, respectively). -* *hw*/peer\_spad - This file is used to read and write peer scratchpads. See - *spad* for details. - -NTB Hardware Drivers -==================== - -NTB hardware drivers should register devices with the NTB core driver. After -registering, clients probe and remove functions will be called. - -NTB Intel Hardware Driver (ntb\_hw\_intel) ------------------------------------------- - -The Intel hardware driver supports NTB on Xeon and Atom CPUs. - -Module Parameters: - -* b2b\_mw\_idx - If the peer ntb is to be accessed via a memory window, then use - this memory window to access the peer ntb. A value of zero or positive - starts from the first mw idx, and a negative value starts from the last - mw idx. Both sides MUST set the same value here! The default value is - `-1`. -* b2b\_mw\_share - If the peer ntb is to be accessed via a memory window, and if - the memory window is large enough, still allow the client to use the - second half of the memory window for address translation to the peer. -* xeon\_b2b\_usd\_bar2\_addr64 - If using B2B topology on Xeon hardware, use - this 64 bit address on the bus between the NTB devices for the window - at BAR2, on the upstream side of the link. -* xeon\_b2b\_usd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*. -* xeon\_b2b\_usd\_bar4\_addr32 - See *xeon\_b2b\_bar2\_addr64*. -* xeon\_b2b\_usd\_bar5\_addr32 - See *xeon\_b2b\_bar2\_addr64*. -* xeon\_b2b\_dsd\_bar2\_addr64 - See *xeon\_b2b\_bar2\_addr64*. -* xeon\_b2b\_dsd\_bar4\_addr64 - See *xeon\_b2b\_bar2\_addr64*. -* xeon\_b2b\_dsd\_bar4\_addr32 - See *xeon\_b2b\_bar2\_addr64*. -* xeon\_b2b\_dsd\_bar5\_addr32 - See *xeon\_b2b\_bar2\_addr64*. diff --git a/Documentation/nvmem/nvmem.rst b/Documentation/nvmem/nvmem.rst deleted file mode 100644 index 3866b6e066d5..000000000000 --- a/Documentation/nvmem/nvmem.rst +++ /dev/null @@ -1,189 +0,0 @@ -:orphan: - -=============== -NVMEM Subsystem -=============== - - Srinivas Kandagatla - -This document explains the NVMEM Framework along with the APIs provided, -and how to use it. - -1. Introduction -=============== -*NVMEM* is the abbreviation for Non Volatile Memory layer. It is used to -retrieve configuration of SOC or Device specific data from non volatile -memories like eeprom, efuses and so on. - -Before this framework existed, NVMEM drivers like eeprom were stored in -drivers/misc, where they all had to duplicate pretty much the same code to -register a sysfs file, allow in-kernel users to access the content of the -devices they were driving, etc. - -This was also a problem as far as other in-kernel users were involved, since -the solutions used were pretty much different from one driver to another, there -was a rather big abstraction leak. - -This framework aims at solve these problems. It also introduces DT -representation for consumer devices to go get the data they require (MAC -Addresses, SoC/Revision ID, part numbers, and so on) from the NVMEMs. This -framework is based on regmap, so that most of the abstraction available in -regmap can be reused, across multiple types of buses. - -NVMEM Providers -+++++++++++++++ - -NVMEM provider refers to an entity that implements methods to initialize, read -and write the non-volatile memory. - -2. Registering/Unregistering the NVMEM provider -=============================================== - -A NVMEM provider can register with NVMEM core by supplying relevant -nvmem configuration to nvmem_register(), on success core would return a valid -nvmem_device pointer. - -nvmem_unregister(nvmem) is used to unregister a previously registered provider. - -For example, a simple qfprom case:: - - static struct nvmem_config econfig = { - .name = "qfprom", - .owner = THIS_MODULE, - }; - - static int qfprom_probe(struct platform_device *pdev) - { - ... - econfig.dev = &pdev->dev; - nvmem = nvmem_register(&econfig); - ... - } - -It is mandatory that the NVMEM provider has a regmap associated with its -struct device. Failure to do would return error code from nvmem_register(). - -Users of board files can define and register nvmem cells using the -nvmem_cell_table struct:: - - static struct nvmem_cell_info foo_nvmem_cells[] = { - { - .name = "macaddr", - .offset = 0x7f00, - .bytes = ETH_ALEN, - } - }; - - static struct nvmem_cell_table foo_nvmem_cell_table = { - .nvmem_name = "i2c-eeprom", - .cells = foo_nvmem_cells, - .ncells = ARRAY_SIZE(foo_nvmem_cells), - }; - - nvmem_add_cell_table(&foo_nvmem_cell_table); - -Additionally it is possible to create nvmem cell lookup entries and register -them with the nvmem framework from machine code as shown in the example below:: - - static struct nvmem_cell_lookup foo_nvmem_lookup = { - .nvmem_name = "i2c-eeprom", - .cell_name = "macaddr", - .dev_id = "foo_mac.0", - .con_id = "mac-address", - }; - - nvmem_add_cell_lookups(&foo_nvmem_lookup, 1); - -NVMEM Consumers -+++++++++++++++ - -NVMEM consumers are the entities which make use of the NVMEM provider to -read from and to NVMEM. - -3. NVMEM cell based consumer APIs -================================= - -NVMEM cells are the data entries/fields in the NVMEM. -The NVMEM framework provides 3 APIs to read/write NVMEM cells:: - - struct nvmem_cell *nvmem_cell_get(struct device *dev, const char *name); - struct nvmem_cell *devm_nvmem_cell_get(struct device *dev, const char *name); - - void nvmem_cell_put(struct nvmem_cell *cell); - void devm_nvmem_cell_put(struct device *dev, struct nvmem_cell *cell); - - void *nvmem_cell_read(struct nvmem_cell *cell, ssize_t *len); - int nvmem_cell_write(struct nvmem_cell *cell, void *buf, ssize_t len); - -`*nvmem_cell_get()` apis will get a reference to nvmem cell for a given id, -and nvmem_cell_read/write() can then read or write to the cell. -Once the usage of the cell is finished the consumer should call -`*nvmem_cell_put()` to free all the allocation memory for the cell. - -4. Direct NVMEM device based consumer APIs -========================================== - -In some instances it is necessary to directly read/write the NVMEM. -To facilitate such consumers NVMEM framework provides below apis:: - - struct nvmem_device *nvmem_device_get(struct device *dev, const char *name); - struct nvmem_device *devm_nvmem_device_get(struct device *dev, - const char *name); - void nvmem_device_put(struct nvmem_device *nvmem); - int nvmem_device_read(struct nvmem_device *nvmem, unsigned int offset, - size_t bytes, void *buf); - int nvmem_device_write(struct nvmem_device *nvmem, unsigned int offset, - size_t bytes, void *buf); - int nvmem_device_cell_read(struct nvmem_device *nvmem, - struct nvmem_cell_info *info, void *buf); - int nvmem_device_cell_write(struct nvmem_device *nvmem, - struct nvmem_cell_info *info, void *buf); - -Before the consumers can read/write NVMEM directly, it should get hold -of nvmem_controller from one of the `*nvmem_device_get()` api. - -The difference between these apis and cell based apis is that these apis always -take nvmem_device as parameter. - -5. Releasing a reference to the NVMEM -===================================== - -When a consumer no longer needs the NVMEM, it has to release the reference -to the NVMEM it has obtained using the APIs mentioned in the above section. -The NVMEM framework provides 2 APIs to release a reference to the NVMEM:: - - void nvmem_cell_put(struct nvmem_cell *cell); - void devm_nvmem_cell_put(struct device *dev, struct nvmem_cell *cell); - void nvmem_device_put(struct nvmem_device *nvmem); - void devm_nvmem_device_put(struct device *dev, struct nvmem_device *nvmem); - -Both these APIs are used to release a reference to the NVMEM and -devm_nvmem_cell_put and devm_nvmem_device_put destroys the devres associated -with this NVMEM. - -Userspace -+++++++++ - -6. Userspace binary interface -============================== - -Userspace can read/write the raw NVMEM file located at:: - - /sys/bus/nvmem/devices/*/nvmem - -ex:: - - hexdump /sys/bus/nvmem/devices/qfprom0/nvmem - - 0000000 0000 0000 0000 0000 0000 0000 0000 0000 - * - 00000a0 db10 2240 0000 e000 0c00 0c00 0000 0c00 - 0000000 0000 0000 0000 0000 0000 0000 0000 0000 - ... - * - 0001000 - -7. DeviceTree Binding -===================== - -See Documentation/devicetree/bindings/nvmem/nvmem.txt diff --git a/Documentation/parport-lowlevel.txt b/Documentation/parport-lowlevel.txt deleted file mode 100644 index 0633d70ffda7..000000000000 --- a/Documentation/parport-lowlevel.txt +++ /dev/null @@ -1,1832 +0,0 @@ -=============================== -PARPORT interface documentation -=============================== - -:Time-stamp: <2000-02-24 13:30:20 twaugh> - -Described here are the following functions: - -Global functions:: - parport_register_driver - parport_unregister_driver - parport_enumerate - parport_register_device - parport_unregister_device - parport_claim - parport_claim_or_block - parport_release - parport_yield - parport_yield_blocking - parport_wait_peripheral - parport_poll_peripheral - parport_wait_event - parport_negotiate - parport_read - parport_write - parport_open - parport_close - parport_device_id - parport_device_coords - parport_find_class - parport_find_device - parport_set_timeout - -Port functions (can be overridden by low-level drivers): - - SPP:: - port->ops->read_data - port->ops->write_data - port->ops->read_status - port->ops->read_control - port->ops->write_control - port->ops->frob_control - port->ops->enable_irq - port->ops->disable_irq - port->ops->data_forward - port->ops->data_reverse - - EPP:: - port->ops->epp_write_data - port->ops->epp_read_data - port->ops->epp_write_addr - port->ops->epp_read_addr - - ECP:: - port->ops->ecp_write_data - port->ops->ecp_read_data - port->ops->ecp_write_addr - - Other:: - port->ops->nibble_read_data - port->ops->byte_read_data - port->ops->compat_write_data - -The parport subsystem comprises ``parport`` (the core port-sharing -code), and a variety of low-level drivers that actually do the port -accesses. Each low-level driver handles a particular style of port -(PC, Amiga, and so on). - -The parport interface to the device driver author can be broken down -into global functions and port functions. - -The global functions are mostly for communicating between the device -driver and the parport subsystem: acquiring a list of available ports, -claiming a port for exclusive use, and so on. They also include -``generic`` functions for doing standard things that will work on any -IEEE 1284-capable architecture. - -The port functions are provided by the low-level drivers, although the -core parport module provides generic ``defaults`` for some routines. -The port functions can be split into three groups: SPP, EPP, and ECP. - -SPP (Standard Parallel Port) functions modify so-called ``SPP`` -registers: data, status, and control. The hardware may not actually -have registers exactly like that, but the PC does and this interface is -modelled after common PC implementations. Other low-level drivers may -be able to emulate most of the functionality. - -EPP (Enhanced Parallel Port) functions are provided for reading and -writing in IEEE 1284 EPP mode, and ECP (Extended Capabilities Port) -functions are used for IEEE 1284 ECP mode. (What about BECP? Does -anyone care?) - -Hardware assistance for EPP and/or ECP transfers may or may not be -available, and if it is available it may or may not be used. If -hardware is not used, the transfer will be software-driven. In order -to cope with peripherals that only tenuously support IEEE 1284, a -low-level driver specific function is provided, for altering 'fudge -factors'. - -Global functions -================ - -parport_register_driver - register a device driver with parport ---------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_driver { - const char *name; - void (*attach) (struct parport *); - void (*detach) (struct parport *); - struct parport_driver *next; - }; - int parport_register_driver (struct parport_driver *driver); - -DESCRIPTION -^^^^^^^^^^^ - -In order to be notified about parallel ports when they are detected, -parport_register_driver should be called. Your driver will -immediately be notified of all ports that have already been detected, -and of each new port as low-level drivers are loaded. - -A ``struct parport_driver`` contains the textual name of your driver, -a pointer to a function to handle new ports, and a pointer to a -function to handle ports going away due to a low-level driver -unloading. Ports will only be detached if they are not being used -(i.e. there are no devices registered on them). - -The visible parts of the ``struct parport *`` argument given to -attach/detach are:: - - struct parport - { - struct parport *next; /* next parport in list */ - const char *name; /* port's name */ - unsigned int modes; /* bitfield of hardware modes */ - struct parport_device_info probe_info; - /* IEEE1284 info */ - int number; /* parport index */ - struct parport_operations *ops; - ... - }; - -There are other members of the structure, but they should not be -touched. - -The ``modes`` member summarises the capabilities of the underlying -hardware. It consists of flags which may be bitwise-ored together: - - ============================= =============================================== - PARPORT_MODE_PCSPP IBM PC registers are available, - i.e. functions that act on data, - control and status registers are - probably writing directly to the - hardware. - PARPORT_MODE_TRISTATE The data drivers may be turned off. - This allows the data lines to be used - for reverse (peripheral to host) - transfers. - PARPORT_MODE_COMPAT The hardware can assist with - compatibility-mode (printer) - transfers, i.e. compat_write_block. - PARPORT_MODE_EPP The hardware can assist with EPP - transfers. - PARPORT_MODE_ECP The hardware can assist with ECP - transfers. - PARPORT_MODE_DMA The hardware can use DMA, so you might - want to pass ISA DMA-able memory - (i.e. memory allocated using the - GFP_DMA flag with kmalloc) to the - low-level driver in order to take - advantage of it. - ============================= =============================================== - -There may be other flags in ``modes`` as well. - -The contents of ``modes`` is advisory only. For example, if the -hardware is capable of DMA, and PARPORT_MODE_DMA is in ``modes``, it -doesn't necessarily mean that DMA will always be used when possible. -Similarly, hardware that is capable of assisting ECP transfers won't -necessarily be used. - -RETURN VALUE -^^^^^^^^^^^^ - -Zero on success, otherwise an error code. - -ERRORS -^^^^^^ - -None. (Can it fail? Why return int?) - -EXAMPLE -^^^^^^^ - -:: - - static void lp_attach (struct parport *port) - { - ... - private = kmalloc (...); - dev[count++] = parport_register_device (...); - ... - } - - static void lp_detach (struct parport *port) - { - ... - } - - static struct parport_driver lp_driver = { - "lp", - lp_attach, - lp_detach, - NULL /* always put NULL here */ - }; - - int lp_init (void) - { - ... - if (parport_register_driver (&lp_driver)) { - /* Failed; nothing we can do. */ - return -EIO; - } - ... - } - - -SEE ALSO -^^^^^^^^ - -parport_unregister_driver, parport_register_device, parport_enumerate - - - -parport_unregister_driver - tell parport to forget about this driver --------------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_driver { - const char *name; - void (*attach) (struct parport *); - void (*detach) (struct parport *); - struct parport_driver *next; - }; - void parport_unregister_driver (struct parport_driver *driver); - -DESCRIPTION -^^^^^^^^^^^ - -This tells parport not to notify the device driver of new ports or of -ports going away. Registered devices belonging to that driver are NOT -unregistered: parport_unregister_device must be used for each one. - -EXAMPLE -^^^^^^^ - -:: - - void cleanup_module (void) - { - ... - /* Stop notifications. */ - parport_unregister_driver (&lp_driver); - - /* Unregister devices. */ - for (i = 0; i < NUM_DEVS; i++) - parport_unregister_device (dev[i]); - ... - } - -SEE ALSO -^^^^^^^^ - -parport_register_driver, parport_enumerate - - - -parport_enumerate - retrieve a list of parallel ports (DEPRECATED) ------------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport *parport_enumerate (void); - -DESCRIPTION -^^^^^^^^^^^ - -Retrieve the first of a list of valid parallel ports for this machine. -Successive parallel ports can be found using the ``struct parport -*next`` element of the ``struct parport *`` that is returned. If ``next`` -is NULL, there are no more parallel ports in the list. The number of -ports in the list will not exceed PARPORT_MAX. - -RETURN VALUE -^^^^^^^^^^^^ - -A ``struct parport *`` describing a valid parallel port for the machine, -or NULL if there are none. - -ERRORS -^^^^^^ - -This function can return NULL to indicate that there are no parallel -ports to use. - -EXAMPLE -^^^^^^^ - -:: - - int detect_device (void) - { - struct parport *port; - - for (port = parport_enumerate (); - port != NULL; - port = port->next) { - /* Try to detect a device on the port... */ - ... - } - } - - ... - } - -NOTES -^^^^^ - -parport_enumerate is deprecated; parport_register_driver should be -used instead. - -SEE ALSO -^^^^^^^^ - -parport_register_driver, parport_unregister_driver - - - -parport_register_device - register to use a port ------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - typedef int (*preempt_func) (void *handle); - typedef void (*wakeup_func) (void *handle); - typedef int (*irq_func) (int irq, void *handle, struct pt_regs *); - - struct pardevice *parport_register_device(struct parport *port, - const char *name, - preempt_func preempt, - wakeup_func wakeup, - irq_func irq, - int flags, - void *handle); - -DESCRIPTION -^^^^^^^^^^^ - -Use this function to register your device driver on a parallel port -(``port``). Once you have done that, you will be able to use -parport_claim and parport_release in order to use the port. - -The (``name``) argument is the name of the device that appears in /proc -filesystem. The string must be valid for the whole lifetime of the -device (until parport_unregister_device is called). - -This function will register three callbacks into your driver: -``preempt``, ``wakeup`` and ``irq``. Each of these may be NULL in order to -indicate that you do not want a callback. - -When the ``preempt`` function is called, it is because another driver -wishes to use the parallel port. The ``preempt`` function should return -non-zero if the parallel port cannot be released yet -- if zero is -returned, the port is lost to another driver and the port must be -re-claimed before use. - -The ``wakeup`` function is called once another driver has released the -port and no other driver has yet claimed it. You can claim the -parallel port from within the ``wakeup`` function (in which case the -claim is guaranteed to succeed), or choose not to if you don't need it -now. - -If an interrupt occurs on the parallel port your driver has claimed, -the ``irq`` function will be called. (Write something about shared -interrupts here.) - -The ``handle`` is a pointer to driver-specific data, and is passed to -the callback functions. - -``flags`` may be a bitwise combination of the following flags: - - ===================== ================================================= - Flag Meaning - ===================== ================================================= - PARPORT_DEV_EXCL The device cannot share the parallel port at all. - Use this only when absolutely necessary. - ===================== ================================================= - -The typedefs are not actually defined -- they are only shown in order -to make the function prototype more readable. - -The visible parts of the returned ``struct pardevice`` are:: - - struct pardevice { - struct parport *port; /* Associated port */ - void *private; /* Device driver's 'handle' */ - ... - }; - -RETURN VALUE -^^^^^^^^^^^^ - -A ``struct pardevice *``: a handle to the registered parallel port -device that can be used for parport_claim, parport_release, etc. - -ERRORS -^^^^^^ - -A return value of NULL indicates that there was a problem registering -a device on that port. - -EXAMPLE -^^^^^^^ - -:: - - static int preempt (void *handle) - { - if (busy_right_now) - return 1; - - must_reclaim_port = 1; - return 0; - } - - static void wakeup (void *handle) - { - struct toaster *private = handle; - struct pardevice *dev = private->dev; - if (!dev) return; /* avoid races */ - - if (want_port) - parport_claim (dev); - } - - static int toaster_detect (struct toaster *private, struct parport *port) - { - private->dev = parport_register_device (port, "toaster", preempt, - wakeup, NULL, 0, - private); - if (!private->dev) - /* Couldn't register with parport. */ - return -EIO; - - must_reclaim_port = 0; - busy_right_now = 1; - parport_claim_or_block (private->dev); - ... - /* Don't need the port while the toaster warms up. */ - busy_right_now = 0; - ... - busy_right_now = 1; - if (must_reclaim_port) { - parport_claim_or_block (private->dev); - must_reclaim_port = 0; - } - ... - } - -SEE ALSO -^^^^^^^^ - -parport_unregister_device, parport_claim - - - -parport_unregister_device - finish using a port ------------------------------------------------ - -SYNPOPSIS - -:: - - #include - - void parport_unregister_device (struct pardevice *dev); - -DESCRIPTION -^^^^^^^^^^^ - -This function is the opposite of parport_register_device. After using -parport_unregister_device, ``dev`` is no longer a valid device handle. - -You should not unregister a device that is currently claimed, although -if you do it will be released automatically. - -EXAMPLE -^^^^^^^ - -:: - - ... - kfree (dev->private); /* before we lose the pointer */ - parport_unregister_device (dev); - ... - -SEE ALSO -^^^^^^^^ - - -parport_unregister_driver - -parport_claim, parport_claim_or_block - claim the parallel port for a device ----------------------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - int parport_claim (struct pardevice *dev); - int parport_claim_or_block (struct pardevice *dev); - -DESCRIPTION -^^^^^^^^^^^ - -These functions attempt to gain control of the parallel port on which -``dev`` is registered. ``parport_claim`` does not block, but -``parport_claim_or_block`` may do. (Put something here about blocking -interruptibly or non-interruptibly.) - -You should not try to claim a port that you have already claimed. - -RETURN VALUE -^^^^^^^^^^^^ - -A return value of zero indicates that the port was successfully -claimed, and the caller now has possession of the parallel port. - -If ``parport_claim_or_block`` blocks before returning successfully, the -return value is positive. - -ERRORS -^^^^^^ - -========== ========================================================== - -EAGAIN The port is unavailable at the moment, but another attempt - to claim it may succeed. -========== ========================================================== - -SEE ALSO -^^^^^^^^ - - -parport_release - -parport_release - release the parallel port -------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - void parport_release (struct pardevice *dev); - -DESCRIPTION -^^^^^^^^^^^ - -Once a parallel port device has been claimed, it can be released using -``parport_release``. It cannot fail, but you should not release a -device that you do not have possession of. - -EXAMPLE -^^^^^^^ - -:: - - static size_t write (struct pardevice *dev, const void *buf, - size_t len) - { - ... - written = dev->port->ops->write_ecp_data (dev->port, buf, - len); - parport_release (dev); - ... - } - - -SEE ALSO -^^^^^^^^ - -change_mode, parport_claim, parport_claim_or_block, parport_yield - - - -parport_yield, parport_yield_blocking - temporarily release a parallel port ---------------------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - int parport_yield (struct pardevice *dev) - int parport_yield_blocking (struct pardevice *dev); - -DESCRIPTION -^^^^^^^^^^^ - -When a driver has control of a parallel port, it may allow another -driver to temporarily ``borrow`` it. ``parport_yield`` does not block; -``parport_yield_blocking`` may do. - -RETURN VALUE -^^^^^^^^^^^^ - -A return value of zero indicates that the caller still owns the port -and the call did not block. - -A positive return value from ``parport_yield_blocking`` indicates that -the caller still owns the port and the call blocked. - -A return value of -EAGAIN indicates that the caller no longer owns the -port, and it must be re-claimed before use. - -ERRORS -^^^^^^ - -========= ========================================================== - -EAGAIN Ownership of the parallel port was given away. -========= ========================================================== - -SEE ALSO -^^^^^^^^ - -parport_release - - - -parport_wait_peripheral - wait for status lines, up to 35ms ------------------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - int parport_wait_peripheral (struct parport *port, - unsigned char mask, - unsigned char val); - -DESCRIPTION -^^^^^^^^^^^ - -Wait for the status lines in mask to match the values in val. - -RETURN VALUE -^^^^^^^^^^^^ - -======== ========================================================== - -EINTR a signal is pending - 0 the status lines in mask have values in val - 1 timed out while waiting (35ms elapsed) -======== ========================================================== - -SEE ALSO -^^^^^^^^ - -parport_poll_peripheral - - - -parport_poll_peripheral - wait for status lines, in usec --------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - int parport_poll_peripheral (struct parport *port, - unsigned char mask, - unsigned char val, - int usec); - -DESCRIPTION -^^^^^^^^^^^ - -Wait for the status lines in mask to match the values in val. - -RETURN VALUE -^^^^^^^^^^^^ - -======== ========================================================== - -EINTR a signal is pending - 0 the status lines in mask have values in val - 1 timed out while waiting (usec microseconds have elapsed) -======== ========================================================== - -SEE ALSO -^^^^^^^^ - -parport_wait_peripheral - - - -parport_wait_event - wait for an event on a port ------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - int parport_wait_event (struct parport *port, signed long timeout) - -DESCRIPTION -^^^^^^^^^^^ - -Wait for an event (e.g. interrupt) on a port. The timeout is in -jiffies. - -RETURN VALUE -^^^^^^^^^^^^ - -======= ========================================================== - 0 success - <0 error (exit as soon as possible) - >0 timed out -======= ========================================================== - -parport_negotiate - perform IEEE 1284 negotiation -------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - int parport_negotiate (struct parport *, int mode); - -DESCRIPTION -^^^^^^^^^^^ - -Perform IEEE 1284 negotiation. - -RETURN VALUE -^^^^^^^^^^^^ - -======= ========================================================== - 0 handshake OK; IEEE 1284 peripheral and mode available - -1 handshake failed; peripheral not compliant (or none present) - 1 handshake OK; IEEE 1284 peripheral present but mode not - available -======= ========================================================== - -SEE ALSO -^^^^^^^^ - -parport_read, parport_write - - - -parport_read - read data from device ------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - ssize_t parport_read (struct parport *, void *buf, size_t len); - -DESCRIPTION -^^^^^^^^^^^ - -Read data from device in current IEEE 1284 transfer mode. This only -works for modes that support reverse data transfer. - -RETURN VALUE -^^^^^^^^^^^^ - -If negative, an error code; otherwise the number of bytes transferred. - -SEE ALSO -^^^^^^^^ - -parport_write, parport_negotiate - - - -parport_write - write data to device ------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - ssize_t parport_write (struct parport *, const void *buf, size_t len); - -DESCRIPTION -^^^^^^^^^^^ - -Write data to device in current IEEE 1284 transfer mode. This only -works for modes that support forward data transfer. - -RETURN VALUE -^^^^^^^^^^^^ - -If negative, an error code; otherwise the number of bytes transferred. - -SEE ALSO -^^^^^^^^ - -parport_read, parport_negotiate - - - -parport_open - register device for particular device number ------------------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct pardevice *parport_open (int devnum, const char *name, - int (*pf) (void *), - void (*kf) (void *), - void (*irqf) (int, void *, - struct pt_regs *), - int flags, void *handle); - -DESCRIPTION -^^^^^^^^^^^ - -This is like parport_register_device but takes a device number instead -of a pointer to a struct parport. - -RETURN VALUE -^^^^^^^^^^^^ - -See parport_register_device. If no device is associated with devnum, -NULL is returned. - -SEE ALSO -^^^^^^^^ - -parport_register_device - - - -parport_close - unregister device for particular device number --------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - void parport_close (struct pardevice *dev); - -DESCRIPTION -^^^^^^^^^^^ - -This is the equivalent of parport_unregister_device for parport_open. - -SEE ALSO -^^^^^^^^ - -parport_unregister_device, parport_open - - - -parport_device_id - obtain IEEE 1284 Device ID ----------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - ssize_t parport_device_id (int devnum, char *buffer, size_t len); - -DESCRIPTION -^^^^^^^^^^^ - -Obtains the IEEE 1284 Device ID associated with a given device. - -RETURN VALUE -^^^^^^^^^^^^ - -If negative, an error code; otherwise, the number of bytes of buffer -that contain the device ID. The format of the device ID is as -follows:: - - [length][ID] - -The first two bytes indicate the inclusive length of the entire Device -ID, and are in big-endian order. The ID is a sequence of pairs of the -form:: - - key:value; - -NOTES -^^^^^ - -Many devices have ill-formed IEEE 1284 Device IDs. - -SEE ALSO -^^^^^^^^ - -parport_find_class, parport_find_device - - - -parport_device_coords - convert device number to device coordinates -------------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - int parport_device_coords (int devnum, int *parport, int *mux, - int *daisy); - -DESCRIPTION -^^^^^^^^^^^ - -Convert between device number (zero-based) and device coordinates -(port, multiplexor, daisy chain address). - -RETURN VALUE -^^^^^^^^^^^^ - -Zero on success, in which case the coordinates are (``*parport``, ``*mux``, -``*daisy``). - -SEE ALSO -^^^^^^^^ - -parport_open, parport_device_id - - - -parport_find_class - find a device by its class ------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - typedef enum { - PARPORT_CLASS_LEGACY = 0, /* Non-IEEE1284 device */ - PARPORT_CLASS_PRINTER, - PARPORT_CLASS_MODEM, - PARPORT_CLASS_NET, - PARPORT_CLASS_HDC, /* Hard disk controller */ - PARPORT_CLASS_PCMCIA, - PARPORT_CLASS_MEDIA, /* Multimedia device */ - PARPORT_CLASS_FDC, /* Floppy disk controller */ - PARPORT_CLASS_PORTS, - PARPORT_CLASS_SCANNER, - PARPORT_CLASS_DIGCAM, - PARPORT_CLASS_OTHER, /* Anything else */ - PARPORT_CLASS_UNSPEC, /* No CLS field in ID */ - PARPORT_CLASS_SCSIADAPTER - } parport_device_class; - - int parport_find_class (parport_device_class cls, int from); - -DESCRIPTION -^^^^^^^^^^^ - -Find a device by class. The search starts from device number from+1. - -RETURN VALUE -^^^^^^^^^^^^ - -The device number of the next device in that class, or -1 if no such -device exists. - -NOTES -^^^^^ - -Example usage:: - - int devnum = -1; - while ((devnum = parport_find_class (PARPORT_CLASS_DIGCAM, devnum)) != -1) { - struct pardevice *dev = parport_open (devnum, ...); - ... - } - -SEE ALSO -^^^^^^^^ - -parport_find_device, parport_open, parport_device_id - - - -parport_find_device - find a device by its class ------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - int parport_find_device (const char *mfg, const char *mdl, int from); - -DESCRIPTION -^^^^^^^^^^^ - -Find a device by vendor and model. The search starts from device -number from+1. - -RETURN VALUE -^^^^^^^^^^^^ - -The device number of the next device matching the specifications, or --1 if no such device exists. - -NOTES -^^^^^ - -Example usage:: - - int devnum = -1; - while ((devnum = parport_find_device ("IOMEGA", "ZIP+", devnum)) != -1) { - struct pardevice *dev = parport_open (devnum, ...); - ... - } - -SEE ALSO -^^^^^^^^ - -parport_find_class, parport_open, parport_device_id - - - -parport_set_timeout - set the inactivity timeout ------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - long parport_set_timeout (struct pardevice *dev, long inactivity); - -DESCRIPTION -^^^^^^^^^^^ - -Set the inactivity timeout, in jiffies, for a registered device. The -previous timeout is returned. - -RETURN VALUE -^^^^^^^^^^^^ - -The previous timeout, in jiffies. - -NOTES -^^^^^ - -Some of the port->ops functions for a parport may take time, owing to -delays at the peripheral. After the peripheral has not responded for -``inactivity`` jiffies, a timeout will occur and the blocking function -will return. - -A timeout of 0 jiffies is a special case: the function must do as much -as it can without blocking or leaving the hardware in an unknown -state. If port operations are performed from within an interrupt -handler, for instance, a timeout of 0 jiffies should be used. - -Once set for a registered device, the timeout will remain at the set -value until set again. - -SEE ALSO -^^^^^^^^ - -port->ops->xxx_read/write_yyy - - - - -PORT FUNCTIONS -============== - -The functions in the port->ops structure (struct parport_operations) -are provided by the low-level driver responsible for that port. - -port->ops->read_data - read the data register ---------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - unsigned char (*read_data) (struct parport *port); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -If port->modes contains the PARPORT_MODE_TRISTATE flag and the -PARPORT_CONTROL_DIRECTION bit in the control register is set, this -returns the value on the data pins. If port->modes contains the -PARPORT_MODE_TRISTATE flag and the PARPORT_CONTROL_DIRECTION bit is -not set, the return value _may_ be the last value written to the data -register. Otherwise the return value is undefined. - -SEE ALSO -^^^^^^^^ - -write_data, read_status, write_control - - - -port->ops->write_data - write the data register ------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - void (*write_data) (struct parport *port, unsigned char d); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Writes to the data register. May have side-effects (a STROBE pulse, -for instance). - -SEE ALSO -^^^^^^^^ - -read_data, read_status, write_control - - - -port->ops->read_status - read the status register -------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - unsigned char (*read_status) (struct parport *port); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Reads from the status register. This is a bitmask: - -- PARPORT_STATUS_ERROR (printer fault, "nFault") -- PARPORT_STATUS_SELECT (on-line, "Select") -- PARPORT_STATUS_PAPEROUT (no paper, "PError") -- PARPORT_STATUS_ACK (handshake, "nAck") -- PARPORT_STATUS_BUSY (busy, "Busy") - -There may be other bits set. - -SEE ALSO -^^^^^^^^ - -read_data, write_data, write_control - - - -port->ops->read_control - read the control register ---------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - unsigned char (*read_control) (struct parport *port); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Returns the last value written to the control register (either from -write_control or frob_control). No port access is performed. - -SEE ALSO -^^^^^^^^ - -read_data, write_data, read_status, write_control - - - -port->ops->write_control - write the control register ------------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - void (*write_control) (struct parport *port, unsigned char s); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Writes to the control register. This is a bitmask:: - - _______ - - PARPORT_CONTROL_STROBE (nStrobe) - _______ - - PARPORT_CONTROL_AUTOFD (nAutoFd) - _____ - - PARPORT_CONTROL_INIT (nInit) - _________ - - PARPORT_CONTROL_SELECT (nSelectIn) - -SEE ALSO -^^^^^^^^ - -read_data, write_data, read_status, frob_control - - - -port->ops->frob_control - write control register bits ------------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - unsigned char (*frob_control) (struct parport *port, - unsigned char mask, - unsigned char val); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -This is equivalent to reading from the control register, masking out -the bits in mask, exclusive-or'ing with the bits in val, and writing -the result to the control register. - -As some ports don't allow reads from the control port, a software copy -of its contents is maintained, so frob_control is in fact only one -port access. - -SEE ALSO -^^^^^^^^ - -read_data, write_data, read_status, write_control - - - -port->ops->enable_irq - enable interrupt generation ---------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - void (*enable_irq) (struct parport *port); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -The parallel port hardware is instructed to generate interrupts at -appropriate moments, although those moments are -architecture-specific. For the PC architecture, interrupts are -commonly generated on the rising edge of nAck. - -SEE ALSO -^^^^^^^^ - -disable_irq - - - -port->ops->disable_irq - disable interrupt generation ------------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - void (*disable_irq) (struct parport *port); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -The parallel port hardware is instructed not to generate interrupts. -The interrupt itself is not masked. - -SEE ALSO -^^^^^^^^ - -enable_irq - - - -port->ops->data_forward - enable data drivers ---------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - void (*data_forward) (struct parport *port); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Enables the data line drivers, for 8-bit host-to-peripheral -communications. - -SEE ALSO -^^^^^^^^ - -data_reverse - - - -port->ops->data_reverse - tristate the buffer ---------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - void (*data_reverse) (struct parport *port); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Places the data bus in a high impedance state, if port->modes has the -PARPORT_MODE_TRISTATE bit set. - -SEE ALSO -^^^^^^^^ - -data_forward - - - -port->ops->epp_write_data - write EPP data ------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*epp_write_data) (struct parport *port, const void *buf, - size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Writes data in EPP mode, and returns the number of bytes written. - -The ``flags`` parameter may be one or more of the following, -bitwise-or'ed together: - -======================= ================================================= -PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and - 32-bit registers. However, if a transfer - times out, the return value may be unreliable. -======================= ================================================= - -SEE ALSO -^^^^^^^^ - -epp_read_data, epp_write_addr, epp_read_addr - - - -port->ops->epp_read_data - read EPP data ----------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*epp_read_data) (struct parport *port, void *buf, - size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Reads data in EPP mode, and returns the number of bytes read. - -The ``flags`` parameter may be one or more of the following, -bitwise-or'ed together: - -======================= ================================================= -PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and - 32-bit registers. However, if a transfer - times out, the return value may be unreliable. -======================= ================================================= - -SEE ALSO -^^^^^^^^ - -epp_write_data, epp_write_addr, epp_read_addr - - - -port->ops->epp_write_addr - write EPP address ---------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*epp_write_addr) (struct parport *port, - const void *buf, size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Writes EPP addresses (8 bits each), and returns the number written. - -The ``flags`` parameter may be one or more of the following, -bitwise-or'ed together: - -======================= ================================================= -PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and - 32-bit registers. However, if a transfer - times out, the return value may be unreliable. -======================= ================================================= - -(Does PARPORT_EPP_FAST make sense for this function?) - -SEE ALSO -^^^^^^^^ - -epp_write_data, epp_read_data, epp_read_addr - - - -port->ops->epp_read_addr - read EPP address -------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*epp_read_addr) (struct parport *port, void *buf, - size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Reads EPP addresses (8 bits each), and returns the number read. - -The ``flags`` parameter may be one or more of the following, -bitwise-or'ed together: - -======================= ================================================= -PARPORT_EPP_FAST Use fast transfers. Some chips provide 16-bit and - 32-bit registers. However, if a transfer - times out, the return value may be unreliable. -======================= ================================================= - -(Does PARPORT_EPP_FAST make sense for this function?) - -SEE ALSO -^^^^^^^^ - -epp_write_data, epp_read_data, epp_write_addr - - - -port->ops->ecp_write_data - write a block of ECP data ------------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*ecp_write_data) (struct parport *port, - const void *buf, size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Writes a block of ECP data. The ``flags`` parameter is ignored. - -RETURN VALUE -^^^^^^^^^^^^ - -The number of bytes written. - -SEE ALSO -^^^^^^^^ - -ecp_read_data, ecp_write_addr - - - -port->ops->ecp_read_data - read a block of ECP data ---------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*ecp_read_data) (struct parport *port, - void *buf, size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Reads a block of ECP data. The ``flags`` parameter is ignored. - -RETURN VALUE -^^^^^^^^^^^^ - -The number of bytes read. NB. There may be more unread data in a -FIFO. Is there a way of stunning the FIFO to prevent this? - -SEE ALSO -^^^^^^^^ - -ecp_write_block, ecp_write_addr - - - -port->ops->ecp_write_addr - write a block of ECP addresses ----------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*ecp_write_addr) (struct parport *port, - const void *buf, size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Writes a block of ECP addresses. The ``flags`` parameter is ignored. - -RETURN VALUE -^^^^^^^^^^^^ - -The number of bytes written. - -NOTES -^^^^^ - -This may use a FIFO, and if so shall not return until the FIFO is empty. - -SEE ALSO -^^^^^^^^ - -ecp_read_data, ecp_write_data - - - -port->ops->nibble_read_data - read a block of data in nibble mode ------------------------------------------------------------------ - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*nibble_read_data) (struct parport *port, - void *buf, size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Reads a block of data in nibble mode. The ``flags`` parameter is ignored. - -RETURN VALUE -^^^^^^^^^^^^ - -The number of whole bytes read. - -SEE ALSO -^^^^^^^^ - -byte_read_data, compat_write_data - - - -port->ops->byte_read_data - read a block of data in byte mode -------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*byte_read_data) (struct parport *port, - void *buf, size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Reads a block of data in byte mode. The ``flags`` parameter is ignored. - -RETURN VALUE -^^^^^^^^^^^^ - -The number of bytes read. - -SEE ALSO -^^^^^^^^ - -nibble_read_data, compat_write_data - - - -port->ops->compat_write_data - write a block of data in compatibility mode --------------------------------------------------------------------------- - -SYNOPSIS -^^^^^^^^ - -:: - - #include - - struct parport_operations { - ... - size_t (*compat_write_data) (struct parport *port, - const void *buf, size_t len, int flags); - ... - }; - -DESCRIPTION -^^^^^^^^^^^ - -Writes a block of data in compatibility mode. The ``flags`` parameter -is ignored. - -RETURN VALUE -^^^^^^^^^^^^ - -The number of bytes written. - -SEE ALSO -^^^^^^^^ - -nibble_read_data, byte_read_data diff --git a/Documentation/pti/pti_intel_mid.rst b/Documentation/pti/pti_intel_mid.rst deleted file mode 100644 index ea05725174cb..000000000000 --- a/Documentation/pti/pti_intel_mid.rst +++ /dev/null @@ -1,106 +0,0 @@ -:orphan: - -============= -Intel MID PTI -============= - -The Intel MID PTI project is HW implemented in Intel Atom -system-on-a-chip designs based on the Parallel Trace -Interface for MIPI P1149.7 cJTAG standard. The kernel solution -for this platform involves the following files:: - - ./include/linux/pti.h - ./drivers/.../n_tracesink.h - ./drivers/.../n_tracerouter.c - ./drivers/.../n_tracesink.c - ./drivers/.../pti.c - -pti.c is the driver that enables various debugging features -popular on platforms from certain mobile manufacturers. -n_tracerouter.c and n_tracesink.c allow extra system information to -be collected and routed to the pti driver, such as trace -debugging data from a modem. Although n_tracerouter -and n_tracesink are a part of the complete PTI solution, -these two line disciplines can work separately from -pti.c and route any data stream from one /dev/tty node -to another /dev/tty node via kernel-space. This provides -a stable, reliable connection that will not break unless -the user-space application shuts down (plus avoids -kernel->user->kernel context switch overheads of routing -data). - -An example debugging usage for this driver system: - - * Hook /dev/ttyPTI0 to syslogd. Opening this port will also start - a console device to further capture debugging messages to PTI. - * Hook /dev/ttyPTI1 to modem debugging data to write to PTI HW. - This is where n_tracerouter and n_tracesink are used. - * Hook /dev/pti to a user-level debugging application for writing - to PTI HW. - * `Use mipi_` Kernel Driver API in other device drivers for - debugging to PTI by first requesting a PTI write address via - mipi_request_masterchannel(1). - -Below is example pseudo-code on how a 'privileged' application -can hook up n_tracerouter and n_tracesink to any tty on -a system. 'Privileged' means the application has enough -privileges to successfully manipulate the ldisc drivers -but is not just blindly executing as 'root'. Keep in mind -the use of ioctl(,TIOCSETD,) is not specific to the n_tracerouter -and n_tracesink line discpline drivers but is a generic -operation for a program to use a line discpline driver -on a tty port other than the default n_tty:: - - /////////// To hook up n_tracerouter and n_tracesink ///////// - - // Note that n_tracerouter depends on n_tracesink. - #include - #define ONE_TTY "/dev/ttyOne" - #define TWO_TTY "/dev/ttyTwo" - - // needed global to hand onto ldisc connection - static int g_fd_source = -1; - static int g_fd_sink = -1; - - // these two vars used to grab LDISC values from loaded ldisc drivers - // in OS. Look at /proc/tty/ldiscs to get the right numbers from - // the ldiscs loaded in the system. - int source_ldisc_num, sink_ldisc_num = -1; - int retval; - - g_fd_source = open(ONE_TTY, O_RDWR); // must be R/W - g_fd_sink = open(TWO_TTY, O_RDWR); // must be R/W - - if (g_fd_source <= 0) || (g_fd_sink <= 0) { - // doubt you'll want to use these exact error lines of code - printf("Error on open(). errno: %d\n",errno); - return errno; - } - - retval = ioctl(g_fd_sink, TIOCSETD, &sink_ldisc_num); - if (retval < 0) { - printf("Error on ioctl(). errno: %d\n", errno); - return errno; - } - - retval = ioctl(g_fd_source, TIOCSETD, &source_ldisc_num); - if (retval < 0) { - printf("Error on ioctl(). errno: %d\n", errno); - return errno; - } - - /////////// To disconnect n_tracerouter and n_tracesink //////// - - // First make sure data through the ldiscs has stopped. - - // Second, disconnect ldiscs. This provides a - // little cleaner shutdown on tty stack. - sink_ldisc_num = 0; - source_ldisc_num = 0; - ioctl(g_fd_uart, TIOCSETD, &sink_ldisc_num); - ioctl(g_fd_gadget, TIOCSETD, &source_ldisc_num); - - // Three, program closes connection, and cleanup: - close(g_fd_uart); - close(g_fd_gadget); - g_fd_uart = g_fd_gadget = NULL; diff --git a/Documentation/pwm.txt b/Documentation/pwm.txt deleted file mode 100644 index ab62f1bb0366..000000000000 --- a/Documentation/pwm.txt +++ /dev/null @@ -1,165 +0,0 @@ -====================================== -Pulse Width Modulation (PWM) interface -====================================== - -This provides an overview about the Linux PWM interface - -PWMs are commonly used for controlling LEDs, fans or vibrators in -cell phones. PWMs with a fixed purpose have no need implementing -the Linux PWM API (although they could). However, PWMs are often -found as discrete devices on SoCs which have no fixed purpose. It's -up to the board designer to connect them to LEDs or fans. To provide -this kind of flexibility the generic PWM API exists. - -Identifying PWMs ----------------- - -Users of the legacy PWM API use unique IDs to refer to PWM devices. - -Instead of referring to a PWM device via its unique ID, board setup code -should instead register a static mapping that can be used to match PWM -consumers to providers, as given in the following example:: - - static struct pwm_lookup board_pwm_lookup[] = { - PWM_LOOKUP("tegra-pwm", 0, "pwm-backlight", NULL, - 50000, PWM_POLARITY_NORMAL), - }; - - static void __init board_init(void) - { - ... - pwm_add_table(board_pwm_lookup, ARRAY_SIZE(board_pwm_lookup)); - ... - } - -Using PWMs ----------- - -Legacy users can request a PWM device using pwm_request() and free it -after usage with pwm_free(). - -New users should use the pwm_get() function and pass to it the consumer -device or a consumer name. pwm_put() is used to free the PWM device. Managed -variants of these functions, devm_pwm_get() and devm_pwm_put(), also exist. - -After being requested, a PWM has to be configured using:: - - int pwm_apply_state(struct pwm_device *pwm, struct pwm_state *state); - -This API controls both the PWM period/duty_cycle config and the -enable/disable state. - -The pwm_config(), pwm_enable() and pwm_disable() functions are just wrappers -around pwm_apply_state() and should not be used if the user wants to change -several parameter at once. For example, if you see pwm_config() and -pwm_{enable,disable}() calls in the same function, this probably means you -should switch to pwm_apply_state(). - -The PWM user API also allows one to query the PWM state with pwm_get_state(). - -In addition to the PWM state, the PWM API also exposes PWM arguments, which -are the reference PWM config one should use on this PWM. -PWM arguments are usually platform-specific and allows the PWM user to only -care about dutycycle relatively to the full period (like, duty = 50% of the -period). struct pwm_args contains 2 fields (period and polarity) and should -be used to set the initial PWM config (usually done in the probe function -of the PWM user). PWM arguments are retrieved with pwm_get_args(). - -All consumers should really be reconfiguring the PWM upon resume as -appropriate. This is the only way to ensure that everything is resumed in -the proper order. - -Using PWMs with the sysfs interface ------------------------------------ - -If CONFIG_SYSFS is enabled in your kernel configuration a simple sysfs -interface is provided to use the PWMs from userspace. It is exposed at -/sys/class/pwm/. Each probed PWM controller/chip will be exported as -pwmchipN, where N is the base of the PWM chip. Inside the directory you -will find: - - npwm - The number of PWM channels this chip supports (read-only). - - export - Exports a PWM channel for use with sysfs (write-only). - - unexport - Unexports a PWM channel from sysfs (write-only). - -The PWM channels are numbered using a per-chip index from 0 to npwm-1. - -When a PWM channel is exported a pwmX directory will be created in the -pwmchipN directory it is associated with, where X is the number of the -channel that was exported. The following properties will then be available: - - period - The total period of the PWM signal (read/write). - Value is in nanoseconds and is the sum of the active and inactive - time of the PWM. - - duty_cycle - The active time of the PWM signal (read/write). - Value is in nanoseconds and must be less than the period. - - polarity - Changes the polarity of the PWM signal (read/write). - Writes to this property only work if the PWM chip supports changing - the polarity. The polarity can only be changed if the PWM is not - enabled. Value is the string "normal" or "inversed". - - enable - Enable/disable the PWM signal (read/write). - - - 0 - disabled - - 1 - enabled - -Implementing a PWM driver -------------------------- - -Currently there are two ways to implement pwm drivers. Traditionally -there only has been the barebone API meaning that each driver has -to implement the pwm_*() functions itself. This means that it's impossible -to have multiple PWM drivers in the system. For this reason it's mandatory -for new drivers to use the generic PWM framework. - -A new PWM controller/chip can be added using pwmchip_add() and removed -again with pwmchip_remove(). pwmchip_add() takes a filled in struct -pwm_chip as argument which provides a description of the PWM chip, the -number of PWM devices provided by the chip and the chip-specific -implementation of the supported PWM operations to the framework. - -When implementing polarity support in a PWM driver, make sure to respect the -signal conventions in the PWM framework. By definition, normal polarity -characterizes a signal starts high for the duration of the duty cycle and -goes low for the remainder of the period. Conversely, a signal with inversed -polarity starts low for the duration of the duty cycle and goes high for the -remainder of the period. - -Drivers are encouraged to implement ->apply() instead of the legacy -->enable(), ->disable() and ->config() methods. Doing that should provide -atomicity in the PWM config workflow, which is required when the PWM controls -a critical device (like a regulator). - -The implementation of ->get_state() (a method used to retrieve initial PWM -state) is also encouraged for the same reason: letting the PWM user know -about the current PWM state would allow him to avoid glitches. - -Drivers should not implement any power management. In other words, -consumers should implement it as described in the "Using PWMs" section. - -Locking -------- - -The PWM core list manipulations are protected by a mutex, so pwm_request() -and pwm_free() may not be called from an atomic context. Currently the -PWM core does not enforce any locking to pwm_enable(), pwm_disable() and -pwm_config(), so the calling context is currently driver specific. This -is an issue derived from the former barebone API and should be fixed soon. - -Helpers -------- - -Currently a PWM can only be configured with period_ns and duty_ns. For several -use cases freq_hz and duty_percent might be better. Instead of calculating -this in your driver please consider adding appropriate helpers to the framework. diff --git a/Documentation/rfkill.txt b/Documentation/rfkill.txt deleted file mode 100644 index 7d3684e81df6..000000000000 --- a/Documentation/rfkill.txt +++ /dev/null @@ -1,132 +0,0 @@ -=============================== -rfkill - RF kill switch support -=============================== - - -.. contents:: - :depth: 2 - -Introduction -============ - -The rfkill subsystem provides a generic interface for disabling any radio -transmitter in the system. When a transmitter is blocked, it shall not -radiate any power. - -The subsystem also provides the ability to react on button presses and -disable all transmitters of a certain type (or all). This is intended for -situations where transmitters need to be turned off, for example on -aircraft. - -The rfkill subsystem has a concept of "hard" and "soft" block, which -differ little in their meaning (block == transmitters off) but rather in -whether they can be changed or not: - - - hard block - read-only radio block that cannot be overridden by software - - - soft block - writable radio block (need not be readable) that is set by - the system software. - -The rfkill subsystem has two parameters, rfkill.default_state and -rfkill.master_switch_mode, which are documented in -admin-guide/kernel-parameters.rst. - - -Implementation details -====================== - -The rfkill subsystem is composed of three main components: - - * the rfkill core, - * the deprecated rfkill-input module (an input layer handler, being - replaced by userspace policy code) and - * the rfkill drivers. - -The rfkill core provides API for kernel drivers to register their radio -transmitter with the kernel, methods for turning it on and off, and letting -the system know about hardware-disabled states that may be implemented on -the device. - -The rfkill core code also notifies userspace of state changes, and provides -ways for userspace to query the current states. See the "Userspace support" -section below. - -When the device is hard-blocked (either by a call to rfkill_set_hw_state() -or from query_hw_block), set_block() will be invoked for additional software -block, but drivers can ignore the method call since they can use the return -value of the function rfkill_set_hw_state() to sync the software state -instead of keeping track of calls to set_block(). In fact, drivers should -use the return value of rfkill_set_hw_state() unless the hardware actually -keeps track of soft and hard block separately. - - -Kernel API -========== - -Drivers for radio transmitters normally implement an rfkill driver. - -Platform drivers might implement input devices if the rfkill button is just -that, a button. If that button influences the hardware then you need to -implement an rfkill driver instead. This also applies if the platform provides -a way to turn on/off the transmitter(s). - -For some platforms, it is possible that the hardware state changes during -suspend/hibernation, in which case it will be necessary to update the rfkill -core with the current state at resume time. - -To create an rfkill driver, driver's Kconfig needs to have:: - - depends on RFKILL || !RFKILL - -to ensure the driver cannot be built-in when rfkill is modular. The !RFKILL -case allows the driver to be built when rfkill is not configured, in which -case all rfkill API can still be used but will be provided by static inlines -which compile to almost nothing. - -Calling rfkill_set_hw_state() when a state change happens is required from -rfkill drivers that control devices that can be hard-blocked unless they also -assign the poll_hw_block() callback (then the rfkill core will poll the -device). Don't do this unless you cannot get the event in any other way. - -rfkill provides per-switch LED triggers, which can be used to drive LEDs -according to the switch state (LED_FULL when blocked, LED_OFF otherwise). - - -Userspace support -================= - -The recommended userspace interface to use is /dev/rfkill, which is a misc -character device that allows userspace to obtain and set the state of rfkill -devices and sets of devices. It also notifies userspace about device addition -and removal. The API is a simple read/write API that is defined in -linux/rfkill.h, with one ioctl that allows turning off the deprecated input -handler in the kernel for the transition period. - -Except for the one ioctl, communication with the kernel is done via read() -and write() of instances of 'struct rfkill_event'. In this structure, the -soft and hard block are properly separated (unlike sysfs, see below) and -userspace is able to get a consistent snapshot of all rfkill devices in the -system. Also, it is possible to switch all rfkill drivers (or all drivers of -a specified type) into a state which also updates the default state for -hotplugged devices. - -After an application opens /dev/rfkill, it can read the current state of all -devices. Changes can be obtained by either polling the descriptor for -hotplug or state change events or by listening for uevents emitted by the -rfkill core framework. - -Additionally, each rfkill device is registered in sysfs and emits uevents. - -rfkill devices issue uevents (with an action of "change"), with the following -environment variables set:: - - RFKILL_NAME - RFKILL_STATE - RFKILL_TYPE - -The content of these variables corresponds to the "name", "state" and -"type" sysfs files explained above. - -For further details consult Documentation/ABI/stable/sysfs-class-rfkill. diff --git a/Documentation/s390/vfio-ccw.rst b/Documentation/s390/vfio-ccw.rst index 1f6d0b56d53e..1e210c6afa88 100644 --- a/Documentation/s390/vfio-ccw.rst +++ b/Documentation/s390/vfio-ccw.rst @@ -38,7 +38,7 @@ every detail. More information/reference could be found here: qemu/hw/s390x/css.c For vfio mediated device framework: -- Documentation/vfio-mediated-device.txt +- Documentation/driver-api/vfio-mediated-device.rst Motivation of vfio-ccw ---------------------- @@ -322,5 +322,5 @@ Reference 2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) 3. https://en.wikipedia.org/wiki/Channel_I/O 4. Documentation/s390/cds.rst -5. Documentation/vfio.txt -6. Documentation/vfio-mediated-device.txt +5. Documentation/driver-api/vfio.rst +6. Documentation/driver-api/vfio-mediated-device.rst diff --git a/Documentation/sgi-ioc4.txt b/Documentation/sgi-ioc4.txt deleted file mode 100644 index 72709222d3c0..000000000000 --- a/Documentation/sgi-ioc4.txt +++ /dev/null @@ -1,49 +0,0 @@ -==================================== -SGI IOC4 PCI (multi function) device -==================================== - -The SGI IOC4 PCI device is a bit of a strange beast, so some notes on -it are in order. - -First, even though the IOC4 performs multiple functions, such as an -IDE controller, a serial controller, a PS/2 keyboard/mouse controller, -and an external interrupt mechanism, it's not implemented as a -multifunction device. The consequence of this from a software -standpoint is that all these functions share a single IRQ, and -they can't all register to own the same PCI device ID. To make -matters a bit worse, some of the register blocks (and even registers -themselves) present in IOC4 are mixed-purpose between these several -functions, meaning that there's no clear "owning" device driver. - -The solution is to organize the IOC4 driver into several independent -drivers, "ioc4", "sgiioc4", and "ioc4_serial". Note that there is no -PS/2 controller driver as this functionality has never been wired up -on a shipping IO card. - -ioc4 -==== -This is the core (or shim) driver for IOC4. It is responsible for -initializing the basic functionality of the chip, and allocating -the PCI resources that are shared between the IOC4 functions. - -This driver also provides registration functions that the other -IOC4 drivers can call to make their presence known. Each driver -needs to provide a probe and remove function, which are invoked -by the core driver at appropriate times. The interface of these -IOC4 function probe and remove operations isn't precisely the same -as PCI device probe and remove operations, but is logically the -same operation. - -sgiioc4 -======= -This is the IDE driver for IOC4. Its name isn't very descriptive -simply for historical reasons (it used to be the only IOC4 driver -component). There's not much to say about it other than it hooks -up to the ioc4 driver via the appropriate registration, probe, and -remove functions. - -ioc4_serial -=========== -This is the serial driver for IOC4. There's not much to say about it -other than it hooks up to the ioc4 driver via the appropriate registration, -probe, and remove functions. diff --git a/Documentation/smsc_ece1099.txt b/Documentation/smsc_ece1099.txt deleted file mode 100644 index 079277421eaf..000000000000 --- a/Documentation/smsc_ece1099.txt +++ /dev/null @@ -1,60 +0,0 @@ -================================================= -Msc Keyboard Scan Expansion/GPIO Expansion device -================================================= - -What is smsc-ece1099? ----------------------- - -The ECE1099 is a 40-Pin 3.3V Keyboard Scan Expansion -or GPIO Expansion device. The device supports a keyboard -scan matrix of 23x8. The device is connected to a Master -via the SMSC BC-Link interface or via the SMBus. -Keypad scan Input(KSI) and Keypad Scan Output(KSO) signals -are multiplexed with GPIOs. - -Interrupt generation --------------------- - -Interrupts can be generated by an edge detection on a GPIO -pin or an edge detection on one of the bus interface pins. -Interrupts can also be detected on the keyboard scan interface. -The bus interrupt pin (BC_INT# or SMBUS_INT#) is asserted if -any bit in one of the Interrupt Status registers is 1 and -the corresponding Interrupt Mask bit is also 1. - -In order for software to determine which device is the source -of an interrupt, it should first read the Group Interrupt Status Register -to determine which Status register group is a source for the interrupt. -Software should read both the Status register and the associated Mask register, -then AND the two values together. Bits that are 1 in the result of the AND -are active interrupts. Software clears an interrupt by writing a 1 to the -corresponding bit in the Status register. - -Communication Protocol ----------------------- - -- SMbus slave Interface - The host processor communicates with the ECE1099 device - through a series of read/write registers via the SMBus - interface. SMBus is a serial communication protocol between - a computer host and its peripheral devices. The SMBus data - rate is 10KHz minimum to 400 KHz maximum - -- Slave Bus Interface - The ECE1099 device SMBus implementation is a subset of the - SMBus interface to the host. The device is a slave-only SMBus device. - The implementation in the device is a subset of SMBus since it - only supports four protocols. - - The Write Byte, Read Byte, Send Byte, and Receive Byte protocols are the - only valid SMBus protocols for the device. - -- BC-LinkTM Interface - The BC-Link is a proprietary bus that allows communication - between a Master device and a Companion device. The Master - device uses this serial bus to read and write registers - located on the Companion device. The bus comprises three signals, - BC_CLK, BC_DAT and BC_INT#. The Master device always provides the - clock, BC_CLK, and the Companion device is the source for an - independent asynchronous interrupt signal, BC_INT#. The ECE1099 - supports BC-Link speeds up to 24MHz. diff --git a/Documentation/switchtec.txt b/Documentation/switchtec.txt deleted file mode 100644 index 30d6a64e53f7..000000000000 --- a/Documentation/switchtec.txt +++ /dev/null @@ -1,102 +0,0 @@ -======================== -Linux Switchtec Support -======================== - -Microsemi's "Switchtec" line of PCI switch devices is already -supported by the kernel with standard PCI switch drivers. However, the -Switchtec device advertises a special management endpoint which -enables some additional functionality. This includes: - -* Packet and Byte Counters -* Firmware Upgrades -* Event and Error logs -* Querying port link status -* Custom user firmware commands - -The switchtec kernel module implements this functionality. - - -Interface -========= - -The primary means of communicating with the Switchtec management firmware is -through the Memory-mapped Remote Procedure Call (MRPC) interface. -Commands are submitted to the interface with a 4-byte command -identifier and up to 1KB of command specific data. The firmware will -respond with a 4-byte return code and up to 1KB of command-specific -data. The interface only processes a single command at a time. - - -Userspace Interface -=================== - -The MRPC interface will be exposed to userspace through a simple char -device: /dev/switchtec#, one for each management endpoint in the system. - -The char device has the following semantics: - -* A write must consist of at least 4 bytes and no more than 1028 bytes. - The first 4 bytes will be interpreted as the Command ID and the - remainder will be used as the input data. A write will send the - command to the firmware to begin processing. - -* Each write must be followed by exactly one read. Any double write will - produce an error and any read that doesn't follow a write will - produce an error. - -* A read will block until the firmware completes the command and return - the 4-byte Command Return Value plus up to 1024 bytes of output - data. (The length will be specified by the size parameter of the read - call -- reading less than 4 bytes will produce an error.) - -* The poll call will also be supported for userspace applications that - need to do other things while waiting for the command to complete. - -The following IOCTLs are also supported by the device: - -* SWITCHTEC_IOCTL_FLASH_INFO - Retrieve firmware length and number - of partitions in the device. - -* SWITCHTEC_IOCTL_FLASH_PART_INFO - Retrieve address and lengeth for - any specified partition in flash. - -* SWITCHTEC_IOCTL_EVENT_SUMMARY - Read a structure of bitmaps - indicating all uncleared events. - -* SWITCHTEC_IOCTL_EVENT_CTL - Get the current count, clear and set flags - for any event. This ioctl takes in a switchtec_ioctl_event_ctl struct - with the event_id, index and flags set (index being the partition or PFF - number for non-global events). It returns whether the event has - occurred, the number of times and any event specific data. The flags - can be used to clear the count or enable and disable actions to - happen when the event occurs. - By using the SWITCHTEC_IOCTL_EVENT_FLAG_EN_POLL flag, - you can set an event to trigger a poll command to return with - POLLPRI. In this way, userspace can wait for events to occur. - -* SWITCHTEC_IOCTL_PFF_TO_PORT and SWITCHTEC_IOCTL_PORT_TO_PFF convert - between PCI Function Framework number (used by the event system) - and Switchtec Logic Port ID and Partition number (which is more - user friendly). - - -Non-Transparent Bridge (NTB) Driver -=================================== - -An NTB hardware driver is provided for the Switchtec hardware in -ntb_hw_switchtec. Currently, it only supports switches configured with -exactly 2 NT partitions and zero or more non-NT partitions. It also requires -the following configuration settings: - -* Both NT partitions must be able to access each other's GAS spaces. - Thus, the bits in the GAS Access Vector under Management Settings - must be set to support this. -* Kernel configuration MUST include support for NTB (CONFIG_NTB needs - to be set) - -NT EP BAR 2 will be dynamically configured as a Direct Window, and -the configuration file does not need to configure it explicitly. - -Please refer to Documentation/ntb.txt in Linux source tree for an overall -understanding of the Linux NTB stack. ntb_hw_switchtec works as an NTB -Hardware Driver in this stack. diff --git a/Documentation/sync_file.txt b/Documentation/sync_file.txt deleted file mode 100644 index 496fb2c3b3e6..000000000000 --- a/Documentation/sync_file.txt +++ /dev/null @@ -1,86 +0,0 @@ -=================== -Sync File API Guide -=================== - -:Author: Gustavo Padovan - -This document serves as a guide for device drivers writers on what the -sync_file API is, and how drivers can support it. Sync file is the carrier of -the fences(struct dma_fence) that are needed to synchronize between drivers or -across process boundaries. - -The sync_file API is meant to be used to send and receive fence information -to/from userspace. It enables userspace to do explicit fencing, where instead -of attaching a fence to the buffer a producer driver (such as a GPU or V4L -driver) sends the fence related to the buffer to userspace via a sync_file. - -The sync_file then can be sent to the consumer (DRM driver for example), that -will not use the buffer for anything before the fence(s) signals, i.e., the -driver that issued the fence is not using/processing the buffer anymore, so it -signals that the buffer is ready to use. And vice-versa for the consumer -> -producer part of the cycle. - -Sync files allows userspace awareness on buffer sharing synchronization between -drivers. - -Sync file was originally added in the Android kernel but current Linux Desktop -can benefit a lot from it. - -in-fences and out-fences ------------------------- - -Sync files can go either to or from userspace. When a sync_file is sent from -the driver to userspace we call the fences it contains 'out-fences'. They are -related to a buffer that the driver is processing or is going to process, so -the driver creates an out-fence to be able to notify, through -dma_fence_signal(), when it has finished using (or processing) that buffer. -Out-fences are fences that the driver creates. - -On the other hand if the driver receives fence(s) through a sync_file from -userspace we call these fence(s) 'in-fences'. Receiving in-fences means that -we need to wait for the fence(s) to signal before using any buffer related to -the in-fences. - -Creating Sync Files -------------------- - -When a driver needs to send an out-fence userspace it creates a sync_file. - -Interface:: - - struct sync_file *sync_file_create(struct dma_fence *fence); - -The caller pass the out-fence and gets back the sync_file. That is just the -first step, next it needs to install an fd on sync_file->file. So it gets an -fd:: - - fd = get_unused_fd_flags(O_CLOEXEC); - -and installs it on sync_file->file:: - - fd_install(fd, sync_file->file); - -The sync_file fd now can be sent to userspace. - -If the creation process fail, or the sync_file needs to be released by any -other reason fput(sync_file->file) should be used. - -Receiving Sync Files from Userspace ------------------------------------ - -When userspace needs to send an in-fence to the driver it passes file descriptor -of the Sync File to the kernel. The kernel can then retrieve the fences -from it. - -Interface:: - - struct dma_fence *sync_file_get_fence(int fd); - - -The returned reference is owned by the caller and must be disposed of -afterwards using dma_fence_put(). In case of error, a NULL is returned instead. - -References: - -1. struct sync_file in include/linux/sync_file.h -2. All interfaces mentioned above defined in include/linux/sync_file.h diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt deleted file mode 100644 index c3f69bcaf96e..000000000000 --- a/Documentation/vfio-mediated-device.txt +++ /dev/null @@ -1,414 +0,0 @@ -.. include:: - -===================== -VFIO Mediated devices -===================== - -:Copyright: |copy| 2016, NVIDIA CORPORATION. All rights reserved. -:Author: Neo Jia -:Author: Kirti Wankhede - -This program is free software; you can redistribute it and/or modify -it under the terms of the GNU General Public License version 2 as -published by the Free Software Foundation. - - -Virtual Function I/O (VFIO) Mediated devices[1] -=============================================== - -The number of use cases for virtualizing DMA devices that do not have built-in -SR_IOV capability is increasing. Previously, to virtualize such devices, -developers had to create their own management interfaces and APIs, and then -integrate them with user space software. To simplify integration with user space -software, we have identified common requirements and a unified management -interface for such devices. - -The VFIO driver framework provides unified APIs for direct device access. It is -an IOMMU/device-agnostic framework for exposing direct device access to user -space in a secure, IOMMU-protected environment. This framework is used for -multiple devices, such as GPUs, network adapters, and compute accelerators. With -direct device access, virtual machines or user space applications have direct -access to the physical device. This framework is reused for mediated devices. - -The mediated core driver provides a common interface for mediated device -management that can be used by drivers of different devices. This module -provides a generic interface to perform these operations: - -* Create and destroy a mediated device -* Add a mediated device to and remove it from a mediated bus driver -* Add a mediated device to and remove it from an IOMMU group - -The mediated core driver also provides an interface to register a bus driver. -For example, the mediated VFIO mdev driver is designed for mediated devices and -supports VFIO APIs. The mediated bus driver adds a mediated device to and -removes it from a VFIO group. - -The following high-level block diagram shows the main components and interfaces -in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM -devices as examples, as these devices are the first devices to use this module:: - - +---------------+ - | | - | +-----------+ | mdev_register_driver() +--------------+ - | | | +<------------------------+ | - | | mdev | | | | - | | bus | +------------------------>+ vfio_mdev.ko |<-> VFIO user - | | driver | | probe()/remove() | | APIs - | | | | +--------------+ - | +-----------+ | - | | - | MDEV CORE | - | MODULE | - | mdev.ko | - | +-----------+ | mdev_register_device() +--------------+ - | | | +<------------------------+ | - | | | | | nvidia.ko |<-> physical - | | | +------------------------>+ | device - | | | | callbacks +--------------+ - | | Physical | | - | | device | | mdev_register_device() +--------------+ - | | interface | |<------------------------+ | - | | | | | i915.ko |<-> physical - | | | +------------------------>+ | device - | | | | callbacks +--------------+ - | | | | - | | | | mdev_register_device() +--------------+ - | | | +<------------------------+ | - | | | | | ccw_device.ko|<-> physical - | | | +------------------------>+ | device - | | | | callbacks +--------------+ - | +-----------+ | - +---------------+ - - -Registration Interfaces -======================= - -The mediated core driver provides the following types of registration -interfaces: - -* Registration interface for a mediated bus driver -* Physical device driver interface - -Registration Interface for a Mediated Bus Driver ------------------------------------------------- - -The registration interface for a mediated bus driver provides the following -structure to represent a mediated device's driver:: - - /* - * struct mdev_driver [2] - Mediated device's driver - * @name: driver name - * @probe: called when new device created - * @remove: called when device removed - * @driver: device driver structure - */ - struct mdev_driver { - const char *name; - int (*probe) (struct device *dev); - void (*remove) (struct device *dev); - struct device_driver driver; - }; - -A mediated bus driver for mdev should use this structure in the function calls -to register and unregister itself with the core driver: - -* Register:: - - extern int mdev_register_driver(struct mdev_driver *drv, - struct module *owner); - -* Unregister:: - - extern void mdev_unregister_driver(struct mdev_driver *drv); - -The mediated bus driver is responsible for adding mediated devices to the VFIO -group when devices are bound to the driver and removing mediated devices from -the VFIO when devices are unbound from the driver. - - -Physical Device Driver Interface --------------------------------- - -The physical device driver interface provides the mdev_parent_ops[3] structure -to define the APIs to manage work in the mediated core driver that is related -to the physical device. - -The structures in the mdev_parent_ops structure are as follows: - -* dev_attr_groups: attributes of the parent device -* mdev_attr_groups: attributes of the mediated device -* supported_config: attributes to define supported configurations - -The functions in the mdev_parent_ops structure are as follows: - -* create: allocate basic resources in a driver for a mediated device -* remove: free resources in a driver when a mediated device is destroyed - -(Note that mdev-core provides no implicit serialization of create/remove -callbacks per mdev parent device, per mdev type, or any other categorization. -Vendor drivers are expected to be fully asynchronous in this respect or -provide their own internal resource protection.) - -The callbacks in the mdev_parent_ops structure are as follows: - -* open: open callback of mediated device -* close: close callback of mediated device -* ioctl: ioctl callback of mediated device -* read : read emulation callback -* write: write emulation callback -* mmap: mmap emulation callback - -A driver should use the mdev_parent_ops structure in the function call to -register itself with the mdev core driver:: - - extern int mdev_register_device(struct device *dev, - const struct mdev_parent_ops *ops); - -However, the mdev_parent_ops structure is not required in the function call -that a driver should use to unregister itself with the mdev core driver:: - - extern void mdev_unregister_device(struct device *dev); - - -Mediated Device Management Interface Through sysfs -================================================== - -The management interface through sysfs enables user space software, such as -libvirt, to query and configure mediated devices in a hardware-agnostic fashion. -This management interface provides flexibility to the underlying physical -device's driver to support features such as: - -* Mediated device hot plug -* Multiple mediated devices in a single virtual machine -* Multiple mediated devices from different physical devices - -Links in the mdev_bus Class Directory -------------------------------------- -The /sys/class/mdev_bus/ directory contains links to devices that are registered -with the mdev core driver. - -Directories and files under the sysfs for Each Physical Device --------------------------------------------------------------- - -:: - - |- [parent physical device] - |--- Vendor-specific-attributes [optional] - |--- [mdev_supported_types] - | |--- [] - | | |--- create - | | |--- name - | | |--- available_instances - | | |--- device_api - | | |--- description - | | |--- [devices] - | |--- [] - | | |--- create - | | |--- name - | | |--- available_instances - | | |--- device_api - | | |--- description - | | |--- [devices] - | |--- [] - | |--- create - | |--- name - | |--- available_instances - | |--- device_api - | |--- description - | |--- [devices] - -* [mdev_supported_types] - - The list of currently supported mediated device types and their details. - - [], device_api, and available_instances are mandatory attributes - that should be provided by vendor driver. - -* [] - - The [] name is created by adding the device driver string as a prefix - to the string provided by the vendor driver. This format of this name is as - follows:: - - sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name); - - (or using mdev_parent_dev(mdev) to arrive at the parent device outside - of the core mdev code) - -* device_api - - This attribute should show which device API is being created, for example, - "vfio-pci" for a PCI device. - -* available_instances - - This attribute should show the number of devices of type that can be - created. - -* [device] - - This directory contains links to the devices of type that have been - created. - -* name - - This attribute should show human readable name. This is optional attribute. - -* description - - This attribute should show brief features/description of the type. This is - optional attribute. - -Directories and Files Under the sysfs for Each mdev Device ----------------------------------------------------------- - -:: - - |- [parent phy device] - |--- [$MDEV_UUID] - |--- remove - |--- mdev_type {link to its type} - |--- vendor-specific-attributes [optional] - -* remove (write only) - -Writing '1' to the 'remove' file destroys the mdev device. The vendor driver can -fail the remove() callback if that device is active and the vendor driver -doesn't support hot unplug. - -Example:: - - # echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove - -Mediated device Hot plug ------------------------- - -Mediated devices can be created and assigned at runtime. The procedure to hot -plug a mediated device is the same as the procedure to hot plug a PCI device. - -Translation APIs for Mediated Devices -===================================== - -The following APIs are provided for translating user pfn to host pfn in a VFIO -driver:: - - extern int vfio_pin_pages(struct device *dev, unsigned long *user_pfn, - int npage, int prot, unsigned long *phys_pfn); - - extern int vfio_unpin_pages(struct device *dev, unsigned long *user_pfn, - int npage); - -These functions call back into the back-end IOMMU module by using the pin_pages -and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently -these callbacks are supported in the TYPE1 IOMMU module. To enable them for -other IOMMU backend modules, such as PPC64 sPAPR module, they need to provide -these two callback functions. - -Using the Sample Code -===================== - -mtty.c in samples/vfio-mdev/ directory is a sample driver program to -demonstrate how to use the mediated device framework. - -The sample driver creates an mdev device that simulates a serial port over a PCI -card. - -1. Build and load the mtty.ko module. - - This step creates a dummy device, /sys/devices/virtual/mtty/mtty/ - - Files in this device directory in sysfs are similar to the following:: - - # tree /sys/devices/virtual/mtty/mtty/ - /sys/devices/virtual/mtty/mtty/ - |-- mdev_supported_types - | |-- mtty-1 - | | |-- available_instances - | | |-- create - | | |-- device_api - | | |-- devices - | | `-- name - | `-- mtty-2 - | |-- available_instances - | |-- create - | |-- device_api - | |-- devices - | `-- name - |-- mtty_dev - | `-- sample_mtty_dev - |-- power - | |-- autosuspend_delay_ms - | |-- control - | |-- runtime_active_time - | |-- runtime_status - | `-- runtime_suspended_time - |-- subsystem -> ../../../../class/mtty - `-- uevent - -2. Create a mediated device by using the dummy device that you created in the - previous step:: - - # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" > \ - /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create - -3. Add parameters to qemu-kvm:: - - -device vfio-pci,\ - sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 - -4. Boot the VM. - - In the Linux guest VM, with no hardware on the host, the device appears - as follows:: - - # lspci -s 00:05.0 -xxvv - 00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550]) - Subsystem: Device 4348:3253 - Physical Slot: 5 - Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- - Stepping- SERR- FastB2B- DisINTx- - Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- - SERR- Link[LNKA] -> GSI 10 (level, high) -> IRQ 10 - 0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A - 0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A - - -5. In the Linux guest VM, check the serial ports:: - - # setserial -g /dev/ttyS* - /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 - /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10 - /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10 - -6. Using minicom or any terminal emulation program, open port /dev/ttyS1 or - /dev/ttyS2 with hardware flow control disabled. - -7. Type data on the minicom terminal or send data to the terminal emulation - program and read the data. - - Data is loop backed from hosts mtty driver. - -8. Destroy the mediated device that you created:: - - # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove - -References -========== - -1. See Documentation/vfio.txt for more information on VFIO. -2. struct mdev_driver in include/linux/mdev.h -3. struct mdev_parent_ops in include/linux/mdev.h -4. struct vfio_iommu_driver_ops in include/linux/vfio.h diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt deleted file mode 100644 index f1a4d3c3ba0b..000000000000 --- a/Documentation/vfio.txt +++ /dev/null @@ -1,520 +0,0 @@ -================================== -VFIO - "Virtual Function I/O" [1]_ -================================== - -Many modern system now provide DMA and interrupt remapping facilities -to help ensure I/O devices behave within the boundaries they've been -allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, -POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC -systems such as Freescale PAMU. The VFIO driver is an IOMMU/device -agnostic framework for exposing direct device access to userspace, in -a secure, IOMMU protected environment. In other words, this allows -safe [2]_, non-privileged, userspace drivers. - -Why do we want that? Virtual machines often make use of direct device -access ("device assignment") when configured for the highest possible -I/O performance. From a device and host perspective, this simply -turns the VM into a userspace driver, with the benefits of -significantly reduced latency, higher bandwidth, and direct use of -bare-metal device drivers [3]_. - -Some applications, particularly in the high performance computing -field, also benefit from low-overhead, direct device access from -userspace. Examples include network adapters (often non-TCP/IP based) -and compute accelerators. Prior to VFIO, these drivers had to either -go through the full development cycle to become proper upstream -driver, be maintained out of tree, or make use of the UIO framework, -which has no notion of IOMMU protection, limited interrupt support, -and requires root privileges to access things like PCI configuration -space. - -The VFIO driver framework intends to unify these, replacing both the -KVM PCI specific device assignment code as well as provide a more -secure, more featureful userspace driver environment than UIO. - -Groups, Devices, and IOMMUs ---------------------------- - -Devices are the main target of any I/O driver. Devices typically -create a programming interface made up of I/O access, interrupts, -and DMA. Without going into the details of each of these, DMA is -by far the most critical aspect for maintaining a secure environment -as allowing a device read-write access to system memory imposes the -greatest risk to the overall system integrity. - -To help mitigate this risk, many modern IOMMUs now incorporate -isolation properties into what was, in many cases, an interface only -meant for translation (ie. solving the addressing problems of devices -with limited address spaces). With this, devices can now be isolated -from each other and from arbitrary memory access, thus allowing -things like secure direct assignment of devices into virtual machines. - -This isolation is not always at the granularity of a single device -though. Even when an IOMMU is capable of this, properties of devices, -interconnects, and IOMMU topologies can each reduce this isolation. -For instance, an individual device may be part of a larger multi- -function enclosure. While the IOMMU may be able to distinguish -between devices within the enclosure, the enclosure may not require -transactions between devices to reach the IOMMU. Examples of this -could be anything from a multi-function PCI device with backdoors -between functions to a non-PCI-ACS (Access Control Services) capable -bridge allowing redirection without reaching the IOMMU. Topology -can also play a factor in terms of hiding devices. A PCIe-to-PCI -bridge masks the devices behind it, making transaction appear as if -from the bridge itself. Obviously IOMMU design plays a major factor -as well. - -Therefore, while for the most part an IOMMU may have device level -granularity, any system is susceptible to reduced granularity. The -IOMMU API therefore supports a notion of IOMMU groups. A group is -a set of devices which is isolatable from all other devices in the -system. Groups are therefore the unit of ownership used by VFIO. - -While the group is the minimum granularity that must be used to -ensure secure user access, it's not necessarily the preferred -granularity. In IOMMUs which make use of page tables, it may be -possible to share a set of page tables between different groups, -reducing the overhead both to the platform (reduced TLB thrashing, -reduced duplicate page tables), and to the user (programming only -a single set of translations). For this reason, VFIO makes use of -a container class, which may hold one or more groups. A container -is created by simply opening the /dev/vfio/vfio character device. - -On its own, the container provides little functionality, with all -but a couple version and extension query interfaces locked away. -The user needs to add a group into the container for the next level -of functionality. To do this, the user first needs to identify the -group associated with the desired device. This can be done using -the sysfs links described in the example below. By unbinding the -device from the host driver and binding it to a VFIO driver, a new -VFIO group will appear for the group as /dev/vfio/$GROUP, where -$GROUP is the IOMMU group number of which the device is a member. -If the IOMMU group contains multiple devices, each will need to -be bound to a VFIO driver before operations on the VFIO group -are allowed (it's also sufficient to only unbind the device from -host drivers if a VFIO driver is unavailable; this will make the -group available, but not that particular device). TBD - interface -for disabling driver probing/locking a device. - -Once the group is ready, it may be added to the container by opening -the VFIO group character device (/dev/vfio/$GROUP) and using the -VFIO_GROUP_SET_CONTAINER ioctl, passing the file descriptor of the -previously opened container file. If desired and if the IOMMU driver -supports sharing the IOMMU context between groups, multiple groups may -be set to the same container. If a group fails to set to a container -with existing groups, a new empty container will need to be used -instead. - -With a group (or groups) attached to a container, the remaining -ioctls become available, enabling access to the VFIO IOMMU interfaces. -Additionally, it now becomes possible to get file descriptors for each -device within a group using an ioctl on the VFIO group file descriptor. - -The VFIO device API includes ioctls for describing the device, the I/O -regions and their read/write/mmap offsets on the device descriptor, as -well as mechanisms for describing and registering interrupt -notifications. - -VFIO Usage Example ------------------- - -Assume user wants to access PCI device 0000:06:0d.0:: - - $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group - ../../../../kernel/iommu_groups/26 - -This device is therefore in IOMMU group 26. This device is on the -pci bus, therefore the user will make use of vfio-pci to manage the -group:: - - # modprobe vfio-pci - -Binding this device to the vfio-pci driver creates the VFIO group -character devices for this group:: - - $ lspci -n -s 0000:06:0d.0 - 06:0d.0 0401: 1102:0002 (rev 08) - # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind - # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id - -Now we need to look at what other devices are in the group to free -it for use by VFIO:: - - $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices - total 0 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> - ../../../../devices/pci0000:00/0000:00:1e.0 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> - ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 - lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> - ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 - -This device is behind a PCIe-to-PCI bridge [4]_, therefore we also -need to add device 0000:06:0d.1 to the group following the same -procedure as above. Device 0000:00:1e.0 is a bridge that does -not currently have a host driver, therefore it's not required to -bind this device to the vfio-pci driver (vfio-pci does not currently -support PCI bridges). - -The final step is to provide the user with access to the group if -unprivileged operation is desired (note that /dev/vfio/vfio provides -no capabilities on its own and is therefore expected to be set to -mode 0666 by the system):: - - # chown user:user /dev/vfio/26 - -The user now has full access to all the devices and the iommu for this -group and can access them as follows:: - - int container, group, device, i; - struct vfio_group_status group_status = - { .argsz = sizeof(group_status) }; - struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) }; - struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) }; - struct vfio_device_info device_info = { .argsz = sizeof(device_info) }; - - /* Create a new container */ - container = open("/dev/vfio/vfio", O_RDWR); - - if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) - /* Unknown API version */ - - if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) - /* Doesn't support the IOMMU driver we want. */ - - /* Open the group */ - group = open("/dev/vfio/26", O_RDWR); - - /* Test the group is viable and available */ - ioctl(group, VFIO_GROUP_GET_STATUS, &group_status); - - if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) - /* Group is not viable (ie, not all devices bound for vfio) */ - - /* Add the group to the container */ - ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); - - /* Enable the IOMMU model we want */ - ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); - - /* Get addition IOMMU info */ - ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info); - - /* Allocate some space and setup a DMA mapping */ - dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, - MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); - dma_map.size = 1024 * 1024; - dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ - dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; - - ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); - - /* Get a file descriptor for the device */ - device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); - - /* Test and setup the device */ - ioctl(device, VFIO_DEVICE_GET_INFO, &device_info); - - for (i = 0; i < device_info.num_regions; i++) { - struct vfio_region_info reg = { .argsz = sizeof(reg) }; - - reg.index = i; - - ioctl(device, VFIO_DEVICE_GET_REGION_INFO, ®); - - /* Setup mappings... read/write offsets, mmaps - * For PCI devices, config space is a region */ - } - - for (i = 0; i < device_info.num_irqs; i++) { - struct vfio_irq_info irq = { .argsz = sizeof(irq) }; - - irq.index = i; - - ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq); - - /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */ - } - - /* Gratuitous device reset and go... */ - ioctl(device, VFIO_DEVICE_RESET); - -VFIO User API -------------------------------------------------------------------------------- - -Please see include/linux/vfio.h for complete API documentation. - -VFIO bus driver API -------------------------------------------------------------------------------- - -VFIO bus drivers, such as vfio-pci make use of only a few interfaces -into VFIO core. When devices are bound and unbound to the driver, -the driver should call vfio_add_group_dev() and vfio_del_group_dev() -respectively:: - - extern int vfio_add_group_dev(struct device *dev, - const struct vfio_device_ops *ops, - void *device_data); - - extern void *vfio_del_group_dev(struct device *dev); - -vfio_add_group_dev() indicates to the core to begin tracking the -iommu_group of the specified dev and register the dev as owned by -a VFIO bus driver. The driver provides an ops structure for callbacks -similar to a file operations structure:: - - struct vfio_device_ops { - int (*open)(void *device_data); - void (*release)(void *device_data); - ssize_t (*read)(void *device_data, char __user *buf, - size_t count, loff_t *ppos); - ssize_t (*write)(void *device_data, const char __user *buf, - size_t size, loff_t *ppos); - long (*ioctl)(void *device_data, unsigned int cmd, - unsigned long arg); - int (*mmap)(void *device_data, struct vm_area_struct *vma); - }; - -Each function is passed the device_data that was originally registered -in the vfio_add_group_dev() call above. This allows the bus driver -an easy place to store its opaque, private data. The open/release -callbacks are issued when a new file descriptor is created for a -device (via VFIO_GROUP_GET_DEVICE_FD). The ioctl interface provides -a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap -interfaces implement the device region access defined by the device's -own VFIO_DEVICE_GET_REGION_INFO ioctl. - - -PPC64 sPAPR implementation note -------------------------------- - -This implementation has some specifics: - -1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per - container is supported as an IOMMU table is allocated at the boot time, - one table per a IOMMU group which is a Partitionable Endpoint (PE) - (PE is often a PCI domain but not always). - - Newer systems (POWER8 with IODA2) have improved hardware design which allows - to remove this limitation and have multiple IOMMU groups per a VFIO - container. - -2) The hardware supports so called DMA windows - the PCI address range - within which DMA transfer is allowed, any attempt to access address space - out of the window leads to the whole PE isolation. - -3) PPC64 guests are paravirtualized but not fully emulated. There is an API - to map/unmap pages for DMA, and it normally maps 1..32 pages per call and - currently there is no way to reduce the number of calls. In order to make - things faster, the map/unmap handling has been implemented in real mode - which provides an excellent performance which has limitations such as - inability to do locked pages accounting in real time. - -4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O - subtree that can be treated as a unit for the purposes of partitioning and - error recovery. A PE may be a single or multi-function IOA (IO Adapter), a - function of a multi-function IOA, or multiple IOAs (possibly including - switch and bridge structures above the multiple IOAs). PPC64 guests detect - PCI errors and recover from them via EEH RTAS services, which works on the - basis of additional ioctl commands. - - So 4 additional ioctls have been added: - - VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start of the DMA window on the PCI bus. - - VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting - is done at this point. This lets user first to know what - the DMA window is and adjust rlimit before doing any real job. - - VFIO_IOMMU_DISABLE - disables the container. - - VFIO_EEH_PE_OP - provides an API for EEH setup, error detection and recovery. - - The code flow from the example above should be slightly changed:: - - struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; - - ..... - /* Add the group to the container */ - ioctl(group, VFIO_GROUP_SET_CONTAINER, &container); - - /* Enable the IOMMU model we want */ - ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU) - - /* Get addition sPAPR IOMMU info */ - vfio_iommu_spapr_tce_info spapr_iommu_info; - ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info); - - if (ioctl(container, VFIO_IOMMU_ENABLE)) - /* Cannot enable container, may be low rlimit */ - - /* Allocate some space and setup a DMA mapping */ - dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE, - MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); - - dma_map.size = 1024 * 1024; - dma_map.iova = 0; /* 1MB starting at 0x0 from device view */ - dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; - - /* Check here is .iova/.size are within DMA window from spapr_iommu_info */ - ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map); - - /* Get a file descriptor for the device */ - device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0"); - - .... - - /* Gratuitous device reset and go... */ - ioctl(device, VFIO_DEVICE_RESET); - - /* Make sure EEH is supported */ - ioctl(container, VFIO_CHECK_EXTENSION, VFIO_EEH); - - /* Enable the EEH functionality on the device */ - pe_op.op = VFIO_EEH_PE_ENABLE; - ioctl(container, VFIO_EEH_PE_OP, &pe_op); - - /* You're suggested to create additional data struct to represent - * PE, and put child devices belonging to same IOMMU group to the - * PE instance for later reference. - */ - - /* Check the PE's state and make sure it's in functional state */ - pe_op.op = VFIO_EEH_PE_GET_STATE; - ioctl(container, VFIO_EEH_PE_OP, &pe_op); - - /* Save device state using pci_save_state(). - * EEH should be enabled on the specified device. - */ - - .... - - /* Inject EEH error, which is expected to be caused by 32-bits - * config load. - */ - pe_op.op = VFIO_EEH_PE_INJECT_ERR; - pe_op.err.type = EEH_ERR_TYPE_32; - pe_op.err.func = EEH_ERR_FUNC_LD_CFG_ADDR; - pe_op.err.addr = 0ul; - pe_op.err.mask = 0ul; - ioctl(container, VFIO_EEH_PE_OP, &pe_op); - - .... - - /* When 0xFF's returned from reading PCI config space or IO BARs - * of the PCI device. Check the PE's state to see if that has been - * frozen. - */ - ioctl(container, VFIO_EEH_PE_OP, &pe_op); - - /* Waiting for pending PCI transactions to be completed and don't - * produce any more PCI traffic from/to the affected PE until - * recovery is finished. - */ - - /* Enable IO for the affected PE and collect logs. Usually, the - * standard part of PCI config space, AER registers are dumped - * as logs for further analysis. - */ - pe_op.op = VFIO_EEH_PE_UNFREEZE_IO; - ioctl(container, VFIO_EEH_PE_OP, &pe_op); - - /* - * Issue PE reset: hot or fundamental reset. Usually, hot reset - * is enough. However, the firmware of some PCI adapters would - * require fundamental reset. - */ - pe_op.op = VFIO_EEH_PE_RESET_HOT; - ioctl(container, VFIO_EEH_PE_OP, &pe_op); - pe_op.op = VFIO_EEH_PE_RESET_DEACTIVATE; - ioctl(container, VFIO_EEH_PE_OP, &pe_op); - - /* Configure the PCI bridges for the affected PE */ - pe_op.op = VFIO_EEH_PE_CONFIGURE; - ioctl(container, VFIO_EEH_PE_OP, &pe_op); - - /* Restored state we saved at initialization time. pci_restore_state() - * is good enough as an example. - */ - - /* Hopefully, error is recovered successfully. Now, you can resume to - * start PCI traffic to/from the affected PE. - */ - - .... - -5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ - VFIO_IOMMU_DISABLE and implements 2 new ioctls: - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - (which are unsupported in v1 IOMMU). - - PPC64 paravirtualized guests generate a lot of map/unmap requests, - and the handling of those includes pinning/unpinning pages and updating - mm::locked_vm counter to make sure we do not exceed the rlimit. - The v2 IOMMU splits accounting and pinning into separate operations: - - - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls - receive a user space address and size of the block to be pinned. - Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to - be called with the exact address and size used for registering - the memory block. The userspace is not expected to call these often. - The ranges are stored in a linked list in a VFIO container. - - - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual - IOMMU table and do not do pinning; instead these check that the userspace - address is from pre-registered range. - - This separation helps in optimizing DMA for guests. - -6) sPAPR specification allows guests to have an additional DMA window(s) on - a PCI bus with a variable page size. Two ioctls have been added to support - this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. - The platform has to support the functionality or error will be returned to - the userspace. The existing hardware supports up to 2 DMA windows, one is - 2GB long, uses 4K pages and called "default 32bit window"; the other can - be as big as entire RAM, use different page size, it is optional - guests - create those in run-time if the guest driver supports 64bit DMA. - - VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and - a number of TCE table levels (if a TCE table is going to be big enough and - the kernel may not be able to allocate enough of physically contiguous - memory). It creates a new window in the available slot and returns the bus - address where the new window starts. Due to hardware limitation, the user - space cannot choose the location of DMA windows. - - VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window - and removes it. - -------------------------------------------------------------------------------- - -.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its - initial implementation by Tom Lyon while as Cisco. We've since - outgrown the acronym, but it's catchy. - -.. [2] "safe" also depends upon a device being "well behaved". It's - possible for multi-function devices to have backdoors between - functions and even for single function devices to have alternative - access to things like PCI config space through MMIO registers. To - guard against the former we can include additional precautions in the - IOMMU driver to group multi-function PCI devices together - (iommu=group_mf). The latter we can't prevent, but the IOMMU should - still provide isolation. For PCI, SR-IOV Virtual Functions are the - best indicator of "well behaved", as these are designed for - virtualization usage models. - -.. [3] As always there are trade-offs to virtual machine device - assignment that are beyond the scope of VFIO. It's expected that - future IOMMU technologies will reduce some, but maybe not all, of - these trade-offs. - -.. [4] In this case the device is below a PCI bridge, so transactions - from either function of the device are indistinguishable to the iommu:: - - -[0000:00]-+-1e.0-[06]--+-0d.0 - \-0d.1 - - 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) diff --git a/Documentation/w1/w1.netlink b/Documentation/w1/w1.netlink index ef2727192d69..94ad4c420828 100644 --- a/Documentation/w1/w1.netlink +++ b/Documentation/w1/w1.netlink @@ -183,7 +183,7 @@ acknowledge number is set to seq+1. Additional documantion, source code examples. ============================================ -1. Documentation/connector +1. Documentation/driver-api/connector.rst 2. http://www.ioremap.net/archive/w1 This archive includes userspace application w1d.c which uses read/write/search commands for all master/slave devices found on the bus. diff --git a/Documentation/xillybus.txt b/Documentation/xillybus.txt deleted file mode 100644 index 2446ee303c09..000000000000 --- a/Documentation/xillybus.txt +++ /dev/null @@ -1,379 +0,0 @@ -========================================== -Xillybus driver for generic FPGA interface -========================================== - -:Author: Eli Billauer, Xillybus Ltd. (http://xillybus.com) -:Email: eli.billauer@gmail.com or as advertised on Xillybus' site. - -.. Contents: - - - Introduction - -- Background - -- Xillybus Overview - - - Usage - -- User interface - -- Synchronization - -- Seekable pipes - - - Internals - -- Source code organization - -- Pipe attributes - -- Host never reads from the FPGA - -- Channels, pipes, and the message channel - -- Data streaming - -- Data granularity - -- Probing - -- Buffer allocation - -- The "nonempty" message (supporting poll) - - -Introduction -============ - -Background ----------- - -An FPGA (Field Programmable Gate Array) is a piece of logic hardware, which -can be programmed to become virtually anything that is usually found as a -dedicated chipset: For instance, a display adapter, network interface card, -or even a processor with its peripherals. FPGAs are the LEGO of hardware: -Based upon certain building blocks, you make your own toys the way you like -them. It's usually pointless to reimplement something that is already -available on the market as a chipset, so FPGAs are mostly used when some -special functionality is needed, and the production volume is relatively low -(hence not justifying the development of an ASIC). - -The challenge with FPGAs is that everything is implemented at a very low -level, even lower than assembly language. In order to allow FPGA designers to -focus on their specific project, and not reinvent the wheel over and over -again, pre-designed building blocks, IP cores, are often used. These are the -FPGA parallels of library functions. IP cores may implement certain -mathematical functions, a functional unit (e.g. a USB interface), an entire -processor (e.g. ARM) or anything that might come handy. Think of them as a -building block, with electrical wires dangling on the sides for connection to -other blocks. - -One of the daunting tasks in FPGA design is communicating with a fullblown -operating system (actually, with the processor running it): Implementing the -low-level bus protocol and the somewhat higher-level interface with the host -(registers, interrupts, DMA etc.) is a project in itself. When the FPGA's -function is a well-known one (e.g. a video adapter card, or a NIC), it can -make sense to design the FPGA's interface logic specifically for the project. -A special driver is then written to present the FPGA as a well-known interface -to the kernel and/or user space. In that case, there is no reason to treat the -FPGA differently than any device on the bus. - -It's however common that the desired data communication doesn't fit any well- -known peripheral function. Also, the effort of designing an elegant -abstraction for the data exchange is often considered too big. In those cases, -a quicker and possibly less elegant solution is sought: The driver is -effectively written as a user space program, leaving the kernel space part -with just elementary data transport. This still requires designing some -interface logic for the FPGA, and write a simple ad-hoc driver for the kernel. - -Xillybus Overview ------------------ - -Xillybus is an IP core and a Linux driver. Together, they form a kit for -elementary data transport between an FPGA and the host, providing pipe-like -data streams with a straightforward user interface. It's intended as a low- -effort solution for mixed FPGA-host projects, for which it makes sense to -have the project-specific part of the driver running in a user-space program. - -Since the communication requirements may vary significantly from one FPGA -project to another (the number of data pipes needed in each direction and -their attributes), there isn't one specific chunk of logic being the Xillybus -IP core. Rather, the IP core is configured and built based upon a -specification given by its end user. - -Xillybus presents independent data streams, which resemble pipes or TCP/IP -communication to the user. At the host side, a character device file is used -just like any pipe file. On the FPGA side, hardware FIFOs are used to stream -the data. This is contrary to a common method of communicating through fixed- -sized buffers (even though such buffers are used by Xillybus under the hood). -There may be more than a hundred of these streams on a single IP core, but -also no more than one, depending on the configuration. - -In order to ease the deployment of the Xillybus IP core, it contains a simple -data structure which completely defines the core's configuration. The Linux -driver fetches this data structure during its initialization process, and sets -up the DMA buffers and character devices accordingly. As a result, a single -driver is used to work out of the box with any Xillybus IP core. - -The data structure just mentioned should not be confused with PCI's -configuration space or the Flattened Device Tree. - -Usage -===== - -User interface --------------- - -On the host, all interface with Xillybus is done through /dev/xillybus_* -device files, which are generated automatically as the drivers loads. The -names of these files depend on the IP core that is loaded in the FPGA (see -Probing below). To communicate with the FPGA, open the device file that -corresponds to the hardware FIFO you want to send data or receive data from, -and use plain write() or read() calls, just like with a regular pipe. In -particular, it makes perfect sense to go:: - - $ cat mydata > /dev/xillybus_thisfifo - - $ cat /dev/xillybus_thatfifo > hisdata - -possibly pressing CTRL-C as some stage, even though the xillybus_* pipes have -the capability to send an EOF (but may not use it). - -The driver and hardware are designed to behave sensibly as pipes, including: - -* Supporting non-blocking I/O (by setting O_NONBLOCK on open() ). - -* Supporting poll() and select(). - -* Being bandwidth efficient under load (using DMA) but also handle small - pieces of data sent across (like TCP/IP) by autoflushing. - -A device file can be read only, write only or bidirectional. Bidirectional -device files are treated like two independent pipes (except for sharing a -"channel" structure in the implementation code). - -Synchronization ---------------- - -Xillybus pipes are configured (on the IP core) to be either synchronous or -asynchronous. For a synchronous pipe, write() returns successfully only after -some data has been submitted and acknowledged by the FPGA. This slows down -bulk data transfers, and is nearly impossible for use with streams that -require data at a constant rate: There is no data transmitted to the FPGA -between write() calls, in particular when the process loses the CPU. - -When a pipe is configured asynchronous, write() returns if there was enough -room in the buffers to store any of the data in the buffers. - -For FPGA to host pipes, asynchronous pipes allow data transfer from the FPGA -as soon as the respective device file is opened, regardless of if the data -has been requested by a read() call. On synchronous pipes, only the amount -of data requested by a read() call is transmitted. - -In summary, for synchronous pipes, data between the host and FPGA is -transmitted only to satisfy the read() or write() call currently handled -by the driver, and those calls wait for the transmission to complete before -returning. - -Note that the synchronization attribute has nothing to do with the possibility -that read() or write() completes less bytes than requested. There is a -separate configuration flag ("allowpartial") that determines whether such a -partial completion is allowed. - -Seekable pipes --------------- - -A synchronous pipe can be configured to have the stream's position exposed -to the user logic at the FPGA. Such a pipe is also seekable on the host API. -With this feature, a memory or register interface can be attached on the -FPGA side to the seekable stream. Reading or writing to a certain address in -the attached memory is done by seeking to the desired address, and calling -read() or write() as required. - - -Internals -========= - -Source code organization ------------------------- - -The Xillybus driver consists of a core module, xillybus_core.c, and modules -that depend on the specific bus interface (xillybus_of.c and xillybus_pcie.c). - -The bus specific modules are those probed when a suitable device is found by -the kernel. Since the DMA mapping and synchronization functions, which are bus -dependent by their nature, are used by the core module, a -xilly_endpoint_hardware structure is passed to the core module on -initialization. This structure is populated with pointers to wrapper functions -which execute the DMA-related operations on the bus. - -Pipe attributes ---------------- - -Each pipe has a number of attributes which are set when the FPGA component -(IP core) is built. They are fetched from the IDT (the data structure which -defines the core's configuration, see Probing below) by xilly_setupchannels() -in xillybus_core.c as follows: - -* is_writebuf: The pipe's direction. A non-zero value means it's an FPGA to - host pipe (the FPGA "writes"). - -* channelnum: The pipe's identification number in communication between the - host and FPGA. - -* format: The underlying data width. See Data Granularity below. - -* allowpartial: A non-zero value means that a read() or write() (whichever - applies) may return with less than the requested number of bytes. The common - choice is a non-zero value, to match standard UNIX behavior. - -* synchronous: A non-zero value means that the pipe is synchronous. See - Synchronization above. - -* bufsize: Each DMA buffer's size. Always a power of two. - -* bufnum: The number of buffers allocated for this pipe. Always a power of two. - -* exclusive_open: A non-zero value forces exclusive opening of the associated - device file. If the device file is bidirectional, and already opened only in - one direction, the opposite direction may be opened once. - -* seekable: A non-zero value indicates that the pipe is seekable. See - Seekable pipes above. - -* supports_nonempty: A non-zero value (which is typical) indicates that the - hardware will send the messages that are necessary to support select() and - poll() for this pipe. - -Host never reads from the FPGA ------------------------------- - -Even though PCI Express is hotpluggable in general, a typical motherboard -doesn't expect a card to go away all of the sudden. But since the PCIe card -is based upon reprogrammable logic, a sudden disappearance from the bus is -quite likely as a result of an accidental reprogramming of the FPGA while the -host is up. In practice, nothing happens immediately in such a situation. But -if the host attempts to read from an address that is mapped to the PCI Express -device, that leads to an immediate freeze of the system on some motherboards, -even though the PCIe standard requires a graceful recovery. - -In order to avoid these freezes, the Xillybus driver refrains completely from -reading from the device's register space. All communication from the FPGA to -the host is done through DMA. In particular, the Interrupt Service Routine -doesn't follow the common practice of checking a status register when it's -invoked. Rather, the FPGA prepares a small buffer which contains short -messages, which inform the host what the interrupt was about. - -This mechanism is used on non-PCIe buses as well for the sake of uniformity. - - -Channels, pipes, and the message channel ----------------------------------------- - -Each of the (possibly bidirectional) pipes presented to the user is allocated -a data channel between the FPGA and the host. The distinction between channels -and pipes is necessary only because of channel 0, which is used for interrupt- -related messages from the FPGA, and has no pipe attached to it. - -Data streaming --------------- - -Even though a non-segmented data stream is presented to the user at both -sides, the implementation relies on a set of DMA buffers which is allocated -for each channel. For the sake of illustration, let's take the FPGA to host -direction: As data streams into the respective channel's interface in the -FPGA, the Xillybus IP core writes it to one of the DMA buffers. When the -buffer is full, the FPGA informs the host about that (appending a -XILLYMSG_OPCODE_RELEASEBUF message channel 0 and sending an interrupt if -necessary). The host responds by making the data available for reading through -the character device. When all data has been read, the host writes on the -the FPGA's buffer control register, allowing the buffer's overwriting. Flow -control mechanisms exist on both sides to prevent underflows and overflows. - -This is not good enough for creating a TCP/IP-like stream: If the data flow -stops momentarily before a DMA buffer is filled, the intuitive expectation is -that the partial data in buffer will arrive anyhow, despite the buffer not -being completed. This is implemented by adding a field in the -XILLYMSG_OPCODE_RELEASEBUF message, through which the FPGA informs not just -which buffer is submitted, but how much data it contains. - -But the FPGA will submit a partially filled buffer only if directed to do so -by the host. This situation occurs when the read() method has been blocking -for XILLY_RX_TIMEOUT jiffies (currently 10 ms), after which the host commands -the FPGA to submit a DMA buffer as soon as it can. This timeout mechanism -balances between bus bandwidth efficiency (preventing a lot of partially -filled buffers being sent) and a latency held fairly low for tails of data. - -A similar setting is used in the host to FPGA direction. The handling of -partial DMA buffers is somewhat different, though. The user can tell the -driver to submit all data it has in the buffers to the FPGA, by issuing a -write() with the byte count set to zero. This is similar to a flush request, -but it doesn't block. There is also an autoflushing mechanism, which triggers -an equivalent flush roughly XILLY_RX_TIMEOUT jiffies after the last write(). -This allows the user to be oblivious about the underlying buffering mechanism -and yet enjoy a stream-like interface. - -Note that the issue of partial buffer flushing is irrelevant for pipes having -the "synchronous" attribute nonzero, since synchronous pipes don't allow data -to lay around in the DMA buffers between read() and write() anyhow. - -Data granularity ----------------- - -The data arrives or is sent at the FPGA as 8, 16 or 32 bit wide words, as -configured by the "format" attribute. Whenever possible, the driver attempts -to hide this when the pipe is accessed differently from its natural alignment. -For example, reading single bytes from a pipe with 32 bit granularity works -with no issues. Writing single bytes to pipes with 16 or 32 bit granularity -will also work, but the driver can't send partially completed words to the -FPGA, so the transmission of up to one word may be held until it's fully -occupied with user data. - -This somewhat complicates the handling of host to FPGA streams, because -when a buffer is flushed, it may contain up to 3 bytes don't form a word in -the FPGA, and hence can't be sent. To prevent loss of data, these leftover -bytes need to be moved to the next buffer. The parts in xillybus_core.c -that mention "leftovers" in some way are related to this complication. - -Probing -------- - -As mentioned earlier, the number of pipes that are created when the driver -loads and their attributes depend on the Xillybus IP core in the FPGA. During -the driver's initialization, a blob containing configuration info, the -Interface Description Table (IDT), is sent from the FPGA to the host. The -bootstrap process is done in three phases: - -1. Acquire the length of the IDT, so a buffer can be allocated for it. This - is done by sending a quiesce command to the device, since the acknowledge - for this command contains the IDT's buffer length. - -2. Acquire the IDT itself. - -3. Create the interfaces according to the IDT. - -Buffer allocation ------------------ - -In order to simplify the logic that prevents illegal boundary crossings of -PCIe packets, the following rule applies: If a buffer is smaller than 4kB, -it must not cross a 4kB boundary. Otherwise, it must be 4kB aligned. The -xilly_setupchannels() functions allocates these buffers by requesting whole -pages from the kernel, and diving them into DMA buffers as necessary. Since -all buffers' sizes are powers of two, it's possible to pack any set of such -buffers, with a maximal waste of one page of memory. - -All buffers are allocated when the driver is loaded. This is necessary, -since large continuous physical memory segments are sometimes requested, -which are more likely to be available when the system is freshly booted. - -The allocation of buffer memory takes place in the same order they appear in -the IDT. The driver relies on a rule that the pipes are sorted with decreasing -buffer size in the IDT. If a requested buffer is larger or equal to a page, -the necessary number of pages is requested from the kernel, and these are -used for this buffer. If the requested buffer is smaller than a page, one -single page is requested from the kernel, and that page is partially used. -Or, if there already is a partially used page at hand, the buffer is packed -into that page. It can be shown that all pages requested from the kernel -(except possibly for the last) are 100% utilized this way. - -The "nonempty" message (supporting poll) ----------------------------------------- - -In order to support the "poll" method (and hence select() ), there is a small -catch regarding the FPGA to host direction: The FPGA may have filled a DMA -buffer with some data, but not submitted that buffer. If the host waited for -the buffer's submission by the FPGA, there would be a possibility that the -FPGA side has sent data, but a select() call would still block, because the -host has not received any notification about this. This is solved with -XILLYMSG_OPCODE_NONEMPTY messages sent by the FPGA when a channel goes from -completely empty to containing some data. - -These messages are used only to support poll() and select(). The IP core can -be configured not to send them for a slight reduction of bandwidth. diff --git a/Documentation/zorro.txt b/Documentation/zorro.txt deleted file mode 100644 index 664072b017e3..000000000000 --- a/Documentation/zorro.txt +++ /dev/null @@ -1,104 +0,0 @@ -======================================== -Writing Device Drivers for Zorro Devices -======================================== - -:Author: Written by Geert Uytterhoeven -:Last revised: September 5, 2003 - - -Introduction ------------- - -The Zorro bus is the bus used in the Amiga family of computers. Thanks to -AutoConfig(tm), it's 100% Plug-and-Play. - -There are two types of Zorro buses, Zorro II and Zorro III: - - - The Zorro II address space is 24-bit and lies within the first 16 MB of the - Amiga's address map. - - - Zorro III is a 32-bit extension of Zorro II, which is backwards compatible - with Zorro II. The Zorro III address space lies outside the first 16 MB. - - -Probing for Zorro Devices -------------------------- - -Zorro devices are found by calling ``zorro_find_device()``, which returns a -pointer to the ``next`` Zorro device with the specified Zorro ID. A probe loop -for the board with Zorro ID ``ZORRO_PROD_xxx`` looks like:: - - struct zorro_dev *z = NULL; - - while ((z = zorro_find_device(ZORRO_PROD_xxx, z))) { - if (!zorro_request_region(z->resource.start+MY_START, MY_SIZE, - "My explanation")) - ... - } - -``ZORRO_WILDCARD`` acts as a wildcard and finds any Zorro device. If your driver -supports different types of boards, you can use a construct like:: - - struct zorro_dev *z = NULL; - - while ((z = zorro_find_device(ZORRO_WILDCARD, z))) { - if (z->id != ZORRO_PROD_xxx1 && z->id != ZORRO_PROD_xxx2 && ...) - continue; - if (!zorro_request_region(z->resource.start+MY_START, MY_SIZE, - "My explanation")) - ... - } - - -Zorro Resources ---------------- - -Before you can access a Zorro device's registers, you have to make sure it's -not yet in use. This is done using the I/O memory space resource management -functions:: - - request_mem_region() - release_mem_region() - -Shortcuts to claim the whole device's address space are provided as well:: - - zorro_request_device - zorro_release_device - - -Accessing the Zorro Address Space ---------------------------------- - -The address regions in the Zorro device resources are Zorro bus address -regions. Due to the identity bus-physical address mapping on the Zorro bus, -they are CPU physical addresses as well. - -The treatment of these regions depends on the type of Zorro space: - - - Zorro II address space is always mapped and does not have to be mapped - explicitly using z_ioremap(). - - Conversion from bus/physical Zorro II addresses to kernel virtual addresses - and vice versa is done using:: - - virt_addr = ZTWO_VADDR(bus_addr); - bus_addr = ZTWO_PADDR(virt_addr); - - - Zorro III address space must be mapped explicitly using z_ioremap() first - before it can be accessed:: - - virt_addr = z_ioremap(bus_addr, size); - ... - z_iounmap(virt_addr); - - -References ----------- - -#. linux/include/linux/zorro.h -#. linux/include/uapi/linux/zorro.h -#. linux/include/uapi/linux/zorro_ids.h -#. linux/arch/m68k/include/asm/zorro.h -#. linux/drivers/zorro -#. /proc/bus/zorro - diff --git a/MAINTAINERS b/MAINTAINERS index 570572627fd1..d1a0a817dd92 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4642,7 +4642,7 @@ DELL SYSTEMS MANAGEMENT BASE DRIVER (dcdbas) M: Stuart Hayes L: platform-driver-x86@vger.kernel.org S: Maintained -F: Documentation/dcdbas.txt +F: Documentation/driver-api/dcdbas.rst F: drivers/platform/x86/dcdbas.* DELL WMI NOTIFICATIONS DRIVER @@ -8462,7 +8462,7 @@ F: drivers/irqchip/ ISA M: William Breathitt Gray S: Maintained -F: Documentation/isa.txt +F: Documentation/driver-api/isa.rst F: drivers/base/isa.c F: include/linux/isa.h @@ -8477,7 +8477,7 @@ F: drivers/media/radio/radio-isa* ISAPNP M: Jaroslav Kysela S: Maintained -F: Documentation/isapnp.txt +F: Documentation/driver-api/isapnp.rst F: drivers/pnp/isapnp/ F: include/linux/isapnp.h @@ -10353,7 +10353,7 @@ M: Johannes Thumshirn S: Maintained F: drivers/mcb/ F: include/linux/mcb.h -F: Documentation/men-chameleon-bus.txt +F: Documentation/driver-api/men-chameleon-bus.rst MEN F21BMC (Board Management Controller) M: Andreas Werner @@ -12070,7 +12070,7 @@ F: drivers/parport/ F: include/linux/parport*.h F: drivers/char/ppdev.c F: include/uapi/linux/ppdev.h -F: Documentation/parport*.txt +F: Documentation/driver-api/parport*.rst PARAVIRT_OPS INTERFACE M: Juergen Gross @@ -12245,7 +12245,7 @@ M: Kurt Schwemmer M: Logan Gunthorpe L: linux-pci@vger.kernel.org S: Maintained -F: Documentation/switchtec.txt +F: Documentation/driver-api/switchtec.rst F: Documentation/ABI/testing/sysfs-class-switchtec F: drivers/pci/switch/switchtec* F: include/uapi/linux/switchtec_ioctl.h @@ -13006,7 +13006,7 @@ M: Thierry Reding L: linux-pwm@vger.kernel.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm.git -F: Documentation/pwm.txt +F: Documentation/driver-api/pwm.rst F: Documentation/devicetree/bindings/pwm/ F: include/linux/pwm.h F: drivers/pwm/ @@ -13620,7 +13620,7 @@ W: http://wireless.kernel.org/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git S: Maintained -F: Documentation/rfkill.txt +F: Documentation/driver-api/rfkill.rst F: Documentation/ABI/stable/sysfs-class-rfkill F: net/rfkill/ F: include/linux/rfkill.h @@ -15343,7 +15343,7 @@ F: drivers/dma-buf/dma-fence* F: drivers/dma-buf/sw_sync.c F: include/linux/sync_file.h F: include/uapi/linux/sync_file.h -F: Documentation/sync_file.txt +F: Documentation/driver-api/sync_file.rst T: git git://anongit.freedesktop.org/drm/drm-misc SYNOPSYS ARC ARCHITECTURE @@ -16839,7 +16839,7 @@ R: Cornelia Huck L: kvm@vger.kernel.org T: git git://github.com/awilliam/linux-vfio.git S: Maintained -F: Documentation/vfio.txt +F: Documentation/driver-api/vfio.rst F: drivers/vfio/ F: include/linux/vfio.h F: include/uapi/linux/vfio.h @@ -16848,7 +16848,7 @@ VFIO MEDIATED DEVICE DRIVERS M: Kirti Wankhede L: kvm@vger.kernel.org S: Maintained -F: Documentation/vfio-mediated-device.txt +F: Documentation/driver-api/vfio-mediated-device.rst F: drivers/vfio/mdev/ F: include/linux/mdev.h F: samples/vfio-mdev/ diff --git a/drivers/dma-buf/Kconfig b/drivers/dma-buf/Kconfig index d5f915830b68..b6a9c2f1bc41 100644 --- a/drivers/dma-buf/Kconfig +++ b/drivers/dma-buf/Kconfig @@ -15,7 +15,7 @@ config SYNC_FILE associated with a buffer. When a job is submitted to the GPU a fence is attached to the buffer and is transferred via userspace, using Sync Files fds, to the DRM driver for example. More details at - Documentation/sync_file.txt. + Documentation/driver-api/sync_file.rst. config SW_SYNC bool "Sync File Validation Framework" diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig index e4fee216d5a4..079cca438466 100644 --- a/drivers/gpio/Kconfig +++ b/drivers/gpio/Kconfig @@ -1301,7 +1301,7 @@ config GPIO_BT8XX The card needs to be physically altered for using it as a GPIO card. For more information on how to build a GPIO card from a BT8xx TV card, see the documentation file at - Documentation/bt8xxgpio.txt + Documentation/driver-api/bt8xxgpio.rst If unsure, say N. diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig index e20e2956f620..9f49de00777e 100644 --- a/drivers/gpu/drm/Kconfig +++ b/drivers/gpu/drm/Kconfig @@ -141,7 +141,7 @@ config DRM_LOAD_EDID_FIRMWARE monitor are unable to provide appropriate EDID data. Since this feature is provided as a workaround for broken hardware, the default case is N. Details and instructions how to build your own - EDID data are given in Documentation/EDID/howto.rst. + EDID data are given in Documentation/driver-api/edid.rst. config DRM_DP_CEC bool "Enable DisplayPort CEC-Tunneling-over-AUX HDMI support" diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig index aee28a5bb98f..d370f4ce0492 100644 --- a/drivers/pci/switch/Kconfig +++ b/drivers/pci/switch/Kconfig @@ -9,7 +9,7 @@ config PCI_SW_SWITCHTEC Enables support for the management interface for the MicroSemi Switchtec series of PCIe switches. Supports userspace access to submit MRPC commands to the switch via /dev/switchtecX - devices. See for more + devices. See for more information. endmenu diff --git a/drivers/platform/x86/Kconfig b/drivers/platform/x86/Kconfig index 5f580580a8e0..1b67bb578f9f 100644 --- a/drivers/platform/x86/Kconfig +++ b/drivers/platform/x86/Kconfig @@ -118,7 +118,7 @@ config DCDBAS Interrupts (SMIs) and Host Control Actions (system power cycle or power off after OS shutdown) on certain Dell systems. - See for more details on the driver + See for more details on the driver and the Dell systems on which Dell systems management software makes use of this driver. @@ -259,7 +259,7 @@ config DELL_RBU DELL system. Note you need a Dell OpenManage or Dell Update package (DUP) supporting application to communicate with the BIOS regarding the new image for the image update to take effect. - See for more details on the driver. + See for more details on the driver. config FUJITSU_LAPTOP diff --git a/drivers/platform/x86/dcdbas.c b/drivers/platform/x86/dcdbas.c index 12cf9475ac85..84f4cc839cc3 100644 --- a/drivers/platform/x86/dcdbas.c +++ b/drivers/platform/x86/dcdbas.c @@ -7,7 +7,7 @@ * and Host Control Actions (power cycle or power off after OS shutdown) on * Dell systems. * - * See Documentation/dcdbas.txt for more information. + * See Documentation/driver-api/dcdbas.rst for more information. * * Copyright (C) 1995-2006 Dell Inc. */ diff --git a/drivers/platform/x86/dell_rbu.c b/drivers/platform/x86/dell_rbu.c index a58fc10293ee..3691391fea6b 100644 --- a/drivers/platform/x86/dell_rbu.c +++ b/drivers/platform/x86/dell_rbu.c @@ -24,7 +24,7 @@ * on every time the packet data is written. This driver requires an * application to break the BIOS image in to fixed sized packet chunks. * - * See Documentation/dell_rbu.txt for more info. + * See Documentation/driver-api/dell_rbu.rst for more info. */ #include #include diff --git a/drivers/pnp/isapnp/Kconfig b/drivers/pnp/isapnp/Kconfig index 4b58a3dcb52b..d0479a563123 100644 --- a/drivers/pnp/isapnp/Kconfig +++ b/drivers/pnp/isapnp/Kconfig @@ -7,6 +7,6 @@ config ISAPNP depends on ISA || COMPILE_TEST help Say Y here if you would like support for ISA Plug and Play devices. - Some information is in . + Some information is in . If unsure, say Y. diff --git a/drivers/tty/Kconfig b/drivers/tty/Kconfig index 1cb50f19d58c..ee51b9514225 100644 --- a/drivers/tty/Kconfig +++ b/drivers/tty/Kconfig @@ -93,7 +93,7 @@ config VT_HW_CONSOLE_BINDING select the console driver that will serve as the backend for the virtual terminals. - See for more + See for more information. For framebuffer console users, please refer to . diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index e5a7a454fe17..fd17db9b432f 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -25,7 +25,7 @@ menuconfig VFIO select VFIO_IOMMU_TYPE1 if (X86 || S390 || ARM || ARM64) help VFIO provides a framework for secure userspace device drivers. - See Documentation/vfio.txt for more details. + See Documentation/driver-api/vfio.rst for more details. If you don't know what to do here, say N. diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig index ba94a076887f..5da27f2100f9 100644 --- a/drivers/vfio/mdev/Kconfig +++ b/drivers/vfio/mdev/Kconfig @@ -6,7 +6,7 @@ config VFIO_MDEV default n help Provides a framework to virtualize devices. - See Documentation/vfio-mediated-device.txt for more details. + See Documentation/driver-api/vfio-mediated-device.rst for more details. If you don't know what do here, say N. diff --git a/drivers/w1/Kconfig b/drivers/w1/Kconfig index 160053c0baea..3e7ad7b232fe 100644 --- a/drivers/w1/Kconfig +++ b/drivers/w1/Kconfig @@ -19,7 +19,7 @@ config W1_CON default y ---help--- This allows to communicate with userspace using connector. For more - information see . + information see . There are three types of messages between w1 core and userspace: 1. Events. They are generated each time new master or slave device found either due to automatic or requested search. diff --git a/samples/Kconfig b/samples/Kconfig index 155da47dc6a4..c8dacb4dda80 100644 --- a/samples/Kconfig +++ b/samples/Kconfig @@ -99,7 +99,7 @@ config SAMPLE_CONNECTOR When enabled, this builds both a sample kernel module for the connector interface and a user space tool to communicate with it. - See also Documentation/connector/connector.rst + See also Documentation/driver-api/connector.rst config SAMPLE_HIDRAW bool "hidraw sample" -- cgit v1.2.3 From fb8c5327b3c6c78b74a27a3c42e4f32b2cc30a04 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 13 Jun 2019 14:40:42 -0300 Subject: docs: driver-api: add xilinx driver API documentation The current file there (emmi) provides a description of the driver uAPI and kAPI. Signed-off-by: Mauro Carvalho Chehab --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/xilinx/eemi.rst | 67 +++++++++++++++++++++++++++++++ Documentation/driver-api/xilinx/index.rst | 16 ++++++++ Documentation/xilinx/eemi.rst | 67 ------------------------------- Documentation/xilinx/index.rst | 17 -------- 5 files changed, 84 insertions(+), 84 deletions(-) create mode 100644 Documentation/driver-api/xilinx/eemi.rst create mode 100644 Documentation/driver-api/xilinx/index.rst delete mode 100644 Documentation/xilinx/eemi.rst delete mode 100644 Documentation/xilinx/index.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index d1c6513dd20d..77322753c1bc 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -93,6 +93,7 @@ available subsections can be seen below. sync_file vfio-mediated-device vfio + xilinx/index xillybus zorro diff --git a/Documentation/driver-api/xilinx/eemi.rst b/Documentation/driver-api/xilinx/eemi.rst new file mode 100644 index 000000000000..9dcbc6f18d75 --- /dev/null +++ b/Documentation/driver-api/xilinx/eemi.rst @@ -0,0 +1,67 @@ +==================================== +Xilinx Zynq MPSoC EEMI Documentation +==================================== + +Xilinx Zynq MPSoC Firmware Interface +------------------------------------- +The zynqmp-firmware node describes the interface to platform firmware. +ZynqMP has an interface to communicate with secure firmware. Firmware +driver provides an interface to firmware APIs. Interface APIs can be +used by any driver to communicate with PMC(Platform Management Controller). + +Embedded Energy Management Interface (EEMI) +---------------------------------------------- +The embedded energy management interface is used to allow software +components running across different processing clusters on a chip or +device to communicate with a power management controller (PMC) on a +device to issue or respond to power management requests. + +EEMI ops is a structure containing all eemi APIs supported by Zynq MPSoC. +The zynqmp-firmware driver maintain all EEMI APIs in zynqmp_eemi_ops +structure. Any driver who want to communicate with PMC using EEMI APIs +can call zynqmp_pm_get_eemi_ops(). + +Example of EEMI ops:: + + /* zynqmp-firmware driver maintain all EEMI APIs */ + struct zynqmp_eemi_ops { + int (*get_api_version)(u32 *version); + int (*query_data)(struct zynqmp_pm_query_data qdata, u32 *out); + }; + + static const struct zynqmp_eemi_ops eemi_ops = { + .get_api_version = zynqmp_pm_get_api_version, + .query_data = zynqmp_pm_query_data, + }; + +Example of EEMI ops usage:: + + static const struct zynqmp_eemi_ops *eemi_ops; + u32 ret_payload[PAYLOAD_ARG_CNT]; + int ret; + + eemi_ops = zynqmp_pm_get_eemi_ops(); + if (IS_ERR(eemi_ops)) + return PTR_ERR(eemi_ops); + + ret = eemi_ops->query_data(qdata, ret_payload); + +IOCTL +------ +IOCTL API is for device control and configuration. It is not a system +IOCTL but it is an EEMI API. This API can be used by master to control +any device specific configuration. IOCTL definitions can be platform +specific. This API also manage shared device configuration. + +The following IOCTL IDs are valid for device control: +- IOCTL_SET_PLL_FRAC_MODE 8 +- IOCTL_GET_PLL_FRAC_MODE 9 +- IOCTL_SET_PLL_FRAC_DATA 10 +- IOCTL_GET_PLL_FRAC_DATA 11 + +Refer EEMI API guide [0] for IOCTL specific parameters and other EEMI APIs. + +References +---------- +[0] Embedded Energy Management Interface (EEMI) API guide: + https://www.xilinx.com/support/documentation/user_guides/ug1200-eemi-api.pdf diff --git a/Documentation/driver-api/xilinx/index.rst b/Documentation/driver-api/xilinx/index.rst new file mode 100644 index 000000000000..13f7589ed442 --- /dev/null +++ b/Documentation/driver-api/xilinx/index.rst @@ -0,0 +1,16 @@ + +=========== +Xilinx FPGA +=========== + +.. toctree:: + :maxdepth: 1 + + eemi + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/xilinx/eemi.rst b/Documentation/xilinx/eemi.rst deleted file mode 100644 index 9dcbc6f18d75..000000000000 --- a/Documentation/xilinx/eemi.rst +++ /dev/null @@ -1,67 +0,0 @@ -==================================== -Xilinx Zynq MPSoC EEMI Documentation -==================================== - -Xilinx Zynq MPSoC Firmware Interface -------------------------------------- -The zynqmp-firmware node describes the interface to platform firmware. -ZynqMP has an interface to communicate with secure firmware. Firmware -driver provides an interface to firmware APIs. Interface APIs can be -used by any driver to communicate with PMC(Platform Management Controller). - -Embedded Energy Management Interface (EEMI) ----------------------------------------------- -The embedded energy management interface is used to allow software -components running across different processing clusters on a chip or -device to communicate with a power management controller (PMC) on a -device to issue or respond to power management requests. - -EEMI ops is a structure containing all eemi APIs supported by Zynq MPSoC. -The zynqmp-firmware driver maintain all EEMI APIs in zynqmp_eemi_ops -structure. Any driver who want to communicate with PMC using EEMI APIs -can call zynqmp_pm_get_eemi_ops(). - -Example of EEMI ops:: - - /* zynqmp-firmware driver maintain all EEMI APIs */ - struct zynqmp_eemi_ops { - int (*get_api_version)(u32 *version); - int (*query_data)(struct zynqmp_pm_query_data qdata, u32 *out); - }; - - static const struct zynqmp_eemi_ops eemi_ops = { - .get_api_version = zynqmp_pm_get_api_version, - .query_data = zynqmp_pm_query_data, - }; - -Example of EEMI ops usage:: - - static const struct zynqmp_eemi_ops *eemi_ops; - u32 ret_payload[PAYLOAD_ARG_CNT]; - int ret; - - eemi_ops = zynqmp_pm_get_eemi_ops(); - if (IS_ERR(eemi_ops)) - return PTR_ERR(eemi_ops); - - ret = eemi_ops->query_data(qdata, ret_payload); - -IOCTL ------- -IOCTL API is for device control and configuration. It is not a system -IOCTL but it is an EEMI API. This API can be used by master to control -any device specific configuration. IOCTL definitions can be platform -specific. This API also manage shared device configuration. - -The following IOCTL IDs are valid for device control: -- IOCTL_SET_PLL_FRAC_MODE 8 -- IOCTL_GET_PLL_FRAC_MODE 9 -- IOCTL_SET_PLL_FRAC_DATA 10 -- IOCTL_GET_PLL_FRAC_DATA 11 - -Refer EEMI API guide [0] for IOCTL specific parameters and other EEMI APIs. - -References ----------- -[0] Embedded Energy Management Interface (EEMI) API guide: - https://www.xilinx.com/support/documentation/user_guides/ug1200-eemi-api.pdf diff --git a/Documentation/xilinx/index.rst b/Documentation/xilinx/index.rst deleted file mode 100644 index 01cc1a0714df..000000000000 --- a/Documentation/xilinx/index.rst +++ /dev/null @@ -1,17 +0,0 @@ -:orphan: - -=========== -Xilinx FPGA -=========== - -.. toctree:: - :maxdepth: 1 - - eemi - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` -- cgit v1.2.3 From c92992fc609fe99d926855eb1945f38ef4ad8e6c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 22 Apr 2019 16:49:11 -0300 Subject: docs: driver-api: add remaining converted dirs to it There are a number of driver-specific descriptions that contain a mix of userspace and kernelspace documentation. Just like we did with other similar subsystems, add them at the driver-api groupset, but don't move the directories. Signed-off-by: Mauro Carvalho Chehab --- Documentation/driver-api/index.rst | 2 ++ Documentation/driver-api/pps.rst | 2 +- Documentation/driver-api/ptp.rst | 2 +- Documentation/index.rst | 3 +++ Documentation/mic/index.rst | 2 -- Documentation/phy/samsung-usb2.rst | 2 -- Documentation/scheduler/index.rst | 2 -- 7 files changed, 7 insertions(+), 8 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 77322753c1bc..1dde9692075c 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -83,6 +83,8 @@ available subsections can be seen below. ntb nvmem parport-lowlevel + pps + ptp pti_intel_mid pwm rfkill diff --git a/Documentation/driver-api/pps.rst b/Documentation/driver-api/pps.rst index 1456d2c32ebd..2d6b99766ee8 100644 --- a/Documentation/driver-api/pps.rst +++ b/Documentation/driver-api/pps.rst @@ -1,4 +1,4 @@ -:orphan: +.. SPDX-License-Identifier: GPL-2.0 ====================== PPS - Pulse Per Second diff --git a/Documentation/driver-api/ptp.rst b/Documentation/driver-api/ptp.rst index b6e65d66d37a..a15192e32347 100644 --- a/Documentation/driver-api/ptp.rst +++ b/Documentation/driver-api/ptp.rst @@ -1,4 +1,4 @@ -:orphan: +.. SPDX-License-Identifier: GPL-2.0 =========================================== PTP hardware clock infrastructure for Linux diff --git a/Documentation/index.rst b/Documentation/index.rst index dcdaaff71633..041ffe442960 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -110,6 +110,9 @@ needed). bpf/index usb/index misc-devices/index + mic/index + phy/samsung-usb2 + scheduler/index Architecture-specific documentation ----------------------------------- diff --git a/Documentation/mic/index.rst b/Documentation/mic/index.rst index 082fa8f6a260..3a8d06367ef1 100644 --- a/Documentation/mic/index.rst +++ b/Documentation/mic/index.rst @@ -1,5 +1,3 @@ -:orphan: - ============================================= Intel Many Integrated Core (MIC) architecture ============================================= diff --git a/Documentation/phy/samsung-usb2.rst b/Documentation/phy/samsung-usb2.rst index 98b5952fcb97..c48c8b9797b9 100644 --- a/Documentation/phy/samsung-usb2.rst +++ b/Documentation/phy/samsung-usb2.rst @@ -1,5 +1,3 @@ -:orphan: - ==================================== Samsung USB 2.0 PHY adaptation layer ==================================== diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst index 058be77a4c34..69074e5de9c4 100644 --- a/Documentation/scheduler/index.rst +++ b/Documentation/scheduler/index.rst @@ -1,5 +1,3 @@ -:orphan: - =============== Linux Scheduler =============== -- cgit v1.2.3 From 65388dad1bbb51a4eb6cc91b9fa865b57646fb67 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 27 Jun 2019 16:31:35 -0300 Subject: docs: serial: move it to the driver-api The contents of this directory is mostly driver-api stuff. Signed-off-by: Mauro Carvalho Chehab --- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/serial/cyclades_z.rst | 11 + Documentation/driver-api/serial/driver.rst | 549 ++++++++++++++++++ Documentation/driver-api/serial/index.rst | 32 ++ Documentation/driver-api/serial/moxa-smartio.rst | 615 +++++++++++++++++++++ Documentation/driver-api/serial/n_gsm.rst | 103 ++++ Documentation/driver-api/serial/rocket.rst | 185 +++++++ Documentation/driver-api/serial/serial-iso7816.rst | 90 +++ Documentation/driver-api/serial/serial-rs485.rst | 103 ++++ Documentation/driver-api/serial/tty.rst | 328 +++++++++++ Documentation/serial/cyclades_z.rst | 11 - Documentation/serial/driver.rst | 549 ------------------ Documentation/serial/index.rst | 32 -- Documentation/serial/moxa-smartio.rst | 615 --------------------- Documentation/serial/n_gsm.rst | 103 ---- Documentation/serial/rocket.rst | 185 ------- Documentation/serial/serial-iso7816.rst | 90 --- Documentation/serial/serial-rs485.rst | 103 ---- Documentation/serial/tty.rst | 328 ----------- MAINTAINERS | 6 +- drivers/tty/Kconfig | 4 +- drivers/tty/serial/ucc_uart.c | 2 +- include/linux/serial_core.h | 2 +- 23 files changed, 2024 insertions(+), 2023 deletions(-) create mode 100644 Documentation/driver-api/serial/cyclades_z.rst create mode 100644 Documentation/driver-api/serial/driver.rst create mode 100644 Documentation/driver-api/serial/index.rst create mode 100644 Documentation/driver-api/serial/moxa-smartio.rst create mode 100644 Documentation/driver-api/serial/n_gsm.rst create mode 100644 Documentation/driver-api/serial/rocket.rst create mode 100644 Documentation/driver-api/serial/serial-iso7816.rst create mode 100644 Documentation/driver-api/serial/serial-rs485.rst create mode 100644 Documentation/driver-api/serial/tty.rst delete mode 100644 Documentation/serial/cyclades_z.rst delete mode 100644 Documentation/serial/driver.rst delete mode 100644 Documentation/serial/index.rst delete mode 100644 Documentation/serial/moxa-smartio.rst delete mode 100644 Documentation/serial/n_gsm.rst delete mode 100644 Documentation/serial/rocket.rst delete mode 100644 Documentation/serial/serial-iso7816.rst delete mode 100644 Documentation/serial/serial-rs485.rst delete mode 100644 Documentation/serial/tty.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index 1dde9692075c..cf39b8f9d0f9 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -88,6 +88,7 @@ available subsections can be seen below. pti_intel_mid pwm rfkill + serial/index sgi-ioc4 sm501 smsc_ece1099 diff --git a/Documentation/driver-api/serial/cyclades_z.rst b/Documentation/driver-api/serial/cyclades_z.rst new file mode 100644 index 000000000000..532ff67e2f1c --- /dev/null +++ b/Documentation/driver-api/serial/cyclades_z.rst @@ -0,0 +1,11 @@ +================ +Cyclades-Z notes +================ + +The Cyclades-Z must have firmware loaded onto the card before it will +operate. This operation should be performed during system startup, + +The firmware, loader program and the latest device driver code are +available from Cyclades at + + ftp://ftp.cyclades.com/pub/cyclades/cyclades-z/linux/ diff --git a/Documentation/driver-api/serial/driver.rst b/Documentation/driver-api/serial/driver.rst new file mode 100644 index 000000000000..31bd4e16fb1f --- /dev/null +++ b/Documentation/driver-api/serial/driver.rst @@ -0,0 +1,549 @@ +==================== +Low Level Serial API +==================== + + +This document is meant as a brief overview of some aspects of the new serial +driver. It is not complete, any questions you have should be directed to + + +The reference implementation is contained within amba-pl011.c. + + + +Low Level Serial Hardware Driver +-------------------------------- + +The low level serial hardware driver is responsible for supplying port +information (defined by uart_port) and a set of control methods (defined +by uart_ops) to the core serial driver. The low level driver is also +responsible for handling interrupts for the port, and providing any +console support. + + +Console Support +--------------- + +The serial core provides a few helper functions. This includes identifing +the correct port structure (via uart_get_console) and decoding command line +arguments (uart_parse_options). + +There is also a helper function (uart_console_write) which performs a +character by character write, translating newlines to CRLF sequences. +Driver writers are recommended to use this function rather than implementing +their own version. + + +Locking +------- + +It is the responsibility of the low level hardware driver to perform the +necessary locking using port->lock. There are some exceptions (which +are described in the uart_ops listing below.) + +There are two locks. A per-port spinlock, and an overall semaphore. + +From the core driver perspective, the port->lock locks the following +data:: + + port->mctrl + port->icount + port->state->xmit.head (circ_buf->head) + port->state->xmit.tail (circ_buf->tail) + +The low level driver is free to use this lock to provide any additional +locking. + +The port_sem semaphore is used to protect against ports being added/ +removed or reconfigured at inappropriate times. Since v2.6.27, this +semaphore has been the 'mutex' member of the tty_port struct, and +commonly referred to as the port mutex. + + +uart_ops +-------- + +The uart_ops structure is the main interface between serial_core and the +hardware specific driver. It contains all the methods to control the +hardware. + + tx_empty(port) + This function tests whether the transmitter fifo and shifter + for the port described by 'port' is empty. If it is empty, + this function should return TIOCSER_TEMT, otherwise return 0. + If the port does not support this operation, then it should + return TIOCSER_TEMT. + + Locking: none. + + Interrupts: caller dependent. + + This call must not sleep + + set_mctrl(port, mctrl) + This function sets the modem control lines for port described + by 'port' to the state described by mctrl. The relevant bits + of mctrl are: + + - TIOCM_RTS RTS signal. + - TIOCM_DTR DTR signal. + - TIOCM_OUT1 OUT1 signal. + - TIOCM_OUT2 OUT2 signal. + - TIOCM_LOOP Set the port into loopback mode. + + If the appropriate bit is set, the signal should be driven + active. If the bit is clear, the signal should be driven + inactive. + + Locking: port->lock taken. + + Interrupts: locally disabled. + + This call must not sleep + + get_mctrl(port) + Returns the current state of modem control inputs. The state + of the outputs should not be returned, since the core keeps + track of their state. The state information should include: + + - TIOCM_CAR state of DCD signal + - TIOCM_CTS state of CTS signal + - TIOCM_DSR state of DSR signal + - TIOCM_RI state of RI signal + + The bit is set if the signal is currently driven active. If + the port does not support CTS, DCD or DSR, the driver should + indicate that the signal is permanently active. If RI is + not available, the signal should not be indicated as active. + + Locking: port->lock taken. + + Interrupts: locally disabled. + + This call must not sleep + + stop_tx(port) + Stop transmitting characters. This might be due to the CTS + line becoming inactive or the tty layer indicating we want + to stop transmission due to an XOFF character. + + The driver should stop transmitting characters as soon as + possible. + + Locking: port->lock taken. + + Interrupts: locally disabled. + + This call must not sleep + + start_tx(port) + Start transmitting characters. + + Locking: port->lock taken. + + Interrupts: locally disabled. + + This call must not sleep + + throttle(port) + Notify the serial driver that input buffers for the line discipline are + close to full, and it should somehow signal that no more characters + should be sent to the serial port. + This will be called only if hardware assisted flow control is enabled. + + Locking: serialized with .unthrottle() and termios modification by the + tty layer. + + unthrottle(port) + Notify the serial driver that characters can now be sent to the serial + port without fear of overrunning the input buffers of the line + disciplines. + + This will be called only if hardware assisted flow control is enabled. + + Locking: serialized with .throttle() and termios modification by the + tty layer. + + send_xchar(port,ch) + Transmit a high priority character, even if the port is stopped. + This is used to implement XON/XOFF flow control and tcflow(). If + the serial driver does not implement this function, the tty core + will append the character to the circular buffer and then call + start_tx() / stop_tx() to flush the data out. + + Do not transmit if ch == '\0' (__DISABLED_CHAR). + + Locking: none. + + Interrupts: caller dependent. + + stop_rx(port) + Stop receiving characters; the port is in the process of + being closed. + + Locking: port->lock taken. + + Interrupts: locally disabled. + + This call must not sleep + + enable_ms(port) + Enable the modem status interrupts. + + This method may be called multiple times. Modem status + interrupts should be disabled when the shutdown method is + called. + + Locking: port->lock taken. + + Interrupts: locally disabled. + + This call must not sleep + + break_ctl(port,ctl) + Control the transmission of a break signal. If ctl is + nonzero, the break signal should be transmitted. The signal + should be terminated when another call is made with a zero + ctl. + + Locking: caller holds tty_port->mutex + + startup(port) + Grab any interrupt resources and initialise any low level driver + state. Enable the port for reception. It should not activate + RTS nor DTR; this will be done via a separate call to set_mctrl. + + This method will only be called when the port is initially opened. + + Locking: port_sem taken. + + Interrupts: globally disabled. + + shutdown(port) + Disable the port, disable any break condition that may be in + effect, and free any interrupt resources. It should not disable + RTS nor DTR; this will have already been done via a separate + call to set_mctrl. + + Drivers must not access port->state once this call has completed. + + This method will only be called when there are no more users of + this port. + + Locking: port_sem taken. + + Interrupts: caller dependent. + + flush_buffer(port) + Flush any write buffers, reset any DMA state and stop any + ongoing DMA transfers. + + This will be called whenever the port->state->xmit circular + buffer is cleared. + + Locking: port->lock taken. + + Interrupts: locally disabled. + + This call must not sleep + + set_termios(port,termios,oldtermios) + Change the port parameters, including word length, parity, stop + bits. Update read_status_mask and ignore_status_mask to indicate + the types of events we are interested in receiving. Relevant + termios->c_cflag bits are: + + CSIZE + - word size + CSTOPB + - 2 stop bits + PARENB + - parity enable + PARODD + - odd parity (when PARENB is in force) + CREAD + - enable reception of characters (if not set, + still receive characters from the port, but + throw them away. + CRTSCTS + - if set, enable CTS status change reporting + CLOCAL + - if not set, enable modem status change + reporting. + + Relevant termios->c_iflag bits are: + + INPCK + - enable frame and parity error events to be + passed to the TTY layer. + BRKINT / PARMRK + - both of these enable break events to be + passed to the TTY layer. + + IGNPAR + - ignore parity and framing errors + IGNBRK + - ignore break errors, If IGNPAR is also + set, ignore overrun errors as well. + + The interaction of the iflag bits is as follows (parity error + given as an example): + + =============== ======= ====== ============================= + Parity error INPCK IGNPAR + =============== ======= ====== ============================= + n/a 0 n/a character received, marked as + TTY_NORMAL + None 1 n/a character received, marked as + TTY_NORMAL + Yes 1 0 character received, marked as + TTY_PARITY + Yes 1 1 character discarded + =============== ======= ====== ============================= + + Other flags may be used (eg, xon/xoff characters) if your + hardware supports hardware "soft" flow control. + + Locking: caller holds tty_port->mutex + + Interrupts: caller dependent. + + This call must not sleep + + set_ldisc(port,termios) + Notifier for discipline change. See Documentation/driver-api/serial/tty.rst. + + Locking: caller holds tty_port->mutex + + pm(port,state,oldstate) + Perform any power management related activities on the specified + port. State indicates the new state (defined by + enum uart_pm_state), oldstate indicates the previous state. + + This function should not be used to grab any resources. + + This will be called when the port is initially opened and finally + closed, except when the port is also the system console. This + will occur even if CONFIG_PM is not set. + + Locking: none. + + Interrupts: caller dependent. + + type(port) + Return a pointer to a string constant describing the specified + port, or return NULL, in which case the string 'unknown' is + substituted. + + Locking: none. + + Interrupts: caller dependent. + + release_port(port) + Release any memory and IO region resources currently in use by + the port. + + Locking: none. + + Interrupts: caller dependent. + + request_port(port) + Request any memory and IO region resources required by the port. + If any fail, no resources should be registered when this function + returns, and it should return -EBUSY on failure. + + Locking: none. + + Interrupts: caller dependent. + + config_port(port,type) + Perform any autoconfiguration steps required for the port. `type` + contains a bit mask of the required configuration. UART_CONFIG_TYPE + indicates that the port requires detection and identification. + port->type should be set to the type found, or PORT_UNKNOWN if + no port was detected. + + UART_CONFIG_IRQ indicates autoconfiguration of the interrupt signal, + which should be probed using standard kernel autoprobing techniques. + This is not necessary on platforms where ports have interrupts + internally hard wired (eg, system on a chip implementations). + + Locking: none. + + Interrupts: caller dependent. + + verify_port(port,serinfo) + Verify the new serial port information contained within serinfo is + suitable for this port type. + + Locking: none. + + Interrupts: caller dependent. + + ioctl(port,cmd,arg) + Perform any port specific IOCTLs. IOCTL commands must be defined + using the standard numbering system found in + + Locking: none. + + Interrupts: caller dependent. + + poll_init(port) + Called by kgdb to perform the minimal hardware initialization needed + to support poll_put_char() and poll_get_char(). Unlike ->startup() + this should not request interrupts. + + Locking: tty_mutex and tty_port->mutex taken. + + Interrupts: n/a. + + poll_put_char(port,ch) + Called by kgdb to write a single character directly to the serial + port. It can and should block until there is space in the TX FIFO. + + Locking: none. + + Interrupts: caller dependent. + + This call must not sleep + + poll_get_char(port) + Called by kgdb to read a single character directly from the serial + port. If data is available, it should be returned; otherwise + the function should return NO_POLL_CHAR immediately. + + Locking: none. + + Interrupts: caller dependent. + + This call must not sleep + +Other functions +--------------- + +uart_update_timeout(port,cflag,baud) + Update the FIFO drain timeout, port->timeout, according to the + number of bits, parity, stop bits and baud rate. + + Locking: caller is expected to take port->lock + + Interrupts: n/a + +uart_get_baud_rate(port,termios,old,min,max) + Return the numeric baud rate for the specified termios, taking + account of the special 38400 baud "kludge". The B0 baud rate + is mapped to 9600 baud. + + If the baud rate is not within min..max, then if old is non-NULL, + the original baud rate will be tried. If that exceeds the + min..max constraint, 9600 baud will be returned. termios will + be updated to the baud rate in use. + + Note: min..max must always allow 9600 baud to be selected. + + Locking: caller dependent. + + Interrupts: n/a + +uart_get_divisor(port,baud) + Return the divisor (baud_base / baud) for the specified baud + rate, appropriately rounded. + + If 38400 baud and custom divisor is selected, return the + custom divisor instead. + + Locking: caller dependent. + + Interrupts: n/a + +uart_match_port(port1,port2) + This utility function can be used to determine whether two + uart_port structures describe the same port. + + Locking: n/a + + Interrupts: n/a + +uart_write_wakeup(port) + A driver is expected to call this function when the number of + characters in the transmit buffer have dropped below a threshold. + + Locking: port->lock should be held. + + Interrupts: n/a + +uart_register_driver(drv) + Register a uart driver with the core driver. We in turn register + with the tty layer, and initialise the core driver per-port state. + + drv->port should be NULL, and the per-port structures should be + registered using uart_add_one_port after this call has succeeded. + + Locking: none + + Interrupts: enabled + +uart_unregister_driver() + Remove all references to a driver from the core driver. The low + level driver must have removed all its ports via the + uart_remove_one_port() if it registered them with uart_add_one_port(). + + Locking: none + + Interrupts: enabled + +**uart_suspend_port()** + +**uart_resume_port()** + +**uart_add_one_port()** + +**uart_remove_one_port()** + +Other notes +----------- + +It is intended some day to drop the 'unused' entries from uart_port, and +allow low level drivers to register their own individual uart_port's with +the core. This will allow drivers to use uart_port as a pointer to a +structure containing both the uart_port entry with their own extensions, +thus:: + + struct my_port { + struct uart_port port; + int my_stuff; + }; + +Modem control lines via GPIO +---------------------------- + +Some helpers are provided in order to set/get modem control lines via GPIO. + +mctrl_gpio_init(port, idx): + This will get the {cts,rts,...}-gpios from device tree if they are + present and request them, set direction etc, and return an + allocated structure. `devm_*` functions are used, so there's no need + to call mctrl_gpio_free(). + As this sets up the irq handling make sure to not handle changes to the + gpio input lines in your driver, too. + +mctrl_gpio_free(dev, gpios): + This will free the requested gpios in mctrl_gpio_init(). + As `devm_*` functions are used, there's generally no need to call + this function. + +mctrl_gpio_to_gpiod(gpios, gidx) + This returns the gpio_desc structure associated to the modem line + index. + +mctrl_gpio_set(gpios, mctrl): + This will sets the gpios according to the mctrl state. + +mctrl_gpio_get(gpios, mctrl): + This will update mctrl with the gpios values. + +mctrl_gpio_enable_ms(gpios): + Enables irqs and handling of changes to the ms lines. + +mctrl_gpio_disable_ms(gpios): + Disables irqs and handling of changes to the ms lines. diff --git a/Documentation/driver-api/serial/index.rst b/Documentation/driver-api/serial/index.rst new file mode 100644 index 000000000000..33ad10d05b26 --- /dev/null +++ b/Documentation/driver-api/serial/index.rst @@ -0,0 +1,32 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +Support for Serial devices +========================== + +.. toctree:: + :maxdepth: 1 + + + driver + tty + +Serial drivers +============== + +.. toctree:: + :maxdepth: 1 + + cyclades_z + moxa-smartio + n_gsm + rocket + serial-iso7816 + serial-rs485 + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/driver-api/serial/moxa-smartio.rst b/Documentation/driver-api/serial/moxa-smartio.rst new file mode 100644 index 000000000000..156100f17c3f --- /dev/null +++ b/Documentation/driver-api/serial/moxa-smartio.rst @@ -0,0 +1,615 @@ +============================================================= +MOXA Smartio/Industio Family Device Driver Installation Guide +============================================================= + +.. note:: + + This file is outdated. It needs some care in order to make it + updated to Kernel 5.0 and upper + +Copyright (C) 2008, Moxa Inc. + +Date: 01/21/2008 + +.. Content + + 1. Introduction + 2. System Requirement + 3. Installation + 3.1 Hardware installation + 3.2 Driver files + 3.3 Device naming convention + 3.4 Module driver configuration + 3.5 Static driver configuration for Linux kernel 2.4.x and 2.6.x. + 3.6 Custom configuration + 3.7 Verify driver installation + 4. Utilities + 5. Setserial + 6. Troubleshooting + +1. Introduction +^^^^^^^^^^^^^^^ + + The Smartio/Industio/UPCI family Linux driver supports following multiport + boards. + + - 2 ports multiport board + CP-102U, CP-102UL, CP-102UF + CP-132U-I, CP-132UL, + CP-132, CP-132I, CP132S, CP-132IS, + CI-132, CI-132I, CI-132IS, + (C102H, C102HI, C102HIS, C102P, CP-102, CP-102S) + + - 4 ports multiport board + CP-104EL, + CP-104UL, CP-104JU, + CP-134U, CP-134U-I, + C104H/PCI, C104HS/PCI, + CP-114, CP-114I, CP-114S, CP-114IS, CP-114UL, + C104H, C104HS, + CI-104J, CI-104JS, + CI-134, CI-134I, CI-134IS, + (C114HI, CT-114I, C104P), + POS-104UL, + CB-114, + CB-134I + + - 8 ports multiport board + CP-118EL, CP-168EL, + CP-118U, CP-168U, + C168H/PCI, + C168H, C168HS, + (C168P), + CB-108 + + This driver and installation procedure have been developed upon Linux Kernel + 2.4.x and 2.6.x. This driver supports Intel x86 hardware platform. In order + to maintain compatibility, this version has also been properly tested with + RedHat, Mandrake, Fedora and S.u.S.E Linux. However, if compatibility problem + occurs, please contact Moxa at support@moxa.com.tw. + + In addition to device driver, useful utilities are also provided in this + version. They are: + + - msdiag + Diagnostic program for displaying installed Moxa + Smartio/Industio boards. + - msmon + Monitor program to observe data count and line status signals. + - msterm A simple terminal program which is useful in testing serial + ports. + - io-irq.exe + Configuration program to setup ISA boards. Please note that + this program can only be executed under DOS. + + All the drivers and utilities are published in form of source code under + GNU General Public License in this version. Please refer to GNU General + Public License announcement in each source code file for more detail. + + In Moxa's Web sites, you may always find latest driver at http://www.moxa.com/. + + This version of driver can be installed as Loadable Module (Module driver) + or built-in into kernel (Static driver). You may refer to following + installation procedure for suitable one. Before you install the driver, + please refer to hardware installation procedure in the User's Manual. + + We assume the user should be familiar with following documents. + + - Serial-HOWTO + - Kernel-HOWTO + +2. System Requirement +^^^^^^^^^^^^^^^^^^^^^ + + - Hardware platform: Intel x86 machine + - Kernel version: 2.4.x or 2.6.x + - gcc version 2.72 or later + - Maximum 4 boards can be installed in combination + +3. Installation +^^^^^^^^^^^^^^^ + +3.1 Hardware installation +========================= + + There are two types of buses, ISA and PCI, for Smartio/Industio + family multiport board. + +ISA board +--------- + + You'll have to configure CAP address, I/O address, Interrupt Vector + as well as IRQ before installing this driver. Please refer to hardware + installation procedure in User's Manual before proceed any further. + Please make sure the JP1 is open after the ISA board is set properly. + +PCI/UPCI board +-------------- + + You may need to adjust IRQ usage in BIOS to avoid from IRQ conflict + with other ISA devices. Please refer to hardware installation + procedure in User's Manual in advance. + +PCI IRQ Sharing +--------------- + + Each port within the same multiport board shares the same IRQ. Up to + 4 Moxa Smartio/Industio PCI Family multiport boards can be installed + together on one system and they can share the same IRQ. + + +3.2 Driver files +================ + + The driver file may be obtained from ftp, CD-ROM or floppy disk. The + first step, anyway, is to copy driver file "mxser.tgz" into specified + directory. e.g. /moxa. The execute commands as below:: + + # cd / + # mkdir moxa + # cd /moxa + # tar xvf /dev/fd0 + +or:: + + # cd / + # mkdir moxa + # cd /moxa + # cp /mnt/cdrom//mxser.tgz . + # tar xvfz mxser.tgz + + +3.3 Device naming convention +============================ + + You may find all the driver and utilities files in /moxa/mxser. + Following installation procedure depends on the model you'd like to + run the driver. If you prefer module driver, please refer to 3.4. + If static driver is required, please refer to 3.5. + +Dialin and callout port +----------------------- + + This driver remains traditional serial device properties. There are + two special file name for each serial port. One is dial-in port + which is named "ttyMxx". For callout port, the naming convention + is "cumxx". + +Device naming when more than 2 boards installed +----------------------------------------------- + + Naming convention for each Smartio/Industio multiport board is + pre-defined as below. + + ============ =============== ============== + Board Num. Dial-in Port Callout port + 1st board ttyM0 - ttyM7 cum0 - cum7 + 2nd board ttyM8 - ttyM15 cum8 - cum15 + 3rd board ttyM16 - ttyM23 cum16 - cum23 + 4th board ttyM24 - ttym31 cum24 - cum31 + ============ =============== ============== + +.. note:: + + Under Kernel 2.6 and upper, the cum Device is Obsolete. So use ttyM* + device instead. + +Board sequence +-------------- + + This driver will activate ISA boards according to the parameter set + in the driver. After all specified ISA board activated, PCI board + will be installed in the system automatically driven. + Therefore the board number is sorted by the CAP address of ISA boards. + For PCI boards, their sequence will be after ISA boards and C168H/PCI + has higher priority than C104H/PCI boards. + +3.4 Module driver configuration +=============================== + + Module driver is easiest way to install. If you prefer static driver + installation, please skip this paragraph. + + + ------------- Prepare to use the MOXA driver -------------------- + +3.4.1 Create tty device with correct major number +------------------------------------------------- + + Before using MOXA driver, your system must have the tty devices + which are created with driver's major number. We offer one shell + script "msmknod" to simplify the procedure. + This step is only needed to be executed once. But you still + need to do this procedure when: + + a. You change the driver's major number. Please refer the "3.7" + section. + b. Your total installed MOXA boards number is changed. Maybe you + add/delete one MOXA board. + c. You want to change the tty name. This needs to modify the + shell script "msmknod" + + The procedure is:: + + # cd /moxa/mxser/driver + # ./msmknod + + This shell script will require the major number for dial-in + device and callout device to create tty device. You also need + to specify the total installed MOXA board number. Default major + numbers for dial-in device and callout device are 30, 35. If + you need to change to other number, please refer section "3.7" + for more detailed procedure. + Msmknod will delete any special files occupying the same device + naming. + +3.4.2 Build the MOXA driver and utilities +----------------------------------------- + + Before using the MOXA driver and utilities, you need compile the + all the source code. This step is only need to be executed once. + But you still re-compile the source code if you modify the source + code. For example, if you change the driver's major number (see + "3.7" section), then you need to do this step again. + + Find "Makefile" in /moxa/mxser, then run + + # make clean; make install + + ..note:: + + For Red Hat 9, Red Hat Enterprise Linux AS3/ES3/WS3 & Fedora Core1: + # make clean; make installsp1 + + For Red Hat Enterprise Linux AS4/ES4/WS4: + # make clean; make installsp2 + + The driver files "mxser.o" and utilities will be properly compiled + and copied to system directories respectively. + +------------- Load MOXA driver-------------------- + +3.4.3 Load the MOXA driver +-------------------------- + + :: + + # modprobe mxser + + will activate the module driver. You may run "lsmod" to check + if "mxser" is activated. If the MOXA board is ISA board, the + is needed. Please refer to section "3.4.5" for more + information. + +------------- Load MOXA driver on boot -------------------- + +3.4.4 Load the mxser driver +--------------------------- + + + For the above description, you may manually execute + "modprobe mxser" to activate this driver and run + "rmmod mxser" to remove it. + + However, it's better to have a boot time configuration to + eliminate manual operation. Boot time configuration can be + achieved by rc file. We offer one "rc.mxser" file to simplify + the procedure under "moxa/mxser/driver". + + But if you use ISA board, please modify the "modprobe ..." command + to add the argument (see "3.4.5" section). After modifying the + rc.mxser, please try to execute "/moxa/mxser/driver/rc.mxser" + manually to make sure the modification is ok. If any error + encountered, please try to modify again. If the modification is + completed, follow the below step. + + Run following command for setting rc files:: + + # cd /moxa/mxser/driver + # cp ./rc.mxser /etc/rc.d + # cd /etc/rc.d + + Check "rc.serial" is existed or not. If "rc.serial" doesn't exist, + create it by vi, run "chmod 755 rc.serial" to change the permission. + + Add "/etc/rc.d/rc.mxser" in last line. + + Reboot and check if moxa.o activated by "lsmod" command. + +3.4.5. specify CAP address +-------------------------- + + If you'd like to drive Smartio/Industio ISA boards in the system, + you'll have to add parameter to specify CAP address of given + board while activating "mxser.o". The format for parameters are + as follows.:: + + modprobe mxser ioaddr=0x???,0x???,0x???,0x??? + | | | | + | | | +- 4th ISA board + | | +------ 3rd ISA board + | +------------ 2nd ISA board + +-------------------1st ISA board + +3.5 Static driver configuration for Linux kernel 2.4.x and 2.6.x +================================================================ + + Note: + To use static driver, you must install the linux kernel + source package. + +3.5.1 Backup the built-in driver in the kernel +---------------------------------------------- + + :: + + # cd /usr/src/linux/drivers/char + # mv mxser.c mxser.c.old + + For Red Hat 7.x user, you need to create link: + # cd /usr/src + # ln -s linux-2.4 linux + +3.5.2 Create link +----------------- + :: + + # cd /usr/src/linux/drivers/char + # ln -s /moxa/mxser/driver/mxser.c mxser.c + +3.5.3 Add CAP address list for ISA boards. +------------------------------------------ + + For PCI boards user, please skip this step. + + In module mode, the CAP address for ISA board is given by + parameter. In static driver configuration, you'll have to + assign it within driver's source code. If you will not + install any ISA boards, you may skip to next portion. + The instructions to modify driver source code are as + below. + + a. run:: + + # cd /moxa/mxser/driver + # vi mxser.c + + b. Find the array mxserBoardCAP[] as below:: + + static int mxserBoardCAP[] = {0x00, 0x00, 0x00, 0x00}; + + c. Change the address within this array using vi. For + example, to driver 2 ISA boards with CAP address + 0x280 and 0x180 as 1st and 2nd board. Just to change + the source code as follows:: + + static int mxserBoardCAP[] = {0x280, 0x180, 0x00, 0x00}; + +3.5.4 Setup kernel configuration +-------------------------------- + + Configure the kernel:: + + # cd /usr/src/linux + # make menuconfig + + You will go into a menu-driven system. Please select [Character + devices][Non-standard serial port support], enable the [Moxa + SmartIO support] driver with "[*]" for built-in (not "[M]"), then + select [Exit] to exit this program. + +3.5.5 Rebuild kernel +-------------------- + + The following are for Linux kernel rebuilding, for your + reference only. + + For appropriate details, please refer to the Linux document: + + a. Run the following commands:: + + cd /usr/src/linux + make clean # take a few minutes + make dep # take a few minutes + make bzImage # take probably 10-20 minutes + make install # copy boot image to correct position + + f. Please make sure the boot kernel (vmlinuz) is in the + correct position. + g. If you use 'lilo' utility, you should check /etc/lilo.conf + 'image' item specified the path which is the 'vmlinuz' path, + or you will load wrong (or old) boot kernel image (vmlinuz). + After checking /etc/lilo.conf, please run "lilo". + + Note that if the result of "make bzImage" is ERROR, then you have to + go back to Linux configuration Setup. Type "make menuconfig" in + directory /usr/src/linux. + + +3.5.6 Make tty device and special file +-------------------------------------- + + :: + # cd /moxa/mxser/driver + # ./msmknod + +3.5.7 Make utility +------------------ + + :: + + # cd /moxa/mxser/utility + # make clean; make install + +3.5.8 Reboot +------------ + + + +3.6 Custom configuration +======================== + + Although this driver already provides you default configuration, you + still can change the device name and major number. The instruction to + change these parameters are shown as below. + +a. Change Device name + + If you'd like to use other device names instead of default naming + convention, all you have to do is to modify the internal code + within the shell script "msmknod". First, you have to open "msmknod" + by vi. Locate each line contains "ttyM" and "cum" and change them + to the device name you desired. "msmknod" creates the device names + you need next time executed. + +b. Change Major number + + If major number 30 and 35 had been occupied, you may have to select + 2 free major numbers for this driver. There are 3 steps to change + major numbers. + +3.6.1 Find free major numbers +----------------------------- + + In /proc/devices, you may find all the major numbers occupied + in the system. Please select 2 major numbers that are available. + e.g. 40, 45. + +3.6.2 Create special files +-------------------------- + + Run /moxa/mxser/driver/msmknod to create special files with + specified major numbers. + +3.6.3 Modify driver with new major number +----------------------------------------- + + Run vi to open /moxa/mxser/driver/mxser.c. Locate the line + contains "MXSERMAJOR". Change the content as below:: + + #define MXSERMAJOR 40 + #define MXSERCUMAJOR 45 + + 3.6.4 Run "make clean; make install" in /moxa/mxser/driver. + +3.7 Verify driver installation +============================== + + You may refer to /var/log/messages to check the latest status + log reported by this driver whenever it's activated. + +4. Utilities +^^^^^^^^^^^^ + + There are 3 utilities contained in this driver. They are msdiag, msmon and + msterm. These 3 utilities are released in form of source code. They should + be compiled into executable file and copied into /usr/bin. + + Before using these utilities, please load driver (refer 3.4 & 3.5) and + make sure you had run the "msmknod" utility. + +msdiag - Diagnostic +=================== + + This utility provides the function to display what Moxa Smartio/Industio + board found by driver in the system. + +msmon - Port Monitoring +======================= + + This utility gives the user a quick view about all the MOXA ports' + activities. One can easily learn each port's total received/transmitted + (Rx/Tx) character count since the time when the monitoring is started. + + Rx/Tx throughputs per second are also reported in interval basis (e.g. + the last 5 seconds) and in average basis (since the time the monitoring + is started). You can reset all ports' count by key. <+> <-> + (plus/minus) keys to change the displaying time interval. Press + on the port, that cursor stay, to view the port's communication + parameters, signal status, and input/output queue. + +msterm - Terminal Emulation +=========================== + + This utility provides data sending and receiving ability of all tty ports, + especially for MOXA ports. It is quite useful for testing simple + application, for example, sending AT command to a modem connected to the + port or used as a terminal for login purpose. Note that this is only a + dumb terminal emulation without handling full screen operation. + +5. Setserial +^^^^^^^^^^^^ + + Supported Setserial parameters are listed as below. + + ============== ========================================================= + uart set UART type(16450-->disable FIFO, 16550A-->enable FIFO) + close_delay set the amount of time(in 1/100 of a second) that DTR + should be kept low while being closed. + closing_wait set the amount of time(in 1/100 of a second) that the + serial port should wait for data to be drained while + being closed, before the receiver is disable. + spd_hi Use 57.6kb when the application requests 38.4kb. + spd_vhi Use 115.2kb when the application requests 38.4kb. + spd_shi Use 230.4kb when the application requests 38.4kb. + spd_warp Use 460.8kb when the application requests 38.4kb. + spd_normal Use 38.4kb when the application requests 38.4kb. + spd_cust Use the custom divisor to set the speed when the + application requests 38.4kb. + divisor This option set the custom division. + baud_base This option set the base baud rate. + ============== ========================================================= + +6. Troubleshooting +^^^^^^^^^^^^^^^^^^ + + The boot time error messages and solutions are stated as clearly as + possible. If all the possible solutions fail, please contact our technical + support team to get more help. + + + Error msg: + More than 4 Moxa Smartio/Industio family boards found. Fifth board + and after are ignored. + + Solution: + To avoid this problem, please unplug fifth and after board, because Moxa + driver supports up to 4 boards. + + Error msg: + Request_irq fail, IRQ(?) may be conflict with another device. + + Solution: + Other PCI or ISA devices occupy the assigned IRQ. If you are not sure + which device causes the situation, please check /proc/interrupts to find + free IRQ and simply change another free IRQ for Moxa board. + + Error msg: + Board #: C1xx Series(CAP=xxx) interrupt number invalid. + + Solution: + Each port within the same multiport board shares the same IRQ. Please set + one IRQ (IRQ doesn't equal to zero) for one Moxa board. + + Error msg: + No interrupt vector be set for Moxa ISA board(CAP=xxx). + + Solution: + Moxa ISA board needs an interrupt vector.Please refer to user's manual + "Hardware Installation" chapter to set interrupt vector. + + Error msg: + Couldn't install MOXA Smartio/Industio family driver! + + Solution: + Load Moxa driver fail, the major number may conflict with other devices. + Please refer to previous section 3.7 to change a free major number for + Moxa driver. + + Error msg: + Couldn't install MOXA Smartio/Industio family callout driver! + + Solution: + Load Moxa callout driver fail, the callout device major number may + conflict with other devices. Please refer to previous section 3.7 to + change a free callout device major number for Moxa driver. diff --git a/Documentation/driver-api/serial/n_gsm.rst b/Documentation/driver-api/serial/n_gsm.rst new file mode 100644 index 000000000000..f3ad9fd26408 --- /dev/null +++ b/Documentation/driver-api/serial/n_gsm.rst @@ -0,0 +1,103 @@ +============================== +GSM 0710 tty multiplexor HOWTO +============================== + +This line discipline implements the GSM 07.10 multiplexing protocol +detailed in the following 3GPP document: + + http://www.3gpp.org/ftp/Specs/archive/07_series/07.10/0710-720.zip + +This document give some hints on how to use this driver with GPRS and 3G +modems connected to a physical serial port. + +How to use it +------------- +1. initialize the modem in 0710 mux mode (usually AT+CMUX= command) through + its serial port. Depending on the modem used, you can pass more or less + parameters to this command, +2. switch the serial line to using the n_gsm line discipline by using + TIOCSETD ioctl, +3. configure the mux using GSMIOC_GETCONF / GSMIOC_SETCONF ioctl, + +Major parts of the initialization program : +(a good starting point is util-linux-ng/sys-utils/ldattach.c):: + + #include + #define N_GSM0710 21 /* GSM 0710 Mux */ + #define DEFAULT_SPEED B115200 + #define SERIAL_PORT /dev/ttyS0 + + int ldisc = N_GSM0710; + struct gsm_config c; + struct termios configuration; + + /* open the serial port connected to the modem */ + fd = open(SERIAL_PORT, O_RDWR | O_NOCTTY | O_NDELAY); + + /* configure the serial port : speed, flow control ... */ + + /* send the AT commands to switch the modem to CMUX mode + and check that it's successful (should return OK) */ + write(fd, "AT+CMUX=0\r", 10); + + /* experience showed that some modems need some time before + being able to answer to the first MUX packet so a delay + may be needed here in some case */ + sleep(3); + + /* use n_gsm line discipline */ + ioctl(fd, TIOCSETD, &ldisc); + + /* get n_gsm configuration */ + ioctl(fd, GSMIOC_GETCONF, &c); + /* we are initiator and need encoding 0 (basic) */ + c.initiator = 1; + c.encapsulation = 0; + /* our modem defaults to a maximum size of 127 bytes */ + c.mru = 127; + c.mtu = 127; + /* set the new configuration */ + ioctl(fd, GSMIOC_SETCONF, &c); + + /* and wait for ever to keep the line discipline enabled */ + daemon(0,0); + pause(); + +4. create the devices corresponding to the "virtual" serial ports (take care, + each modem has its configuration and some DLC have dedicated functions, + for example GPS), starting with minor 1 (DLC0 is reserved for the management + of the mux):: + + MAJOR=`cat /proc/devices |grep gsmtty | awk '{print $1}` + for i in `seq 1 4`; do + mknod /dev/ttygsm$i c $MAJOR $i + done + +5. use these devices as plain serial ports. + + for example, it's possible: + + - and to use gnokii to send / receive SMS on ttygsm1 + - to use ppp to establish a datalink on ttygsm2 + +6. first close all virtual ports before closing the physical port. + + Note that after closing the physical port the modem is still in multiplexing + mode. This may prevent a successful re-opening of the port later. To avoid + this situation either reset the modem if your hardware allows that or send + a disconnect command frame manually before initializing the multiplexing mode + for the second time. The byte sequence for the disconnect command frame is:: + + 0xf9, 0x03, 0xef, 0x03, 0xc3, 0x16, 0xf9. + +Additional Documentation +------------------------ +More practical details on the protocol and how it's supported by industrial +modems can be found in the following documents : + +- http://www.telit.com/module/infopool/download.php?id=616 +- http://www.u-blox.com/images/downloads/Product_Docs/LEON-G100-G200-MuxImplementation_ApplicationNote_%28GSM%20G1-CS-10002%29.pdf +- http://www.sierrawireless.com/Support/Downloads/AirPrime/WMP_Series/~/media/Support_Downloads/AirPrime/Application_notes/CMUX_Feature_Application_Note-Rev004.ashx +- http://wm.sim.com/sim/News/photo/2010721161442.pdf + +11-03-08 - Eric Bénard - diff --git a/Documentation/driver-api/serial/rocket.rst b/Documentation/driver-api/serial/rocket.rst new file mode 100644 index 000000000000..23761eae4282 --- /dev/null +++ b/Documentation/driver-api/serial/rocket.rst @@ -0,0 +1,185 @@ +================================================ +Comtrol(tm) RocketPort(R)/RocketModem(TM) Series +================================================ + +Device Driver for the Linux Operating System +============================================ + +Product overview +---------------- + +This driver provides a loadable kernel driver for the Comtrol RocketPort +and RocketModem PCI boards. These boards provide, 2, 4, 8, 16, or 32 +high-speed serial ports or modems. This driver supports up to a combination +of four RocketPort or RocketModems boards in one machine simultaneously. +This file assumes that you are using the RocketPort driver which is +integrated into the kernel sources. + +The driver can also be installed as an external module using the usual +"make;make install" routine. This external module driver, obtainable +from the Comtrol website listed below, is useful for updating the driver +or installing it into kernels which do not have the driver configured +into them. Installations instructions for the external module +are in the included README and HW_INSTALL files. + +RocketPort ISA and RocketModem II PCI boards currently are only supported by +this driver in module form. + +The RocketPort ISA board requires I/O ports to be configured by the DIP +switches on the board. See the section "ISA Rocketport Boards" below for +information on how to set the DIP switches. + +You pass the I/O port to the driver using the following module parameters: + +board1: + I/O port for the first ISA board +board2: + I/O port for the second ISA board +board3: + I/O port for the third ISA board +board4: + I/O port for the fourth ISA board + +There is a set of utilities and scripts provided with the external driver +(downloadable from http://www.comtrol.com) that ease the configuration and +setup of the ISA cards. + +The RocketModem II PCI boards require firmware to be loaded into the card +before it will function. The driver has only been tested as a module for this +board. + +Installation Procedures +----------------------- + +RocketPort/RocketModem PCI cards require no driver configuration, they are +automatically detected and configured. + +The RocketPort driver can be installed as a module (recommended) or built +into the kernel. This is selected, as for other drivers, through the `make config` +command from the root of the Linux source tree during the kernel build process. + +The RocketPort/RocketModem serial ports installed by this driver are assigned +device major number 46, and will be named /dev/ttyRx, where x is the port number +starting at zero (ex. /dev/ttyR0, /devttyR1, ...). If you have multiple cards +installed in the system, the mapping of port names to serial ports is displayed +in the system log at /var/log/messages. + +If installed as a module, the module must be loaded. This can be done +manually by entering "modprobe rocket". To have the module loaded automatically +upon system boot, edit a `/etc/modprobe.d/*.conf` file and add the line +"alias char-major-46 rocket". + +In order to use the ports, their device names (nodes) must be created with mknod. +This is only required once, the system will retain the names once created. To +create the RocketPort/RocketModem device names, use the command +"mknod /dev/ttyRx c 46 x" where x is the port number starting at zero. + +For example:: + + > mknod /dev/ttyR0 c 46 0 + > mknod /dev/ttyR1 c 46 1 + > mknod /dev/ttyR2 c 46 2 + +The Linux script MAKEDEV will create the first 16 ttyRx device names (nodes) +for you:: + + >/dev/MAKEDEV ttyR + +ISA Rocketport Boards +--------------------- + +You must assign and configure the I/O addresses used by the ISA Rocketport +card before installing and using it. This is done by setting a set of DIP +switches on the Rocketport board. + + +Setting the I/O address +----------------------- + +Before installing RocketPort(R) or RocketPort RA boards, you must find +a range of I/O addresses for it to use. The first RocketPort card +requires a 68-byte contiguous block of I/O addresses, starting at one +of the following: 0x100h, 0x140h, 0x180h, 0x200h, 0x240h, 0x280h, +0x300h, 0x340h, 0x380h. This I/O address must be reflected in the DIP +switches of *all* of the Rocketport cards. + +The second, third, and fourth RocketPort cards require a 64-byte +contiguous block of I/O addresses, starting at one of the following +I/O addresses: 0x100h, 0x140h, 0x180h, 0x1C0h, 0x200h, 0x240h, 0x280h, +0x2C0h, 0x300h, 0x340h, 0x380h, 0x3C0h. The I/O address used by the +second, third, and fourth Rocketport cards (if present) are set via +software control. The DIP switch settings for the I/O address must be +set to the value of the first Rocketport cards. + +In order to distinguish each of the card from the others, each card +must have a unique board ID set on the dip switches. The first +Rocketport board must be set with the DIP switches corresponding to +the first board, the second board must be set with the DIP switches +corresponding to the second board, etc. IMPORTANT: The board ID is +the only place where the DIP switch settings should differ between the +various Rocketport boards in a system. + +The I/O address range used by any of the RocketPort cards must not +conflict with any other cards in the system, including other +RocketPort cards. Below, you will find a list of commonly used I/O +address ranges which may be in use by other devices in your system. +On a Linux system, "cat /proc/ioports" will also be helpful in +identifying what I/O addresses are being used by devices on your +system. + +Remember, the FIRST RocketPort uses 68 I/O addresses. So, if you set it +for 0x100, it will occupy 0x100 to 0x143. This would mean that you +CAN NOT set the second, third or fourth board for address 0x140 since +the first 4 bytes of that range are used by the first board. You would +need to set the second, third, or fourth board to one of the next available +blocks such as 0x180. + +RocketPort and RocketPort RA SW1 Settings:: + + +-------------------------------+ + | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | + +-------+-------+---------------+ + | Unused| Card | I/O Port Block| + +-------------------------------+ + + DIP Switches DIP Switches + 7 8 6 5 + =================== =================== + On On UNUSED, MUST BE ON. On On First Card <==== Default + On Off Second Card + Off On Third Card + Off Off Fourth Card + + DIP Switches I/O Address Range + 4 3 2 1 Used by the First Card + ===================================== + On Off On Off 100-143 + On Off Off On 140-183 + On Off Off Off 180-1C3 <==== Default + Off On On Off 200-243 + Off On Off On 240-283 + Off On Off Off 280-2C3 + Off Off On Off 300-343 + Off Off Off On 340-383 + Off Off Off Off 380-3C3 + +Reporting Bugs +-------------- + +For technical support, please provide the following +information: Driver version, kernel release, distribution of +kernel, and type of board you are using. Error messages and log +printouts port configuration details are especially helpful. + +USA: + :Phone: (612) 494-4100 + :FAX: (612) 494-4199 + :email: support@comtrol.com + +Comtrol Europe: + :Phone: +44 (0) 1 869 323-220 + :FAX: +44 (0) 1 869 323-211 + :email: support@comtrol.co.uk + +Web: http://www.comtrol.com +FTP: ftp.comtrol.com diff --git a/Documentation/driver-api/serial/serial-iso7816.rst b/Documentation/driver-api/serial/serial-iso7816.rst new file mode 100644 index 000000000000..d990143de0c6 --- /dev/null +++ b/Documentation/driver-api/serial/serial-iso7816.rst @@ -0,0 +1,90 @@ +============================= +ISO7816 Serial Communications +============================= + +1. Introduction +=============== + + ISO/IEC7816 is a series of standards specifying integrated circuit cards (ICC) + also known as smart cards. + +2. Hardware-related considerations +================================== + + Some CPUs/UARTs (e.g., Microchip AT91) contain a built-in mode capable of + handling communication with a smart card. + + For these microcontrollers, the Linux driver should be made capable of + working in both modes, and proper ioctls (see later) should be made + available at user-level to allow switching from one mode to the other, and + vice versa. + +3. Data Structures Already Available in the Kernel +================================================== + + The Linux kernel provides the serial_iso7816 structure (see [1]) to handle + ISO7816 communications. This data structure is used to set and configure + ISO7816 parameters in ioctls. + + Any driver for devices capable of working both as RS232 and ISO7816 should + implement the iso7816_config callback in the uart_port structure. The + serial_core calls iso7816_config to do the device specific part in response + to TIOCGISO7816 and TIOCSISO7816 ioctls (see below). The iso7816_config + callback receives a pointer to struct serial_iso7816. + +4. Usage from user-level +======================== + + From user-level, ISO7816 configuration can be get/set using the previous + ioctls. For instance, to set ISO7816 you can use the following code:: + + #include + + /* Include definition for ISO7816 ioctls: TIOCSISO7816 and TIOCGISO7816 */ + #include + + /* Open your specific device (e.g., /dev/mydevice): */ + int fd = open ("/dev/mydevice", O_RDWR); + if (fd < 0) { + /* Error handling. See errno. */ + } + + struct serial_iso7816 iso7816conf; + + /* Reserved fields as to be zeroed */ + memset(&iso7816conf, 0, sizeof(iso7816conf)); + + /* Enable ISO7816 mode: */ + iso7816conf.flags |= SER_ISO7816_ENABLED; + + /* Select the protocol: */ + /* T=0 */ + iso7816conf.flags |= SER_ISO7816_T(0); + /* or T=1 */ + iso7816conf.flags |= SER_ISO7816_T(1); + + /* Set the guard time: */ + iso7816conf.tg = 2; + + /* Set the clock frequency*/ + iso7816conf.clk = 3571200; + + /* Set transmission factors: */ + iso7816conf.sc_fi = 372; + iso7816conf.sc_di = 1; + + if (ioctl(fd_usart, TIOCSISO7816, &iso7816conf) < 0) { + /* Error handling. See errno. */ + } + + /* Use read() and write() syscalls here... */ + + /* Close the device when finished: */ + if (close (fd) < 0) { + /* Error handling. See errno. */ + } + +5. References +============= + + [1] include/uapi/linux/serial.h diff --git a/Documentation/driver-api/serial/serial-rs485.rst b/Documentation/driver-api/serial/serial-rs485.rst new file mode 100644 index 000000000000..6bc824f948f9 --- /dev/null +++ b/Documentation/driver-api/serial/serial-rs485.rst @@ -0,0 +1,103 @@ +=========================== +RS485 Serial Communications +=========================== + +1. Introduction +=============== + + EIA-485, also known as TIA/EIA-485 or RS-485, is a standard defining the + electrical characteristics of drivers and receivers for use in balanced + digital multipoint systems. + This standard is widely used for communications in industrial automation + because it can be used effectively over long distances and in electrically + noisy environments. + +2. Hardware-related Considerations +================================== + + Some CPUs/UARTs (e.g., Atmel AT91 or 16C950 UART) contain a built-in + half-duplex mode capable of automatically controlling line direction by + toggling RTS or DTR signals. That can be used to control external + half-duplex hardware like an RS485 transceiver or any RS232-connected + half-duplex devices like some modems. + + For these microcontrollers, the Linux driver should be made capable of + working in both modes, and proper ioctls (see later) should be made + available at user-level to allow switching from one mode to the other, and + vice versa. + +3. Data Structures Already Available in the Kernel +================================================== + + The Linux kernel provides the serial_rs485 structure (see [1]) to handle + RS485 communications. This data structure is used to set and configure RS485 + parameters in the platform data and in ioctls. + + The device tree can also provide RS485 boot time parameters (see [2] + for bindings). The driver is in charge of filling this data structure from + the values given by the device tree. + + Any driver for devices capable of working both as RS232 and RS485 should + implement the rs485_config callback in the uart_port structure. The + serial_core calls rs485_config to do the device specific part in response + to TIOCSRS485 and TIOCGRS485 ioctls (see below). The rs485_config callback + receives a pointer to struct serial_rs485. + +4. Usage from user-level +======================== + + From user-level, RS485 configuration can be get/set using the previous + ioctls. For instance, to set RS485 you can use the following code:: + + #include + + /* Include definition for RS485 ioctls: TIOCGRS485 and TIOCSRS485 */ + #include + + /* Open your specific device (e.g., /dev/mydevice): */ + int fd = open ("/dev/mydevice", O_RDWR); + if (fd < 0) { + /* Error handling. See errno. */ + } + + struct serial_rs485 rs485conf; + + /* Enable RS485 mode: */ + rs485conf.flags |= SER_RS485_ENABLED; + + /* Set logical level for RTS pin equal to 1 when sending: */ + rs485conf.flags |= SER_RS485_RTS_ON_SEND; + /* or, set logical level for RTS pin equal to 0 when sending: */ + rs485conf.flags &= ~(SER_RS485_RTS_ON_SEND); + + /* Set logical level for RTS pin equal to 1 after sending: */ + rs485conf.flags |= SER_RS485_RTS_AFTER_SEND; + /* or, set logical level for RTS pin equal to 0 after sending: */ + rs485conf.flags &= ~(SER_RS485_RTS_AFTER_SEND); + + /* Set rts delay before send, if needed: */ + rs485conf.delay_rts_before_send = ...; + + /* Set rts delay after send, if needed: */ + rs485conf.delay_rts_after_send = ...; + + /* Set this flag if you want to receive data even while sending data */ + rs485conf.flags |= SER_RS485_RX_DURING_TX; + + if (ioctl (fd, TIOCSRS485, &rs485conf) < 0) { + /* Error handling. See errno. */ + } + + /* Use read() and write() syscalls here... */ + + /* Close the device when finished: */ + if (close (fd) < 0) { + /* Error handling. See errno. */ + } + +5. References +============= + + [1] include/uapi/linux/serial.h + + [2] Documentation/devicetree/bindings/serial/rs485.txt diff --git a/Documentation/driver-api/serial/tty.rst b/Documentation/driver-api/serial/tty.rst new file mode 100644 index 000000000000..dd972caacf3e --- /dev/null +++ b/Documentation/driver-api/serial/tty.rst @@ -0,0 +1,328 @@ +================= +The Lockronomicon +================= + +Your guide to the ancient and twisted locking policies of the tty layer and +the warped logic behind them. Beware all ye who read on. + + +Line Discipline +--------------- + +Line disciplines are registered with tty_register_ldisc() passing the +discipline number and the ldisc structure. At the point of registration the +discipline must be ready to use and it is possible it will get used before +the call returns success. If the call returns an error then it won't get +called. Do not re-use ldisc numbers as they are part of the userspace ABI +and writing over an existing ldisc will cause demons to eat your computer. +After the return the ldisc data has been copied so you may free your own +copy of the structure. You must not re-register over the top of the line +discipline even with the same data or your computer again will be eaten by +demons. + +In order to remove a line discipline call tty_unregister_ldisc(). +In ancient times this always worked. In modern times the function will +return -EBUSY if the ldisc is currently in use. Since the ldisc referencing +code manages the module counts this should not usually be a concern. + +Heed this warning: the reference count field of the registered copies of the +tty_ldisc structure in the ldisc table counts the number of lines using this +discipline. The reference count of the tty_ldisc structure within a tty +counts the number of active users of the ldisc at this instant. In effect it +counts the number of threads of execution within an ldisc method (plus those +about to enter and exit although this detail matters not). + +Line Discipline Methods +----------------------- + +TTY side interfaces +^^^^^^^^^^^^^^^^^^^ + +======================= ======================================================= +open() Called when the line discipline is attached to + the terminal. No other call into the line + discipline for this tty will occur until it + completes successfully. Should initialize any + state needed by the ldisc, and set receive_room + in the tty_struct to the maximum amount of data + the line discipline is willing to accept from the + driver with a single call to receive_buf(). + Returning an error will prevent the ldisc from + being attached. Can sleep. + +close() This is called on a terminal when the line + discipline is being unplugged. At the point of + execution no further users will enter the + ldisc code for this tty. Can sleep. + +hangup() Called when the tty line is hung up. + The line discipline should cease I/O to the tty. + No further calls into the ldisc code will occur. + The return value is ignored. Can sleep. + +read() (optional) A process requests reading data from + the line. Multiple read calls may occur in parallel + and the ldisc must deal with serialization issues. + If not defined, the process will receive an EIO + error. May sleep. + +write() (optional) A process requests writing data to the + line. Multiple write calls are serialized by the + tty layer for the ldisc. If not defined, the + process will receive an EIO error. May sleep. + +flush_buffer() (optional) May be called at any point between + open and close, and instructs the line discipline + to empty its input buffer. + +set_termios() (optional) Called on termios structure changes. + The caller passes the old termios data and the + current data is in the tty. Called under the + termios semaphore so allowed to sleep. Serialized + against itself only. + +poll() (optional) Check the status for the poll/select + calls. Multiple poll calls may occur in parallel. + May sleep. + +ioctl() (optional) Called when an ioctl is handed to the + tty layer that might be for the ldisc. Multiple + ioctl calls may occur in parallel. May sleep. + +compat_ioctl() (optional) Called when a 32 bit ioctl is handed + to the tty layer that might be for the ldisc. + Multiple ioctl calls may occur in parallel. + May sleep. +======================= ======================================================= + +Driver Side Interfaces +^^^^^^^^^^^^^^^^^^^^^^ + +======================= ======================================================= +receive_buf() (optional) Called by the low-level driver to hand + a buffer of received bytes to the ldisc for + processing. The number of bytes is guaranteed not + to exceed the current value of tty->receive_room. + All bytes must be processed. + +receive_buf2() (optional) Called by the low-level driver to hand + a buffer of received bytes to the ldisc for + processing. Returns the number of bytes processed. + + If both receive_buf() and receive_buf2() are + defined, receive_buf2() should be preferred. + +write_wakeup() May be called at any point between open and close. + The TTY_DO_WRITE_WAKEUP flag indicates if a call + is needed but always races versus calls. Thus the + ldisc must be careful about setting order and to + handle unexpected calls. Must not sleep. + + The driver is forbidden from calling this directly + from the ->write call from the ldisc as the ldisc + is permitted to call the driver write method from + this function. In such a situation defer it. + +dcd_change() Report to the tty line the current DCD pin status + changes and the relative timestamp. The timestamp + cannot be NULL. +======================= ======================================================= + + +Driver Access +^^^^^^^^^^^^^ + +Line discipline methods can call the following methods of the underlying +hardware driver through the function pointers within the tty->driver +structure: + +======================= ======================================================= +write() Write a block of characters to the tty device. + Returns the number of characters accepted. The + character buffer passed to this method is already + in kernel space. + +put_char() Queues a character for writing to the tty device. + If there is no room in the queue, the character is + ignored. + +flush_chars() (Optional) If defined, must be called after + queueing characters with put_char() in order to + start transmission. + +write_room() Returns the numbers of characters the tty driver + will accept for queueing to be written. + +ioctl() Invoke device specific ioctl. + Expects data pointers to refer to userspace. + Returns ENOIOCTLCMD for unrecognized ioctl numbers. + +set_termios() Notify the tty driver that the device's termios + settings have changed. New settings are in + tty->termios. Previous settings should be passed in + the "old" argument. + + The API is defined such that the driver should return + the actual modes selected. This means that the + driver function is responsible for modifying any + bits in the request it cannot fulfill to indicate + the actual modes being used. A device with no + hardware capability for change (e.g. a USB dongle or + virtual port) can provide NULL for this method. + +throttle() Notify the tty driver that input buffers for the + line discipline are close to full, and it should + somehow signal that no more characters should be + sent to the tty. + +unthrottle() Notify the tty driver that characters can now be + sent to the tty without fear of overrunning the + input buffers of the line disciplines. + +stop() Ask the tty driver to stop outputting characters + to the tty device. + +start() Ask the tty driver to resume sending characters + to the tty device. + +hangup() Ask the tty driver to hang up the tty device. + +break_ctl() (Optional) Ask the tty driver to turn on or off + BREAK status on the RS-232 port. If state is -1, + then the BREAK status should be turned on; if + state is 0, then BREAK should be turned off. + If this routine is not implemented, use ioctls + TIOCSBRK / TIOCCBRK instead. + +wait_until_sent() Waits until the device has written out all of the + characters in its transmitter FIFO. + +send_xchar() Send a high-priority XON/XOFF character to the device. +======================= ======================================================= + + +Flags +^^^^^ + +Line discipline methods have access to tty->flags field containing the +following interesting flags: + +======================= ======================================================= +TTY_THROTTLED Driver input is throttled. The ldisc should call + tty->driver->unthrottle() in order to resume + reception when it is ready to process more data. + +TTY_DO_WRITE_WAKEUP If set, causes the driver to call the ldisc's + write_wakeup() method in order to resume + transmission when it can accept more data + to transmit. + +TTY_IO_ERROR If set, causes all subsequent userspace read/write + calls on the tty to fail, returning -EIO. + +TTY_OTHER_CLOSED Device is a pty and the other side has closed. + +TTY_NO_WRITE_SPLIT Prevent driver from splitting up writes into + smaller chunks. +======================= ======================================================= + + +Locking +^^^^^^^ + +Callers to the line discipline functions from the tty layer are required to +take line discipline locks. The same is true of calls from the driver side +but not yet enforced. + +Three calls are now provided:: + + ldisc = tty_ldisc_ref(tty); + +takes a handle to the line discipline in the tty and returns it. If no ldisc +is currently attached or the ldisc is being closed and re-opened at this +point then NULL is returned. While this handle is held the ldisc will not +change or go away:: + + tty_ldisc_deref(ldisc) + +Returns the ldisc reference and allows the ldisc to be closed. Returning the +reference takes away your right to call the ldisc functions until you take +a new reference:: + + ldisc = tty_ldisc_ref_wait(tty); + +Performs the same function as tty_ldisc_ref except that it will wait for an +ldisc change to complete and then return a reference to the new ldisc. + +While these functions are slightly slower than the old code they should have +minimal impact as most receive logic uses the flip buffers and they only +need to take a reference when they push bits up through the driver. + +A caution: The ldisc->open(), ldisc->close() and driver->set_ldisc +functions are called with the ldisc unavailable. Thus tty_ldisc_ref will +fail in this situation if used within these functions. Ldisc and driver +code calling its own functions must be careful in this case. + + +Driver Interface +---------------- + +======================= ======================================================= +open() Called when a device is opened. May sleep + +close() Called when a device is closed. At the point of + return from this call the driver must make no + further ldisc calls of any kind. May sleep + +write() Called to write bytes to the device. May not + sleep. May occur in parallel in special cases. + Because this includes panic paths drivers generally + shouldn't try and do clever locking here. + +put_char() Stuff a single character onto the queue. The + driver is guaranteed following up calls to + flush_chars. + +flush_chars() Ask the kernel to write put_char queue + +write_room() Return the number of characters that can be stuffed + into the port buffers without overflow (or less). + The ldisc is responsible for being intelligent + about multi-threading of write_room/write calls + +ioctl() Called when an ioctl may be for the driver + +set_termios() Called on termios change, serialized against + itself by a semaphore. May sleep. + +set_ldisc() Notifier for discipline change. At the point this + is done the discipline is not yet usable. Can now + sleep (I think) + +throttle() Called by the ldisc to ask the driver to do flow + control. Serialization including with unthrottle + is the job of the ldisc layer. + +unthrottle() Called by the ldisc to ask the driver to stop flow + control. + +stop() Ldisc notifier to the driver to stop output. As with + throttle the serializations with start() are down + to the ldisc layer. + +start() Ldisc notifier to the driver to start output. + +hangup() Ask the tty driver to cause a hangup initiated + from the host side. [Can sleep ??] + +break_ctl() Send RS232 break. Can sleep. Can get called in + parallel, driver must serialize (for now), and + with write calls. + +wait_until_sent() Wait for characters to exit the hardware queue + of the driver. Can sleep + +send_xchar() Send XON/XOFF and if possible jump the queue with + it in order to get fast flow control responses. + Cannot sleep ?? +======================= ======================================================= diff --git a/Documentation/serial/cyclades_z.rst b/Documentation/serial/cyclades_z.rst deleted file mode 100644 index 532ff67e2f1c..000000000000 --- a/Documentation/serial/cyclades_z.rst +++ /dev/null @@ -1,11 +0,0 @@ -================ -Cyclades-Z notes -================ - -The Cyclades-Z must have firmware loaded onto the card before it will -operate. This operation should be performed during system startup, - -The firmware, loader program and the latest device driver code are -available from Cyclades at - - ftp://ftp.cyclades.com/pub/cyclades/cyclades-z/linux/ diff --git a/Documentation/serial/driver.rst b/Documentation/serial/driver.rst deleted file mode 100644 index 4537119bf624..000000000000 --- a/Documentation/serial/driver.rst +++ /dev/null @@ -1,549 +0,0 @@ -==================== -Low Level Serial API -==================== - - -This document is meant as a brief overview of some aspects of the new serial -driver. It is not complete, any questions you have should be directed to - - -The reference implementation is contained within amba-pl011.c. - - - -Low Level Serial Hardware Driver --------------------------------- - -The low level serial hardware driver is responsible for supplying port -information (defined by uart_port) and a set of control methods (defined -by uart_ops) to the core serial driver. The low level driver is also -responsible for handling interrupts for the port, and providing any -console support. - - -Console Support ---------------- - -The serial core provides a few helper functions. This includes identifing -the correct port structure (via uart_get_console) and decoding command line -arguments (uart_parse_options). - -There is also a helper function (uart_console_write) which performs a -character by character write, translating newlines to CRLF sequences. -Driver writers are recommended to use this function rather than implementing -their own version. - - -Locking -------- - -It is the responsibility of the low level hardware driver to perform the -necessary locking using port->lock. There are some exceptions (which -are described in the uart_ops listing below.) - -There are two locks. A per-port spinlock, and an overall semaphore. - -From the core driver perspective, the port->lock locks the following -data:: - - port->mctrl - port->icount - port->state->xmit.head (circ_buf->head) - port->state->xmit.tail (circ_buf->tail) - -The low level driver is free to use this lock to provide any additional -locking. - -The port_sem semaphore is used to protect against ports being added/ -removed or reconfigured at inappropriate times. Since v2.6.27, this -semaphore has been the 'mutex' member of the tty_port struct, and -commonly referred to as the port mutex. - - -uart_ops --------- - -The uart_ops structure is the main interface between serial_core and the -hardware specific driver. It contains all the methods to control the -hardware. - - tx_empty(port) - This function tests whether the transmitter fifo and shifter - for the port described by 'port' is empty. If it is empty, - this function should return TIOCSER_TEMT, otherwise return 0. - If the port does not support this operation, then it should - return TIOCSER_TEMT. - - Locking: none. - - Interrupts: caller dependent. - - This call must not sleep - - set_mctrl(port, mctrl) - This function sets the modem control lines for port described - by 'port' to the state described by mctrl. The relevant bits - of mctrl are: - - - TIOCM_RTS RTS signal. - - TIOCM_DTR DTR signal. - - TIOCM_OUT1 OUT1 signal. - - TIOCM_OUT2 OUT2 signal. - - TIOCM_LOOP Set the port into loopback mode. - - If the appropriate bit is set, the signal should be driven - active. If the bit is clear, the signal should be driven - inactive. - - Locking: port->lock taken. - - Interrupts: locally disabled. - - This call must not sleep - - get_mctrl(port) - Returns the current state of modem control inputs. The state - of the outputs should not be returned, since the core keeps - track of their state. The state information should include: - - - TIOCM_CAR state of DCD signal - - TIOCM_CTS state of CTS signal - - TIOCM_DSR state of DSR signal - - TIOCM_RI state of RI signal - - The bit is set if the signal is currently driven active. If - the port does not support CTS, DCD or DSR, the driver should - indicate that the signal is permanently active. If RI is - not available, the signal should not be indicated as active. - - Locking: port->lock taken. - - Interrupts: locally disabled. - - This call must not sleep - - stop_tx(port) - Stop transmitting characters. This might be due to the CTS - line becoming inactive or the tty layer indicating we want - to stop transmission due to an XOFF character. - - The driver should stop transmitting characters as soon as - possible. - - Locking: port->lock taken. - - Interrupts: locally disabled. - - This call must not sleep - - start_tx(port) - Start transmitting characters. - - Locking: port->lock taken. - - Interrupts: locally disabled. - - This call must not sleep - - throttle(port) - Notify the serial driver that input buffers for the line discipline are - close to full, and it should somehow signal that no more characters - should be sent to the serial port. - This will be called only if hardware assisted flow control is enabled. - - Locking: serialized with .unthrottle() and termios modification by the - tty layer. - - unthrottle(port) - Notify the serial driver that characters can now be sent to the serial - port without fear of overrunning the input buffers of the line - disciplines. - - This will be called only if hardware assisted flow control is enabled. - - Locking: serialized with .throttle() and termios modification by the - tty layer. - - send_xchar(port,ch) - Transmit a high priority character, even if the port is stopped. - This is used to implement XON/XOFF flow control and tcflow(). If - the serial driver does not implement this function, the tty core - will append the character to the circular buffer and then call - start_tx() / stop_tx() to flush the data out. - - Do not transmit if ch == '\0' (__DISABLED_CHAR). - - Locking: none. - - Interrupts: caller dependent. - - stop_rx(port) - Stop receiving characters; the port is in the process of - being closed. - - Locking: port->lock taken. - - Interrupts: locally disabled. - - This call must not sleep - - enable_ms(port) - Enable the modem status interrupts. - - This method may be called multiple times. Modem status - interrupts should be disabled when the shutdown method is - called. - - Locking: port->lock taken. - - Interrupts: locally disabled. - - This call must not sleep - - break_ctl(port,ctl) - Control the transmission of a break signal. If ctl is - nonzero, the break signal should be transmitted. The signal - should be terminated when another call is made with a zero - ctl. - - Locking: caller holds tty_port->mutex - - startup(port) - Grab any interrupt resources and initialise any low level driver - state. Enable the port for reception. It should not activate - RTS nor DTR; this will be done via a separate call to set_mctrl. - - This method will only be called when the port is initially opened. - - Locking: port_sem taken. - - Interrupts: globally disabled. - - shutdown(port) - Disable the port, disable any break condition that may be in - effect, and free any interrupt resources. It should not disable - RTS nor DTR; this will have already been done via a separate - call to set_mctrl. - - Drivers must not access port->state once this call has completed. - - This method will only be called when there are no more users of - this port. - - Locking: port_sem taken. - - Interrupts: caller dependent. - - flush_buffer(port) - Flush any write buffers, reset any DMA state and stop any - ongoing DMA transfers. - - This will be called whenever the port->state->xmit circular - buffer is cleared. - - Locking: port->lock taken. - - Interrupts: locally disabled. - - This call must not sleep - - set_termios(port,termios,oldtermios) - Change the port parameters, including word length, parity, stop - bits. Update read_status_mask and ignore_status_mask to indicate - the types of events we are interested in receiving. Relevant - termios->c_cflag bits are: - - CSIZE - - word size - CSTOPB - - 2 stop bits - PARENB - - parity enable - PARODD - - odd parity (when PARENB is in force) - CREAD - - enable reception of characters (if not set, - still receive characters from the port, but - throw them away. - CRTSCTS - - if set, enable CTS status change reporting - CLOCAL - - if not set, enable modem status change - reporting. - - Relevant termios->c_iflag bits are: - - INPCK - - enable frame and parity error events to be - passed to the TTY layer. - BRKINT / PARMRK - - both of these enable break events to be - passed to the TTY layer. - - IGNPAR - - ignore parity and framing errors - IGNBRK - - ignore break errors, If IGNPAR is also - set, ignore overrun errors as well. - - The interaction of the iflag bits is as follows (parity error - given as an example): - - =============== ======= ====== ============================= - Parity error INPCK IGNPAR - =============== ======= ====== ============================= - n/a 0 n/a character received, marked as - TTY_NORMAL - None 1 n/a character received, marked as - TTY_NORMAL - Yes 1 0 character received, marked as - TTY_PARITY - Yes 1 1 character discarded - =============== ======= ====== ============================= - - Other flags may be used (eg, xon/xoff characters) if your - hardware supports hardware "soft" flow control. - - Locking: caller holds tty_port->mutex - - Interrupts: caller dependent. - - This call must not sleep - - set_ldisc(port,termios) - Notifier for discipline change. See Documentation/serial/tty.rst. - - Locking: caller holds tty_port->mutex - - pm(port,state,oldstate) - Perform any power management related activities on the specified - port. State indicates the new state (defined by - enum uart_pm_state), oldstate indicates the previous state. - - This function should not be used to grab any resources. - - This will be called when the port is initially opened and finally - closed, except when the port is also the system console. This - will occur even if CONFIG_PM is not set. - - Locking: none. - - Interrupts: caller dependent. - - type(port) - Return a pointer to a string constant describing the specified - port, or return NULL, in which case the string 'unknown' is - substituted. - - Locking: none. - - Interrupts: caller dependent. - - release_port(port) - Release any memory and IO region resources currently in use by - the port. - - Locking: none. - - Interrupts: caller dependent. - - request_port(port) - Request any memory and IO region resources required by the port. - If any fail, no resources should be registered when this function - returns, and it should return -EBUSY on failure. - - Locking: none. - - Interrupts: caller dependent. - - config_port(port,type) - Perform any autoconfiguration steps required for the port. `type` - contains a bit mask of the required configuration. UART_CONFIG_TYPE - indicates that the port requires detection and identification. - port->type should be set to the type found, or PORT_UNKNOWN if - no port was detected. - - UART_CONFIG_IRQ indicates autoconfiguration of the interrupt signal, - which should be probed using standard kernel autoprobing techniques. - This is not necessary on platforms where ports have interrupts - internally hard wired (eg, system on a chip implementations). - - Locking: none. - - Interrupts: caller dependent. - - verify_port(port,serinfo) - Verify the new serial port information contained within serinfo is - suitable for this port type. - - Locking: none. - - Interrupts: caller dependent. - - ioctl(port,cmd,arg) - Perform any port specific IOCTLs. IOCTL commands must be defined - using the standard numbering system found in - - Locking: none. - - Interrupts: caller dependent. - - poll_init(port) - Called by kgdb to perform the minimal hardware initialization needed - to support poll_put_char() and poll_get_char(). Unlike ->startup() - this should not request interrupts. - - Locking: tty_mutex and tty_port->mutex taken. - - Interrupts: n/a. - - poll_put_char(port,ch) - Called by kgdb to write a single character directly to the serial - port. It can and should block until there is space in the TX FIFO. - - Locking: none. - - Interrupts: caller dependent. - - This call must not sleep - - poll_get_char(port) - Called by kgdb to read a single character directly from the serial - port. If data is available, it should be returned; otherwise - the function should return NO_POLL_CHAR immediately. - - Locking: none. - - Interrupts: caller dependent. - - This call must not sleep - -Other functions ---------------- - -uart_update_timeout(port,cflag,baud) - Update the FIFO drain timeout, port->timeout, according to the - number of bits, parity, stop bits and baud rate. - - Locking: caller is expected to take port->lock - - Interrupts: n/a - -uart_get_baud_rate(port,termios,old,min,max) - Return the numeric baud rate for the specified termios, taking - account of the special 38400 baud "kludge". The B0 baud rate - is mapped to 9600 baud. - - If the baud rate is not within min..max, then if old is non-NULL, - the original baud rate will be tried. If that exceeds the - min..max constraint, 9600 baud will be returned. termios will - be updated to the baud rate in use. - - Note: min..max must always allow 9600 baud to be selected. - - Locking: caller dependent. - - Interrupts: n/a - -uart_get_divisor(port,baud) - Return the divisor (baud_base / baud) for the specified baud - rate, appropriately rounded. - - If 38400 baud and custom divisor is selected, return the - custom divisor instead. - - Locking: caller dependent. - - Interrupts: n/a - -uart_match_port(port1,port2) - This utility function can be used to determine whether two - uart_port structures describe the same port. - - Locking: n/a - - Interrupts: n/a - -uart_write_wakeup(port) - A driver is expected to call this function when the number of - characters in the transmit buffer have dropped below a threshold. - - Locking: port->lock should be held. - - Interrupts: n/a - -uart_register_driver(drv) - Register a uart driver with the core driver. We in turn register - with the tty layer, and initialise the core driver per-port state. - - drv->port should be NULL, and the per-port structures should be - registered using uart_add_one_port after this call has succeeded. - - Locking: none - - Interrupts: enabled - -uart_unregister_driver() - Remove all references to a driver from the core driver. The low - level driver must have removed all its ports via the - uart_remove_one_port() if it registered them with uart_add_one_port(). - - Locking: none - - Interrupts: enabled - -**uart_suspend_port()** - -**uart_resume_port()** - -**uart_add_one_port()** - -**uart_remove_one_port()** - -Other notes ------------ - -It is intended some day to drop the 'unused' entries from uart_port, and -allow low level drivers to register their own individual uart_port's with -the core. This will allow drivers to use uart_port as a pointer to a -structure containing both the uart_port entry with their own extensions, -thus:: - - struct my_port { - struct uart_port port; - int my_stuff; - }; - -Modem control lines via GPIO ----------------------------- - -Some helpers are provided in order to set/get modem control lines via GPIO. - -mctrl_gpio_init(port, idx): - This will get the {cts,rts,...}-gpios from device tree if they are - present and request them, set direction etc, and return an - allocated structure. `devm_*` functions are used, so there's no need - to call mctrl_gpio_free(). - As this sets up the irq handling make sure to not handle changes to the - gpio input lines in your driver, too. - -mctrl_gpio_free(dev, gpios): - This will free the requested gpios in mctrl_gpio_init(). - As `devm_*` functions are used, there's generally no need to call - this function. - -mctrl_gpio_to_gpiod(gpios, gidx) - This returns the gpio_desc structure associated to the modem line - index. - -mctrl_gpio_set(gpios, mctrl): - This will sets the gpios according to the mctrl state. - -mctrl_gpio_get(gpios, mctrl): - This will update mctrl with the gpios values. - -mctrl_gpio_enable_ms(gpios): - Enables irqs and handling of changes to the ms lines. - -mctrl_gpio_disable_ms(gpios): - Disables irqs and handling of changes to the ms lines. diff --git a/Documentation/serial/index.rst b/Documentation/serial/index.rst deleted file mode 100644 index d0ba22ea23bf..000000000000 --- a/Documentation/serial/index.rst +++ /dev/null @@ -1,32 +0,0 @@ -:orphan: - -========================== -Support for Serial devices -========================== - -.. toctree:: - :maxdepth: 1 - - - driver - tty - -Serial drivers -============== - -.. toctree:: - :maxdepth: 1 - - cyclades_z - moxa-smartio - n_gsm - rocket - serial-iso7816 - serial-rs485 - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/serial/moxa-smartio.rst b/Documentation/serial/moxa-smartio.rst deleted file mode 100644 index 156100f17c3f..000000000000 --- a/Documentation/serial/moxa-smartio.rst +++ /dev/null @@ -1,615 +0,0 @@ -============================================================= -MOXA Smartio/Industio Family Device Driver Installation Guide -============================================================= - -.. note:: - - This file is outdated. It needs some care in order to make it - updated to Kernel 5.0 and upper - -Copyright (C) 2008, Moxa Inc. - -Date: 01/21/2008 - -.. Content - - 1. Introduction - 2. System Requirement - 3. Installation - 3.1 Hardware installation - 3.2 Driver files - 3.3 Device naming convention - 3.4 Module driver configuration - 3.5 Static driver configuration for Linux kernel 2.4.x and 2.6.x. - 3.6 Custom configuration - 3.7 Verify driver installation - 4. Utilities - 5. Setserial - 6. Troubleshooting - -1. Introduction -^^^^^^^^^^^^^^^ - - The Smartio/Industio/UPCI family Linux driver supports following multiport - boards. - - - 2 ports multiport board - CP-102U, CP-102UL, CP-102UF - CP-132U-I, CP-132UL, - CP-132, CP-132I, CP132S, CP-132IS, - CI-132, CI-132I, CI-132IS, - (C102H, C102HI, C102HIS, C102P, CP-102, CP-102S) - - - 4 ports multiport board - CP-104EL, - CP-104UL, CP-104JU, - CP-134U, CP-134U-I, - C104H/PCI, C104HS/PCI, - CP-114, CP-114I, CP-114S, CP-114IS, CP-114UL, - C104H, C104HS, - CI-104J, CI-104JS, - CI-134, CI-134I, CI-134IS, - (C114HI, CT-114I, C104P), - POS-104UL, - CB-114, - CB-134I - - - 8 ports multiport board - CP-118EL, CP-168EL, - CP-118U, CP-168U, - C168H/PCI, - C168H, C168HS, - (C168P), - CB-108 - - This driver and installation procedure have been developed upon Linux Kernel - 2.4.x and 2.6.x. This driver supports Intel x86 hardware platform. In order - to maintain compatibility, this version has also been properly tested with - RedHat, Mandrake, Fedora and S.u.S.E Linux. However, if compatibility problem - occurs, please contact Moxa at support@moxa.com.tw. - - In addition to device driver, useful utilities are also provided in this - version. They are: - - - msdiag - Diagnostic program for displaying installed Moxa - Smartio/Industio boards. - - msmon - Monitor program to observe data count and line status signals. - - msterm A simple terminal program which is useful in testing serial - ports. - - io-irq.exe - Configuration program to setup ISA boards. Please note that - this program can only be executed under DOS. - - All the drivers and utilities are published in form of source code under - GNU General Public License in this version. Please refer to GNU General - Public License announcement in each source code file for more detail. - - In Moxa's Web sites, you may always find latest driver at http://www.moxa.com/. - - This version of driver can be installed as Loadable Module (Module driver) - or built-in into kernel (Static driver). You may refer to following - installation procedure for suitable one. Before you install the driver, - please refer to hardware installation procedure in the User's Manual. - - We assume the user should be familiar with following documents. - - - Serial-HOWTO - - Kernel-HOWTO - -2. System Requirement -^^^^^^^^^^^^^^^^^^^^^ - - - Hardware platform: Intel x86 machine - - Kernel version: 2.4.x or 2.6.x - - gcc version 2.72 or later - - Maximum 4 boards can be installed in combination - -3. Installation -^^^^^^^^^^^^^^^ - -3.1 Hardware installation -========================= - - There are two types of buses, ISA and PCI, for Smartio/Industio - family multiport board. - -ISA board ---------- - - You'll have to configure CAP address, I/O address, Interrupt Vector - as well as IRQ before installing this driver. Please refer to hardware - installation procedure in User's Manual before proceed any further. - Please make sure the JP1 is open after the ISA board is set properly. - -PCI/UPCI board --------------- - - You may need to adjust IRQ usage in BIOS to avoid from IRQ conflict - with other ISA devices. Please refer to hardware installation - procedure in User's Manual in advance. - -PCI IRQ Sharing ---------------- - - Each port within the same multiport board shares the same IRQ. Up to - 4 Moxa Smartio/Industio PCI Family multiport boards can be installed - together on one system and they can share the same IRQ. - - -3.2 Driver files -================ - - The driver file may be obtained from ftp, CD-ROM or floppy disk. The - first step, anyway, is to copy driver file "mxser.tgz" into specified - directory. e.g. /moxa. The execute commands as below:: - - # cd / - # mkdir moxa - # cd /moxa - # tar xvf /dev/fd0 - -or:: - - # cd / - # mkdir moxa - # cd /moxa - # cp /mnt/cdrom//mxser.tgz . - # tar xvfz mxser.tgz - - -3.3 Device naming convention -============================ - - You may find all the driver and utilities files in /moxa/mxser. - Following installation procedure depends on the model you'd like to - run the driver. If you prefer module driver, please refer to 3.4. - If static driver is required, please refer to 3.5. - -Dialin and callout port ------------------------ - - This driver remains traditional serial device properties. There are - two special file name for each serial port. One is dial-in port - which is named "ttyMxx". For callout port, the naming convention - is "cumxx". - -Device naming when more than 2 boards installed ------------------------------------------------ - - Naming convention for each Smartio/Industio multiport board is - pre-defined as below. - - ============ =============== ============== - Board Num. Dial-in Port Callout port - 1st board ttyM0 - ttyM7 cum0 - cum7 - 2nd board ttyM8 - ttyM15 cum8 - cum15 - 3rd board ttyM16 - ttyM23 cum16 - cum23 - 4th board ttyM24 - ttym31 cum24 - cum31 - ============ =============== ============== - -.. note:: - - Under Kernel 2.6 and upper, the cum Device is Obsolete. So use ttyM* - device instead. - -Board sequence --------------- - - This driver will activate ISA boards according to the parameter set - in the driver. After all specified ISA board activated, PCI board - will be installed in the system automatically driven. - Therefore the board number is sorted by the CAP address of ISA boards. - For PCI boards, their sequence will be after ISA boards and C168H/PCI - has higher priority than C104H/PCI boards. - -3.4 Module driver configuration -=============================== - - Module driver is easiest way to install. If you prefer static driver - installation, please skip this paragraph. - - - ------------- Prepare to use the MOXA driver -------------------- - -3.4.1 Create tty device with correct major number -------------------------------------------------- - - Before using MOXA driver, your system must have the tty devices - which are created with driver's major number. We offer one shell - script "msmknod" to simplify the procedure. - This step is only needed to be executed once. But you still - need to do this procedure when: - - a. You change the driver's major number. Please refer the "3.7" - section. - b. Your total installed MOXA boards number is changed. Maybe you - add/delete one MOXA board. - c. You want to change the tty name. This needs to modify the - shell script "msmknod" - - The procedure is:: - - # cd /moxa/mxser/driver - # ./msmknod - - This shell script will require the major number for dial-in - device and callout device to create tty device. You also need - to specify the total installed MOXA board number. Default major - numbers for dial-in device and callout device are 30, 35. If - you need to change to other number, please refer section "3.7" - for more detailed procedure. - Msmknod will delete any special files occupying the same device - naming. - -3.4.2 Build the MOXA driver and utilities ------------------------------------------ - - Before using the MOXA driver and utilities, you need compile the - all the source code. This step is only need to be executed once. - But you still re-compile the source code if you modify the source - code. For example, if you change the driver's major number (see - "3.7" section), then you need to do this step again. - - Find "Makefile" in /moxa/mxser, then run - - # make clean; make install - - ..note:: - - For Red Hat 9, Red Hat Enterprise Linux AS3/ES3/WS3 & Fedora Core1: - # make clean; make installsp1 - - For Red Hat Enterprise Linux AS4/ES4/WS4: - # make clean; make installsp2 - - The driver files "mxser.o" and utilities will be properly compiled - and copied to system directories respectively. - -------------- Load MOXA driver-------------------- - -3.4.3 Load the MOXA driver --------------------------- - - :: - - # modprobe mxser - - will activate the module driver. You may run "lsmod" to check - if "mxser" is activated. If the MOXA board is ISA board, the - is needed. Please refer to section "3.4.5" for more - information. - -------------- Load MOXA driver on boot -------------------- - -3.4.4 Load the mxser driver ---------------------------- - - - For the above description, you may manually execute - "modprobe mxser" to activate this driver and run - "rmmod mxser" to remove it. - - However, it's better to have a boot time configuration to - eliminate manual operation. Boot time configuration can be - achieved by rc file. We offer one "rc.mxser" file to simplify - the procedure under "moxa/mxser/driver". - - But if you use ISA board, please modify the "modprobe ..." command - to add the argument (see "3.4.5" section). After modifying the - rc.mxser, please try to execute "/moxa/mxser/driver/rc.mxser" - manually to make sure the modification is ok. If any error - encountered, please try to modify again. If the modification is - completed, follow the below step. - - Run following command for setting rc files:: - - # cd /moxa/mxser/driver - # cp ./rc.mxser /etc/rc.d - # cd /etc/rc.d - - Check "rc.serial" is existed or not. If "rc.serial" doesn't exist, - create it by vi, run "chmod 755 rc.serial" to change the permission. - - Add "/etc/rc.d/rc.mxser" in last line. - - Reboot and check if moxa.o activated by "lsmod" command. - -3.4.5. specify CAP address --------------------------- - - If you'd like to drive Smartio/Industio ISA boards in the system, - you'll have to add parameter to specify CAP address of given - board while activating "mxser.o". The format for parameters are - as follows.:: - - modprobe mxser ioaddr=0x???,0x???,0x???,0x??? - | | | | - | | | +- 4th ISA board - | | +------ 3rd ISA board - | +------------ 2nd ISA board - +-------------------1st ISA board - -3.5 Static driver configuration for Linux kernel 2.4.x and 2.6.x -================================================================ - - Note: - To use static driver, you must install the linux kernel - source package. - -3.5.1 Backup the built-in driver in the kernel ----------------------------------------------- - - :: - - # cd /usr/src/linux/drivers/char - # mv mxser.c mxser.c.old - - For Red Hat 7.x user, you need to create link: - # cd /usr/src - # ln -s linux-2.4 linux - -3.5.2 Create link ------------------ - :: - - # cd /usr/src/linux/drivers/char - # ln -s /moxa/mxser/driver/mxser.c mxser.c - -3.5.3 Add CAP address list for ISA boards. ------------------------------------------- - - For PCI boards user, please skip this step. - - In module mode, the CAP address for ISA board is given by - parameter. In static driver configuration, you'll have to - assign it within driver's source code. If you will not - install any ISA boards, you may skip to next portion. - The instructions to modify driver source code are as - below. - - a. run:: - - # cd /moxa/mxser/driver - # vi mxser.c - - b. Find the array mxserBoardCAP[] as below:: - - static int mxserBoardCAP[] = {0x00, 0x00, 0x00, 0x00}; - - c. Change the address within this array using vi. For - example, to driver 2 ISA boards with CAP address - 0x280 and 0x180 as 1st and 2nd board. Just to change - the source code as follows:: - - static int mxserBoardCAP[] = {0x280, 0x180, 0x00, 0x00}; - -3.5.4 Setup kernel configuration --------------------------------- - - Configure the kernel:: - - # cd /usr/src/linux - # make menuconfig - - You will go into a menu-driven system. Please select [Character - devices][Non-standard serial port support], enable the [Moxa - SmartIO support] driver with "[*]" for built-in (not "[M]"), then - select [Exit] to exit this program. - -3.5.5 Rebuild kernel --------------------- - - The following are for Linux kernel rebuilding, for your - reference only. - - For appropriate details, please refer to the Linux document: - - a. Run the following commands:: - - cd /usr/src/linux - make clean # take a few minutes - make dep # take a few minutes - make bzImage # take probably 10-20 minutes - make install # copy boot image to correct position - - f. Please make sure the boot kernel (vmlinuz) is in the - correct position. - g. If you use 'lilo' utility, you should check /etc/lilo.conf - 'image' item specified the path which is the 'vmlinuz' path, - or you will load wrong (or old) boot kernel image (vmlinuz). - After checking /etc/lilo.conf, please run "lilo". - - Note that if the result of "make bzImage" is ERROR, then you have to - go back to Linux configuration Setup. Type "make menuconfig" in - directory /usr/src/linux. - - -3.5.6 Make tty device and special file --------------------------------------- - - :: - # cd /moxa/mxser/driver - # ./msmknod - -3.5.7 Make utility ------------------- - - :: - - # cd /moxa/mxser/utility - # make clean; make install - -3.5.8 Reboot ------------- - - - -3.6 Custom configuration -======================== - - Although this driver already provides you default configuration, you - still can change the device name and major number. The instruction to - change these parameters are shown as below. - -a. Change Device name - - If you'd like to use other device names instead of default naming - convention, all you have to do is to modify the internal code - within the shell script "msmknod". First, you have to open "msmknod" - by vi. Locate each line contains "ttyM" and "cum" and change them - to the device name you desired. "msmknod" creates the device names - you need next time executed. - -b. Change Major number - - If major number 30 and 35 had been occupied, you may have to select - 2 free major numbers for this driver. There are 3 steps to change - major numbers. - -3.6.1 Find free major numbers ------------------------------ - - In /proc/devices, you may find all the major numbers occupied - in the system. Please select 2 major numbers that are available. - e.g. 40, 45. - -3.6.2 Create special files --------------------------- - - Run /moxa/mxser/driver/msmknod to create special files with - specified major numbers. - -3.6.3 Modify driver with new major number ------------------------------------------ - - Run vi to open /moxa/mxser/driver/mxser.c. Locate the line - contains "MXSERMAJOR". Change the content as below:: - - #define MXSERMAJOR 40 - #define MXSERCUMAJOR 45 - - 3.6.4 Run "make clean; make install" in /moxa/mxser/driver. - -3.7 Verify driver installation -============================== - - You may refer to /var/log/messages to check the latest status - log reported by this driver whenever it's activated. - -4. Utilities -^^^^^^^^^^^^ - - There are 3 utilities contained in this driver. They are msdiag, msmon and - msterm. These 3 utilities are released in form of source code. They should - be compiled into executable file and copied into /usr/bin. - - Before using these utilities, please load driver (refer 3.4 & 3.5) and - make sure you had run the "msmknod" utility. - -msdiag - Diagnostic -=================== - - This utility provides the function to display what Moxa Smartio/Industio - board found by driver in the system. - -msmon - Port Monitoring -======================= - - This utility gives the user a quick view about all the MOXA ports' - activities. One can easily learn each port's total received/transmitted - (Rx/Tx) character count since the time when the monitoring is started. - - Rx/Tx throughputs per second are also reported in interval basis (e.g. - the last 5 seconds) and in average basis (since the time the monitoring - is started). You can reset all ports' count by key. <+> <-> - (plus/minus) keys to change the displaying time interval. Press - on the port, that cursor stay, to view the port's communication - parameters, signal status, and input/output queue. - -msterm - Terminal Emulation -=========================== - - This utility provides data sending and receiving ability of all tty ports, - especially for MOXA ports. It is quite useful for testing simple - application, for example, sending AT command to a modem connected to the - port or used as a terminal for login purpose. Note that this is only a - dumb terminal emulation without handling full screen operation. - -5. Setserial -^^^^^^^^^^^^ - - Supported Setserial parameters are listed as below. - - ============== ========================================================= - uart set UART type(16450-->disable FIFO, 16550A-->enable FIFO) - close_delay set the amount of time(in 1/100 of a second) that DTR - should be kept low while being closed. - closing_wait set the amount of time(in 1/100 of a second) that the - serial port should wait for data to be drained while - being closed, before the receiver is disable. - spd_hi Use 57.6kb when the application requests 38.4kb. - spd_vhi Use 115.2kb when the application requests 38.4kb. - spd_shi Use 230.4kb when the application requests 38.4kb. - spd_warp Use 460.8kb when the application requests 38.4kb. - spd_normal Use 38.4kb when the application requests 38.4kb. - spd_cust Use the custom divisor to set the speed when the - application requests 38.4kb. - divisor This option set the custom division. - baud_base This option set the base baud rate. - ============== ========================================================= - -6. Troubleshooting -^^^^^^^^^^^^^^^^^^ - - The boot time error messages and solutions are stated as clearly as - possible. If all the possible solutions fail, please contact our technical - support team to get more help. - - - Error msg: - More than 4 Moxa Smartio/Industio family boards found. Fifth board - and after are ignored. - - Solution: - To avoid this problem, please unplug fifth and after board, because Moxa - driver supports up to 4 boards. - - Error msg: - Request_irq fail, IRQ(?) may be conflict with another device. - - Solution: - Other PCI or ISA devices occupy the assigned IRQ. If you are not sure - which device causes the situation, please check /proc/interrupts to find - free IRQ and simply change another free IRQ for Moxa board. - - Error msg: - Board #: C1xx Series(CAP=xxx) interrupt number invalid. - - Solution: - Each port within the same multiport board shares the same IRQ. Please set - one IRQ (IRQ doesn't equal to zero) for one Moxa board. - - Error msg: - No interrupt vector be set for Moxa ISA board(CAP=xxx). - - Solution: - Moxa ISA board needs an interrupt vector.Please refer to user's manual - "Hardware Installation" chapter to set interrupt vector. - - Error msg: - Couldn't install MOXA Smartio/Industio family driver! - - Solution: - Load Moxa driver fail, the major number may conflict with other devices. - Please refer to previous section 3.7 to change a free major number for - Moxa driver. - - Error msg: - Couldn't install MOXA Smartio/Industio family callout driver! - - Solution: - Load Moxa callout driver fail, the callout device major number may - conflict with other devices. Please refer to previous section 3.7 to - change a free callout device major number for Moxa driver. diff --git a/Documentation/serial/n_gsm.rst b/Documentation/serial/n_gsm.rst deleted file mode 100644 index f3ad9fd26408..000000000000 --- a/Documentation/serial/n_gsm.rst +++ /dev/null @@ -1,103 +0,0 @@ -============================== -GSM 0710 tty multiplexor HOWTO -============================== - -This line discipline implements the GSM 07.10 multiplexing protocol -detailed in the following 3GPP document: - - http://www.3gpp.org/ftp/Specs/archive/07_series/07.10/0710-720.zip - -This document give some hints on how to use this driver with GPRS and 3G -modems connected to a physical serial port. - -How to use it -------------- -1. initialize the modem in 0710 mux mode (usually AT+CMUX= command) through - its serial port. Depending on the modem used, you can pass more or less - parameters to this command, -2. switch the serial line to using the n_gsm line discipline by using - TIOCSETD ioctl, -3. configure the mux using GSMIOC_GETCONF / GSMIOC_SETCONF ioctl, - -Major parts of the initialization program : -(a good starting point is util-linux-ng/sys-utils/ldattach.c):: - - #include - #define N_GSM0710 21 /* GSM 0710 Mux */ - #define DEFAULT_SPEED B115200 - #define SERIAL_PORT /dev/ttyS0 - - int ldisc = N_GSM0710; - struct gsm_config c; - struct termios configuration; - - /* open the serial port connected to the modem */ - fd = open(SERIAL_PORT, O_RDWR | O_NOCTTY | O_NDELAY); - - /* configure the serial port : speed, flow control ... */ - - /* send the AT commands to switch the modem to CMUX mode - and check that it's successful (should return OK) */ - write(fd, "AT+CMUX=0\r", 10); - - /* experience showed that some modems need some time before - being able to answer to the first MUX packet so a delay - may be needed here in some case */ - sleep(3); - - /* use n_gsm line discipline */ - ioctl(fd, TIOCSETD, &ldisc); - - /* get n_gsm configuration */ - ioctl(fd, GSMIOC_GETCONF, &c); - /* we are initiator and need encoding 0 (basic) */ - c.initiator = 1; - c.encapsulation = 0; - /* our modem defaults to a maximum size of 127 bytes */ - c.mru = 127; - c.mtu = 127; - /* set the new configuration */ - ioctl(fd, GSMIOC_SETCONF, &c); - - /* and wait for ever to keep the line discipline enabled */ - daemon(0,0); - pause(); - -4. create the devices corresponding to the "virtual" serial ports (take care, - each modem has its configuration and some DLC have dedicated functions, - for example GPS), starting with minor 1 (DLC0 is reserved for the management - of the mux):: - - MAJOR=`cat /proc/devices |grep gsmtty | awk '{print $1}` - for i in `seq 1 4`; do - mknod /dev/ttygsm$i c $MAJOR $i - done - -5. use these devices as plain serial ports. - - for example, it's possible: - - - and to use gnokii to send / receive SMS on ttygsm1 - - to use ppp to establish a datalink on ttygsm2 - -6. first close all virtual ports before closing the physical port. - - Note that after closing the physical port the modem is still in multiplexing - mode. This may prevent a successful re-opening of the port later. To avoid - this situation either reset the modem if your hardware allows that or send - a disconnect command frame manually before initializing the multiplexing mode - for the second time. The byte sequence for the disconnect command frame is:: - - 0xf9, 0x03, 0xef, 0x03, 0xc3, 0x16, 0xf9. - -Additional Documentation ------------------------- -More practical details on the protocol and how it's supported by industrial -modems can be found in the following documents : - -- http://www.telit.com/module/infopool/download.php?id=616 -- http://www.u-blox.com/images/downloads/Product_Docs/LEON-G100-G200-MuxImplementation_ApplicationNote_%28GSM%20G1-CS-10002%29.pdf -- http://www.sierrawireless.com/Support/Downloads/AirPrime/WMP_Series/~/media/Support_Downloads/AirPrime/Application_notes/CMUX_Feature_Application_Note-Rev004.ashx -- http://wm.sim.com/sim/News/photo/2010721161442.pdf - -11-03-08 - Eric Bénard - diff --git a/Documentation/serial/rocket.rst b/Documentation/serial/rocket.rst deleted file mode 100644 index 23761eae4282..000000000000 --- a/Documentation/serial/rocket.rst +++ /dev/null @@ -1,185 +0,0 @@ -================================================ -Comtrol(tm) RocketPort(R)/RocketModem(TM) Series -================================================ - -Device Driver for the Linux Operating System -============================================ - -Product overview ----------------- - -This driver provides a loadable kernel driver for the Comtrol RocketPort -and RocketModem PCI boards. These boards provide, 2, 4, 8, 16, or 32 -high-speed serial ports or modems. This driver supports up to a combination -of four RocketPort or RocketModems boards in one machine simultaneously. -This file assumes that you are using the RocketPort driver which is -integrated into the kernel sources. - -The driver can also be installed as an external module using the usual -"make;make install" routine. This external module driver, obtainable -from the Comtrol website listed below, is useful for updating the driver -or installing it into kernels which do not have the driver configured -into them. Installations instructions for the external module -are in the included README and HW_INSTALL files. - -RocketPort ISA and RocketModem II PCI boards currently are only supported by -this driver in module form. - -The RocketPort ISA board requires I/O ports to be configured by the DIP -switches on the board. See the section "ISA Rocketport Boards" below for -information on how to set the DIP switches. - -You pass the I/O port to the driver using the following module parameters: - -board1: - I/O port for the first ISA board -board2: - I/O port for the second ISA board -board3: - I/O port for the third ISA board -board4: - I/O port for the fourth ISA board - -There is a set of utilities and scripts provided with the external driver -(downloadable from http://www.comtrol.com) that ease the configuration and -setup of the ISA cards. - -The RocketModem II PCI boards require firmware to be loaded into the card -before it will function. The driver has only been tested as a module for this -board. - -Installation Procedures ------------------------ - -RocketPort/RocketModem PCI cards require no driver configuration, they are -automatically detected and configured. - -The RocketPort driver can be installed as a module (recommended) or built -into the kernel. This is selected, as for other drivers, through the `make config` -command from the root of the Linux source tree during the kernel build process. - -The RocketPort/RocketModem serial ports installed by this driver are assigned -device major number 46, and will be named /dev/ttyRx, where x is the port number -starting at zero (ex. /dev/ttyR0, /devttyR1, ...). If you have multiple cards -installed in the system, the mapping of port names to serial ports is displayed -in the system log at /var/log/messages. - -If installed as a module, the module must be loaded. This can be done -manually by entering "modprobe rocket". To have the module loaded automatically -upon system boot, edit a `/etc/modprobe.d/*.conf` file and add the line -"alias char-major-46 rocket". - -In order to use the ports, their device names (nodes) must be created with mknod. -This is only required once, the system will retain the names once created. To -create the RocketPort/RocketModem device names, use the command -"mknod /dev/ttyRx c 46 x" where x is the port number starting at zero. - -For example:: - - > mknod /dev/ttyR0 c 46 0 - > mknod /dev/ttyR1 c 46 1 - > mknod /dev/ttyR2 c 46 2 - -The Linux script MAKEDEV will create the first 16 ttyRx device names (nodes) -for you:: - - >/dev/MAKEDEV ttyR - -ISA Rocketport Boards ---------------------- - -You must assign and configure the I/O addresses used by the ISA Rocketport -card before installing and using it. This is done by setting a set of DIP -switches on the Rocketport board. - - -Setting the I/O address ------------------------ - -Before installing RocketPort(R) or RocketPort RA boards, you must find -a range of I/O addresses for it to use. The first RocketPort card -requires a 68-byte contiguous block of I/O addresses, starting at one -of the following: 0x100h, 0x140h, 0x180h, 0x200h, 0x240h, 0x280h, -0x300h, 0x340h, 0x380h. This I/O address must be reflected in the DIP -switches of *all* of the Rocketport cards. - -The second, third, and fourth RocketPort cards require a 64-byte -contiguous block of I/O addresses, starting at one of the following -I/O addresses: 0x100h, 0x140h, 0x180h, 0x1C0h, 0x200h, 0x240h, 0x280h, -0x2C0h, 0x300h, 0x340h, 0x380h, 0x3C0h. The I/O address used by the -second, third, and fourth Rocketport cards (if present) are set via -software control. The DIP switch settings for the I/O address must be -set to the value of the first Rocketport cards. - -In order to distinguish each of the card from the others, each card -must have a unique board ID set on the dip switches. The first -Rocketport board must be set with the DIP switches corresponding to -the first board, the second board must be set with the DIP switches -corresponding to the second board, etc. IMPORTANT: The board ID is -the only place where the DIP switch settings should differ between the -various Rocketport boards in a system. - -The I/O address range used by any of the RocketPort cards must not -conflict with any other cards in the system, including other -RocketPort cards. Below, you will find a list of commonly used I/O -address ranges which may be in use by other devices in your system. -On a Linux system, "cat /proc/ioports" will also be helpful in -identifying what I/O addresses are being used by devices on your -system. - -Remember, the FIRST RocketPort uses 68 I/O addresses. So, if you set it -for 0x100, it will occupy 0x100 to 0x143. This would mean that you -CAN NOT set the second, third or fourth board for address 0x140 since -the first 4 bytes of that range are used by the first board. You would -need to set the second, third, or fourth board to one of the next available -blocks such as 0x180. - -RocketPort and RocketPort RA SW1 Settings:: - - +-------------------------------+ - | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | - +-------+-------+---------------+ - | Unused| Card | I/O Port Block| - +-------------------------------+ - - DIP Switches DIP Switches - 7 8 6 5 - =================== =================== - On On UNUSED, MUST BE ON. On On First Card <==== Default - On Off Second Card - Off On Third Card - Off Off Fourth Card - - DIP Switches I/O Address Range - 4 3 2 1 Used by the First Card - ===================================== - On Off On Off 100-143 - On Off Off On 140-183 - On Off Off Off 180-1C3 <==== Default - Off On On Off 200-243 - Off On Off On 240-283 - Off On Off Off 280-2C3 - Off Off On Off 300-343 - Off Off Off On 340-383 - Off Off Off Off 380-3C3 - -Reporting Bugs --------------- - -For technical support, please provide the following -information: Driver version, kernel release, distribution of -kernel, and type of board you are using. Error messages and log -printouts port configuration details are especially helpful. - -USA: - :Phone: (612) 494-4100 - :FAX: (612) 494-4199 - :email: support@comtrol.com - -Comtrol Europe: - :Phone: +44 (0) 1 869 323-220 - :FAX: +44 (0) 1 869 323-211 - :email: support@comtrol.co.uk - -Web: http://www.comtrol.com -FTP: ftp.comtrol.com diff --git a/Documentation/serial/serial-iso7816.rst b/Documentation/serial/serial-iso7816.rst deleted file mode 100644 index d990143de0c6..000000000000 --- a/Documentation/serial/serial-iso7816.rst +++ /dev/null @@ -1,90 +0,0 @@ -============================= -ISO7816 Serial Communications -============================= - -1. Introduction -=============== - - ISO/IEC7816 is a series of standards specifying integrated circuit cards (ICC) - also known as smart cards. - -2. Hardware-related considerations -================================== - - Some CPUs/UARTs (e.g., Microchip AT91) contain a built-in mode capable of - handling communication with a smart card. - - For these microcontrollers, the Linux driver should be made capable of - working in both modes, and proper ioctls (see later) should be made - available at user-level to allow switching from one mode to the other, and - vice versa. - -3. Data Structures Already Available in the Kernel -================================================== - - The Linux kernel provides the serial_iso7816 structure (see [1]) to handle - ISO7816 communications. This data structure is used to set and configure - ISO7816 parameters in ioctls. - - Any driver for devices capable of working both as RS232 and ISO7816 should - implement the iso7816_config callback in the uart_port structure. The - serial_core calls iso7816_config to do the device specific part in response - to TIOCGISO7816 and TIOCSISO7816 ioctls (see below). The iso7816_config - callback receives a pointer to struct serial_iso7816. - -4. Usage from user-level -======================== - - From user-level, ISO7816 configuration can be get/set using the previous - ioctls. For instance, to set ISO7816 you can use the following code:: - - #include - - /* Include definition for ISO7816 ioctls: TIOCSISO7816 and TIOCGISO7816 */ - #include - - /* Open your specific device (e.g., /dev/mydevice): */ - int fd = open ("/dev/mydevice", O_RDWR); - if (fd < 0) { - /* Error handling. See errno. */ - } - - struct serial_iso7816 iso7816conf; - - /* Reserved fields as to be zeroed */ - memset(&iso7816conf, 0, sizeof(iso7816conf)); - - /* Enable ISO7816 mode: */ - iso7816conf.flags |= SER_ISO7816_ENABLED; - - /* Select the protocol: */ - /* T=0 */ - iso7816conf.flags |= SER_ISO7816_T(0); - /* or T=1 */ - iso7816conf.flags |= SER_ISO7816_T(1); - - /* Set the guard time: */ - iso7816conf.tg = 2; - - /* Set the clock frequency*/ - iso7816conf.clk = 3571200; - - /* Set transmission factors: */ - iso7816conf.sc_fi = 372; - iso7816conf.sc_di = 1; - - if (ioctl(fd_usart, TIOCSISO7816, &iso7816conf) < 0) { - /* Error handling. See errno. */ - } - - /* Use read() and write() syscalls here... */ - - /* Close the device when finished: */ - if (close (fd) < 0) { - /* Error handling. See errno. */ - } - -5. References -============= - - [1] include/uapi/linux/serial.h diff --git a/Documentation/serial/serial-rs485.rst b/Documentation/serial/serial-rs485.rst deleted file mode 100644 index 6bc824f948f9..000000000000 --- a/Documentation/serial/serial-rs485.rst +++ /dev/null @@ -1,103 +0,0 @@ -=========================== -RS485 Serial Communications -=========================== - -1. Introduction -=============== - - EIA-485, also known as TIA/EIA-485 or RS-485, is a standard defining the - electrical characteristics of drivers and receivers for use in balanced - digital multipoint systems. - This standard is widely used for communications in industrial automation - because it can be used effectively over long distances and in electrically - noisy environments. - -2. Hardware-related Considerations -================================== - - Some CPUs/UARTs (e.g., Atmel AT91 or 16C950 UART) contain a built-in - half-duplex mode capable of automatically controlling line direction by - toggling RTS or DTR signals. That can be used to control external - half-duplex hardware like an RS485 transceiver or any RS232-connected - half-duplex devices like some modems. - - For these microcontrollers, the Linux driver should be made capable of - working in both modes, and proper ioctls (see later) should be made - available at user-level to allow switching from one mode to the other, and - vice versa. - -3. Data Structures Already Available in the Kernel -================================================== - - The Linux kernel provides the serial_rs485 structure (see [1]) to handle - RS485 communications. This data structure is used to set and configure RS485 - parameters in the platform data and in ioctls. - - The device tree can also provide RS485 boot time parameters (see [2] - for bindings). The driver is in charge of filling this data structure from - the values given by the device tree. - - Any driver for devices capable of working both as RS232 and RS485 should - implement the rs485_config callback in the uart_port structure. The - serial_core calls rs485_config to do the device specific part in response - to TIOCSRS485 and TIOCGRS485 ioctls (see below). The rs485_config callback - receives a pointer to struct serial_rs485. - -4. Usage from user-level -======================== - - From user-level, RS485 configuration can be get/set using the previous - ioctls. For instance, to set RS485 you can use the following code:: - - #include - - /* Include definition for RS485 ioctls: TIOCGRS485 and TIOCSRS485 */ - #include - - /* Open your specific device (e.g., /dev/mydevice): */ - int fd = open ("/dev/mydevice", O_RDWR); - if (fd < 0) { - /* Error handling. See errno. */ - } - - struct serial_rs485 rs485conf; - - /* Enable RS485 mode: */ - rs485conf.flags |= SER_RS485_ENABLED; - - /* Set logical level for RTS pin equal to 1 when sending: */ - rs485conf.flags |= SER_RS485_RTS_ON_SEND; - /* or, set logical level for RTS pin equal to 0 when sending: */ - rs485conf.flags &= ~(SER_RS485_RTS_ON_SEND); - - /* Set logical level for RTS pin equal to 1 after sending: */ - rs485conf.flags |= SER_RS485_RTS_AFTER_SEND; - /* or, set logical level for RTS pin equal to 0 after sending: */ - rs485conf.flags &= ~(SER_RS485_RTS_AFTER_SEND); - - /* Set rts delay before send, if needed: */ - rs485conf.delay_rts_before_send = ...; - - /* Set rts delay after send, if needed: */ - rs485conf.delay_rts_after_send = ...; - - /* Set this flag if you want to receive data even while sending data */ - rs485conf.flags |= SER_RS485_RX_DURING_TX; - - if (ioctl (fd, TIOCSRS485, &rs485conf) < 0) { - /* Error handling. See errno. */ - } - - /* Use read() and write() syscalls here... */ - - /* Close the device when finished: */ - if (close (fd) < 0) { - /* Error handling. See errno. */ - } - -5. References -============= - - [1] include/uapi/linux/serial.h - - [2] Documentation/devicetree/bindings/serial/rs485.txt diff --git a/Documentation/serial/tty.rst b/Documentation/serial/tty.rst deleted file mode 100644 index dd972caacf3e..000000000000 --- a/Documentation/serial/tty.rst +++ /dev/null @@ -1,328 +0,0 @@ -================= -The Lockronomicon -================= - -Your guide to the ancient and twisted locking policies of the tty layer and -the warped logic behind them. Beware all ye who read on. - - -Line Discipline ---------------- - -Line disciplines are registered with tty_register_ldisc() passing the -discipline number and the ldisc structure. At the point of registration the -discipline must be ready to use and it is possible it will get used before -the call returns success. If the call returns an error then it won't get -called. Do not re-use ldisc numbers as they are part of the userspace ABI -and writing over an existing ldisc will cause demons to eat your computer. -After the return the ldisc data has been copied so you may free your own -copy of the structure. You must not re-register over the top of the line -discipline even with the same data or your computer again will be eaten by -demons. - -In order to remove a line discipline call tty_unregister_ldisc(). -In ancient times this always worked. In modern times the function will -return -EBUSY if the ldisc is currently in use. Since the ldisc referencing -code manages the module counts this should not usually be a concern. - -Heed this warning: the reference count field of the registered copies of the -tty_ldisc structure in the ldisc table counts the number of lines using this -discipline. The reference count of the tty_ldisc structure within a tty -counts the number of active users of the ldisc at this instant. In effect it -counts the number of threads of execution within an ldisc method (plus those -about to enter and exit although this detail matters not). - -Line Discipline Methods ------------------------ - -TTY side interfaces -^^^^^^^^^^^^^^^^^^^ - -======================= ======================================================= -open() Called when the line discipline is attached to - the terminal. No other call into the line - discipline for this tty will occur until it - completes successfully. Should initialize any - state needed by the ldisc, and set receive_room - in the tty_struct to the maximum amount of data - the line discipline is willing to accept from the - driver with a single call to receive_buf(). - Returning an error will prevent the ldisc from - being attached. Can sleep. - -close() This is called on a terminal when the line - discipline is being unplugged. At the point of - execution no further users will enter the - ldisc code for this tty. Can sleep. - -hangup() Called when the tty line is hung up. - The line discipline should cease I/O to the tty. - No further calls into the ldisc code will occur. - The return value is ignored. Can sleep. - -read() (optional) A process requests reading data from - the line. Multiple read calls may occur in parallel - and the ldisc must deal with serialization issues. - If not defined, the process will receive an EIO - error. May sleep. - -write() (optional) A process requests writing data to the - line. Multiple write calls are serialized by the - tty layer for the ldisc. If not defined, the - process will receive an EIO error. May sleep. - -flush_buffer() (optional) May be called at any point between - open and close, and instructs the line discipline - to empty its input buffer. - -set_termios() (optional) Called on termios structure changes. - The caller passes the old termios data and the - current data is in the tty. Called under the - termios semaphore so allowed to sleep. Serialized - against itself only. - -poll() (optional) Check the status for the poll/select - calls. Multiple poll calls may occur in parallel. - May sleep. - -ioctl() (optional) Called when an ioctl is handed to the - tty layer that might be for the ldisc. Multiple - ioctl calls may occur in parallel. May sleep. - -compat_ioctl() (optional) Called when a 32 bit ioctl is handed - to the tty layer that might be for the ldisc. - Multiple ioctl calls may occur in parallel. - May sleep. -======================= ======================================================= - -Driver Side Interfaces -^^^^^^^^^^^^^^^^^^^^^^ - -======================= ======================================================= -receive_buf() (optional) Called by the low-level driver to hand - a buffer of received bytes to the ldisc for - processing. The number of bytes is guaranteed not - to exceed the current value of tty->receive_room. - All bytes must be processed. - -receive_buf2() (optional) Called by the low-level driver to hand - a buffer of received bytes to the ldisc for - processing. Returns the number of bytes processed. - - If both receive_buf() and receive_buf2() are - defined, receive_buf2() should be preferred. - -write_wakeup() May be called at any point between open and close. - The TTY_DO_WRITE_WAKEUP flag indicates if a call - is needed but always races versus calls. Thus the - ldisc must be careful about setting order and to - handle unexpected calls. Must not sleep. - - The driver is forbidden from calling this directly - from the ->write call from the ldisc as the ldisc - is permitted to call the driver write method from - this function. In such a situation defer it. - -dcd_change() Report to the tty line the current DCD pin status - changes and the relative timestamp. The timestamp - cannot be NULL. -======================= ======================================================= - - -Driver Access -^^^^^^^^^^^^^ - -Line discipline methods can call the following methods of the underlying -hardware driver through the function pointers within the tty->driver -structure: - -======================= ======================================================= -write() Write a block of characters to the tty device. - Returns the number of characters accepted. The - character buffer passed to this method is already - in kernel space. - -put_char() Queues a character for writing to the tty device. - If there is no room in the queue, the character is - ignored. - -flush_chars() (Optional) If defined, must be called after - queueing characters with put_char() in order to - start transmission. - -write_room() Returns the numbers of characters the tty driver - will accept for queueing to be written. - -ioctl() Invoke device specific ioctl. - Expects data pointers to refer to userspace. - Returns ENOIOCTLCMD for unrecognized ioctl numbers. - -set_termios() Notify the tty driver that the device's termios - settings have changed. New settings are in - tty->termios. Previous settings should be passed in - the "old" argument. - - The API is defined such that the driver should return - the actual modes selected. This means that the - driver function is responsible for modifying any - bits in the request it cannot fulfill to indicate - the actual modes being used. A device with no - hardware capability for change (e.g. a USB dongle or - virtual port) can provide NULL for this method. - -throttle() Notify the tty driver that input buffers for the - line discipline are close to full, and it should - somehow signal that no more characters should be - sent to the tty. - -unthrottle() Notify the tty driver that characters can now be - sent to the tty without fear of overrunning the - input buffers of the line disciplines. - -stop() Ask the tty driver to stop outputting characters - to the tty device. - -start() Ask the tty driver to resume sending characters - to the tty device. - -hangup() Ask the tty driver to hang up the tty device. - -break_ctl() (Optional) Ask the tty driver to turn on or off - BREAK status on the RS-232 port. If state is -1, - then the BREAK status should be turned on; if - state is 0, then BREAK should be turned off. - If this routine is not implemented, use ioctls - TIOCSBRK / TIOCCBRK instead. - -wait_until_sent() Waits until the device has written out all of the - characters in its transmitter FIFO. - -send_xchar() Send a high-priority XON/XOFF character to the device. -======================= ======================================================= - - -Flags -^^^^^ - -Line discipline methods have access to tty->flags field containing the -following interesting flags: - -======================= ======================================================= -TTY_THROTTLED Driver input is throttled. The ldisc should call - tty->driver->unthrottle() in order to resume - reception when it is ready to process more data. - -TTY_DO_WRITE_WAKEUP If set, causes the driver to call the ldisc's - write_wakeup() method in order to resume - transmission when it can accept more data - to transmit. - -TTY_IO_ERROR If set, causes all subsequent userspace read/write - calls on the tty to fail, returning -EIO. - -TTY_OTHER_CLOSED Device is a pty and the other side has closed. - -TTY_NO_WRITE_SPLIT Prevent driver from splitting up writes into - smaller chunks. -======================= ======================================================= - - -Locking -^^^^^^^ - -Callers to the line discipline functions from the tty layer are required to -take line discipline locks. The same is true of calls from the driver side -but not yet enforced. - -Three calls are now provided:: - - ldisc = tty_ldisc_ref(tty); - -takes a handle to the line discipline in the tty and returns it. If no ldisc -is currently attached or the ldisc is being closed and re-opened at this -point then NULL is returned. While this handle is held the ldisc will not -change or go away:: - - tty_ldisc_deref(ldisc) - -Returns the ldisc reference and allows the ldisc to be closed. Returning the -reference takes away your right to call the ldisc functions until you take -a new reference:: - - ldisc = tty_ldisc_ref_wait(tty); - -Performs the same function as tty_ldisc_ref except that it will wait for an -ldisc change to complete and then return a reference to the new ldisc. - -While these functions are slightly slower than the old code they should have -minimal impact as most receive logic uses the flip buffers and they only -need to take a reference when they push bits up through the driver. - -A caution: The ldisc->open(), ldisc->close() and driver->set_ldisc -functions are called with the ldisc unavailable. Thus tty_ldisc_ref will -fail in this situation if used within these functions. Ldisc and driver -code calling its own functions must be careful in this case. - - -Driver Interface ----------------- - -======================= ======================================================= -open() Called when a device is opened. May sleep - -close() Called when a device is closed. At the point of - return from this call the driver must make no - further ldisc calls of any kind. May sleep - -write() Called to write bytes to the device. May not - sleep. May occur in parallel in special cases. - Because this includes panic paths drivers generally - shouldn't try and do clever locking here. - -put_char() Stuff a single character onto the queue. The - driver is guaranteed following up calls to - flush_chars. - -flush_chars() Ask the kernel to write put_char queue - -write_room() Return the number of characters that can be stuffed - into the port buffers without overflow (or less). - The ldisc is responsible for being intelligent - about multi-threading of write_room/write calls - -ioctl() Called when an ioctl may be for the driver - -set_termios() Called on termios change, serialized against - itself by a semaphore. May sleep. - -set_ldisc() Notifier for discipline change. At the point this - is done the discipline is not yet usable. Can now - sleep (I think) - -throttle() Called by the ldisc to ask the driver to do flow - control. Serialization including with unthrottle - is the job of the ldisc layer. - -unthrottle() Called by the ldisc to ask the driver to stop flow - control. - -stop() Ldisc notifier to the driver to stop output. As with - throttle the serializations with start() are down - to the ldisc layer. - -start() Ldisc notifier to the driver to start output. - -hangup() Ask the tty driver to cause a hangup initiated - from the host side. [Can sleep ??] - -break_ctl() Send RS232 break. Can sleep. Can get called in - parallel, driver must serialize (for now), and - with write calls. - -wait_until_sent() Wait for characters to exit the hardware queue - of the driver. Can sleep - -send_xchar() Send XON/XOFF and if possible jump the queue with - it in order to get fast flow control responses. - Cannot sleep ?? -======================= ======================================================= diff --git a/MAINTAINERS b/MAINTAINERS index d1a0a817dd92..4f88bca37c55 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10767,7 +10767,7 @@ F: include/uapi/linux/meye.h MOXA SMARTIO/INDUSTIO/INTELLIO SERIAL CARD M: Jiri Slaby S: Maintained -F: Documentation/serial/moxa-smartio.rst +F: Documentation/driver-api/serial/moxa-smartio.rst F: drivers/tty/mxser.* MR800 AVERMEDIA USB FM RADIO DRIVER @@ -13689,7 +13689,7 @@ ROCKETPORT DRIVER P: Comtrol Corp. W: http://www.comtrol.com S: Maintained -F: Documentation/serial/rocket.rst +F: Documentation/driver-api/serial/rocket.rst F: drivers/tty/rocket* ROCKETPORT EXPRESS/INFINITY DRIVER @@ -16228,7 +16228,7 @@ M: Greg Kroah-Hartman M: Jiri Slaby S: Supported T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty.git -F: Documentation/serial/ +F: Documentation/driver-api/serial/ F: drivers/tty/ F: drivers/tty/serial/serial_core.c F: include/linux/serial_core.h diff --git a/drivers/tty/Kconfig b/drivers/tty/Kconfig index ee51b9514225..c7623f99ac0f 100644 --- a/drivers/tty/Kconfig +++ b/drivers/tty/Kconfig @@ -175,7 +175,7 @@ config ROCKETPORT This driver supports Comtrol RocketPort and RocketModem PCI boards. These boards provide 2, 4, 8, 16, or 32 high-speed serial ports or modems. For information about the RocketPort/RocketModem boards - and this driver read . + and this driver read . To compile this driver as a module, choose M here: the module will be called rocket. @@ -193,7 +193,7 @@ config CYCLADES your Linux box, for instance in order to become a dial-in server. For information about the Cyclades-Z card, read - . + . To compile this driver as a module, choose M here: the module will be called cyclades. diff --git a/drivers/tty/serial/ucc_uart.c b/drivers/tty/serial/ucc_uart.c index 6e3c66ab0e62..a0555ae2b1ef 100644 --- a/drivers/tty/serial/ucc_uart.c +++ b/drivers/tty/serial/ucc_uart.c @@ -1081,7 +1081,7 @@ static int qe_uart_verify_port(struct uart_port *port, } /* UART operations * - * Details on these functions can be found in Documentation/serial/driver.rst + * Details on these functions can be found in Documentation/driver-api/serial/driver.rst */ static const struct uart_ops qe_uart_pops = { .tx_empty = qe_uart_tx_empty, diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h index 05b179015d6c..2b78cc734719 100644 --- a/include/linux/serial_core.h +++ b/include/linux/serial_core.h @@ -32,7 +32,7 @@ struct device; /* * This structure describes all the operations that can be done on the - * physical hardware. See Documentation/serial/driver.rst for details. + * physical hardware. See Documentation/driver-api/serial/driver.rst for details. */ struct uart_ops { unsigned int (*tx_empty)(struct uart_port *); -- cgit v1.2.3 From 4745dc8abb0a0a9851c07265eea01d844886d5c8 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 27 Jun 2019 16:36:04 -0300 Subject: docs: phy: place documentation under driver-api This subsystem-specific documentation belongs to the driver-api. Signed-off-by: Mauro Carvalho Chehab --- .../devicetree/bindings/phy/phy-bindings.txt | 2 +- .../devicetree/bindings/phy/phy-pxa-usb.txt | 2 +- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/phy/index.rst | 16 ++ Documentation/driver-api/phy/phy.rst | 197 +++++++++++++++++++++ Documentation/driver-api/phy/samsung-usb2.rst | 137 ++++++++++++++ Documentation/index.rst | 1 - Documentation/phy.txt | 197 --------------------- Documentation/phy/samsung-usb2.rst | 137 -------------- MAINTAINERS | 2 +- 10 files changed, 354 insertions(+), 338 deletions(-) create mode 100644 Documentation/driver-api/phy/index.rst create mode 100644 Documentation/driver-api/phy/phy.rst create mode 100644 Documentation/driver-api/phy/samsung-usb2.rst delete mode 100644 Documentation/phy.txt delete mode 100644 Documentation/phy/samsung-usb2.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/devicetree/bindings/phy/phy-bindings.txt b/Documentation/devicetree/bindings/phy/phy-bindings.txt index a403b81d0679..c4eb38902533 100644 --- a/Documentation/devicetree/bindings/phy/phy-bindings.txt +++ b/Documentation/devicetree/bindings/phy/phy-bindings.txt @@ -1,5 +1,5 @@ This document explains only the device tree data binding. For general -information about PHY subsystem refer to Documentation/phy.txt +information about PHY subsystem refer to Documentation/driver-api/phy/phy.rst PHY device node =============== diff --git a/Documentation/devicetree/bindings/phy/phy-pxa-usb.txt b/Documentation/devicetree/bindings/phy/phy-pxa-usb.txt index 93fc09c12954..d80e36a77ec5 100644 --- a/Documentation/devicetree/bindings/phy/phy-pxa-usb.txt +++ b/Documentation/devicetree/bindings/phy/phy-pxa-usb.txt @@ -15,4 +15,4 @@ Example: }; This document explains the device tree binding. For general -information about PHY subsystem refer to Documentation/phy.txt +information about PHY subsystem refer to Documentation/driver-api/phy/phy.rst diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index cf39b8f9d0f9..eff22db0ed14 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -85,6 +85,7 @@ available subsections can be seen below. parport-lowlevel pps ptp + phy/index pti_intel_mid pwm rfkill diff --git a/Documentation/driver-api/phy/index.rst b/Documentation/driver-api/phy/index.rst new file mode 100644 index 000000000000..fce9ffae2812 --- /dev/null +++ b/Documentation/driver-api/phy/index.rst @@ -0,0 +1,16 @@ +===================== +Generic PHY Framework +===================== + +.. toctree:: + + phy + samsung-usb2 + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` + diff --git a/Documentation/driver-api/phy/phy.rst b/Documentation/driver-api/phy/phy.rst new file mode 100644 index 000000000000..457c3e0f86d6 --- /dev/null +++ b/Documentation/driver-api/phy/phy.rst @@ -0,0 +1,197 @@ +============= +PHY subsystem +============= + +:Author: Kishon Vijay Abraham I + +This document explains the Generic PHY Framework along with the APIs provided, +and how-to-use. + +Introduction +============ + +*PHY* is the abbreviation for physical layer. It is used to connect a device +to the physical medium e.g., the USB controller has a PHY to provide functions +such as serialization, de-serialization, encoding, decoding and is responsible +for obtaining the required data transmission rate. Note that some USB +controllers have PHY functionality embedded into it and others use an external +PHY. Other peripherals that use PHY include Wireless LAN, Ethernet, +SATA etc. + +The intention of creating this framework is to bring the PHY drivers spread +all over the Linux kernel to drivers/phy to increase code re-use and for +better code maintainability. + +This framework will be of use only to devices that use external PHY (PHY +functionality is not embedded within the controller). + +Registering/Unregistering the PHY provider +========================================== + +PHY provider refers to an entity that implements one or more PHY instances. +For the simple case where the PHY provider implements only a single instance of +the PHY, the framework provides its own implementation of of_xlate in +of_phy_simple_xlate. If the PHY provider implements multiple instances, it +should provide its own implementation of of_xlate. of_xlate is used only for +dt boot case. + +:: + + #define of_phy_provider_register(dev, xlate) \ + __of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate)) + + #define devm_of_phy_provider_register(dev, xlate) \ + __devm_of_phy_provider_register((dev), NULL, THIS_MODULE, + (xlate)) + +of_phy_provider_register and devm_of_phy_provider_register macros can be used to +register the phy_provider and it takes device and of_xlate as +arguments. For the dt boot case, all PHY providers should use one of the above +2 macros to register the PHY provider. + +Often the device tree nodes associated with a PHY provider will contain a set +of children that each represent a single PHY. Some bindings may nest the child +nodes within extra levels for context and extensibility, in which case the low +level of_phy_provider_register_full() and devm_of_phy_provider_register_full() +macros can be used to override the node containing the children. + +:: + + #define of_phy_provider_register_full(dev, children, xlate) \ + __of_phy_provider_register(dev, children, THIS_MODULE, xlate) + + #define devm_of_phy_provider_register_full(dev, children, xlate) \ + __devm_of_phy_provider_register_full(dev, children, + THIS_MODULE, xlate) + + void devm_of_phy_provider_unregister(struct device *dev, + struct phy_provider *phy_provider); + void of_phy_provider_unregister(struct phy_provider *phy_provider); + +devm_of_phy_provider_unregister and of_phy_provider_unregister can be used to +unregister the PHY. + +Creating the PHY +================ + +The PHY driver should create the PHY in order for other peripheral controllers +to make use of it. The PHY framework provides 2 APIs to create the PHY. + +:: + + struct phy *phy_create(struct device *dev, struct device_node *node, + const struct phy_ops *ops); + struct phy *devm_phy_create(struct device *dev, + struct device_node *node, + const struct phy_ops *ops); + +The PHY drivers can use one of the above 2 APIs to create the PHY by passing +the device pointer and phy ops. +phy_ops is a set of function pointers for performing PHY operations such as +init, exit, power_on and power_off. + +Inorder to dereference the private data (in phy_ops), the phy provider driver +can use phy_set_drvdata() after creating the PHY and use phy_get_drvdata() in +phy_ops to get back the private data. + +4. Getting a reference to the PHY + +Before the controller can make use of the PHY, it has to get a reference to +it. This framework provides the following APIs to get a reference to the PHY. + +:: + + struct phy *phy_get(struct device *dev, const char *string); + struct phy *phy_optional_get(struct device *dev, const char *string); + struct phy *devm_phy_get(struct device *dev, const char *string); + struct phy *devm_phy_optional_get(struct device *dev, + const char *string); + struct phy *devm_of_phy_get_by_index(struct device *dev, + struct device_node *np, + int index); + +phy_get, phy_optional_get, devm_phy_get and devm_phy_optional_get can +be used to get the PHY. In the case of dt boot, the string arguments +should contain the phy name as given in the dt data and in the case of +non-dt boot, it should contain the label of the PHY. The two +devm_phy_get associates the device with the PHY using devres on +successful PHY get. On driver detach, release function is invoked on +the devres data and devres data is freed. phy_optional_get and +devm_phy_optional_get should be used when the phy is optional. These +two functions will never return -ENODEV, but instead returns NULL when +the phy cannot be found.Some generic drivers, such as ehci, may use multiple +phys and for such drivers referencing phy(s) by name(s) does not make sense. In +this case, devm_of_phy_get_by_index can be used to get a phy reference based on +the index. + +It should be noted that NULL is a valid phy reference. All phy +consumer calls on the NULL phy become NOPs. That is the release calls, +the phy_init() and phy_exit() calls, and phy_power_on() and +phy_power_off() calls are all NOP when applied to a NULL phy. The NULL +phy is useful in devices for handling optional phy devices. + +Releasing a reference to the PHY +================================ + +When the controller no longer needs the PHY, it has to release the reference +to the PHY it has obtained using the APIs mentioned in the above section. The +PHY framework provides 2 APIs to release a reference to the PHY. + +:: + + void phy_put(struct phy *phy); + void devm_phy_put(struct device *dev, struct phy *phy); + +Both these APIs are used to release a reference to the PHY and devm_phy_put +destroys the devres associated with this PHY. + +Destroying the PHY +================== + +When the driver that created the PHY is unloaded, it should destroy the PHY it +created using one of the following 2 APIs:: + + void phy_destroy(struct phy *phy); + void devm_phy_destroy(struct device *dev, struct phy *phy); + +Both these APIs destroy the PHY and devm_phy_destroy destroys the devres +associated with this PHY. + +PM Runtime +========== + +This subsystem is pm runtime enabled. So while creating the PHY, +pm_runtime_enable of the phy device created by this subsystem is called and +while destroying the PHY, pm_runtime_disable is called. Note that the phy +device created by this subsystem will be a child of the device that calls +phy_create (PHY provider device). + +So pm_runtime_get_sync of the phy_device created by this subsystem will invoke +pm_runtime_get_sync of PHY provider device because of parent-child relationship. +It should also be noted that phy_power_on and phy_power_off performs +phy_pm_runtime_get_sync and phy_pm_runtime_put respectively. +There are exported APIs like phy_pm_runtime_get, phy_pm_runtime_get_sync, +phy_pm_runtime_put, phy_pm_runtime_put_sync, phy_pm_runtime_allow and +phy_pm_runtime_forbid for performing PM operations. + +PHY Mappings +============ + +In order to get reference to a PHY without help from DeviceTree, the framework +offers lookups which can be compared to clkdev that allow clk structures to be +bound to devices. A lookup can be made be made during runtime when a handle to +the struct phy already exists. + +The framework offers the following API for registering and unregistering the +lookups:: + + int phy_create_lookup(struct phy *phy, const char *con_id, + const char *dev_id); + void phy_remove_lookup(struct phy *phy, const char *con_id, + const char *dev_id); + +DeviceTree Binding +================== + +The documentation for PHY dt binding can be found @ +Documentation/devicetree/bindings/phy/phy-bindings.txt diff --git a/Documentation/driver-api/phy/samsung-usb2.rst b/Documentation/driver-api/phy/samsung-usb2.rst new file mode 100644 index 000000000000..c48c8b9797b9 --- /dev/null +++ b/Documentation/driver-api/phy/samsung-usb2.rst @@ -0,0 +1,137 @@ +==================================== +Samsung USB 2.0 PHY adaptation layer +==================================== + +1. Description +-------------- + +The architecture of the USB 2.0 PHY module in Samsung SoCs is similar +among many SoCs. In spite of the similarities it proved difficult to +create a one driver that would fit all these PHY controllers. Often +the differences were minor and were found in particular bits of the +registers of the PHY. In some rare cases the order of register writes or +the PHY powering up process had to be altered. This adaptation layer is +a compromise between having separate drivers and having a single driver +with added support for many special cases. + +2. Files description +-------------------- + +- phy-samsung-usb2.c + This is the main file of the adaptation layer. This file contains + the probe function and provides two callbacks to the Generic PHY + Framework. This two callbacks are used to power on and power off the + phy. They carry out the common work that has to be done on all version + of the PHY module. Depending on which SoC was chosen they execute SoC + specific callbacks. The specific SoC version is selected by choosing + the appropriate compatible string. In addition, this file contains + struct of_device_id definitions for particular SoCs. + +- phy-samsung-usb2.h + This is the include file. It declares the structures used by this + driver. In addition it should contain extern declarations for + structures that describe particular SoCs. + +3. Supporting SoCs +------------------ + +To support a new SoC a new file should be added to the drivers/phy +directory. Each SoC's configuration is stored in an instance of the +struct samsung_usb2_phy_config:: + + struct samsung_usb2_phy_config { + const struct samsung_usb2_common_phy *phys; + int (*rate_to_clk)(unsigned long, u32 *); + unsigned int num_phys; + bool has_mode_switch; + }; + +The num_phys is the number of phys handled by the driver. `*phys` is an +array that contains the configuration for each phy. The has_mode_switch +property is a boolean flag that determines whether the SoC has USB host +and device on a single pair of pins. If so, a special register has to +be modified to change the internal routing of these pins between a USB +device or host module. + +For example the configuration for Exynos 4210 is following:: + + const struct samsung_usb2_phy_config exynos4210_usb2_phy_config = { + .has_mode_switch = 0, + .num_phys = EXYNOS4210_NUM_PHYS, + .phys = exynos4210_phys, + .rate_to_clk = exynos4210_rate_to_clk, + } + +- `int (*rate_to_clk)(unsigned long, u32 *)` + + The rate_to_clk callback is to convert the rate of the clock + used as the reference clock for the PHY module to the value + that should be written in the hardware register. + +The exynos4210_phys configuration array is as follows:: + + static const struct samsung_usb2_common_phy exynos4210_phys[] = { + { + .label = "device", + .id = EXYNOS4210_DEVICE, + .power_on = exynos4210_power_on, + .power_off = exynos4210_power_off, + }, + { + .label = "host", + .id = EXYNOS4210_HOST, + .power_on = exynos4210_power_on, + .power_off = exynos4210_power_off, + }, + { + .label = "hsic0", + .id = EXYNOS4210_HSIC0, + .power_on = exynos4210_power_on, + .power_off = exynos4210_power_off, + }, + { + .label = "hsic1", + .id = EXYNOS4210_HSIC1, + .power_on = exynos4210_power_on, + .power_off = exynos4210_power_off, + }, + {}, + }; + +- `int (*power_on)(struct samsung_usb2_phy_instance *);` + `int (*power_off)(struct samsung_usb2_phy_instance *);` + + These two callbacks are used to power on and power off the phy + by modifying appropriate registers. + +Final change to the driver is adding appropriate compatible value to the +phy-samsung-usb2.c file. In case of Exynos 4210 the following lines were +added to the struct of_device_id samsung_usb2_phy_of_match[] array:: + + #ifdef CONFIG_PHY_EXYNOS4210_USB2 + { + .compatible = "samsung,exynos4210-usb2-phy", + .data = &exynos4210_usb2_phy_config, + }, + #endif + +To add further flexibility to the driver the Kconfig file enables to +include support for selected SoCs in the compiled driver. The Kconfig +entry for Exynos 4210 is following:: + + config PHY_EXYNOS4210_USB2 + bool "Support for Exynos 4210" + depends on PHY_SAMSUNG_USB2 + depends on CPU_EXYNOS4210 + help + Enable USB PHY support for Exynos 4210. This option requires that + Samsung USB 2.0 PHY driver is enabled and means that support for this + particular SoC is compiled in the driver. In case of Exynos 4210 four + phys are available - device, host, HSCI0 and HSCI1. + +The newly created file that supports the new SoC has to be also added to the +Makefile. In case of Exynos 4210 the added line is following:: + + obj-$(CONFIG_PHY_EXYNOS4210_USB2) += phy-exynos4210-usb2.o + +After completing these steps the support for the new SoC should be ready. diff --git a/Documentation/index.rst b/Documentation/index.rst index 041ffe442960..dbfec00ba535 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -111,7 +111,6 @@ needed). usb/index misc-devices/index mic/index - phy/samsung-usb2 scheduler/index Architecture-specific documentation diff --git a/Documentation/phy.txt b/Documentation/phy.txt deleted file mode 100644 index 457c3e0f86d6..000000000000 --- a/Documentation/phy.txt +++ /dev/null @@ -1,197 +0,0 @@ -============= -PHY subsystem -============= - -:Author: Kishon Vijay Abraham I - -This document explains the Generic PHY Framework along with the APIs provided, -and how-to-use. - -Introduction -============ - -*PHY* is the abbreviation for physical layer. It is used to connect a device -to the physical medium e.g., the USB controller has a PHY to provide functions -such as serialization, de-serialization, encoding, decoding and is responsible -for obtaining the required data transmission rate. Note that some USB -controllers have PHY functionality embedded into it and others use an external -PHY. Other peripherals that use PHY include Wireless LAN, Ethernet, -SATA etc. - -The intention of creating this framework is to bring the PHY drivers spread -all over the Linux kernel to drivers/phy to increase code re-use and for -better code maintainability. - -This framework will be of use only to devices that use external PHY (PHY -functionality is not embedded within the controller). - -Registering/Unregistering the PHY provider -========================================== - -PHY provider refers to an entity that implements one or more PHY instances. -For the simple case where the PHY provider implements only a single instance of -the PHY, the framework provides its own implementation of of_xlate in -of_phy_simple_xlate. If the PHY provider implements multiple instances, it -should provide its own implementation of of_xlate. of_xlate is used only for -dt boot case. - -:: - - #define of_phy_provider_register(dev, xlate) \ - __of_phy_provider_register((dev), NULL, THIS_MODULE, (xlate)) - - #define devm_of_phy_provider_register(dev, xlate) \ - __devm_of_phy_provider_register((dev), NULL, THIS_MODULE, - (xlate)) - -of_phy_provider_register and devm_of_phy_provider_register macros can be used to -register the phy_provider and it takes device and of_xlate as -arguments. For the dt boot case, all PHY providers should use one of the above -2 macros to register the PHY provider. - -Often the device tree nodes associated with a PHY provider will contain a set -of children that each represent a single PHY. Some bindings may nest the child -nodes within extra levels for context and extensibility, in which case the low -level of_phy_provider_register_full() and devm_of_phy_provider_register_full() -macros can be used to override the node containing the children. - -:: - - #define of_phy_provider_register_full(dev, children, xlate) \ - __of_phy_provider_register(dev, children, THIS_MODULE, xlate) - - #define devm_of_phy_provider_register_full(dev, children, xlate) \ - __devm_of_phy_provider_register_full(dev, children, - THIS_MODULE, xlate) - - void devm_of_phy_provider_unregister(struct device *dev, - struct phy_provider *phy_provider); - void of_phy_provider_unregister(struct phy_provider *phy_provider); - -devm_of_phy_provider_unregister and of_phy_provider_unregister can be used to -unregister the PHY. - -Creating the PHY -================ - -The PHY driver should create the PHY in order for other peripheral controllers -to make use of it. The PHY framework provides 2 APIs to create the PHY. - -:: - - struct phy *phy_create(struct device *dev, struct device_node *node, - const struct phy_ops *ops); - struct phy *devm_phy_create(struct device *dev, - struct device_node *node, - const struct phy_ops *ops); - -The PHY drivers can use one of the above 2 APIs to create the PHY by passing -the device pointer and phy ops. -phy_ops is a set of function pointers for performing PHY operations such as -init, exit, power_on and power_off. - -Inorder to dereference the private data (in phy_ops), the phy provider driver -can use phy_set_drvdata() after creating the PHY and use phy_get_drvdata() in -phy_ops to get back the private data. - -4. Getting a reference to the PHY - -Before the controller can make use of the PHY, it has to get a reference to -it. This framework provides the following APIs to get a reference to the PHY. - -:: - - struct phy *phy_get(struct device *dev, const char *string); - struct phy *phy_optional_get(struct device *dev, const char *string); - struct phy *devm_phy_get(struct device *dev, const char *string); - struct phy *devm_phy_optional_get(struct device *dev, - const char *string); - struct phy *devm_of_phy_get_by_index(struct device *dev, - struct device_node *np, - int index); - -phy_get, phy_optional_get, devm_phy_get and devm_phy_optional_get can -be used to get the PHY. In the case of dt boot, the string arguments -should contain the phy name as given in the dt data and in the case of -non-dt boot, it should contain the label of the PHY. The two -devm_phy_get associates the device with the PHY using devres on -successful PHY get. On driver detach, release function is invoked on -the devres data and devres data is freed. phy_optional_get and -devm_phy_optional_get should be used when the phy is optional. These -two functions will never return -ENODEV, but instead returns NULL when -the phy cannot be found.Some generic drivers, such as ehci, may use multiple -phys and for such drivers referencing phy(s) by name(s) does not make sense. In -this case, devm_of_phy_get_by_index can be used to get a phy reference based on -the index. - -It should be noted that NULL is a valid phy reference. All phy -consumer calls on the NULL phy become NOPs. That is the release calls, -the phy_init() and phy_exit() calls, and phy_power_on() and -phy_power_off() calls are all NOP when applied to a NULL phy. The NULL -phy is useful in devices for handling optional phy devices. - -Releasing a reference to the PHY -================================ - -When the controller no longer needs the PHY, it has to release the reference -to the PHY it has obtained using the APIs mentioned in the above section. The -PHY framework provides 2 APIs to release a reference to the PHY. - -:: - - void phy_put(struct phy *phy); - void devm_phy_put(struct device *dev, struct phy *phy); - -Both these APIs are used to release a reference to the PHY and devm_phy_put -destroys the devres associated with this PHY. - -Destroying the PHY -================== - -When the driver that created the PHY is unloaded, it should destroy the PHY it -created using one of the following 2 APIs:: - - void phy_destroy(struct phy *phy); - void devm_phy_destroy(struct device *dev, struct phy *phy); - -Both these APIs destroy the PHY and devm_phy_destroy destroys the devres -associated with this PHY. - -PM Runtime -========== - -This subsystem is pm runtime enabled. So while creating the PHY, -pm_runtime_enable of the phy device created by this subsystem is called and -while destroying the PHY, pm_runtime_disable is called. Note that the phy -device created by this subsystem will be a child of the device that calls -phy_create (PHY provider device). - -So pm_runtime_get_sync of the phy_device created by this subsystem will invoke -pm_runtime_get_sync of PHY provider device because of parent-child relationship. -It should also be noted that phy_power_on and phy_power_off performs -phy_pm_runtime_get_sync and phy_pm_runtime_put respectively. -There are exported APIs like phy_pm_runtime_get, phy_pm_runtime_get_sync, -phy_pm_runtime_put, phy_pm_runtime_put_sync, phy_pm_runtime_allow and -phy_pm_runtime_forbid for performing PM operations. - -PHY Mappings -============ - -In order to get reference to a PHY without help from DeviceTree, the framework -offers lookups which can be compared to clkdev that allow clk structures to be -bound to devices. A lookup can be made be made during runtime when a handle to -the struct phy already exists. - -The framework offers the following API for registering and unregistering the -lookups:: - - int phy_create_lookup(struct phy *phy, const char *con_id, - const char *dev_id); - void phy_remove_lookup(struct phy *phy, const char *con_id, - const char *dev_id); - -DeviceTree Binding -================== - -The documentation for PHY dt binding can be found @ -Documentation/devicetree/bindings/phy/phy-bindings.txt diff --git a/Documentation/phy/samsung-usb2.rst b/Documentation/phy/samsung-usb2.rst deleted file mode 100644 index c48c8b9797b9..000000000000 --- a/Documentation/phy/samsung-usb2.rst +++ /dev/null @@ -1,137 +0,0 @@ -==================================== -Samsung USB 2.0 PHY adaptation layer -==================================== - -1. Description --------------- - -The architecture of the USB 2.0 PHY module in Samsung SoCs is similar -among many SoCs. In spite of the similarities it proved difficult to -create a one driver that would fit all these PHY controllers. Often -the differences were minor and were found in particular bits of the -registers of the PHY. In some rare cases the order of register writes or -the PHY powering up process had to be altered. This adaptation layer is -a compromise between having separate drivers and having a single driver -with added support for many special cases. - -2. Files description --------------------- - -- phy-samsung-usb2.c - This is the main file of the adaptation layer. This file contains - the probe function and provides two callbacks to the Generic PHY - Framework. This two callbacks are used to power on and power off the - phy. They carry out the common work that has to be done on all version - of the PHY module. Depending on which SoC was chosen they execute SoC - specific callbacks. The specific SoC version is selected by choosing - the appropriate compatible string. In addition, this file contains - struct of_device_id definitions for particular SoCs. - -- phy-samsung-usb2.h - This is the include file. It declares the structures used by this - driver. In addition it should contain extern declarations for - structures that describe particular SoCs. - -3. Supporting SoCs ------------------- - -To support a new SoC a new file should be added to the drivers/phy -directory. Each SoC's configuration is stored in an instance of the -struct samsung_usb2_phy_config:: - - struct samsung_usb2_phy_config { - const struct samsung_usb2_common_phy *phys; - int (*rate_to_clk)(unsigned long, u32 *); - unsigned int num_phys; - bool has_mode_switch; - }; - -The num_phys is the number of phys handled by the driver. `*phys` is an -array that contains the configuration for each phy. The has_mode_switch -property is a boolean flag that determines whether the SoC has USB host -and device on a single pair of pins. If so, a special register has to -be modified to change the internal routing of these pins between a USB -device or host module. - -For example the configuration for Exynos 4210 is following:: - - const struct samsung_usb2_phy_config exynos4210_usb2_phy_config = { - .has_mode_switch = 0, - .num_phys = EXYNOS4210_NUM_PHYS, - .phys = exynos4210_phys, - .rate_to_clk = exynos4210_rate_to_clk, - } - -- `int (*rate_to_clk)(unsigned long, u32 *)` - - The rate_to_clk callback is to convert the rate of the clock - used as the reference clock for the PHY module to the value - that should be written in the hardware register. - -The exynos4210_phys configuration array is as follows:: - - static const struct samsung_usb2_common_phy exynos4210_phys[] = { - { - .label = "device", - .id = EXYNOS4210_DEVICE, - .power_on = exynos4210_power_on, - .power_off = exynos4210_power_off, - }, - { - .label = "host", - .id = EXYNOS4210_HOST, - .power_on = exynos4210_power_on, - .power_off = exynos4210_power_off, - }, - { - .label = "hsic0", - .id = EXYNOS4210_HSIC0, - .power_on = exynos4210_power_on, - .power_off = exynos4210_power_off, - }, - { - .label = "hsic1", - .id = EXYNOS4210_HSIC1, - .power_on = exynos4210_power_on, - .power_off = exynos4210_power_off, - }, - {}, - }; - -- `int (*power_on)(struct samsung_usb2_phy_instance *);` - `int (*power_off)(struct samsung_usb2_phy_instance *);` - - These two callbacks are used to power on and power off the phy - by modifying appropriate registers. - -Final change to the driver is adding appropriate compatible value to the -phy-samsung-usb2.c file. In case of Exynos 4210 the following lines were -added to the struct of_device_id samsung_usb2_phy_of_match[] array:: - - #ifdef CONFIG_PHY_EXYNOS4210_USB2 - { - .compatible = "samsung,exynos4210-usb2-phy", - .data = &exynos4210_usb2_phy_config, - }, - #endif - -To add further flexibility to the driver the Kconfig file enables to -include support for selected SoCs in the compiled driver. The Kconfig -entry for Exynos 4210 is following:: - - config PHY_EXYNOS4210_USB2 - bool "Support for Exynos 4210" - depends on PHY_SAMSUNG_USB2 - depends on CPU_EXYNOS4210 - help - Enable USB PHY support for Exynos 4210. This option requires that - Samsung USB 2.0 PHY driver is enabled and means that support for this - particular SoC is compiled in the driver. In case of Exynos 4210 four - phys are available - device, host, HSCI0 and HSCI1. - -The newly created file that supports the new SoC has to be also added to the -Makefile. In case of Exynos 4210 the added line is following:: - - obj-$(CONFIG_PHY_EXYNOS4210_USB2) += phy-exynos4210-usb2.o - -After completing these steps the support for the new SoC should be ready. diff --git a/MAINTAINERS b/MAINTAINERS index 4f88bca37c55..6571653ecb40 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14083,7 +14083,7 @@ M: Sylwester Nawrocki L: linux-kernel@vger.kernel.org S: Supported F: Documentation/devicetree/bindings/phy/samsung-phy.txt -F: Documentation/phy/samsung-usb2.rst +F: Documentation/driver-api/phy/samsung-usb2.rst F: drivers/phy/samsung/phy-exynos4210-usb2.c F: drivers/phy/samsung/phy-exynos4x12-usb2.c F: drivers/phy/samsung/phy-exynos5250-usb2.c -- cgit v1.2.3 From 652a49bc68ce3cf0355bde357b3998bd63e73915 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 15:03:13 -0300 Subject: docs: add a memory-devices subdir to driver-api There are two docs describing memory device drivers. Add both to this new chapter of the driver-api. Signed-off-by: Mauro Carvalho Chehab --- Documentation/bus-devices/ti-gpmc.rst | 179 --------------------- Documentation/driver-api/index.rst | 1 + Documentation/driver-api/memory-devices/index.rst | 16 ++ .../driver-api/memory-devices/ti-emif.rst | 64 ++++++++ .../driver-api/memory-devices/ti-gpmc.rst | 179 +++++++++++++++++++++ Documentation/memory-devices/ti-emif.rst | 64 -------- 6 files changed, 260 insertions(+), 243 deletions(-) delete mode 100644 Documentation/bus-devices/ti-gpmc.rst create mode 100644 Documentation/driver-api/memory-devices/index.rst create mode 100644 Documentation/driver-api/memory-devices/ti-emif.rst create mode 100644 Documentation/driver-api/memory-devices/ti-gpmc.rst delete mode 100644 Documentation/memory-devices/ti-emif.rst (limited to 'Documentation/driver-api') diff --git a/Documentation/bus-devices/ti-gpmc.rst b/Documentation/bus-devices/ti-gpmc.rst deleted file mode 100644 index 87c366e418be..000000000000 --- a/Documentation/bus-devices/ti-gpmc.rst +++ /dev/null @@ -1,179 +0,0 @@ -:orphan: - -======================================== -GPMC (General Purpose Memory Controller) -======================================== - -GPMC is an unified memory controller dedicated to interfacing external -memory devices like - - * Asynchronous SRAM like memories and application specific integrated - circuit devices. - * Asynchronous, synchronous, and page mode burst NOR flash devices - NAND flash - * Pseudo-SRAM devices - -GPMC is found on Texas Instruments SoC's (OMAP based) -IP details: http://www.ti.com/lit/pdf/spruh73 section 7.1 - - -GPMC generic timing calculation: -================================ - -GPMC has certain timings that has to be programmed for proper -functioning of the peripheral, while peripheral has another set of -timings. To have peripheral work with gpmc, peripheral timings has to -be translated to the form gpmc can understand. The way it has to be -translated depends on the connected peripheral. Also there is a -dependency for certain gpmc timings on gpmc clock frequency. Hence a -generic timing routine was developed to achieve above requirements. - -Generic routine provides a generic method to calculate gpmc timings -from gpmc peripheral timings. struct gpmc_device_timings fields has to -be updated with timings from the datasheet of the peripheral that is -connected to gpmc. A few of the peripheral timings can be fed either -in time or in cycles, provision to handle this scenario has been -provided (refer struct gpmc_device_timings definition). It may so -happen that timing as specified by peripheral datasheet is not present -in timing structure, in this scenario, try to correlate peripheral -timing to the one available. If that doesn't work, try to add a new -field as required by peripheral, educate generic timing routine to -handle it, make sure that it does not break any of the existing. -Then there may be cases where peripheral datasheet doesn't mention -certain fields of struct gpmc_device_timings, zero those entries. - -Generic timing routine has been verified to work properly on -multiple onenand's and tusb6010 peripherals. - -A word of caution: generic timing routine has been developed based -on understanding of gpmc timings, peripheral timings, available -custom timing routines, a kind of reverse engineering without -most of the datasheets & hardware (to be exact none of those supported -in mainline having custom timing routine) and by simulation. - -gpmc timing dependency on peripheral timings: - -[: , ...] - -1. common - -cs_on: - t_ceasu -adv_on: - t_avdasu, t_ceavd - -2. sync common - -sync_clk: - clk -page_burst_access: - t_bacc -clk_activation: - t_ces, t_avds - -3. read async muxed - -adv_rd_off: - t_avdp_r -oe_on: - t_oeasu, t_aavdh -access: - t_iaa, t_oe, t_ce, t_aa -rd_cycle: - t_rd_cycle, t_cez_r, t_oez - -4. read async non-muxed - -adv_rd_off: - t_avdp_r -oe_on: - t_oeasu -access: - t_iaa, t_oe, t_ce, t_aa -rd_cycle: - t_rd_cycle, t_cez_r, t_oez - -5. read sync muxed - -adv_rd_off: - t_avdp_r, t_avdh -oe_on: - t_oeasu, t_ach, cyc_aavdh_oe -access: - t_iaa, cyc_iaa, cyc_oe -rd_cycle: - t_cez_r, t_oez, t_ce_rdyz - -6. read sync non-muxed - -adv_rd_off: - t_avdp_r -oe_on: - t_oeasu -access: - t_iaa, cyc_iaa, cyc_oe -rd_cycle: - t_cez_r, t_oez, t_ce_rdyz - -7. write async muxed - -adv_wr_off: - t_avdp_w -we_on, wr_data_mux_bus: - t_weasu, t_aavdh, cyc_aavhd_we -we_off: - t_wpl -cs_wr_off: - t_wph -wr_cycle: - t_cez_w, t_wr_cycle - -8. write async non-muxed - -adv_wr_off: - t_avdp_w -we_on, wr_data_mux_bus: - t_weasu -we_off: - t_wpl -cs_wr_off: - t_wph -wr_cycle: - t_cez_w, t_wr_cycle - -9. write sync muxed - -adv_wr_off: - t_avdp_w, t_avdh -we_on, wr_data_mux_bus: - t_weasu, t_rdyo, t_aavdh, cyc_aavhd_we -we_off: - t_wpl, cyc_wpl -cs_wr_off: - t_wph -wr_cycle: - t_cez_w, t_ce_rdyz - -10. write sync non-muxed - -adv_wr_off: - t_avdp_w -we_on, wr_data_mux_bus: - t_weasu, t_rdyo -we_off: - t_wpl, cyc_wpl -cs_wr_off: - t_wph -wr_cycle: - t_cez_w, t_ce_rdyz - - -Note: - Many of gpmc timings are dependent on other gpmc timings (a few - gpmc timings purely dependent on other gpmc timings, a reason that - some of the gpmc timings are missing above), and it will result in - indirect dependency of peripheral timings to gpmc timings other than - mentioned above, refer timing routine for more details. To know what - these peripheral timings correspond to, please see explanations in - struct gpmc_device_timings definition. And for gpmc timings refer - IP details (link above). diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index eff22db0ed14..d12a80f386a6 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -79,6 +79,7 @@ available subsections can be seen below. isapnp generic-counter lightnvm-pblk + memory-devices/index men-chameleon-bus ntb nvmem diff --git a/Documentation/driver-api/memory-devices/index.rst b/Documentation/driver-api/memory-devices/index.rst new file mode 100644 index 000000000000..87549828f6ab --- /dev/null +++ b/Documentation/driver-api/memory-devices/index.rst @@ -0,0 +1,16 @@ +========================= +Memory Controller drivers +========================= + +.. toctree:: + :maxdepth: 1 + + ti-emif + ti-gpmc + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/driver-api/memory-devices/ti-emif.rst b/Documentation/driver-api/memory-devices/ti-emif.rst new file mode 100644 index 000000000000..dea2ad9bcd7e --- /dev/null +++ b/Documentation/driver-api/memory-devices/ti-emif.rst @@ -0,0 +1,64 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================== +TI EMIF SDRAM Controller Driver +=============================== + +Author +====== +Aneesh V + +Location +======== +driver/memory/emif.c + +Supported SoCs: +=============== +TI OMAP44xx +TI OMAP54xx + +Menuconfig option: +================== +Device Drivers + Memory devices + Texas Instruments EMIF driver + +Description +=========== +This driver is for the EMIF module available in Texas Instruments +SoCs. EMIF is an SDRAM controller that, based on its revision, +supports one or more of DDR2, DDR3, and LPDDR2 SDRAM protocols. +This driver takes care of only LPDDR2 memories presently. The +functions of the driver includes re-configuring AC timing +parameters and other settings during frequency, voltage and +temperature changes + +Platform Data (see include/linux/platform_data/emif_plat.h) +=========================================================== +DDR device details and other board dependent and SoC dependent +information can be passed through platform data (struct emif_platform_data) + +- DDR device details: 'struct ddr_device_info' +- Device AC timings: 'struct lpddr2_timings' and 'struct lpddr2_min_tck' +- Custom configurations: customizable policy options through + 'struct emif_custom_configs' +- IP revision +- PHY type + +Interface to the external world +=============================== +EMIF driver registers notifiers for voltage and frequency changes +affecting EMIF and takes appropriate actions when these are invoked. + +- freq_pre_notify_handling() +- freq_post_notify_handling() +- volt_notify_handling() + +Debugfs +======= +The driver creates two debugfs entries per device. + +- regcache_dump : dump of register values calculated and saved for all + frequencies used so far. +- mr4 : last polled value of MR4 register in the LPDDR2 device. MR4 + indicates the current temperature level of the device. diff --git a/Documentation/driver-api/memory-devices/ti-gpmc.rst b/Documentation/driver-api/memory-devices/ti-gpmc.rst new file mode 100644 index 000000000000..33efcb81f080 --- /dev/null +++ b/Documentation/driver-api/memory-devices/ti-gpmc.rst @@ -0,0 +1,179 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +GPMC (General Purpose Memory Controller) +======================================== + +GPMC is an unified memory controller dedicated to interfacing external +memory devices like + + * Asynchronous SRAM like memories and application specific integrated + circuit devices. + * Asynchronous, synchronous, and page mode burst NOR flash devices + NAND flash + * Pseudo-SRAM devices + +GPMC is found on Texas Instruments SoC's (OMAP based) +IP details: http://www.ti.com/lit/pdf/spruh73 section 7.1 + + +GPMC generic timing calculation: +================================ + +GPMC has certain timings that has to be programmed for proper +functioning of the peripheral, while peripheral has another set of +timings. To have peripheral work with gpmc, peripheral timings has to +be translated to the form gpmc can understand. The way it has to be +translated depends on the connected peripheral. Also there is a +dependency for certain gpmc timings on gpmc clock frequency. Hence a +generic timing routine was developed to achieve above requirements. + +Generic routine provides a generic method to calculate gpmc timings +from gpmc peripheral timings. struct gpmc_device_timings fields has to +be updated with timings from the datasheet of the peripheral that is +connected to gpmc. A few of the peripheral timings can be fed either +in time or in cycles, provision to handle this scenario has been +provided (refer struct gpmc_device_timings definition). It may so +happen that timing as specified by peripheral datasheet is not present +in timing structure, in this scenario, try to correlate peripheral +timing to the one available. If that doesn't work, try to add a new +field as required by peripheral, educate generic timing routine to +handle it, make sure that it does not break any of the existing. +Then there may be cases where peripheral datasheet doesn't mention +certain fields of struct gpmc_device_timings, zero those entries. + +Generic timing routine has been verified to work properly on +multiple onenand's and tusb6010 peripherals. + +A word of caution: generic timing routine has been developed based +on understanding of gpmc timings, peripheral timings, available +custom timing routines, a kind of reverse engineering without +most of the datasheets & hardware (to be exact none of those supported +in mainline having custom timing routine) and by simulation. + +gpmc timing dependency on peripheral timings: + +[: , ...] + +1. common + +cs_on: + t_ceasu +adv_on: + t_avdasu, t_ceavd + +2. sync common + +sync_clk: + clk +page_burst_access: + t_bacc +clk_activation: + t_ces, t_avds + +3. read async muxed + +adv_rd_off: + t_avdp_r +oe_on: + t_oeasu, t_aavdh +access: + t_iaa, t_oe, t_ce, t_aa +rd_cycle: + t_rd_cycle, t_cez_r, t_oez + +4. read async non-muxed + +adv_rd_off: + t_avdp_r +oe_on: + t_oeasu +access: + t_iaa, t_oe, t_ce, t_aa +rd_cycle: + t_rd_cycle, t_cez_r, t_oez + +5. read sync muxed + +adv_rd_off: + t_avdp_r, t_avdh +oe_on: + t_oeasu, t_ach, cyc_aavdh_oe +access: + t_iaa, cyc_iaa, cyc_oe +rd_cycle: + t_cez_r, t_oez, t_ce_rdyz + +6. read sync non-muxed + +adv_rd_off: + t_avdp_r +oe_on: + t_oeasu +access: + t_iaa, cyc_iaa, cyc_oe +rd_cycle: + t_cez_r, t_oez, t_ce_rdyz + +7. write async muxed + +adv_wr_off: + t_avdp_w +we_on, wr_data_mux_bus: + t_weasu, t_aavdh, cyc_aavhd_we +we_off: + t_wpl +cs_wr_off: + t_wph +wr_cycle: + t_cez_w, t_wr_cycle + +8. write async non-muxed + +adv_wr_off: + t_avdp_w +we_on, wr_data_mux_bus: + t_weasu +we_off: + t_wpl +cs_wr_off: + t_wph +wr_cycle: + t_cez_w, t_wr_cycle + +9. write sync muxed + +adv_wr_off: + t_avdp_w, t_avdh +we_on, wr_data_mux_bus: + t_weasu, t_rdyo, t_aavdh, cyc_aavhd_we +we_off: + t_wpl, cyc_wpl +cs_wr_off: + t_wph +wr_cycle: + t_cez_w, t_ce_rdyz + +10. write sync non-muxed + +adv_wr_off: + t_avdp_w +we_on, wr_data_mux_bus: + t_weasu, t_rdyo +we_off: + t_wpl, cyc_wpl +cs_wr_off: + t_wph +wr_cycle: + t_cez_w, t_ce_rdyz + + +Note: + Many of gpmc timings are dependent on other gpmc timings (a few + gpmc timings purely dependent on other gpmc timings, a reason that + some of the gpmc timings are missing above), and it will result in + indirect dependency of peripheral timings to gpmc timings other than + mentioned above, refer timing routine for more details. To know what + these peripheral timings correspond to, please see explanations in + struct gpmc_device_timings definition. And for gpmc timings refer + IP details (link above). diff --git a/Documentation/memory-devices/ti-emif.rst b/Documentation/memory-devices/ti-emif.rst deleted file mode 100644 index c9242294e63c..000000000000 --- a/Documentation/memory-devices/ti-emif.rst +++ /dev/null @@ -1,64 +0,0 @@ -:orphan: - -=============================== -TI EMIF SDRAM Controller Driver -=============================== - -Author -====== -Aneesh V - -Location -======== -driver/memory/emif.c - -Supported SoCs: -=============== -TI OMAP44xx -TI OMAP54xx - -Menuconfig option: -================== -Device Drivers - Memory devices - Texas Instruments EMIF driver - -Description -=========== -This driver is for the EMIF module available in Texas Instruments -SoCs. EMIF is an SDRAM controller that, based on its revision, -supports one or more of DDR2, DDR3, and LPDDR2 SDRAM protocols. -This driver takes care of only LPDDR2 memories presently. The -functions of the driver includes re-configuring AC timing -parameters and other settings during frequency, voltage and -temperature changes - -Platform Data (see include/linux/platform_data/emif_plat.h) -=========================================================== -DDR device details and other board dependent and SoC dependent -information can be passed through platform data (struct emif_platform_data) - -- DDR device details: 'struct ddr_device_info' -- Device AC timings: 'struct lpddr2_timings' and 'struct lpddr2_min_tck' -- Custom configurations: customizable policy options through - 'struct emif_custom_configs' -- IP revision -- PHY type - -Interface to the external world -=============================== -EMIF driver registers notifiers for voltage and frequency changes -affecting EMIF and takes appropriate actions when these are invoked. - -- freq_pre_notify_handling() -- freq_post_notify_handling() -- volt_notify_handling() - -Debugfs -======= -The driver creates two debugfs entries per device. - -- regcache_dump : dump of register values calculated and saved for all - frequencies used so far. -- mr4 : last polled value of MR4 register in the LPDDR2 device. MR4 - indicates the current temperature level of the device. -- cgit v1.2.3 From 7e042736faab9457dd754668b9db2a1113cd322b Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Fri, 28 Jun 2019 07:13:34 -0300 Subject: docs: add SPDX tags to new index files All those new files I added are under GPL v2.0 license. Add the corresponding SPDX headers to them. Signed-off-by: Mauro Carvalho Chehab --- Documentation/admin-guide/blockdev/drbd/figures.rst | 2 ++ Documentation/admin-guide/blockdev/index.rst | 2 ++ Documentation/admin-guide/laptops/index.rst | 1 + Documentation/admin-guide/namespaces/index.rst | 2 ++ Documentation/admin-guide/perf/index.rst | 2 ++ Documentation/arm/index.rst | 2 ++ Documentation/arm/nwfpe/index.rst | 2 ++ Documentation/arm/omap/index.rst | 2 ++ Documentation/arm/sa1100/index.rst | 2 ++ Documentation/arm/samsung-s3c24xx/index.rst | 2 ++ Documentation/arm/samsung/index.rst | 2 ++ Documentation/driver-api/early-userspace/index.rst | 2 ++ Documentation/driver-api/md/index.rst | 2 ++ Documentation/driver-api/memory-devices/index.rst | 2 ++ Documentation/driver-api/mmc/index.rst | 2 ++ Documentation/driver-api/mtd/index.rst | 2 ++ Documentation/driver-api/nfc/index.rst | 2 ++ Documentation/driver-api/nvdimm/index.rst | 2 ++ Documentation/driver-api/phy/index.rst | 2 ++ Documentation/driver-api/rapidio/index.rst | 2 ++ Documentation/ia64/index.rst | 2 ++ 21 files changed, 41 insertions(+) (limited to 'Documentation/driver-api') diff --git a/Documentation/admin-guide/blockdev/drbd/figures.rst b/Documentation/admin-guide/blockdev/drbd/figures.rst index 3e3fd4b8a478..bd9a4901fe46 100644 --- a/Documentation/admin-guide/blockdev/drbd/figures.rst +++ b/Documentation/admin-guide/blockdev/drbd/figures.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + .. The here included files are intended to help understand the implementation Data flows that Relate some functions, and write packets diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst index 20a738d9d047..b903cf152091 100644 --- a/Documentation/admin-guide/blockdev/index.rst +++ b/Documentation/admin-guide/blockdev/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + =========================== The Linux RapidIO Subsystem =========================== diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst index 6b554e39863b..cd9a1c2695fd 100644 --- a/Documentation/admin-guide/laptops/index.rst +++ b/Documentation/admin-guide/laptops/index.rst @@ -1,3 +1,4 @@ +.. SPDX-License-Identifier: GPL-2.0 ============== Laptop Drivers diff --git a/Documentation/admin-guide/namespaces/index.rst b/Documentation/admin-guide/namespaces/index.rst index 713ec4949fa7..384f2e0f33d2 100644 --- a/Documentation/admin-guide/namespaces/index.rst +++ b/Documentation/admin-guide/namespaces/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ========== Namespaces ========== diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 9d445451ea18..ee4bfd2a740f 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + =========================== Performance monitor support =========================== diff --git a/Documentation/arm/index.rst b/Documentation/arm/index.rst index 9c2f781f4685..5fc072dd0c5e 100644 --- a/Documentation/arm/index.rst +++ b/Documentation/arm/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ================ ARM Architecture ================ diff --git a/Documentation/arm/nwfpe/index.rst b/Documentation/arm/nwfpe/index.rst index 21fa8ce192ae..3c4d2f9aa10e 100644 --- a/Documentation/arm/nwfpe/index.rst +++ b/Documentation/arm/nwfpe/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + =================================== NetWinder's floating point emulator =================================== diff --git a/Documentation/arm/omap/index.rst b/Documentation/arm/omap/index.rst index f1e9c11d9f9b..8b365b212e49 100644 --- a/Documentation/arm/omap/index.rst +++ b/Documentation/arm/omap/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ======= TI OMAP ======= diff --git a/Documentation/arm/sa1100/index.rst b/Documentation/arm/sa1100/index.rst index fb2385b3accf..68c2a280a745 100644 --- a/Documentation/arm/sa1100/index.rst +++ b/Documentation/arm/sa1100/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ==================== Intel StrongARM 1100 ==================== diff --git a/Documentation/arm/samsung-s3c24xx/index.rst b/Documentation/arm/samsung-s3c24xx/index.rst index 6c7b241cbf37..5b8a7f9398d8 100644 --- a/Documentation/arm/samsung-s3c24xx/index.rst +++ b/Documentation/arm/samsung-s3c24xx/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ========================== Samsung S3C24XX SoC Family ========================== diff --git a/Documentation/arm/samsung/index.rst b/Documentation/arm/samsung/index.rst index f54d95734362..8142cce3d23e 100644 --- a/Documentation/arm/samsung/index.rst +++ b/Documentation/arm/samsung/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + =========== Samsung SoC =========== diff --git a/Documentation/driver-api/early-userspace/index.rst b/Documentation/driver-api/early-userspace/index.rst index 6f20c3c560d8..149c1822f06d 100644 --- a/Documentation/driver-api/early-userspace/index.rst +++ b/Documentation/driver-api/early-userspace/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + =============== Early Userspace =============== diff --git a/Documentation/driver-api/md/index.rst b/Documentation/driver-api/md/index.rst index 205080891a1a..18f54a7d7d6e 100644 --- a/Documentation/driver-api/md/index.rst +++ b/Documentation/driver-api/md/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ==== RAID ==== diff --git a/Documentation/driver-api/memory-devices/index.rst b/Documentation/driver-api/memory-devices/index.rst index 87549828f6ab..28101458cda5 100644 --- a/Documentation/driver-api/memory-devices/index.rst +++ b/Documentation/driver-api/memory-devices/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ========================= Memory Controller drivers ========================= diff --git a/Documentation/driver-api/mmc/index.rst b/Documentation/driver-api/mmc/index.rst index 9aaf64951a8c..7339736ac774 100644 --- a/Documentation/driver-api/mmc/index.rst +++ b/Documentation/driver-api/mmc/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ======================== MMC/SD/SDIO card support ======================== diff --git a/Documentation/driver-api/mtd/index.rst b/Documentation/driver-api/mtd/index.rst index 2e0e7cc4055e..436ba5a851d7 100644 --- a/Documentation/driver-api/mtd/index.rst +++ b/Documentation/driver-api/mtd/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ============================== Memory Technology Device (MTD) ============================== diff --git a/Documentation/driver-api/nfc/index.rst b/Documentation/driver-api/nfc/index.rst index 3afb2c0c2e3c..b6e9eedbff29 100644 --- a/Documentation/driver-api/nfc/index.rst +++ b/Documentation/driver-api/nfc/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ======================== Near Field Communication ======================== diff --git a/Documentation/driver-api/nvdimm/index.rst b/Documentation/driver-api/nvdimm/index.rst index 19dc8ee371dc..a4f8f98aeb94 100644 --- a/Documentation/driver-api/nvdimm/index.rst +++ b/Documentation/driver-api/nvdimm/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + =================================== Non-Volatile Memory Device (NVDIMM) =================================== diff --git a/Documentation/driver-api/phy/index.rst b/Documentation/driver-api/phy/index.rst index fce9ffae2812..69ba1216de72 100644 --- a/Documentation/driver-api/phy/index.rst +++ b/Documentation/driver-api/phy/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ===================== Generic PHY Framework ===================== diff --git a/Documentation/driver-api/rapidio/index.rst b/Documentation/driver-api/rapidio/index.rst index 4c5e51a05134..a41b4242d16f 100644 --- a/Documentation/driver-api/rapidio/index.rst +++ b/Documentation/driver-api/rapidio/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + =========================== The Linux RapidIO Subsystem =========================== diff --git a/Documentation/ia64/index.rst b/Documentation/ia64/index.rst index ef99475f672b..0436e1034115 100644 --- a/Documentation/ia64/index.rst +++ b/Documentation/ia64/index.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ================== IA-64 Architecture ================== -- cgit v1.2.3 From eddeed127b06ea2542dc18f2fe37d383b6369fec Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Sat, 6 Jul 2019 13:38:56 -0300 Subject: docs: don't use nested tables Nested tables aren't supported for pdf output on Sphinx 1.7.9: admin-guide/laptops/sonypi:: nested tables are not yet implemented. admin-guide/laptops/toshiba_haps:: nested tables are not yet implemented. driver-api/nvdimm/btt:: nested tables are not yet implemented. s390/debugging390:: nested tables are not yet implemented. Signed-off-by: Mauro Carvalho Chehab Acked-by: Andy Shevchenko # laptops --- Documentation/admin-guide/laptops/sonypi.rst | 28 ++++++++++------------ Documentation/admin-guide/laptops/toshiba_haps.rst | 8 +++---- Documentation/driver-api/nvdimm/btt.rst | 2 +- Documentation/s390/debugging390.rst | 2 +- 4 files changed, 19 insertions(+), 21 deletions(-) (limited to 'Documentation/driver-api') diff --git a/Documentation/admin-guide/laptops/sonypi.rst b/Documentation/admin-guide/laptops/sonypi.rst index 2a1975ed7ee4..c6eaaf48f7c1 100644 --- a/Documentation/admin-guide/laptops/sonypi.rst +++ b/Documentation/admin-guide/laptops/sonypi.rst @@ -53,7 +53,7 @@ module or sonypi.= on the kernel boot line when sonypi is statically linked into the kernel). Those options are: =============== ======================================================= - minor: minor number of the misc device /dev/sonypi, + minor: minor number of the misc device /dev/sonypi, default is -1 (automatic allocation, see /proc/misc or kernel logs) @@ -89,24 +89,22 @@ statically linked into the kernel). Those options are: set to 0xffffffff, meaning that all possible events will be tried. You can use the following bits to construct your own event mask (from - drivers/char/sonypi.h): - - ======================== ====== - SONYPI_JOGGER_MASK 0x0001 - SONYPI_CAPTURE_MASK 0x0002 - SONYPI_FNKEY_MASK 0x0004 - SONYPI_BLUETOOTH_MASK 0x0008 - SONYPI_PKEY_MASK 0x0010 - SONYPI_BACK_MASK 0x0020 - SONYPI_HELP_MASK 0x0040 - SONYPI_LID_MASK 0x0080 - SONYPI_ZOOM_MASK 0x0100 - SONYPI_THUMBPHRASE_MASK 0x0200 + drivers/char/sonypi.h):: + + SONYPI_JOGGER_MASK 0x0001 + SONYPI_CAPTURE_MASK 0x0002 + SONYPI_FNKEY_MASK 0x0004 + SONYPI_BLUETOOTH_MASK 0x0008 + SONYPI_PKEY_MASK 0x0010 + SONYPI_BACK_MASK 0x0020 + SONYPI_HELP_MASK 0x0040 + SONYPI_LID_MASK 0x0080 + SONYPI_ZOOM_MASK 0x0100 + SONYPI_THUMBPHRASE_MASK 0x0200 SONYPI_MEYE_MASK 0x0400 SONYPI_MEMORYSTICK_MASK 0x0800 SONYPI_BATTERY_MASK 0x1000 SONYPI_WIRELESS_MASK 0x2000 - ======================== ====== useinput: if set (which is the default) two input devices are created, one which interprets the jogdial events as diff --git a/Documentation/admin-guide/laptops/toshiba_haps.rst b/Documentation/admin-guide/laptops/toshiba_haps.rst index 11dfc428c080..d28b6c3f2849 100644 --- a/Documentation/admin-guide/laptops/toshiba_haps.rst +++ b/Documentation/admin-guide/laptops/toshiba_haps.rst @@ -75,11 +75,11 @@ The sysfs files under /sys/devices/LNXSYSTM:00/LNXSYBUS:00/TOS620A:00/ are: protection_level The protection_level is readable and writeable, and provides a way to let userspace query the current protection level, as well as set the desired protection level, the - available protection levels are: + available protection levels are:: - ============ ======= ========== ======== - 0 - Disabled 1 - Low 2 - Medium 3 - High - ============ ======= ========== ======== + ============ ======= ========== ======== + 0 - Disabled 1 - Low 2 - Medium 3 - High + ============ ======= ========== ======== reset_protection The reset_protection entry is writeable only, being "1" the only parameter it accepts, it is used to trigger diff --git a/Documentation/driver-api/nvdimm/btt.rst b/Documentation/driver-api/nvdimm/btt.rst index 2d8269f834bd..107395c042ae 100644 --- a/Documentation/driver-api/nvdimm/btt.rst +++ b/Documentation/driver-api/nvdimm/btt.rst @@ -83,7 +83,7 @@ flags, and the remaining form the internal block number. ======== ============================================================= Bit Description ======== ============================================================= -31 - 30 Error and Zero flags - Used in the following way: +31 - 30 Error and Zero flags - Used in the following way:: == == ==================================================== 31 30 Description diff --git a/Documentation/s390/debugging390.rst b/Documentation/s390/debugging390.rst index d49305fd5e1a..73ad0b06c666 100644 --- a/Documentation/s390/debugging390.rst +++ b/Documentation/s390/debugging390.rst @@ -170,7 +170,7 @@ currently running at. | +----------------+-------------------------------------------------+ | | 32 | Basic Addressing Mode | | | | | -| | | Used to set addressing mode | +| | | Used to set addressing mode:: | | | | | | | | +---------+----------+----------+ | | | | | PSW 31 | PSW 32 | | | -- cgit v1.2.3