diff options
Diffstat (limited to 'Documentation/core-api')
-rw-r--r-- | Documentation/core-api/genalloc.rst | 28 | ||||
-rw-r--r-- | Documentation/core-api/genericirq.rst | 52 | ||||
-rw-r--r-- | Documentation/core-api/index.rst | 7 | ||||
-rw-r--r-- | Documentation/core-api/ioctl.rst | 253 | ||||
-rw-r--r-- | Documentation/core-api/kernel-api.rst | 11 | ||||
-rw-r--r-- | Documentation/core-api/memory-allocation.rst | 54 | ||||
-rw-r--r-- | Documentation/core-api/mm-api.rst | 2 | ||||
-rw-r--r-- | Documentation/core-api/packing.rst | 166 | ||||
-rw-r--r-- | Documentation/core-api/padata.rst | 169 | ||||
-rw-r--r-- | Documentation/core-api/pin_user_pages.rst | 232 | ||||
-rw-r--r-- | Documentation/core-api/printk-formats.rst | 76 | ||||
-rw-r--r-- | Documentation/core-api/refcount-vs-atomic.rst | 36 | ||||
-rw-r--r-- | Documentation/core-api/symbol-namespaces.rst | 157 | ||||
-rw-r--r-- | Documentation/core-api/xarray.rst | 70 |
14 files changed, 1181 insertions, 132 deletions
diff --git a/Documentation/core-api/genalloc.rst b/Documentation/core-api/genalloc.rst index 6b38a39fab24..a5af2cbf58a5 100644 --- a/Documentation/core-api/genalloc.rst +++ b/Documentation/core-api/genalloc.rst @@ -23,7 +23,7 @@ begins with the creation of a pool using one of: .. kernel-doc:: lib/genalloc.c :functions: devm_gen_pool_create -A call to :c:func:`gen_pool_create` will create a pool. The granularity of +A call to gen_pool_create() will create a pool. The granularity of allocations is set with min_alloc_order; it is a log-base-2 number like those used by the page allocator, but it refers to bytes rather than pages. So, if min_alloc_order is passed as 3, then all allocations will be a @@ -32,7 +32,7 @@ required to track the memory in the pool. The nid parameter specifies which NUMA node should be used for the allocation of the housekeeping structures; it can be -1 if the caller doesn't care. -The "managed" interface :c:func:`devm_gen_pool_create` ties the pool to a +The "managed" interface devm_gen_pool_create() ties the pool to a specific device. Among other things, it will automatically clean up the pool when the given device is destroyed. @@ -53,32 +53,32 @@ to the pool. That can be done with one of: :functions: gen_pool_add .. kernel-doc:: lib/genalloc.c - :functions: gen_pool_add_virt + :functions: gen_pool_add_owner -A call to :c:func:`gen_pool_add` will place the size bytes of memory +A call to gen_pool_add() will place the size bytes of memory starting at addr (in the kernel's virtual address space) into the given pool, once again using nid as the node ID for ancillary memory allocations. -The :c:func:`gen_pool_add_virt` variant associates an explicit physical +The gen_pool_add_virt() variant associates an explicit physical address with the memory; this is only necessary if the pool will be used for DMA allocations. The functions for allocating memory from the pool (and putting it back) are: -.. kernel-doc:: lib/genalloc.c +.. kernel-doc:: include/linux/genalloc.h :functions: gen_pool_alloc .. kernel-doc:: lib/genalloc.c :functions: gen_pool_dma_alloc .. kernel-doc:: lib/genalloc.c - :functions: gen_pool_free + :functions: gen_pool_free_owner -As one would expect, :c:func:`gen_pool_alloc` will allocate size< bytes -from the given pool. The :c:func:`gen_pool_dma_alloc` variant allocates +As one would expect, gen_pool_alloc() will allocate size< bytes +from the given pool. The gen_pool_dma_alloc() variant allocates memory for use with DMA operations, returning the associated physical address in the space pointed to by dma. This will only work if the memory -was added with :c:func:`gen_pool_add_virt`. Note that this function +was added with gen_pool_add_virt(). Note that this function departs from the usual genpool pattern of using unsigned long values to represent kernel addresses; it returns a void * instead. @@ -89,14 +89,14 @@ return. If that sort of control is needed, the following functions will be of interest: .. kernel-doc:: lib/genalloc.c - :functions: gen_pool_alloc_algo + :functions: gen_pool_alloc_algo_owner .. kernel-doc:: lib/genalloc.c :functions: gen_pool_set_algo -Allocations with :c:func:`gen_pool_alloc_algo` specify an algorithm to be +Allocations with gen_pool_alloc_algo() specify an algorithm to be used to choose the memory to be allocated; the default algorithm can be set -with :c:func:`gen_pool_set_algo`. The data value is passed to the +with gen_pool_set_algo(). The data value is passed to the algorithm; most ignore it, but it is occasionally needed. One can, naturally, write a special-purpose algorithm, but there is a fair set already available: @@ -129,7 +129,7 @@ writing of special-purpose memory allocators in the future. :functions: gen_pool_for_each_chunk .. kernel-doc:: lib/genalloc.c - :functions: addr_in_gen_pool + :functions: gen_pool_has_addr .. kernel-doc:: lib/genalloc.c :functions: gen_pool_avail diff --git a/Documentation/core-api/genericirq.rst b/Documentation/core-api/genericirq.rst index 4da67b65cecf..8f06d885c310 100644 --- a/Documentation/core-api/genericirq.rst +++ b/Documentation/core-api/genericirq.rst @@ -26,7 +26,7 @@ Rationale ========= The original implementation of interrupt handling in Linux uses the -:c:func:`__do_IRQ` super-handler, which is able to deal with every type of +__do_IRQ() super-handler, which is able to deal with every type of interrupt logic. Originally, Russell King identified different types of handlers to build @@ -43,7 +43,7 @@ During the implementation we identified another type: - Fast EOI type -In the SMP world of the :c:func:`__do_IRQ` super-handler another type was +In the SMP world of the __do_IRQ() super-handler another type was identified: - Per CPU type @@ -83,7 +83,7 @@ IRQ-flow implementation for 'level type' interrupts and add a (sub)architecture specific 'edge type' implementation. To make the transition to the new model easier and prevent the breakage -of existing implementations, the :c:func:`__do_IRQ` super-handler is still +of existing implementations, the __do_IRQ() super-handler is still available. This leads to a kind of duality for the time being. Over time the new model should be used in more and more architectures, as it enables smaller and cleaner IRQ subsystems. It's deprecated for three @@ -116,7 +116,7 @@ status information and pointers to the interrupt flow method and the interrupt chip structure which are assigned to this interrupt. Whenever an interrupt triggers, the low-level architecture code calls -into the generic interrupt code by calling :c:func:`desc->handle_irq`. This +into the generic interrupt code by calling desc->handle_irq(). This high-level IRQ handling function only uses desc->irq_data.chip primitives referenced by the assigned chip descriptor structure. @@ -125,27 +125,29 @@ High-level Driver API The high-level Driver API consists of following functions: -- :c:func:`request_irq` +- request_irq() -- :c:func:`free_irq` +- request_threaded_irq() -- :c:func:`disable_irq` +- free_irq() -- :c:func:`enable_irq` +- disable_irq() -- :c:func:`disable_irq_nosync` (SMP only) +- enable_irq() -- :c:func:`synchronize_irq` (SMP only) +- disable_irq_nosync() (SMP only) -- :c:func:`irq_set_irq_type` +- synchronize_irq() (SMP only) -- :c:func:`irq_set_irq_wake` +- irq_set_irq_type() -- :c:func:`irq_set_handler_data` +- irq_set_irq_wake() -- :c:func:`irq_set_chip` +- irq_set_handler_data() -- :c:func:`irq_set_chip_data` +- irq_set_chip() + +- irq_set_chip_data() See the autogenerated function documentation for details. @@ -154,19 +156,19 @@ High-level IRQ flow handlers The generic layer provides a set of pre-defined irq-flow methods: -- :c:func:`handle_level_irq` +- handle_level_irq() -- :c:func:`handle_edge_irq` +- handle_edge_irq() -- :c:func:`handle_fasteoi_irq` +- handle_fasteoi_irq() -- :c:func:`handle_simple_irq` +- handle_simple_irq() -- :c:func:`handle_percpu_irq` +- handle_percpu_irq() -- :c:func:`handle_edge_eoi_irq` +- handle_edge_eoi_irq() -- :c:func:`handle_bad_irq` +- handle_bad_irq() The interrupt flow handlers (either pre-defined or architecture specific) are assigned to specific interrupts by the architecture either @@ -325,14 +327,14 @@ Delayed interrupt disable This per interrupt selectable feature, which was introduced by Russell King in the ARM interrupt implementation, does not mask an interrupt at -the hardware level when :c:func:`disable_irq` is called. The interrupt is kept +the hardware level when disable_irq() is called. The interrupt is kept enabled and is masked in the flow handler when an interrupt event happens. This prevents losing edge interrupts on hardware which does not store an edge interrupt event while the interrupt is disabled at the hardware level. When an interrupt arrives while the IRQ_DISABLED flag is set, then the interrupt is masked at the hardware level and the IRQ_PENDING bit is set. When the interrupt is re-enabled by -:c:func:`enable_irq` the pending bit is checked and if it is set, the interrupt +enable_irq() the pending bit is checked and if it is set, the interrupt is resent either via hardware or by a software resend mechanism. (It's necessary to enable CONFIG_HARDIRQS_SW_RESEND when you want to use the delayed interrupt disable feature and your hardware is not capable @@ -369,7 +371,7 @@ handler(s) to use these basic units of low-level functionality. __do_IRQ entry point ==================== -The original implementation :c:func:`__do_IRQ` was an alternative entry point +The original implementation __do_IRQ() was an alternative entry point for all types of interrupts. It no longer exists. This handler turned out to be not suitable for all interrupt hardware diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index da0ed972d224..a501dc1c90d0 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -25,11 +25,13 @@ Core utilities librs genalloc errseq + packing printk-formats circular-buffers generic-radix-tree memory-allocation mm-api + pin_user_pages gfp_mask-from-fs-io timekeeping boot-time-mm @@ -37,6 +39,9 @@ Core utilities protection-keys ../RCU/index gcc-plugins + symbol-namespaces + padata + ioctl Interfaces for kernel debugging @@ -48,7 +53,7 @@ Interfaces for kernel debugging debug-objects tracepoint -.. only:: subproject +.. only:: subproject and html Indices ======= diff --git a/Documentation/core-api/ioctl.rst b/Documentation/core-api/ioctl.rst new file mode 100644 index 000000000000..c455db0e1627 --- /dev/null +++ b/Documentation/core-api/ioctl.rst @@ -0,0 +1,253 @@ +====================== +ioctl based interfaces +====================== + +ioctl() is the most common way for applications to interface +with device drivers. It is flexible and easily extended by adding new +commands and can be passed through character devices, block devices as +well as sockets and other special file descriptors. + +However, it is also very easy to get ioctl command definitions wrong, +and hard to fix them later without breaking existing applications, +so this documentation tries to help developers get it right. + +Command number definitions +========================== + +The command number, or request number, is the second argument passed to +the ioctl system call. While this can be any 32-bit number that uniquely +identifies an action for a particular driver, there are a number of +conventions around defining them. + +``include/uapi/asm-generic/ioctl.h`` provides four macros for defining +ioctl commands that follow modern conventions: ``_IO``, ``_IOR``, +``_IOW``, and ``_IOWR``. These should be used for all new commands, +with the correct parameters: + +_IO/_IOR/_IOW/_IOWR + The macro name specifies how the argument will be used. It may be a + pointer to data to be passed into the kernel (_IOW), out of the kernel + (_IOR), or both (_IOWR). _IO can indicate either commands with no + argument or those passing an integer value instead of a pointer. + It is recommended to only use _IO for commands without arguments, + and use pointers for passing data. + +type + An 8-bit number, often a character literal, specific to a subsystem + or driver, and listed in :doc:`../userspace-api/ioctl/ioctl-number` + +nr + An 8-bit number identifying the specific command, unique for a give + value of 'type' + +data_type + The name of the data type pointed to by the argument, the command number + encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer, + leading to a limit of 8191 bytes for the maximum size of the argument. + Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that + will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t). + _IO does not have a data_type parameter. + + +Interface versions +================== + +Some subsystems use version numbers in data structures to overload +commands with different interpretations of the argument. + +This is generally a bad idea, since changes to existing commands tend +to break existing applications. + +A better approach is to add a new ioctl command with a new number. The +old command still needs to be implemented in the kernel for compatibility, +but this can be a wrapper around the new implementation. + +Return code +=========== + +ioctl commands can return negative error codes as documented in errno(3); +these get turned into errno values in user space. On success, the return +code should be zero. It is also possible but not recommended to return +a positive 'long' value. + +When the ioctl callback is called with an unknown command number, the +handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in +-ENOTTY being returned from the system call. Some subsystems return +-ENOSYS or -EINVAL here for historic reasons, but this is wrong. + +Prior to Linux 5.5, compat_ioctl handlers were required to return +-ENOIOCTLCMD in order to use the fallback conversion into native +commands. As all subsystems are now responsible for handling compat +mode themselves, this is no longer needed, but it may be important to +consider when backporting bug fixes to older kernels. + +Timestamps +========== + +Traditionally, timestamps and timeout values are passed as ``struct +timespec`` or ``struct timeval``, but these are problematic because of +incompatible definitions of these structures in user space after the +move to 64-bit time_t. + +The ``struct __kernel_timespec`` type can be used instead to be embedded +in other data structures when separate second/nanosecond values are +desired, or passed to user space directly. This is still not ideal though, +as the structure matches neither the kernel's timespec64 nor the user +space timespec exactly. The get_timespec64() and put_timespec64() helper +functions can be used to ensure that the layout remains compatible with +user space and the padding is treated correctly. + +As it is cheap to convert seconds to nanoseconds, but the opposite +requires an expensive 64-bit division, a simple __u64 nanosecond value +can be simpler and more efficient. + +Timeout values and timestamps should ideally use CLOCK_MONOTONIC time, +as returned by ktime_get_ns() or ktime_get_ts64(). Unlike +CLOCK_REALTIME, this makes the timestamps immune from jumping backwards +or forwards due to leap second adjustments and clock_settime() calls. + +ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that +need to be persistent across a reboot or between multiple machines. + +32-bit compat mode +================== + +In order to support 32-bit user space running on a 64-bit machine, each +subsystem or driver that implements an ioctl callback handler must also +implement the corresponding compat_ioctl handler. + +As long as all the rules for data structures are followed, this is as +easy as setting the .compat_ioctl pointer to a helper function such as +compat_ptr_ioctl() or blkdev_compat_ptr_ioctl(). + +compat_ptr() +------------ + +On the s390 architecture, 31-bit user space has ambiguous representations +for data pointers, with the upper bit being ignored. When running such +a process in compat mode, the compat_ptr() helper must be used to +clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit +pointer. On other architectures, this macro only performs a cast to a +``void __user *`` pointer. + +In an compat_ioctl() callback, the last argument is an unsigned long, +which can be interpreted as either a pointer or a scalar depending on +the command. If it is a scalar, then compat_ptr() must not be used, to +ensure that the 64-bit kernel behaves the same way as a 32-bit kernel +for arguments with the upper bit set. + +The compat_ptr_ioctl() helper can be used in place of a custom +compat_ioctl file operation for drivers that only take arguments that +are pointers to compatible data structures. + +Structure layout +---------------- + +Compatible data structures have the same layout on all architectures, +avoiding all problematic members: + +* ``long`` and ``unsigned long`` are the size of a register, so + they can be either 32-bit or 64-bit wide and cannot be used in portable + data structures. Fixed-length replacements are ``__s32``, ``__u32``, + ``__s64`` and ``__u64``. + +* Pointers have the same problem, in addition to requiring the + use of compat_ptr(). The best workaround is to use ``__u64`` + in place of pointers, which requires a cast to ``uintptr_t`` in user + space, and the use of u64_to_user_ptr() in the kernel to convert + it back into a user pointer. + +* On the x86-32 (i386) architecture, the alignment of 64-bit variables + is only 32-bit, but they are naturally aligned on most other + architectures including x86-64. This means a structure like:: + + struct foo { + __u32 a; + __u64 b; + __u32 c; + }; + + has four bytes of padding between a and b on x86-64, plus another four + bytes of padding at the end, but no padding on i386, and it needs a + compat_ioctl conversion handler to translate between the two formats. + + To avoid this problem, all structures should have their members + naturally aligned, or explicit reserved fields added in place of the + implicit padding. The ``pahole`` tool can be used for checking the + alignment. + +* On ARM OABI user space, structures are padded to multiples of 32-bit, + making some structs incompatible with modern EABI kernels if they + do not end on a 32-bit boundary. + +* On the m68k architecture, struct members are not guaranteed to have an + alignment greater than 16-bit, which is a problem when relying on + implicit padding. + +* Bitfields and enums generally work as one would expect them to, + but some properties of them are implementation-defined, so it is better + to avoid them completely in ioctl interfaces. + +* ``char`` members can be either signed or unsigned, depending on + the architecture, so the __u8 and __s8 types should be used for 8-bit + integer values, though char arrays are clearer for fixed-length strings. + +Information leaks +================= + +Uninitialized data must not be copied back to user space, as this can +cause an information leak, which can be used to defeat kernel address +space layout randomization (KASLR), helping in an attack. + +For this reason (and for compat support) it is best to avoid any +implicit padding in data structures. Where there is implicit padding +in an existing structure, kernel drivers must be careful to fully +initialize an instance of the structure before copying it to user +space. This is usually done by calling memset() before assigning to +individual members. + +Subsystem abstractions +====================== + +While some device drivers implement their own ioctl function, most +subsystems implement the same command for multiple drivers. Ideally the +subsystem has an .ioctl() handler that copies the arguments from and +to user space, passing them into subsystem specific callback functions +through normal kernel pointers. + +This helps in various ways: + +* Applications written for one driver are more likely to work for + another one in the same subsystem if there are no subtle differences + in the user space ABI. + +* The complexity of user space access and data structure layout is done + in one place, reducing the potential for implementation bugs. + +* It is more likely to be reviewed by experienced developers + that can spot problems in the interface when the ioctl is shared + between multiple drivers than when it is only used in a single driver. + +Alternatives to ioctl +===================== + +There are many cases in which ioctl is not the best solution for a +problem. Alternatives include: + +* System calls are a better choice for a system-wide feature that + is not tied to a physical device or constrained by the file system + permissions of a character device node + +* netlink is the preferred way of configuring any network related + objects through sockets. + +* debugfs is used for ad-hoc interfaces for debugging functionality + that does not need to be exposed as a stable interface to applications. + +* sysfs is a good way to expose the state of an in-kernel object + that is not tied to a file descriptor. + +* configfs can be used for more complex configuration than sysfs + +* A custom file system can provide extra flexibility with a simple + user interface but adds a lot of complexity to the implementation. diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 08af5caf036d..4ac53a1363f6 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -42,6 +42,9 @@ String Manipulation .. kernel-doc:: lib/string.c :export: +.. kernel-doc:: include/linux/string.h + :internal: + .. kernel-doc:: mm/util.c :functions: kstrdup kstrdup_const kstrndup kmemdup kmemdup_nul memdup_user vmemdup_user strndup_user memdup_user_nul @@ -54,7 +57,13 @@ The Linux kernel provides more basic utility functions. Bit Operations -------------- -.. kernel-doc:: include/asm-generic/bitops-instrumented.h +.. kernel-doc:: include/asm-generic/bitops/instrumented-atomic.h + :internal: + +.. kernel-doc:: include/asm-generic/bitops/instrumented-non-atomic.h + :internal: + +.. kernel-doc:: include/asm-generic/bitops/instrumented-lock.h :internal: Bitmap Operations diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst index 7744aa3bf2e0..4aa82ddd01b8 100644 --- a/Documentation/core-api/memory-allocation.rst +++ b/Documentation/core-api/memory-allocation.rst @@ -88,39 +88,41 @@ Selecting memory allocator ========================== The most straightforward way to allocate memory is to use a function -from the :c:func:`kmalloc` family. And, to be on the safe size it's -best to use routines that set memory to zero, like -:c:func:`kzalloc`. If you need to allocate memory for an array, there -are :c:func:`kmalloc_array` and :c:func:`kcalloc` helpers. +from the kmalloc() family. And, to be on the safe side it's best to use +routines that set memory to zero, like kzalloc(). If you need to +allocate memory for an array, there are kmalloc_array() and kcalloc() +helpers. The helpers struct_size(), array_size() and array3_size() can +be used to safely calculate object sizes without overflowing. The maximal size of a chunk that can be allocated with `kmalloc` is limited. The actual limit depends on the hardware and the kernel configuration, but it is a good practice to use `kmalloc` for objects smaller than page size. -For large allocations you can use :c:func:`vmalloc` and -:c:func:`vzalloc`, or directly request pages from the page -allocator. The memory allocated by `vmalloc` and related functions is -not physically contiguous. +The address of a chunk allocated with `kmalloc` is aligned to at least +ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of two, the +alignment is also guaranteed to be at least the respective size. + +For large allocations you can use vmalloc() and vzalloc(), or directly +request pages from the page allocator. The memory allocated by `vmalloc` +and related functions is not physically contiguous. If you are not sure whether the allocation size is too large for -`kmalloc`, it is possible to use :c:func:`kvmalloc` and its -derivatives. It will try to allocate memory with `kmalloc` and if the -allocation fails it will be retried with `vmalloc`. There are -restrictions on which GFP flags can be used with `kvmalloc`; please -see :c:func:`kvmalloc_node` reference documentation. Note that -`kvmalloc` may return memory that is not physically contiguous. +`kmalloc`, it is possible to use kvmalloc() and its derivatives. It will +try to allocate memory with `kmalloc` and if the allocation fails it +will be retried with `vmalloc`. There are restrictions on which GFP +flags can be used with `kvmalloc`; please see kvmalloc_node() reference +documentation. Note that `kvmalloc` may return memory that is not +physically contiguous. If you need to allocate many identical objects you can use the slab -cache allocator. The cache should be set up with -:c:func:`kmem_cache_create` or :c:func:`kmem_cache_create_usercopy` -before it can be used. The second function should be used if a part of -the cache might be copied to the userspace. After the cache is -created :c:func:`kmem_cache_alloc` and its convenience wrappers can -allocate memory from that cache. - -When the allocated memory is no longer needed it must be freed. You -can use :c:func:`kvfree` for the memory allocated with `kmalloc`, -`vmalloc` and `kvmalloc`. The slab caches should be freed with -:c:func:`kmem_cache_free`. And don't forget to destroy the cache with -:c:func:`kmem_cache_destroy`. +cache allocator. The cache should be set up with kmem_cache_create() or +kmem_cache_create_usercopy() before it can be used. The second function +should be used if a part of the cache might be copied to the userspace. +After the cache is created kmem_cache_alloc() and its convenience +wrappers can allocate memory from that cache. + +When the allocated memory is no longer needed it must be freed. You can +use kvfree() for the memory allocated with `kmalloc`, `vmalloc` and +`kvmalloc`. The slab caches should be freed with kmem_cache_free(). And +don't forget to destroy the cache with kmem_cache_destroy(). diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index 128e8a721c1e..be726986ff75 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -11,7 +11,7 @@ User Space Memory Access .. kernel-doc:: arch/x86/lib/usercopy_32.c :export: -.. kernel-doc:: mm/util.c +.. kernel-doc:: mm/gup.c :functions: get_user_pages_fast .. _mm-api-gfp-flags: diff --git a/Documentation/core-api/packing.rst b/Documentation/core-api/packing.rst new file mode 100644 index 000000000000..d8c341fe383e --- /dev/null +++ b/Documentation/core-api/packing.rst @@ -0,0 +1,166 @@ +================================================ +Generic bitfield packing and unpacking functions +================================================ + +Problem statement +----------------- + +When working with hardware, one has to choose between several approaches of +interfacing with it. +One can memory-map a pointer to a carefully crafted struct over the hardware +device's memory region, and access its fields as struct members (potentially +declared as bitfields). But writing code this way would make it less portable, +due to potential endianness mismatches between the CPU and the hardware device. +Additionally, one has to pay close attention when translating register +definitions from the hardware documentation into bit field indices for the +structs. Also, some hardware (typically networking equipment) tends to group +its register fields in ways that violate any reasonable word boundaries +(sometimes even 64 bit ones). This creates the inconvenience of having to +define "high" and "low" portions of register fields within the struct. +A more robust alternative to struct field definitions would be to extract the +required fields by shifting the appropriate number of bits. But this would +still not protect from endianness mismatches, except if all memory accesses +were performed byte-by-byte. Also the code can easily get cluttered, and the +high-level idea might get lost among the many bit shifts required. +Many drivers take the bit-shifting approach and then attempt to reduce the +clutter with tailored macros, but more often than not these macros take +shortcuts that still prevent the code from being truly portable. + +The solution +------------ + +This API deals with 2 basic operations: + + - Packing a CPU-usable number into a memory buffer (with hardware + constraints/quirks) + - Unpacking a memory buffer (which has hardware constraints/quirks) + into a CPU-usable number. + +The API offers an abstraction over said hardware constraints and quirks, +over CPU endianness and therefore between possible mismatches between +the two. + +The basic unit of these API functions is the u64. From the CPU's +perspective, bit 63 always means bit offset 7 of byte 7, albeit only +logically. The question is: where do we lay this bit out in memory? + +The following examples cover the memory layout of a packed u64 field. +The byte offsets in the packed buffer are always implicitly 0, 1, ... 7. +What the examples show is where the logical bytes and bits sit. + +1. Normally (no quirks), we would do it like this: + +:: + + 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 + 7 6 5 4 + 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 + 3 2 1 0 + +That is, the MSByte (7) of the CPU-usable u64 sits at memory offset 0, and the +LSByte (0) of the u64 sits at memory offset 7. +This corresponds to what most folks would regard to as "big endian", where +bit i corresponds to the number 2^i. This is also referred to in the code +comments as "logical" notation. + + +2. If QUIRK_MSB_ON_THE_RIGHT is set, we do it like this: + +:: + + 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 + 7 6 5 4 + 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 + 3 2 1 0 + +That is, QUIRK_MSB_ON_THE_RIGHT does not affect byte positioning, but +inverts bit offsets inside a byte. + + +3. If QUIRK_LITTLE_ENDIAN is set, we do it like this: + +:: + + 39 38 37 36 35 34 33 32 47 46 45 44 43 42 41 40 55 54 53 52 51 50 49 48 63 62 61 60 59 58 57 56 + 4 5 6 7 + 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 23 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 + 0 1 2 3 + +Therefore, QUIRK_LITTLE_ENDIAN means that inside the memory region, every +byte from each 4-byte word is placed at its mirrored position compared to +the boundary of that word. + +4. If QUIRK_MSB_ON_THE_RIGHT and QUIRK_LITTLE_ENDIAN are both set, we do it + like this: + +:: + + 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 + 4 5 6 7 + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 + 0 1 2 3 + + +5. If just QUIRK_LSW32_IS_FIRST is set, we do it like this: + +:: + + 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 + 3 2 1 0 + 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 + 7 6 5 4 + +In this case the 8 byte memory region is interpreted as follows: first +4 bytes correspond to the least significant 4-byte word, next 4 bytes to +the more significant 4-byte word. + + +6. If QUIRK_LSW32_IS_FIRST and QUIRK_MSB_ON_THE_RIGHT are set, we do it like + this: + +:: + + 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 + 3 2 1 0 + 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 + 7 6 5 4 + + +7. If QUIRK_LSW32_IS_FIRST and QUIRK_LITTLE_ENDIAN are set, it looks like + this: + +:: + + 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 23 22 21 20 19 18 17 16 31 30 29 28 27 26 25 24 + 0 1 2 3 + 39 38 37 36 35 34 33 32 47 46 45 44 43 42 41 40 55 54 53 52 51 50 49 48 63 62 61 60 59 58 57 56 + 4 5 6 7 + + +8. If QUIRK_LSW32_IS_FIRST, QUIRK_LITTLE_ENDIAN and QUIRK_MSB_ON_THE_RIGHT + are set, it looks like this: + +:: + + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 + 0 1 2 3 + 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 + 4 5 6 7 + + +We always think of our offsets as if there were no quirk, and we translate +them afterwards, before accessing the memory region. + +Intended use +------------ + +Drivers that opt to use this API first need to identify which of the above 3 +quirk combinations (for a total of 8) match what the hardware documentation +describes. Then they should wrap the packing() function, creating a new +xxx_packing() that calls it using the proper QUIRK_* one-hot bits set. + +The packing() function returns an int-encoded error code, which protects the +programmer against incorrect API use. The errors are not expected to occur +durring runtime, therefore it is reasonable for xxx_packing() to return void +and simply swallow those errors. Optionally it can dump stack or print the +error description. diff --git a/Documentation/core-api/padata.rst b/Documentation/core-api/padata.rst new file mode 100644 index 000000000000..9a24c111781d --- /dev/null +++ b/Documentation/core-api/padata.rst @@ -0,0 +1,169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +The padata parallel execution mechanism +======================================= + +:Date: December 2019 + +Padata is a mechanism by which the kernel can farm jobs out to be done in +parallel on multiple CPUs while retaining their ordering. It was developed for +use with the IPsec code, which needs to be able to perform encryption and +decryption on large numbers of packets without reordering those packets. The +crypto developers made a point of writing padata in a sufficiently general +fashion that it could be put to other uses as well. + +Usage +===== + +Initializing +------------ + +The first step in using padata is to set up a padata_instance structure for +overall control of how jobs are to be run:: + + #include <linux/padata.h> + + struct padata_instance *padata_alloc_possible(const char *name); + +'name' simply identifies the instance. + +There are functions for enabling and disabling the instance:: + + int padata_start(struct padata_instance *pinst); + void padata_stop(struct padata_instance *pinst); + +These functions are setting or clearing the "PADATA_INIT" flag; if that flag is +not set, other functions will refuse to work. padata_start() returns zero on +success (flag set) or -EINVAL if the padata cpumask contains no active CPU +(flag not set). padata_stop() clears the flag and blocks until the padata +instance is unused. + +Finally, complete padata initialization by allocating a padata_shell:: + + struct padata_shell *padata_alloc_shell(struct padata_instance *pinst); + +A padata_shell is used to submit a job to padata and allows a series of such +jobs to be serialized independently. A padata_instance may have one or more +padata_shells associated with it, each allowing a separate series of jobs. + +Modifying cpumasks +------------------ + +The CPUs used to run jobs can be changed in two ways, programatically with +padata_set_cpumask() or via sysfs. The former is defined:: + + int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type, + cpumask_var_t cpumask); + +Here cpumask_type is one of PADATA_CPU_PARALLEL or PADATA_CPU_SERIAL, where a +parallel cpumask describes which processors will be used to execute jobs +submitted to this instance in parallel and a serial cpumask defines which +processors are allowed to be used as the serialization callback processor. +cpumask specifies the new cpumask to use. + +There may be sysfs files for an instance's cpumasks. For example, pcrypt's +live in /sys/kernel/pcrypt/<instance-name>. Within an instance's directory +there are two files, parallel_cpumask and serial_cpumask, and either cpumask +may be changed by echoing a bitmask into the file, for example:: + + echo f > /sys/kernel/pcrypt/pencrypt/parallel_cpumask + +Reading one of these files shows the user-supplied cpumask, which may be +different from the 'usable' cpumask. + +Padata maintains two pairs of cpumasks internally, the user-supplied cpumasks +and the 'usable' cpumasks. (Each pair consists of a parallel and a serial +cpumask.) The user-supplied cpumasks default to all possible CPUs on instance +allocation and may be changed as above. The usable cpumasks are always a +subset of the user-supplied cpumasks and contain only the online CPUs in the +user-supplied masks; these are the cpumasks padata actually uses. So it is +legal to supply a cpumask to padata that contains offline CPUs. Once an +offline CPU in the user-supplied cpumask comes online, padata is going to use +it. + +Changing the CPU masks are expensive operations, so it should not be done with +great frequency. + +Running A Job +------------- + +Actually submitting work to the padata instance requires the creation of a +padata_priv structure, which represents one job:: + + struct padata_priv { + /* Other stuff here... */ + void (*parallel)(struct padata_priv *padata); + void (*serial)(struct padata_priv *padata); + }; + +This structure will almost certainly be embedded within some larger +structure specific to the work to be done. Most of its fields are private to +padata, but the structure should be zeroed at initialisation time, and the +parallel() and serial() functions should be provided. Those functions will +be called in the process of getting the work done as we will see +momentarily. + +The submission of the job is done with:: + + int padata_do_parallel(struct padata_shell *ps, + struct padata_priv *padata, int *cb_cpu); + +The ps and padata structures must be set up as described above; cb_cpu +points to the preferred CPU to be used for the final callback when the job is +done; it must be in the current instance's CPU mask (if not the cb_cpu pointer +is updated to point to the CPU actually chosen). The return value from +padata_do_parallel() is zero on success, indicating that the job is in +progress. -EBUSY means that somebody, somewhere else is messing with the +instance's CPU mask, while -EINVAL is a complaint about cb_cpu not being in the +serial cpumask, no online CPUs in the parallel or serial cpumasks, or a stopped +instance. + +Each job submitted to padata_do_parallel() will, in turn, be passed to +exactly one call to the above-mentioned parallel() function, on one CPU, so +true parallelism is achieved by submitting multiple jobs. parallel() runs with +software interrupts disabled and thus cannot sleep. The parallel() +function gets the padata_priv structure pointer as its lone parameter; +information about the actual work to be done is probably obtained by using +container_of() to find the enclosing structure. + +Note that parallel() has no return value; the padata subsystem assumes that +parallel() will take responsibility for the job from this point. The job +need not be completed during this call, but, if parallel() leaves work +outstanding, it should be prepared to be called again with a new job before +the previous one completes. + +Serializing Jobs +---------------- + +When a job does complete, parallel() (or whatever function actually finishes +the work) should inform padata of the fact with a call to:: + + void padata_do_serial(struct padata_priv *padata); + +At some point in the future, padata_do_serial() will trigger a call to the +serial() function in the padata_priv structure. That call will happen on +the CPU requested in the initial call to padata_do_parallel(); it, too, is +run with local software interrupts disabled. +Note that this call may be deferred for a while since the padata code takes +pains to ensure that jobs are completed in the order in which they were +submitted. + +Destroying +---------- + +Cleaning up a padata instance predictably involves calling the three free +functions that correspond to the allocation in reverse:: + + void padata_free_shell(struct padata_shell *ps); + void padata_stop(struct padata_instance *pinst); + void padata_free(struct padata_instance *pinst); + +It is the user's responsibility to ensure all outstanding jobs are complete +before any of the above are called. + +Interface +========= + +.. kernel-doc:: include/linux/padata.h +.. kernel-doc:: kernel/padata.c diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst new file mode 100644 index 000000000000..1d490155ecd7 --- /dev/null +++ b/Documentation/core-api/pin_user_pages.rst @@ -0,0 +1,232 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================================================== +pin_user_pages() and related calls +==================================================== + +.. contents:: :local: + +Overview +======== + +This document describes the following functions:: + + pin_user_pages() + pin_user_pages_fast() + pin_user_pages_remote() + +Basic description of FOLL_PIN +============================= + +FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() +("gup") family of functions. FOLL_PIN has significant interactions and +interdependencies with FOLL_LONGTERM, so both are covered here. + +FOLL_PIN is internal to gup, meaning that it should not appear at the gup call +sites. This allows the associated wrapper functions (pin_user_pages*() and +others) to set the correct combination of these flags, and to check for problems +as well. + +FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. +This is in order to avoid creating a large number of wrapper functions to cover +all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the +pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so +that's a natural dividing line, and a good point to make separate wrapper calls. +In other words, use pin_user_pages*() for DMA-pinned pages, and +get_user_pages*() for other cases. There are four cases described later on in +this document, to further clarify that concept. + +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, +multiple threads and call sites are free to pin the same struct pages, via both +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the +other, not the struct page(s). + +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN +uses a different reference counting technique. + +FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. + +Which flags are set by each wrapper +=================================== + +For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup +flags the caller provides. The caller is required to pass in a non-null struct +pages* array, and the function then pin pages by incrementing each by a special +value. For now, that value is +1, just like get_user_pages*().:: + + Function + -------- + pin_user_pages FOLL_PIN is always set internally by this function. + pin_user_pages_fast FOLL_PIN is always set internally by this function. + pin_user_pages_remote FOLL_PIN is always set internally by this function. + +For these get_user_pages*() functions, FOLL_GET might not even be specified. +Behavior is a little more complex than above. If FOLL_GET was *not* specified, +but the caller passed in a non-null struct pages* array, then the function +sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount +of each page by +1.:: + + Function + -------- + get_user_pages FOLL_GET is sometimes set internally by this function. + get_user_pages_fast FOLL_GET is sometimes set internally by this function. + get_user_pages_remote FOLL_GET is sometimes set internally by this function. + +Tracking dma-pinned pages +========================= + +Some of the key design constraints, and solutions, for tracking dma-pinned +pages: + +* An actual reference count, per struct page, is required. This is because + multiple processes may pin and unpin a page. + +* False positives (reporting that a page is dma-pinned, when in fact it is not) + are acceptable, but false negatives are not. + +* struct page may not be increased in size for this, and all fields are already + used. + +* Given the above, we can overload the page->_refcount field by using, sort of, + the upper bits in that field for a dma-pinned count. "Sort of", means that, + rather than dividing page->_refcount into bit fields, we simple add a medium- + large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to + page->_refcount. This provides fuzzy behavior: if a page has get_page() called + on it 1024 times, then it will appear to have a single dma-pinned count. + And again, that's acceptable. + +This also leads to limitations: there are only 31-10==21 bits available for a +counter that increments 10 bits at a time. + +TODO: for 1GB and larger huge pages, this is cutting it close. That's because +when pin_user_pages() follows such pages, it increments the head page by "1" +(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for +pin_user_pages()) for each tail page. So if you have a 1GB huge page: + +* There are 256K (18 bits) worth of 4 KB tail pages. +* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, + 10 bits at a time) +* There are 21 - 18 == 3 bits available to count. Except that there aren't, + because you need to allow for a few normal get_page() calls on the head page, + as well. Fortunately, the approach of using addition, rather than "hard" + bitfields, within page->_refcount, allows for sharing these bits gracefully. + But we're still looking at about 8 references. + +This, however, is a missing feature more than anything else, because it's easily +solved by addressing an obvious inefficiency in the original get_user_pages() +approach of retrieving pages: stop treating all the pages as if they were +PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of +this, so some work is required. Once that's in place, this limitation mostly +disappears from view, because there will be ample refcounting range available. + +* Callers must specifically request "dma-pinned tracking of pages". In other + words, just calling get_user_pages() will not suffice; a new set of functions, + pin_user_page() and related, must be used. + +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags +========================================================== + +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing +these categories: + +CASE 1: Direct IO (DIO) +----------------------- +There are GUP references to pages that are serving +as DIO buffers. These buffers are needed for a relatively short time (so they +are not "long term"). No special synchronization with page_mkclean() or +munmap() is provided. Therefore, flags to set at the call site are: :: + + FOLL_PIN + +...but rather than setting FOLL_PIN directly, call sites should use one of +the pin_user_pages*() routines that set FOLL_PIN. + +CASE 2: RDMA +------------ +There are GUP references to pages that are serving as DMA +buffers. These buffers are needed for a long time ("long term"). No special +synchronization with page_mkclean() or munmap() is provided. Therefore, flags +to set at the call site are: :: + + FOLL_PIN | FOLL_LONGTERM + +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's +because DAX pages do not have a separate page cache, and so "pinning" implies +locking down file system blocks, which is not (yet) supported in that way. + +CASE 3: Hardware with page faulting support +------------------------------------------- +Here, a well-written driver doesn't normally need to pin pages at all. However, +if the driver does choose to do so, it can register MMU notifiers for the range, +and will be called back upon invalidation. Either way (avoiding page pinning, or +using MMU notifiers to unpin upon request), there is proper synchronization with +both filesystem and mm (page_mkclean(), munmap(), etc). + +Therefore, neither flag needs to be set. + +In this case, ideally, neither get_user_pages() nor pin_user_pages() should be +called. Instead, the software should be written so that it does not pin pages. +This allows mm and filesystems to operate more efficiently and reliably. + +CASE 4: Pinning for struct page manipulation only +------------------------------------------------- +Here, normal GUP calls are sufficient, so neither flag needs to be set. + +page_dma_pinned(): the whole point of pinning +============================================= + +The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able +to query, "is this page DMA-pinned?" That allows code such as page_mkclean() +(and file system writeback code in general) to make informed decisions about +what to do when a page cannot be unmapped due to such pins. + +What to do in those cases is the subject of a years-long series of discussions +and debates (see the References at the end of this document). It's a TODO item +here: fill in the details once that's worked out. Meanwhile, it's safe to say +that having this available: :: + + static inline bool page_dma_pinned(struct page *page) + +...is a prerequisite to solving the long-running gup+DMA problem. + +Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM +=================================================================== + +Another way of thinking about these flags is as a progression of restrictions: +FOLL_GET is for struct page manipulation, without affecting the data that the +struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for +short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is +a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more +restrictive case that has FOLL_PIN as a prerequisite: this is for pages that +will be pinned longterm, and whose data will be accessed. + +Unit testing +============ +This file:: + + tools/testing/selftests/vm/gup_benchmark.c + +has the following new calls to exercise the new pin*() wrapper functions: + +* PIN_FAST_BENCHMARK (./gup_benchmark -a) +* PIN_BENCHMARK (./gup_benchmark -b) + +You can monitor how many total dma-pinned pages have been acquired and released +since the system was booted, via two new /proc/vmstat entries: :: + + /proc/vmstat/nr_foll_pin_requested + /proc/vmstat/nr_foll_pin_requested + +Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is +because there is a noticeable performance drop in unpin_user_page(), when they +are activated. + +References +========== + +* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ +* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ +* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ + +John Hubbard, October, 2019 diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index c6224d039bcb..8ebe46b1af39 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -13,10 +13,10 @@ Integer types If variable is of Type, use printk format specifier: ------------------------------------------------------------ - char %hhd or %hhx - unsigned char %hhu or %hhx - short int %hd or %hx - unsigned short int %hu or %hx + char %d or %x + unsigned char %u or %x + short int %d or %x + unsigned short int %u or %x int %d or %x unsigned int %u or %x long %ld or %lx @@ -25,10 +25,10 @@ Integer types unsigned long long %llu or %llx size_t %zu or %zx ssize_t %zd or %zx - s8 %hhd or %hhx - u8 %hhu or %hhx - s16 %hd or %hx - u16 %hu or %hx + s8 %d or %x + u8 %u or %x + s16 %d or %x + u16 %u or %x s32 %d or %x u32 %u or %x s64 %lld or %llx @@ -79,6 +79,18 @@ has the added benefit of providing a unique identifier. On 64-bit machines the first 32 bits are zeroed. The kernel will print ``(ptrval)`` until it gathers enough entropy. If you *really* want the address see %px below. +Error Pointers +-------------- + +:: + + %pe -ENOSPC + +For printing error pointers (i.e. a pointer for which IS_ERR() is true) +as a symbolic error name. Error values for which no symbolic name is +known are printed in decimal, while a non-ERR_PTR passed as the +argument to %pe gets treated as ordinary %p. + Symbols/Function Pointers ------------------------- @@ -86,8 +98,6 @@ Symbols/Function Pointers %pS versatile_init+0x0/0x110 %ps versatile_init - %pF versatile_init+0x0/0x110 - %pf versatile_init %pSR versatile_init+0x9/0x110 (with __builtin_extract_return_addr() translation) %pB prev_fn_of_versatile_init+0x88/0x88 @@ -97,14 +107,6 @@ The ``S`` and ``s`` specifiers are used for printing a pointer in symbolic format. They result in the symbol name with (S) or without (s) offsets. If KALLSYMS are disabled then the symbol address is printed instead. -Note, that the ``F`` and ``f`` specifiers are identical to ``S`` (``s``) -and thus deprecated. We have ``F`` and ``f`` because on ia64, ppc64 and -parisc64 function pointers are indirect and, in fact, are function -descriptors, which require additional dereferencing before we can lookup -the symbol. As of now, ``S`` and ``s`` perform dereferencing on those -platforms (when needed), so ``F`` and ``f`` exist for compatibility -reasons only. - The ``B`` specifier results in the symbol name with offsets and should be used when printing stack backtraces. The specifier takes into consideration the effect of compiler optimisations which may occur @@ -135,6 +137,20 @@ equivalent to %lx (or %lu). %px is preferred because it is more uniquely grep'able. If in the future we need to modify the way the kernel handles printing pointers we will be better equipped to find the call sites. +Pointer Differences +------------------- + +:: + + %td 2560 + %tx a00 + +For printing the pointer differences, use the %t modifier for ptrdiff_t. + +Example:: + + printk("test: difference between pointers: %td\n", ptr2 - ptr1); + Struct Resources ---------------- @@ -428,6 +444,30 @@ Examples:: Passed by reference. +Fwnode handles +-------------- + +:: + + %pfw[fP] + +For printing information on fwnode handles. The default is to print the full +node name, including the path. The modifiers are functionally equivalent to +%pOF above. + + - f - full name of the node, including the path + - P - the name of the node including an address (if there is one) + +Examples (ACPI):: + + %pfwf \_SB.PCI0.CIO2.port@1.endpoint@0 - Full node name + %pfwP endpoint@0 - Node name + +Examples (OF):: + + %pfwf /ocp@68000000/i2c@48072000/camera@10/port/endpoint - Full name + %pfwP endpoint - Node name + Time and date (struct rtc_time) ------------------------------- diff --git a/Documentation/core-api/refcount-vs-atomic.rst b/Documentation/core-api/refcount-vs-atomic.rst index 976e85adffe8..79a009ce11df 100644 --- a/Documentation/core-api/refcount-vs-atomic.rst +++ b/Documentation/core-api/refcount-vs-atomic.rst @@ -35,7 +35,7 @@ atomics & refcounters only provide atomicity and program order (po) relation (on the same CPU). It guarantees that each ``atomic_*()`` and ``refcount_*()`` operation is atomic and instructions are executed in program order on a single CPU. -This is implemented using :c:func:`READ_ONCE`/:c:func:`WRITE_ONCE` and +This is implemented using READ_ONCE()/WRITE_ONCE() and compare-and-swap primitives. A strong (full) memory ordering guarantees that all prior loads and @@ -44,7 +44,7 @@ before any po-later instruction is executed on the same CPU. It also guarantees that all po-earlier stores on the same CPU and all propagated stores from other CPUs must propagate to all other CPUs before any po-later instruction is executed on the original -CPU (A-cumulative property). This is implemented using :c:func:`smp_mb`. +CPU (A-cumulative property). This is implemented using smp_mb(). A RELEASE memory ordering guarantees that all prior loads and stores (all po-earlier instructions) on the same CPU are completed @@ -52,14 +52,14 @@ before the operation. It also guarantees that all po-earlier stores on the same CPU and all propagated stores from other CPUs must propagate to all other CPUs before the release operation (A-cumulative property). This is implemented using -:c:func:`smp_store_release`. +smp_store_release(). An ACQUIRE memory ordering guarantees that all post loads and stores (all po-later instructions) on the same CPU are completed after the acquire operation. It also guarantees that all po-later stores on the same CPU must propagate to all other CPUs after the acquire operation executes. This is implemented using -:c:func:`smp_acquire__after_ctrl_dep`. +smp_acquire__after_ctrl_dep(). A control dependency (on success) for refcounters guarantees that if a reference for an object was successfully obtained (reference @@ -78,8 +78,8 @@ case 1) - non-"Read/Modify/Write" (RMW) ops Function changes: - * :c:func:`atomic_set` --> :c:func:`refcount_set` - * :c:func:`atomic_read` --> :c:func:`refcount_read` + * atomic_set() --> refcount_set() + * atomic_read() --> refcount_read() Memory ordering guarantee changes: @@ -91,8 +91,8 @@ case 2) - increment-based ops that return no value Function changes: - * :c:func:`atomic_inc` --> :c:func:`refcount_inc` - * :c:func:`atomic_add` --> :c:func:`refcount_add` + * atomic_inc() --> refcount_inc() + * atomic_add() --> refcount_add() Memory ordering guarantee changes: @@ -103,7 +103,7 @@ case 3) - decrement-based RMW ops that return no value Function changes: - * :c:func:`atomic_dec` --> :c:func:`refcount_dec` + * atomic_dec() --> refcount_dec() Memory ordering guarantee changes: @@ -115,8 +115,8 @@ case 4) - increment-based RMW ops that return a value Function changes: - * :c:func:`atomic_inc_not_zero` --> :c:func:`refcount_inc_not_zero` - * no atomic counterpart --> :c:func:`refcount_add_not_zero` + * atomic_inc_not_zero() --> refcount_inc_not_zero() + * no atomic counterpart --> refcount_add_not_zero() Memory ordering guarantees changes: @@ -131,8 +131,8 @@ case 5) - generic dec/sub decrement-based RMW ops that return a value Function changes: - * :c:func:`atomic_dec_and_test` --> :c:func:`refcount_dec_and_test` - * :c:func:`atomic_sub_and_test` --> :c:func:`refcount_sub_and_test` + * atomic_dec_and_test() --> refcount_dec_and_test() + * atomic_sub_and_test() --> refcount_sub_and_test() Memory ordering guarantees changes: @@ -144,14 +144,14 @@ case 6) other decrement-based RMW ops that return a value Function changes: - * no atomic counterpart --> :c:func:`refcount_dec_if_one` + * no atomic counterpart --> refcount_dec_if_one() * ``atomic_add_unless(&var, -1, 1)`` --> ``refcount_dec_not_one(&var)`` Memory ordering guarantees changes: * fully ordered --> RELEASE ordering + control dependency -.. note:: :c:func:`atomic_add_unless` only provides full order on success. +.. note:: atomic_add_unless() only provides full order on success. case 7) - lock-based RMW @@ -159,10 +159,10 @@ case 7) - lock-based RMW Function changes: - * :c:func:`atomic_dec_and_lock` --> :c:func:`refcount_dec_and_lock` - * :c:func:`atomic_dec_and_mutex_lock` --> :c:func:`refcount_dec_and_mutex_lock` + * atomic_dec_and_lock() --> refcount_dec_and_lock() + * atomic_dec_and_mutex_lock() --> refcount_dec_and_mutex_lock() Memory ordering guarantees changes: * fully ordered --> RELEASE ordering + control dependency + hold - :c:func:`spin_lock` on success + spin_lock() on success diff --git a/Documentation/core-api/symbol-namespaces.rst b/Documentation/core-api/symbol-namespaces.rst new file mode 100644 index 000000000000..9b76337f6756 --- /dev/null +++ b/Documentation/core-api/symbol-namespaces.rst @@ -0,0 +1,157 @@ +================= +Symbol Namespaces +================= + +The following document describes how to use Symbol Namespaces to structure the +export surface of in-kernel symbols exported through the family of +EXPORT_SYMBOL() macros. + +.. Table of Contents + + === 1 Introduction + === 2 How to define Symbol Namespaces + --- 2.1 Using the EXPORT_SYMBOL macros + --- 2.2 Using the DEFAULT_SYMBOL_NAMESPACE define + === 3 How to use Symbols exported in Namespaces + === 4 Loading Modules that use namespaced Symbols + === 5 Automatically creating MODULE_IMPORT_NS statements + +1. Introduction +=============== + +Symbol Namespaces have been introduced as a means to structure the export +surface of the in-kernel API. It allows subsystem maintainers to partition +their exported symbols into separate namespaces. That is useful for +documentation purposes (think of the SUBSYSTEM_DEBUG namespace) as well as for +limiting the availability of a set of symbols for use in other parts of the +kernel. As of today, modules that make use of symbols exported into namespaces, +are required to import the namespace. Otherwise the kernel will, depending on +its configuration, reject loading the module or warn about a missing import. + +2. How to define Symbol Namespaces +================================== + +Symbols can be exported into namespace using different methods. All of them are +changing the way EXPORT_SYMBOL and friends are instrumented to create ksymtab +entries. + +2.1 Using the EXPORT_SYMBOL macros +================================== + +In addition to the macros EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL(), that allow +exporting of kernel symbols to the kernel symbol table, variants of these are +available to export symbols into a certain namespace: EXPORT_SYMBOL_NS() and +EXPORT_SYMBOL_NS_GPL(). They take one additional argument: the namespace. +Please note that due to macro expansion that argument needs to be a +preprocessor symbol. E.g. to export the symbol `usb_stor_suspend` into the +namespace `USB_STORAGE`, use:: + + EXPORT_SYMBOL_NS(usb_stor_suspend, USB_STORAGE); + +The corresponding ksymtab entry struct `kernel_symbol` will have the member +`namespace` set accordingly. A symbol that is exported without a namespace will +refer to `NULL`. There is no default namespace if none is defined. `modpost` +and kernel/module.c make use the namespace at build time or module load time, +respectively. + +2.2 Using the DEFAULT_SYMBOL_NAMESPACE define +============================================= + +Defining namespaces for all symbols of a subsystem can be very verbose and may +become hard to maintain. Therefore a default define (DEFAULT_SYMBOL_NAMESPACE) +is been provided, that, if set, will become the default for all EXPORT_SYMBOL() +and EXPORT_SYMBOL_GPL() macro expansions that do not specify a namespace. + +There are multiple ways of specifying this define and it depends on the +subsystem and the maintainer's preference, which one to use. The first option +is to define the default namespace in the `Makefile` of the subsystem. E.g. to +export all symbols defined in usb-common into the namespace USB_COMMON, add a +line like this to drivers/usb/common/Makefile:: + + ccflags-y += -DDEFAULT_SYMBOL_NAMESPACE=USB_COMMON + +That will affect all EXPORT_SYMBOL() and EXPORT_SYMBOL_GPL() statements. A +symbol exported with EXPORT_SYMBOL_NS() while this definition is present, will +still be exported into the namespace that is passed as the namespace argument +as this argument has preference over a default symbol namespace. + +A second option to define the default namespace is directly in the compilation +unit as preprocessor statement. The above example would then read:: + + #undef DEFAULT_SYMBOL_NAMESPACE + #define DEFAULT_SYMBOL_NAMESPACE USB_COMMON + +within the corresponding compilation unit before any EXPORT_SYMBOL macro is +used. + +3. How to use Symbols exported in Namespaces +============================================ + +In order to use symbols that are exported into namespaces, kernel modules need +to explicitly import these namespaces. Otherwise the kernel might reject to +load the module. The module code is required to use the macro MODULE_IMPORT_NS +for the namespaces it uses symbols from. E.g. a module using the +usb_stor_suspend symbol from above, needs to import the namespace USB_STORAGE +using a statement like:: + + MODULE_IMPORT_NS(USB_STORAGE); + +This will create a `modinfo` tag in the module for each imported namespace. +This has the side effect, that the imported namespaces of a module can be +inspected with modinfo:: + + $ modinfo drivers/usb/storage/ums-karma.ko + [...] + import_ns: USB_STORAGE + [...] + + +It is advisable to add the MODULE_IMPORT_NS() statement close to other module +metadata definitions like MODULE_AUTHOR() or MODULE_LICENSE(). Refer to section +5. for a way to create missing import statements automatically. + +4. Loading Modules that use namespaced Symbols +============================================== + +At module loading time (e.g. `insmod`), the kernel will check each symbol +referenced from the module for its availability and whether the namespace it +might be exported to has been imported by the module. The default behaviour of +the kernel is to reject loading modules that don't specify sufficient imports. +An error will be logged and loading will be failed with EINVAL. In order to +allow loading of modules that don't satisfy this precondition, a configuration +option is available: Setting MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=y will +enable loading regardless, but will emit a warning. + +5. Automatically creating MODULE_IMPORT_NS statements +===================================================== + +Missing namespaces imports can easily be detected at build time. In fact, +modpost will emit a warning if a module uses a symbol from a namespace +without importing it. +MODULE_IMPORT_NS() statements will usually be added at a definite location +(along with other module meta data). To make the life of module authors (and +subsystem maintainers) easier, a script and make target is available to fixup +missing imports. Fixing missing imports can be done with:: + + $ make nsdeps + +A typical scenario for module authors would be:: + + - write code that depends on a symbol from a not imported namespace + - `make` + - notice the warning of modpost telling about a missing import + - run `make nsdeps` to add the import to the correct code location + +For subsystem maintainers introducing a namespace, the steps are very similar. +Again, `make nsdeps` will eventually add the missing namespace imports for +in-tree modules:: + + - move or add symbols to a namespace (e.g. with EXPORT_SYMBOL_NS()) + - `make` (preferably with an allmodconfig to cover all in-kernel + modules) + - notice the warning of modpost telling about a missing import + - run `make nsdeps` to add the import to the correct code location + +You can also run nsdeps for external module builds. A typical usage is:: + + $ make -C <path_to_kernel_src> M=$PWD nsdeps diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst index fcedc5349ace..640934b6f7b4 100644 --- a/Documentation/core-api/xarray.rst +++ b/Documentation/core-api/xarray.rst @@ -25,10 +25,6 @@ good performance with large indices. If your index can be larger than ``ULONG_MAX`` then the XArray is not the data type for you. The most important user of the XArray is the page cache. -Each non-``NULL`` entry in the array has three bits associated with -it called marks. Each mark may be set or cleared independently of -the others. You can iterate over entries which are marked. - Normal pointers may be stored in the XArray directly. They must be 4-byte aligned, which is true for any pointer returned from kmalloc() and alloc_page(). It isn't true for arbitrary user-space pointers, @@ -41,12 +37,11 @@ When you retrieve an entry from the XArray, you can check whether it is a value entry by calling xa_is_value(), and convert it back to an integer by calling xa_to_value(). -Some users want to store tagged pointers instead of using the marks -described above. They can call xa_tag_pointer() to create an -entry with a tag, xa_untag_pointer() to turn a tagged entry -back into an untagged pointer and xa_pointer_tag() to retrieve -the tag of an entry. Tagged pointers use the same bits that are used -to distinguish value entries from normal pointers, so each user must +Some users want to tag the pointers they store in the XArray. You can +call xa_tag_pointer() to create an entry with a tag, xa_untag_pointer() +to turn a tagged entry back into an untagged pointer and xa_pointer_tag() +to retrieve the tag of an entry. Tagged pointers use the same bits that +are used to distinguish value entries from normal pointers, so you must decide whether they want to store value entries or tagged pointers in any particular XArray. @@ -56,10 +51,9 @@ conflict with value entries or internal entries. An unusual feature of the XArray is the ability to create entries which occupy a range of indices. Once stored to, looking up any index in the range will return the same entry as looking up any other index in -the range. Setting a mark on one index will set it on all of them. -Storing to any index will store to all of them. Multi-index entries can -be explicitly split into smaller entries, or storing ``NULL`` into any -entry will cause the XArray to forget about the range. +the range. Storing to any index will store to all of them. Multi-index +entries can be explicitly split into smaller entries, or storing ``NULL`` +into any entry will cause the XArray to forget about the range. Normal API ========== @@ -87,17 +81,11 @@ If you want to only store a new entry to an index if the current entry at that index is ``NULL``, you can use xa_insert() which returns ``-EBUSY`` if the entry is not empty. -You can enquire whether a mark is set on an entry by using -xa_get_mark(). If the entry is not ``NULL``, you can set a mark -on it by using xa_set_mark() and remove the mark from an entry by -calling xa_clear_mark(). You can ask whether any entry in the -XArray has a particular mark set by calling xa_marked(). - You can copy entries out of the XArray into a plain array by calling -xa_extract(). Or you can iterate over the present entries in -the XArray by calling xa_for_each(). You may prefer to use -xa_find() or xa_find_after() to move to the next present -entry in the XArray. +xa_extract(). Or you can iterate over the present entries in the XArray +by calling xa_for_each(), xa_for_each_start() or xa_for_each_range(). +You may prefer to use xa_find() or xa_find_after() to move to the next +present entry in the XArray. Calling xa_store_range() stores the same entry in a range of indices. If you do this, some of the other operations will behave @@ -124,6 +112,31 @@ xa_destroy(). If the XArray entries are pointers, you may wish to free the entries first. You can do this by iterating over all present entries in the XArray using the xa_for_each() iterator. +Search Marks +------------ + +Each entry in the array has three bits associated with it called marks. +Each mark may be set or cleared independently of the others. You can +iterate over marked entries by using the xa_for_each_marked() iterator. + +You can enquire whether a mark is set on an entry by using +xa_get_mark(). If the entry is not ``NULL``, you can set a mark on it +by using xa_set_mark() and remove the mark from an entry by calling +xa_clear_mark(). You can ask whether any entry in the XArray has a +particular mark set by calling xa_marked(). Erasing an entry from the +XArray causes all marks associated with that entry to be cleared. + +Setting or clearing a mark on any index of a multi-index entry will +affect all indices covered by that entry. Querying the mark on any +index will return the same result. + +There is no way to iterate over entries which are not marked; the data +structure does not allow this to be implemented efficiently. There are +not currently iterators to search for logical combinations of bits (eg +iterate over all entries which have both ``XA_MARK_1`` and ``XA_MARK_2`` +set, or iterate over all entries which have ``XA_MARK_0`` or ``XA_MARK_2`` +set). It would be possible to add these if a user arises. + Allocating XArrays ------------------ @@ -180,6 +193,8 @@ No lock needed: Takes RCU read lock: * xa_load() * xa_for_each() + * xa_for_each_start() + * xa_for_each_range() * xa_find() * xa_find_after() * xa_extract() @@ -419,10 +434,9 @@ you last processed. If you have interrupts disabled while iterating, then it is good manners to pause the iteration and reenable interrupts every ``XA_CHECK_SCHED`` entries. -The xas_get_mark(), xas_set_mark() and -xas_clear_mark() functions require the xa_state cursor to have -been moved to the appropriate location in the xarray; they will do -nothing if you have called xas_pause() or xas_set() +The xas_get_mark(), xas_set_mark() and xas_clear_mark() functions require +the xa_state cursor to have been moved to the appropriate location in the +XArray; they will do nothing if you have called xas_pause() or xas_set() immediately before. You can call xas_set_update() to have a callback function |