diff options
Diffstat (limited to 'Documentation/admin-guide')
63 files changed, 5400 insertions, 350 deletions
diff --git a/Documentation/admin-guide/LSM/SafeSetID.rst b/Documentation/admin-guide/LSM/SafeSetID.rst index 212434ef65ad..7bff07ce4fdd 100644 --- a/Documentation/admin-guide/LSM/SafeSetID.rst +++ b/Documentation/admin-guide/LSM/SafeSetID.rst @@ -56,7 +56,7 @@ setid capabilities from the application completely and refactor the process spawning semantics in the application (e.g. by using a privileged helper program to do process spawning and UID/GID transitions). Unfortunately, there are a number of semantics around process spawning that would be affected by this, such -as fork() calls where the program doesn???t immediately call exec() after the +as fork() calls where the program doesn't immediately call exec() after the fork(), parent processes specifying custom environment variables or command line args for spawned child processes, or inheritance of file handles across a fork()/exec(). Because of this, as solution that uses a privileged helper in @@ -72,7 +72,7 @@ own user namespace, and only approved UIDs/GIDs could be mapped back to the initial system user namespace, affectively preventing privilege escalation. Unfortunately, it is not generally feasible to use user namespaces in isolation, without pairing them with other namespace types, which is not always an option. -Linux checks for capabilities based off of the user namespace that ???owns??? some +Linux checks for capabilities based off of the user namespace that "owns" some entity. For example, Linux has the notion that network namespaces are owned by the user namespace in which they were created. A consequence of this is that capability checks for access to a given network namespace are done by checking diff --git a/Documentation/admin-guide/acpi/fan_performance_states.rst b/Documentation/admin-guide/acpi/fan_performance_states.rst new file mode 100644 index 000000000000..21d233ca50d8 --- /dev/null +++ b/Documentation/admin-guide/acpi/fan_performance_states.rst @@ -0,0 +1,62 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +ACPI Fan Performance States +=========================== + +When the optional _FPS object is present under an ACPI device representing a +fan (for example, PNP0C0B or INT3404), the ACPI fan driver creates additional +"state*" attributes in the sysfs directory of the ACPI device in question. +These attributes list properties of fan performance states. + +For more information on _FPS refer to the ACPI specification at: + +http://uefi.org/specifications + +For instance, the contents of the INT3404 ACPI device sysfs directory +may look as follows:: + + $ ls -l /sys/bus/acpi/devices/INT3404:00/ + total 0 +... + -r--r--r-- 1 root root 4096 Dec 13 20:38 state0 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state1 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state10 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state11 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state2 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state3 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state4 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state5 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state6 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state7 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state8 + -r--r--r-- 1 root root 4096 Dec 13 20:38 state9 + -r--r--r-- 1 root root 4096 Dec 13 01:00 status + ... + +where each of the "state*" files represents one performance state of the fan +and contains a colon-separated list of 5 integer numbers (fields) with the +following interpretation:: + +control_percent:trip_point_index:speed_rpm:noise_level_mdb:power_mw + +* ``control_percent``: The percent value to be used to set the fan speed to a + specific level using the _FSL object (0-100). + +* ``trip_point_index``: The active cooling trip point number that corresponds + to this performance state (0-9). + +* ``speed_rpm``: Speed of the fan in rotations per minute. + +* ``noise_level_mdb``: Audible noise emitted by the fan in this state in + millidecibels. + +* ``power_mw``: Power draw of the fan in this state in milliwatts. + +For example:: + + $cat /sys/bus/acpi/devices/INT3404:00/state1 + 25:0:3200:12500:1250 + +When a given field is not populated or its value provided by the platform +firmware is invalid, the "not-defined" string is shown instead of the value. diff --git a/Documentation/admin-guide/acpi/index.rst b/Documentation/admin-guide/acpi/index.rst index 4d13eeea1eca..71277689ad97 100644 --- a/Documentation/admin-guide/acpi/index.rst +++ b/Documentation/admin-guide/acpi/index.rst @@ -12,3 +12,4 @@ the Linux ACPI support. dsdt-override ssdt-overlays cppc_sysfs + fan_performance_states diff --git a/Documentation/admin-guide/auxdisplay/cfag12864b.rst b/Documentation/admin-guide/auxdisplay/cfag12864b.rst new file mode 100644 index 000000000000..18c2865bd322 --- /dev/null +++ b/Documentation/admin-guide/auxdisplay/cfag12864b.rst @@ -0,0 +1,98 @@ +=================================== +cfag12864b LCD Driver Documentation +=================================== + +:License: GPLv2 +:Author & Maintainer: Miguel Ojeda Sandonis +:Date: 2006-10-27 + + + +.. INDEX + + 1. DRIVER INFORMATION + 2. DEVICE INFORMATION + 3. WIRING + 4. USERSPACE PROGRAMMING + +1. Driver Information +--------------------- + +This driver supports a cfag12864b LCD. + + +2. Device Information +--------------------- + +:Manufacturer: Crystalfontz +:Device Name: Crystalfontz 12864b LCD Series +:Device Code: cfag12864b +:Webpage: http://www.crystalfontz.com +:Device Webpage: http://www.crystalfontz.com/products/12864b/ +:Type: LCD (Liquid Crystal Display) +:Width: 128 +:Height: 64 +:Colors: 2 (B/N) +:Controller: ks0108 +:Controllers: 2 +:Pages: 8 each controller +:Addresses: 64 each page +:Data size: 1 byte each address +:Memory size: 2 * 8 * 64 * 1 = 1024 bytes = 1 Kbyte + + +3. Wiring +--------- + +The cfag12864b LCD Series don't have official wiring. + +The common wiring is done to the parallel port as shown:: + + Parallel Port cfag12864b + + Name Pin# Pin# Name + + Strobe ( 1)------------------------------(17) Enable + Data 0 ( 2)------------------------------( 4) Data 0 + Data 1 ( 3)------------------------------( 5) Data 1 + Data 2 ( 4)------------------------------( 6) Data 2 + Data 3 ( 5)------------------------------( 7) Data 3 + Data 4 ( 6)------------------------------( 8) Data 4 + Data 5 ( 7)------------------------------( 9) Data 5 + Data 6 ( 8)------------------------------(10) Data 6 + Data 7 ( 9)------------------------------(11) Data 7 + (10) [+5v]---( 1) Vdd + (11) [GND]---( 2) Ground + (12) [+5v]---(14) Reset + (13) [GND]---(15) Read / Write + Line (14)------------------------------(13) Controller Select 1 + (15) + Init (16)------------------------------(12) Controller Select 2 + Select (17)------------------------------(16) Data / Instruction + Ground (18)---[GND] [+5v]---(19) LED + + Ground (19)---[GND] + Ground (20)---[GND] E A Values: + Ground (21)---[GND] [GND]---[P1]---(18) Vee - R = Resistor = 22 ohm + Ground (22)---[GND] | - P1 = Preset = 10 Kohm + Ground (23)---[GND] ---- S ------( 3) V0 - P2 = Preset = 1 Kohm + Ground (24)---[GND] | | + Ground (25)---[GND] [GND]---[P2]---[R]---(20) LED - + + +4. Userspace Programming +------------------------ + +The cfag12864bfb describes a framebuffer device (/dev/fbX). + +It has a size of 1024 bytes = 1 Kbyte. +Each bit represents one pixel. If the bit is high, the pixel will +turn on. If the pixel is low, the pixel will turn off. + +You can use the framebuffer as a file: fopen, fwrite, fclose... +Although the LCD won't get updated until the next refresh time arrives. + +Also, you can mmap the framebuffer: open & mmap, munmap & close... +which is the best option for most uses. + +Check samples/auxdisplay/cfag12864b-example.c +for a real working userspace complete program with usage examples. diff --git a/Documentation/admin-guide/auxdisplay/index.rst b/Documentation/admin-guide/auxdisplay/index.rst new file mode 100644 index 000000000000..e466f0595248 --- /dev/null +++ b/Documentation/admin-guide/auxdisplay/index.rst @@ -0,0 +1,16 @@ +========================= +Auxiliary Display Support +========================= + +.. toctree:: + :maxdepth: 1 + + ks0108.rst + cfag12864b.rst + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/admin-guide/auxdisplay/ks0108.rst b/Documentation/admin-guide/auxdisplay/ks0108.rst new file mode 100644 index 000000000000..c0b7faf73136 --- /dev/null +++ b/Documentation/admin-guide/auxdisplay/ks0108.rst @@ -0,0 +1,50 @@ +========================================== +ks0108 LCD Controller Driver Documentation +========================================== + +:License: GPLv2 +:Author & Maintainer: Miguel Ojeda Sandonis +:Date: 2006-10-27 + + + +.. INDEX + + 1. DRIVER INFORMATION + 2. DEVICE INFORMATION + 3. WIRING + + +1. Driver Information +--------------------- + +This driver supports the ks0108 LCD controller. + + +2. Device Information +--------------------- + +:Manufacturer: Samsung +:Device Name: KS0108 LCD Controller +:Device Code: ks0108 +:Webpage: - +:Device Webpage: - +:Type: LCD Controller (Liquid Crystal Display Controller) +:Width: 64 +:Height: 64 +:Colors: 2 (B/N) +:Pages: 8 +:Addresses: 64 each page +:Data size: 1 byte each address +:Memory size: 8 * 64 * 1 = 512 bytes + + +3. Wiring +--------- + +The driver supports data parallel port wiring. + +If you aren't building LCD related hardware, you should check +your LCD specific wiring information in the same folder. + +For example, check Documentation/admin-guide/auxdisplay/cfag12864b.rst diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 6eccf13219ff..27c77d853028 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -1,15 +1,15 @@ ======================================== -zram: Compressed RAM based block devices +zram: Compressed RAM-based block devices ======================================== Introduction ============ -The zram module creates RAM based block devices named /dev/zram<id> +The zram module creates RAM-based block devices named /dev/zram<id> (<id> = 0, 1, ...). Pages written to these disks are compressed and stored in memory itself. These disks allow very fast I/O and compression provides -good amounts of memory savings. Some of the usecases include /tmp storage, -use as swap disks, various caches under /var and maybe many more :) +good amounts of memory savings. Some of the use cases include /tmp storage, +use as swap disks, various caches under /var and maybe many more. :) Statistics for individual zram devices are exported through sysfs nodes at /sys/block/zram<id>/ @@ -43,17 +43,17 @@ The list of possible return codes: ======== ============================================================= -EBUSY an attempt to modify an attribute that cannot be changed once - the device has been initialised. Please reset device first; + the device has been initialised. Please reset device first. -ENOMEM zram was not able to allocate enough memory to fulfil your - needs; + needs. -EINVAL invalid input has been provided. ======== ============================================================= -If you use 'echo', the returned value that is changed by 'echo' utility, +If you use 'echo', the returned value is set by the 'echo' utility, and, in general case, something like:: echo 3 > /sys/block/zram0/max_comp_streams - if [ $? -ne 0 ]; + if [ $? -ne 0 ]; then handle_error fi @@ -65,7 +65,8 @@ should suffice. :: modprobe zram num_devices=4 - This creates 4 devices: /dev/zram{0,1,2,3} + +This creates 4 devices: /dev/zram{0,1,2,3} num_devices parameter is optional and tells zram how many devices should be pre-created. Default: 1. @@ -73,12 +74,12 @@ pre-created. Default: 1. 2) Set max number of compression streams ======================================== -Regardless the value passed to this attribute, ZRAM will always -allocate multiple compression streams - one per online CPUs - thus +Regardless of the value passed to this attribute, ZRAM will always +allocate multiple compression streams - one per online CPU - thus allowing several concurrent compression operations. The number of allocated compression streams goes down when some of the CPUs become offline. There is no single-compression-stream mode anymore, -unless you are running a UP system or has only 1 CPU online. +unless you are running a UP system or have only 1 CPU online. To find out how many streams are currently available:: @@ -89,7 +90,7 @@ To find out how many streams are currently available:: Using comp_algorithm device attribute one can see available and currently selected (shown in square brackets) compression algorithms, -change selected compression algorithm (once the device is initialised +or change the selected compression algorithm (once the device is initialised there is no way to change compression algorithm). Examples:: @@ -167,9 +168,9 @@ Examples:: zram provides a control interface, which enables dynamic (on-demand) device addition and removal. -In order to add a new /dev/zramX device, perform read operation on hot_add -attribute. This will return either new device's device id (meaning that you -can use /dev/zram<id>) or error code. +In order to add a new /dev/zramX device, perform a read operation on the hot_add +attribute. This will return either the new device's device id (meaning that you +can use /dev/zram<id>) or an error code. Example:: @@ -186,8 +187,8 @@ execute:: Per-device statistics are exported as various nodes under /sys/block/zram<id>/ -A brief description of exported device attributes. For more details please -read Documentation/ABI/testing/sysfs-block-zram. +A brief description of exported device attributes follows. For more details +please read Documentation/ABI/testing/sysfs-block-zram. ====================== ====== =============================================== Name access description @@ -245,7 +246,7 @@ whitespace: File /sys/block/zram<id>/mm_stat -The stat file represents device's mm statistics. It consists of a single +The mm_stat file represents the device's mm statistics. It consists of a single line of text and contains the following stats separated by whitespace: ================ ============================================================= @@ -261,7 +262,7 @@ line of text and contains the following stats separated by whitespace: Unit: bytes mem_limit the maximum amount of memory ZRAM can use to store the compressed data - mem_used_max the maximum amount of memory zram have consumed to + mem_used_max the maximum amount of memory zram has consumed to store the data same_pages the number of same element filled pages written to this disk. No memory is allocated for such pages. @@ -271,7 +272,7 @@ line of text and contains the following stats separated by whitespace: File /sys/block/zram<id>/bd_stat -The stat file represents device's backing device statistics. It consists of +The bd_stat file represents a device's backing device statistics. It consists of a single line of text and contains the following stats separated by whitespace: ============== ============================================================= @@ -316,9 +317,9 @@ To use the feature, admin should set up backing device via:: echo /dev/sda5 > /sys/block/zramX/backing_dev before disksize setting. It supports only partition at this moment. -If admin want to use incompressible page writeback, they could do via:: +If admin wants to use incompressible page writeback, they could do via:: - echo huge > /sys/block/zramX/write + echo huge > /sys/block/zramX/writeback To use idle page writeback, first, user need to declare zram pages as idle:: @@ -326,7 +327,7 @@ as idle:: echo all > /sys/block/zramX/idle From now on, any pages on zram are idle pages. The idle mark -will be removed until someone request access of the block. +will be removed until someone requests access of the block. IOW, unless there is access request, those pages are still idle pages. Admin can request writeback of those idle pages at right timing via:: @@ -341,16 +342,16 @@ to guarantee storage health for entire product life. To overcome the concern, zram supports "writeback_limit" feature. The "writeback_limit_enable"'s default value is 0 so that it doesn't limit -any writeback. IOW, if admin want to apply writeback budget, he should +any writeback. IOW, if admin wants to apply writeback budget, he should enable writeback_limit_enable via:: $ echo 1 > /sys/block/zramX/writeback_limit_enable Once writeback_limit_enable is set, zram doesn't allow any writeback -until admin set the budget via /sys/block/zramX/writeback_limit. +until admin sets the budget via /sys/block/zramX/writeback_limit. (If admin doesn't enable writeback_limit_enable, writeback_limit's value -assigned via /sys/block/zramX/writeback_limit is meaninless.) +assigned via /sys/block/zramX/writeback_limit is meaningless.) If admin want to limit writeback as per-day 400M, he could do it like below:: @@ -361,13 +362,13 @@ like below:: /sys/block/zram0/writeback_limit. $ echo 1 > /sys/block/zram0/writeback_limit_enable -If admin want to allow further write again once the bugdet is exausted, +If admins want to allow further write again once the bugdet is exhausted, he could do it like below:: $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ /sys/block/zram0/writeback_limit -If admin want to see remaining writeback budget since he set:: +If admin wants to see remaining writeback budget since last set:: $ cat /sys/block/zramX/writeback_limit @@ -375,12 +376,12 @@ If admin want to disable writeback limit, he could do:: $ echo 0 > /sys/block/zramX/writeback_limit_enable -The writeback_limit count will reset whenever you reset zram(e.g., +The writeback_limit count will reset whenever you reset zram (e.g., system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of writeback happened until you reset the zram to allocate extra writeback budget in next setting is user's job. -If admin want to measure writeback count in a certain period, he could +If admin wants to measure writeback count in a certain period, he could know it via /sys/block/zram0/bd_stat's 3rd column. memory tracking diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst new file mode 100644 index 000000000000..b342a6796392 --- /dev/null +++ b/Documentation/admin-guide/bootconfig.rst @@ -0,0 +1,190 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _bootconfig: + +================== +Boot Configuration +================== + +:Author: Masami Hiramatsu <mhiramat@kernel.org> + +Overview +======== + +The boot configuration expands the current kernel command line to support +additional key-value data when booting the kernel in an efficient way. +This allows administrators to pass a structured-Key config file. + +Config File Syntax +================== + +The boot config syntax is a simple structured key-value. Each key consists +of dot-connected-words, and key and value are connected by ``=``. The value +has to be terminated by semi-colon (``;``) or newline (``\n``). +For array value, array entries are separated by comma (``,``). :: + +KEY[.WORD[...]] = VALUE[, VALUE2[...]][;] + +Unlike the kernel command line syntax, spaces are OK around the comma and ``=``. + +Each key word must contain only alphabets, numbers, dash (``-``) or underscore +(``_``). And each value only contains printable characters or spaces except +for delimiters such as semi-colon (``;``), new-line (``\n``), comma (``,``), +hash (``#``) and closing brace (``}``). + +If you want to use those delimiters in a value, you can use either double- +quotes (``"VALUE"``) or single-quotes (``'VALUE'``) to quote it. Note that +you can not escape these quotes. + +There can be a key which doesn't have value or has an empty value. Those keys +are used for checking if the key exists or not (like a boolean). + +Key-Value Syntax +---------------- + +The boot config file syntax allows user to merge partially same word keys +by brace. For example:: + + foo.bar.baz = value1 + foo.bar.qux.quux = value2 + +These can be written also in:: + + foo.bar { + baz = value1 + qux.quux = value2 + } + +Or more shorter, written as following:: + + foo.bar { baz = value1; qux.quux = value2 } + +In both styles, same key words are automatically merged when parsing it +at boot time. So you can append similar trees or key-values. + +Comments +-------- + +The config syntax accepts shell-script style comments. The comments starting +with hash ("#") until newline ("\n") will be ignored. + +:: + + # comment line + foo = value # value is set to foo. + bar = 1, # 1st element + 2, # 2nd element + 3 # 3rd element + +This is parsed as below:: + + foo = value + bar = 1, 2, 3 + +Note that you can not put a comment between value and delimiter(``,`` or +``;``). This means following config has a syntax error :: + + key = 1 # comment + ,2 + + +/proc/bootconfig +================ + +/proc/bootconfig is a user-space interface of the boot config. +Unlike /proc/cmdline, this file shows the key-value style list. +Each key-value pair is shown in each line with following style:: + + KEY[.WORDS...] = "[VALUE]"[,"VALUE2"...] + + +Boot Kernel With a Boot Config +============================== + +Since the boot configuration file is loaded with initrd, it will be added +to the end of the initrd (initramfs) image file. The Linux kernel decodes +the last part of the initrd image in memory to get the boot configuration +data. +Because of this "piggyback" method, there is no need to change or +update the boot loader and the kernel image itself. + +To do this operation, Linux kernel provides "bootconfig" command under +tools/bootconfig, which allows admin to apply or delete the config file +to/from initrd image. You can build it by the following command:: + + # make -C tools/bootconfig + +To add your boot config file to initrd image, run bootconfig as below +(Old data is removed automatically if exists):: + + # tools/bootconfig/bootconfig -a your-config /boot/initrd.img-X.Y.Z + +To remove the config from the image, you can use -d option as below:: + + # tools/bootconfig/bootconfig -d /boot/initrd.img-X.Y.Z + +Then add "bootconfig" on the normal kernel command line to tell the +kernel to look for the bootconfig at the end of the initrd file. + +Config File Limitation +====================== + +Currently the maximum config size size is 32KB and the total key-words (not +key-value entries) must be under 1024 nodes. +Note: this is not the number of entries but nodes, an entry must consume +more than 2 nodes (a key-word and a value). So theoretically, it will be +up to 512 key-value pairs. If keys contains 3 words in average, it can +contain 256 key-value pairs. In most cases, the number of config items +will be under 100 entries and smaller than 8KB, so it would be enough. +If the node number exceeds 1024, parser returns an error even if the file +size is smaller than 32KB. +Anyway, since bootconfig command verifies it when appending a boot config +to initrd image, user can notice it before boot. + + +Bootconfig APIs +=============== + +User can query or loop on key-value pairs, also it is possible to find +a root (prefix) key node and find key-values under that node. + +If you have a key string, you can query the value directly with the key +using xbc_find_value(). If you want to know what keys exist in the boot +config, you can use xbc_for_each_key_value() to iterate key-value pairs. +Note that you need to use xbc_array_for_each_value() for accessing +each array's value, e.g.:: + + vnode = NULL; + xbc_find_value("key.word", &vnode); + if (vnode && xbc_node_is_array(vnode)) + xbc_array_for_each_value(vnode, value) { + printk("%s ", value); + } + +If you want to focus on keys which have a prefix string, you can use +xbc_find_node() to find a node by the prefix string, and iterate +keys under the prefix node with xbc_node_for_each_key_value(). + +But the most typical usage is to get the named value under prefix +or get the named array under prefix as below:: + + root = xbc_find_node("key.prefix"); + value = xbc_node_find_value(root, "option", &vnode); + ... + xbc_node_for_each_array_value(root, "array-option", value, anode) { + ... + } + +This accesses a value of "key.prefix.option" and an array of +"key.prefix.array-option". + +Locking is not needed, since after initialization, the config becomes +read-only. All data and keys must be copied if you need to modify it. + + +Functions and structures +======================== + +.. kernel-doc:: include/linux/bootconfig.h +.. kernel-doc:: lib/bootconfig.c + diff --git a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst index 1d7d962933be..36d43ae7dc13 100644 --- a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst +++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst @@ -130,12 +130,6 @@ Proportional weight policy files dev weight 8:16 300 -- blkio.leaf_weight[_device] - - Equivalents of blkio.weight[_device] for the purpose of - deciding how much weight tasks in the given cgroup has while - competing with the cgroup's child cgroups. For details, - please refer to Documentation/block/cfq-iosched.txt. - - blkio.time - disk time allocated to cgroup per device in milliseconds. First two fields specify the major and minor number of the device and diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 41bdc038dad9..0ae4f564c2d6 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -85,8 +85,10 @@ Brief summary of control files. memory.oom_control set/show oom controls. memory.numa_stat show the number of memory usage per numa node - memory.kmem.limit_in_bytes set/show hard limit for kernel memory + This knob is deprecated and shouldn't be + used. It is planned that this be removed in + the foreseeable future. memory.kmem.usage_in_bytes show current kernel memory allocation memory.kmem.failcnt show the number of kernel memory usage hits limits diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3b29005aa981..3f801461f0f3 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -61,6 +61,8 @@ v1 is available under Documentation/admin-guide/cgroup-v1/. 5-6. Device 5-7. RDMA 5-7-1. RDMA Interface Files + 5-8. HugeTLB + 5.8-1. HugeTLB Interface Files 5-8. Misc 5-8-1. perf_event 5-N. Non-normative information @@ -615,8 +617,8 @@ on an IO device and is an example of this type. Protections ----------- -A cgroup is protected to be allocated upto the configured amount of -the resource if the usages of all its ancestors are under their +A cgroup is protected upto the configured amount of the resource +as long as the usages of all its ancestors are under their protected levels. Protections can be hard guarantees or best effort soft boundaries. Protections can also be over-committed in which case only upto the amount available to the parent is protected among @@ -951,6 +953,13 @@ controller implements weight and absolute bandwidth limit models for normal scheduling policy and absolute bandwidth allocation model for realtime scheduling policy. +In all the above models, cycles distribution is defined only on a temporal +base and it does not account for the frequency at which tasks are executed. +The (optional) utilization clamping support allows to hint the schedutil +cpufreq governor about the minimum desired frequency which should always be +provided by a CPU, as well as the maximum desired frequency, which should not +be exceeded by a CPU. + WARNING: cgroup2 doesn't yet support control of realtime processes and the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system management software may already @@ -1016,6 +1025,33 @@ All time durations are in microseconds. Shows pressure stall information for CPU. See Documentation/accounting/psi.rst for details. + cpu.uclamp.min + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. + + The requested minimum utilization (protection) as a percentage + rational number, e.g. 12.34 for 12.34%. + + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp. + + The requested minimum utilization (protection) is always capped by + the current value for the maximum utilization (limit), i.e. + `cpu.uclamp.max`. + + cpu.uclamp.max + A read-write single value file which exists on non-root cgroups. + The default is "max". i.e. no utilization capping + + The requested maximum utilization (limit) as a percentage rational + number, e.g. 98.76 for 98.76%. + + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp. + + Memory ------ @@ -1062,7 +1098,10 @@ PAGE_SIZE multiple when read back. is within its effective min boundary, the cgroup's memory won't be reclaimed under any conditions. If there is no unprotected reclaimable memory available, OOM killer - is invoked. + is invoked. Above the effective min boundary (or + effective low boundary if it is higher), pages are reclaimed + proportionally to the overage, reducing reclaim pressure for + smaller overages. Effective min boundary is limited by memory.min values of all ancestor cgroups. If there is memory.min overcommitment @@ -1083,8 +1122,12 @@ PAGE_SIZE multiple when read back. Best-effort memory protection. If the memory usage of a cgroup is within its effective low boundary, the cgroup's - memory won't be reclaimed unless memory can be reclaimed - from unprotected cgroups. + memory won't be reclaimed unless there is no reclaimable + memory available in unprotected cgroups. + Above the effective low boundary (or + effective min boundary if it is higher), pages are reclaimed + proportionally to the overage, reducing reclaim pressure for + smaller overages. Effective low boundary is limited by memory.low values of all ancestor cgroups. If there is memory.low overcommitment @@ -1248,7 +1291,12 @@ PAGE_SIZE multiple when read back. inactive_anon, active_anon, inactive_file, active_file, unevictable Amount of memory, swap-backed and filesystem-backed, on the internal memory management lists used by the - page reclaim algorithm + page reclaim algorithm. + + As these represent internal list state (eg. shmem pages are on anon + memory management lists), inactive_foo + active_foo may not be equal to + the value for the foo counter, since the foo counter is type-based, not + list-based. slab_reclaimable Part of "slab" that might be reclaimed, such as @@ -1294,7 +1342,7 @@ PAGE_SIZE multiple when read back. pgdeactivate - Amount of pages moved to the inactive LRU lis + Amount of pages moved to the inactive LRU list pglazyfree @@ -1435,6 +1483,103 @@ IO Interface Files 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 + io.cost.qos + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the Quality of Service of the IO cost + model based controller (CONFIG_BLK_CGROUP_IOCOST) which + currently implements "io.weight" proportional control. Lines + are keyed by $MAJ:$MIN device numbers and not ordered. The + line for a given device is populated on the first write for + the device on "io.cost.qos" or "io.cost.model". The following + nested keys are defined. + + ====== ===================================== + enable Weight-based control enable + ctrl "auto" or "user" + rpct Read latency percentile [0, 100] + rlat Read latency threshold + wpct Write latency percentile [0, 100] + wlat Write latency threshold + min Minimum scaling percentage [1, 10000] + max Maximum scaling percentage [1, 10000] + ====== ===================================== + + The controller is disabled by default and can be enabled by + setting "enable" to 1. "rpct" and "wpct" parameters default + to zero and the controller uses internal device saturation + state to adjust the overall IO rate between "min" and "max". + + When a better control quality is needed, latency QoS + parameters can be configured. For example:: + + 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 + + shows that on sdb, the controller is enabled, will consider + the device saturated if the 95th percentile of read completion + latencies is above 75ms or write 150ms, and adjust the overall + IO issue rate between 50% and 150% accordingly. + + The lower the saturation point, the better the latency QoS at + the cost of aggregate bandwidth. The narrower the allowed + adjustment range between "min" and "max", the more conformant + to the cost model the IO behavior. Note that the IO issue + base rate may be far off from 100% and setting "min" and "max" + blindly can lead to a significant loss of device capacity or + control quality. "min" and "max" are useful for regulating + devices which show wide temporary behavior changes - e.g. a + ssd which accepts writes at the line speed for a while and + then completely stalls for multiple seconds. + + When "ctrl" is "auto", the parameters are controlled by the + kernel and may change automatically. Setting "ctrl" to "user" + or setting any of the percentile and latency parameters puts + it into "user" mode and disables the automatic changes. The + automatic mode can be restored by setting "ctrl" to "auto". + + io.cost.model + A read-write nested-keyed file with exists only on the root + cgroup. + + This file configures the cost model of the IO cost model based + controller (CONFIG_BLK_CGROUP_IOCOST) which currently + implements "io.weight" proportional control. Lines are keyed + by $MAJ:$MIN device numbers and not ordered. The line for a + given device is populated on the first write for the device on + "io.cost.qos" or "io.cost.model". The following nested keys + are defined. + + ===== ================================ + ctrl "auto" or "user" + model The cost model in use - "linear" + ===== ================================ + + When "ctrl" is "auto", the kernel may change all parameters + dynamically. When "ctrl" is set to "user" or any other + parameters are written to, "ctrl" become "user" and the + automatic changes are disabled. + + When "model" is "linear", the following model parameters are + defined. + + ============= ======================================== + [r|w]bps The maximum sequential IO throughput + [r|w]seqiops The maximum 4k sequential IOs per second + [r|w]randiops The maximum 4k random IOs per second + ============= ======================================== + + From the above, the builtin linear model determines the base + costs of a sequential and random IO and the cost coefficient + for the IO size. While simple, this model can cover most + common device classes acceptably. + + The IO cost model isn't expected to be accurate in absolute + sense and is scaled to the device behavior dynamically. + + If needed, tools/cgroup/iocost_coef_gen.py can be used to + generate device-specific coefficients. + io.weight A read-write flat-keyed file which exists on non-root cgroups. The default is "default 100". @@ -1783,7 +1928,7 @@ Cpuset Interface Files It accepts only the following input values when written to. - "root" - a paritition root + "root" - a partition root "member" - a non-root member of a partition When set to be a partition root, the current cgroup is the @@ -1913,6 +2058,33 @@ RDMA Interface Files mlx4_0 hca_handle=1 hca_object=20 ocrdma1 hca_handle=1 hca_object=23 +HugeTLB +------- + +The HugeTLB controller allows to limit the HugeTLB usage per control group and +enforces the controller limit during page fault. + +HugeTLB Interface Files +~~~~~~~~~~~~~~~~~~~~~~~ + + hugetlb.<hugepagesize>.current + Show current usage for "hugepagesize" hugetlb. It exists for all + the cgroup except root. + + hugetlb.<hugepagesize>.max + Set/show the hard limit of "hugepagesize" hugetlb usage. + The default value is "max". It exists for all the cgroup except root. + + hugetlb.<hugepagesize>.events + A read-only flat-keyed file which exists on non-root cgroups. + + max + The number of allocation failure due to HugeTLB limit + + hugetlb.<hugepagesize>.events.local + Similar to hugetlb.<hugepagesize>.events but the fields in the file + are local to the cgroup i.e. not hierarchical. The file modified event + generated on this file reflects only the local events. Misc ---- @@ -2351,8 +2523,10 @@ system performance due to overreclaim, to the point where the feature becomes self-defeating. The memory.low boundary on the other hand is a top-down allocated -reserve. A cgroup enjoys reclaim protection when it's within its low, -which makes delegation of subtrees possible. +reserve. A cgroup enjoys reclaim protection when it's within its +effective low, which makes delegation of subtrees possible. It also +enjoys having reclaim pressure proportional to its overage when +above its effective low. The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. diff --git a/Documentation/admin-guide/cifs/authors.rst b/Documentation/admin-guide/cifs/authors.rst new file mode 100644 index 000000000000..b02d6dd6c070 --- /dev/null +++ b/Documentation/admin-guide/cifs/authors.rst @@ -0,0 +1,69 @@ +======= +Authors +======= + +Original Author +--------------- + +Steve French (sfrench@samba.org) + +The author wishes to express his appreciation and thanks to: +Andrew Tridgell (Samba team) for his early suggestions about smb/cifs VFS +improvements. Thanks to IBM for allowing me time and test resources to pursue +this project, to Jim McDonough from IBM (and the Samba Team) for his help, to +the IBM Linux JFS team for explaining many esoteric Linux filesystem features. +Jeremy Allison of the Samba team has done invaluable work in adding the server +side of the original CIFS Unix extensions and reviewing and implementing +portions of the newer CIFS POSIX extensions into the Samba 3 file server. Thank +Dave Boutcher of IBM Rochester (author of the OS/400 smb/cifs filesystem client) +for proving years ago that very good smb/cifs clients could be done on Unix-like +operating systems. Volker Lendecke, Andrew Tridgell, Urban Widmark, John +Newbigin and others for their work on the Linux smbfs module. Thanks to +the other members of the Storage Network Industry Association CIFS Technical +Workgroup for their work specifying this highly complex protocol and finally +thanks to the Samba team for their technical advice and encouragement. + +Patch Contributors +------------------ + +- Zwane Mwaikambo +- Andi Kleen +- Amrut Joshi +- Shobhit Dayal +- Sergey Vlasov +- Richard Hughes +- Yury Umanets +- Mark Hamzy (for some of the early cifs IPv6 work) +- Domen Puncer +- Jesper Juhl (in particular for lots of whitespace/formatting cleanup) +- Vince Negri and Dave Stahl (for finding an important caching bug) +- Adrian Bunk (kcalloc cleanups) +- Miklos Szeredi +- Kazeon team for various fixes especially for 2.4 version. +- Asser Ferno (Change Notify support) +- Shaggy (Dave Kleikamp) for innumerable small fs suggestions and some good cleanup +- Gunter Kukkukk (testing and suggestions for support of old servers) +- Igor Mammedov (DFS support) +- Jeff Layton (many, many fixes, as well as great work on the cifs Kerberos code) +- Scott Lovenberg +- Pavel Shilovsky (for great work adding SMB2 support, and various SMB3 features) +- Aurelien Aptel (for DFS SMB3 work and some key bug fixes) +- Ronnie Sahlberg (for SMB3 xattr work, bug fixes, and lots of great work on compounding) +- Shirish Pargaonkar (for many ACL patches over the years) +- Sachin Prabhu (many bug fixes, including for reconnect, copy offload and security) +- Paulo Alcantara +- Long Li (some great work on RDMA, SMB Direct) + + +Test case and Bug Report contributors +------------------------------------- +Thanks to those in the community who have submitted detailed bug reports +and debug of problems they have found: Jochen Dolze, David Blaine, +Rene Scharfe, Martin Josefsson, Alexander Wild, Anthony Liguori, +Lars Muller, Urban Widmark, Massimiliano Ferrero, Howard Owen, +Olaf Kirch, Kieron Briggs, Nick Millington and others. Also special +mention to the Stanford Checker (SWAT) which pointed out many minor +bugs in error paths. Valuable suggestions also have come from Al Viro +and Dave Miller. + +And thanks to the IBM LTC and Power test teams and SuSE and Citrix and RedHat testers for finding multiple bugs during excellent stress test runs. diff --git a/Documentation/admin-guide/cifs/changes.rst b/Documentation/admin-guide/cifs/changes.rst new file mode 100644 index 000000000000..71f2ecb62299 --- /dev/null +++ b/Documentation/admin-guide/cifs/changes.rst @@ -0,0 +1,8 @@ +======= +Changes +======= + +See https://wiki.samba.org/index.php/LinuxCIFSKernel for summary +information (that may be easier to read than parsing the output of +"git log fs/cifs") about fixes/improvements to CIFS/SMB2/SMB3 support (changes +to cifs.ko module) by kernel version (and cifs internal module version). diff --git a/Documentation/admin-guide/cifs/index.rst b/Documentation/admin-guide/cifs/index.rst new file mode 100644 index 000000000000..fad5268635f5 --- /dev/null +++ b/Documentation/admin-guide/cifs/index.rst @@ -0,0 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==== +CIFS +==== + +.. toctree:: + :maxdepth: 2 + + introduction + usage + todo + changes + authors + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/admin-guide/cifs/introduction.rst b/Documentation/admin-guide/cifs/introduction.rst new file mode 100644 index 000000000000..0b98f672d36f --- /dev/null +++ b/Documentation/admin-guide/cifs/introduction.rst @@ -0,0 +1,53 @@ +============ +Introduction +============ + + This is the client VFS module for the SMB3 NAS protocol as well + as for older dialects such as the Common Internet File System (CIFS) + protocol which was the successor to the Server Message Block + (SMB) protocol, the native file sharing mechanism for most early + PC operating systems. New and improved versions of CIFS are now + called SMB2 and SMB3. Use of SMB3 (and later, including SMB3.1.1) + is strongly preferred over using older dialects like CIFS due to + security reaasons. All modern dialects, including the most recent, + SMB3.1.1 are supported by the CIFS VFS module. The SMB3 protocol + is implemented and supported by all major file servers + such as all modern versions of Windows (including Windows 2016 + Server), as well as by Samba (which provides excellent + CIFS/SMB2/SMB3 server support and tools for Linux and many other + operating systems). Apple systems also support SMB3 well, as + do most Network Attached Storage vendors, so this network + filesystem client can mount to a wide variety of systems. + It also supports mounting to the cloud (for example + Microsoft Azure), including the necessary security features. + + The intent of this module is to provide the most advanced network + file system function for SMB3 compliant servers, including advanced + security features, excellent parallelized high performance i/o, better + POSIX compliance, secure per-user session establishment, encryption, + high performance safe distributed caching (leases/oplocks), optional packet + signing, large files, Unicode support and other internationalization + improvements. Since both Samba server and this filesystem client support + the CIFS Unix extensions (and in the future SMB3 POSIX extensions), + the combination can provide a reasonable alternative to other network and + cluster file systems for fileserving in some Linux to Linux environments, + not just in Linux to Windows (or Linux to Mac) environments. + + This filesystem has a mount utility (mount.cifs) and various user space + tools (including smbinfo and setcifsacl) that can be obtained from + + https://git.samba.org/?p=cifs-utils.git + + or + + git://git.samba.org/cifs-utils.git + + mount.cifs should be installed in the directory with the other mount helpers. + + For more information on the module see the project wiki page at + + https://wiki.samba.org/index.php/LinuxCIFS + + and + + https://wiki.samba.org/index.php/LinuxCIFS_utils diff --git a/Documentation/admin-guide/cifs/todo.rst b/Documentation/admin-guide/cifs/todo.rst new file mode 100644 index 000000000000..084c25f92dcb --- /dev/null +++ b/Documentation/admin-guide/cifs/todo.rst @@ -0,0 +1,133 @@ +==== +TODO +==== + +Version 2.14 December 21, 2018 + +A Partial List of Missing Features +================================== + +Contributions are welcome. There are plenty of opportunities +for visible, important contributions to this module. Here +is a partial list of the known problems and missing features: + +a) SMB3 (and SMB3.1.1) missing optional features: + + - multichannel (started), integration with RDMA + - directory leases (improved metadata caching), started (root dir only) + - T10 copy offload ie "ODX" (copy chunk, and "Duplicate Extents" ioctl + currently the only two server side copy mechanisms supported) + +b) improved sparse file support (fiemap and SEEK_HOLE are implemented + but additional features would be supportable by the protocol). + +c) Directory entry caching relies on a 1 second timer, rather than + using Directory Leases, currently only the root file handle is cached longer + +d) quota support (needs minor kernel change since quota calls + to make it to network filesystems or deviceless filesystems) + +e) Additional use cases can be optimized to use "compounding" (e.g. + open/query/close and open/setinfo/close) to reduce the number of + roundtrips to the server and improve performance. Various cases + (stat, statfs, create, unlink, mkdir) already have been improved by + using compounding but more can be done. In addition we could + significantly reduce redundant opens by using deferred close (with + handle caching leases) and better using reference counters on file + handles. + +f) Finish inotify support so kde and gnome file list windows + will autorefresh (partially complete by Asser). Needs minor kernel + vfs change to support removing D_NOTIFY on a file. + +g) Add GUI tool to configure /proc/fs/cifs settings and for display of + the CIFS statistics (started) + +h) implement support for security and trusted categories of xattrs + (requires minor protocol extension) to enable better support for SELINUX + +i) Add support for tree connect contexts (see MS-SMB2) a new SMB3.1.1 protocol + feature (may be especially useful for virtualization). + +j) Create UID mapping facility so server UIDs can be mapped on a per + mount or a per server basis to client UIDs or nobody if no mapping + exists. Also better integration with winbind for resolving SID owners + +k) Add tools to take advantage of more smb3 specific ioctls and features + (passthrough ioctl/fsctl is now implemented in cifs.ko to allow + sending various SMB3 fsctls and query info and set info calls + directly from user space) Add tools to make setting various non-POSIX + metadata attributes easier from tools (e.g. extending what was done + in smb-info tool). + +l) encrypted file support + +m) improved stats gathering tools (perhaps integration with nfsometer?) + to extend and make easier to use what is currently in /proc/fs/cifs/Stats + +n) Add support for claims based ACLs ("DAC") + +o) mount helper GUI (to simplify the various configuration options on mount) + +p) Add support for witness protocol (perhaps ioctl to cifs.ko from user space + tool listening on witness protocol RPC) to allow for notification of share + move, server failover, and server adapter changes. And also improve other + failover scenarios, e.g. when client knows multiple DFS entries point to + different servers, and the server we are connected to has gone down. + +q) Allow mount.cifs to be more verbose in reporting errors with dialect + or unsupported feature errors. + +r) updating cifs documentation, and user guide. + +s) Addressing bugs found by running a broader set of xfstests in standard + file system xfstest suite. + +t) split cifs and smb3 support into separate modules so legacy (and less + secure) CIFS dialect can be disabled in environments that don't need it + and simplify the code. + +v) POSIX Extensions for SMB3.1.1 (started, create and mkdir support added + so far). + +w) Add support for additional strong encryption types, and additional spnego + authentication mechanisms (see MS-SMB2) + +x) Finish support for SMB3.1.1 compression + +Known Bugs +========== + +See http://bugzilla.samba.org - search on product "CifsVFS" for +current bug list. Also check http://bugzilla.kernel.org (Product = File System, Component = CIFS) + +1) existing symbolic links (Windows reparse points) are recognized but + can not be created remotely. They are implemented for Samba and those that + support the CIFS Unix extensions, although earlier versions of Samba + overly restrict the pathnames. +2) follow_link and readdir code does not follow dfs junctions + but recognizes them + +Misc testing to do +================== +1) check out max path names and max path name components against various server + types. Try nested symlinks (8 deep). Return max path name in stat -f information + +2) Improve xfstest's cifs/smb3 enablement and adapt xfstests where needed to test + cifs/smb3 better + +3) Additional performance testing and optimization using iozone and similar - + there are some easy changes that can be done to parallelize sequential writes, + and when signing is disabled to request larger read sizes (larger than + negotiated size) and send larger write sizes to modern servers. + +4) More exhaustively test against less common servers + +5) Continue to extend the smb3 "buildbot" which does automated xfstesting + against Windows, Samba and Azure currently - to add additional tests and + to allow the buildbot to execute the tests faster. The URL for the + buildbot is: http://smb3-test-rhel-75.southcentralus.cloudapp.azure.com + +6) Address various coverity warnings (most are not bugs per-se, but + the more warnings are addressed, the easier it is to spot real + problems that static analyzers will point out in the future). diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst new file mode 100644 index 000000000000..d3fb67b8a976 --- /dev/null +++ b/Documentation/admin-guide/cifs/usage.rst @@ -0,0 +1,869 @@ +===== +Usage +===== + +This module supports the SMB3 family of advanced network protocols (as well +as older dialects, originally called "CIFS" or SMB1). + +The CIFS VFS module for Linux supports many advanced network filesystem +features such as hierarchical DFS like namespace, hardlinks, locking and more. +It was designed to comply with the SNIA CIFS Technical Reference (which +supersedes the 1992 X/Open SMB Standard) as well as to perform best practice +practical interoperability with Windows 2000, Windows XP, Samba and equivalent +servers. This code was developed in participation with the Protocol Freedom +Information Foundation. CIFS and now SMB3 has now become a defacto +standard for interoperating between Macs and Windows and major NAS appliances. + +Please see +MS-SMB2 (for detailed SMB2/SMB3/SMB3.1.1 protocol specification) +http://protocolfreedom.org/ and +http://samba.org/samba/PFIF/ +for more details. + + +For questions or bug reports please contact: + + smfrench@gmail.com + +See the project page at: https://wiki.samba.org/index.php/LinuxCIFS_utils + +Build instructions +================== + +For Linux: + +1) Download the kernel (e.g. from http://www.kernel.org) + and change directory into the top of the kernel directory tree + (e.g. /usr/src/linux-2.5.73) +2) make menuconfig (or make xconfig) +3) select cifs from within the network filesystem choices +4) save and exit +5) make + + +Installation instructions +========================= + +If you have built the CIFS vfs as module (successfully) simply +type ``make modules_install`` (or if you prefer, manually copy the file to +the modules directory e.g. /lib/modules/2.4.10-4GB/kernel/fs/cifs/cifs.ko). + +If you have built the CIFS vfs into the kernel itself, follow the instructions +for your distribution on how to install a new kernel (usually you +would simply type ``make install``). + +If you do not have the utility mount.cifs (in the Samba 4.x source tree and on +the CIFS VFS web site) copy it to the same directory in which mount helpers +reside (usually /sbin). Although the helper software is not +required, mount.cifs is recommended. Most distros include a ``cifs-utils`` +package that includes this utility so it is recommended to install this. + +Note that running the Winbind pam/nss module (logon service) on all of your +Linux clients is useful in mapping Uids and Gids consistently across the +domain to the proper network user. The mount.cifs mount helper can be +found at cifs-utils.git on git.samba.org + +If cifs is built as a module, then the size and number of network buffers +and maximum number of simultaneous requests to one server can be configured. +Changing these from their defaults is not recommended. By executing modinfo:: + + modinfo kernel/fs/cifs/cifs.ko + +on kernel/fs/cifs/cifs.ko the list of configuration changes that can be made +at module initialization time (by running insmod cifs.ko) can be seen. + +Recommendations +=============== + +To improve security the SMB2.1 dialect or later (usually will get SMB3) is now +the new default. To use old dialects (e.g. to mount Windows XP) use "vers=1.0" +on mount (or vers=2.0 for Windows Vista). Note that the CIFS (vers=1.0) is +much older and less secure than the default dialect SMB3 which includes +many advanced security features such as downgrade attack detection +and encrypted shares and stronger signing and authentication algorithms. +There are additional mount options that may be helpful for SMB3 to get +improved POSIX behavior (NB: can use vers=3.0 to force only SMB3, never 2.1): + + ``mfsymlinks`` and ``cifsacl`` and ``idsfromsid`` + +Allowing User Mounts +==================== + +To permit users to mount and unmount over directories they own is possible +with the cifs vfs. A way to enable such mounting is to mark the mount.cifs +utility as suid (e.g. ``chmod +s /sbin/mount.cifs``). To enable users to +umount shares they mount requires + +1) mount.cifs version 1.4 or later +2) an entry for the share in /etc/fstab indicating that a user may + unmount it e.g.:: + + //server/usersharename /mnt/username cifs user 0 0 + +Note that when the mount.cifs utility is run suid (allowing user mounts), +in order to reduce risks, the ``nosuid`` mount flag is passed in on mount to +disallow execution of an suid program mounted on the remote target. +When mount is executed as root, nosuid is not passed in by default, +and execution of suid programs on the remote target would be enabled +by default. This can be changed, as with nfs and other filesystems, +by simply specifying ``nosuid`` among the mount options. For user mounts +though to be able to pass the suid flag to mount requires rebuilding +mount.cifs with the following flag: CIFS_ALLOW_USR_SUID + +There is a corresponding manual page for cifs mounting in the Samba 3.0 and +later source tree in docs/manpages/mount.cifs.8 + +Allowing User Unmounts +====================== + +To permit users to ummount directories that they have user mounted (see above), +the utility umount.cifs may be used. It may be invoked directly, or if +umount.cifs is placed in /sbin, umount can invoke the cifs umount helper +(at least for most versions of the umount utility) for umount of cifs +mounts, unless umount is invoked with -i (which will avoid invoking a umount +helper). As with mount.cifs, to enable user unmounts umount.cifs must be marked +as suid (e.g. ``chmod +s /sbin/umount.cifs``) or equivalent (some distributions +allow adding entries to a file to the /etc/permissions file to achieve the +equivalent suid effect). For this utility to succeed the target path +must be a cifs mount, and the uid of the current user must match the uid +of the user who mounted the resource. + +Also note that the customary way of allowing user mounts and unmounts is +(instead of using mount.cifs and unmount.cifs as suid) to add a line +to the file /etc/fstab for each //server/share you wish to mount, but +this can become unwieldy when potential mount targets include many +or unpredictable UNC names. + +Samba Considerations +==================== + +Most current servers support SMB2.1 and SMB3 which are more secure, +but there are useful protocol extensions for the older less secure CIFS +dialect, so to get the maximum benefit if mounting using the older dialect +(CIFS/SMB1), we recommend using a server that supports the SNIA CIFS +Unix Extensions standard (e.g. almost any version of Samba ie version +2.2.5 or later) but the CIFS vfs works fine with a wide variety of CIFS servers. +Note that uid, gid and file permissions will display default values if you do +not have a server that supports the Unix extensions for CIFS (such as Samba +2.2.5 or later). To enable the Unix CIFS Extensions in the Samba server, add +the line:: + + unix extensions = yes + +to your smb.conf file on the server. Note that the following smb.conf settings +are also useful (on the Samba server) when the majority of clients are Unix or +Linux:: + + case sensitive = yes + delete readonly = yes + ea support = yes + +Note that server ea support is required for supporting xattrs from the Linux +cifs client, and that EA support is present in later versions of Samba (e.g. +3.0.6 and later (also EA support works in all versions of Windows, at least to +shares on NTFS filesystems). Extended Attribute (xattr) support is an optional +feature of most Linux filesystems which may require enabling via +make menuconfig. Client support for extended attributes (user xattr) can be +disabled on a per-mount basis by specifying ``nouser_xattr`` on mount. + +The CIFS client can get and set POSIX ACLs (getfacl, setfacl) to Samba servers +version 3.10 and later. Setting POSIX ACLs requires enabling both XATTR and +then POSIX support in the CIFS configuration options when building the cifs +module. POSIX ACL support can be disabled on a per mount basic by specifying +``noacl`` on mount. + +Some administrators may want to change Samba's smb.conf ``map archive`` and +``create mask`` parameters from the default. Unless the create mask is changed +newly created files can end up with an unnecessarily restrictive default mode, +which may not be what you want, although if the CIFS Unix extensions are +enabled on the server and client, subsequent setattr calls (e.g. chmod) can +fix the mode. Note that creating special devices (mknod) remotely +may require specifying a mkdev function to Samba if you are not using +Samba 3.0.6 or later. For more information on these see the manual pages +(``man smb.conf``) on the Samba server system. Note that the cifs vfs, +unlike the smbfs vfs, does not read the smb.conf on the client system +(the few optional settings are passed in on mount via -o parameters instead). +Note that Samba 2.2.7 or later includes a fix that allows the CIFS VFS to delete +open files (required for strict POSIX compliance). Windows Servers already +supported this feature. Samba server does not allow symlinks that refer to files +outside of the share, so in Samba versions prior to 3.0.6, most symlinks to +files with absolute paths (ie beginning with slash) such as:: + + ln -s /mnt/foo bar + +would be forbidden. Samba 3.0.6 server or later includes the ability to create +such symlinks safely by converting unsafe symlinks (ie symlinks to server +files that are outside of the share) to a samba specific format on the server +that is ignored by local server applications and non-cifs clients and that will +not be traversed by the Samba server). This is opaque to the Linux client +application using the cifs vfs. Absolute symlinks will work to Samba 3.0.5 or +later, but only for remote clients using the CIFS Unix extensions, and will +be invisbile to Windows clients and typically will not affect local +applications running on the same server as Samba. + +Use instructions +================ + +Once the CIFS VFS support is built into the kernel or installed as a module +(cifs.ko), you can use mount syntax like the following to access Samba or +Mac or Windows servers:: + + mount -t cifs //9.53.216.11/e$ /mnt -o username=myname,password=mypassword + +Before -o the option -v may be specified to make the mount.cifs +mount helper display the mount steps more verbosely. +After -o the following commonly used cifs vfs specific options +are supported:: + + username=<username> + password=<password> + domain=<domain name> + +Other cifs mount options are described below. Use of TCP names (in addition to +ip addresses) is available if the mount helper (mount.cifs) is installed. If +you do not trust the server to which are mounted, or if you do not have +cifs signing enabled (and the physical network is insecure), consider use +of the standard mount options ``noexec`` and ``nosuid`` to reduce the risk of +running an altered binary on your local system (downloaded from a hostile server +or altered by a hostile router). + +Although mounting using format corresponding to the CIFS URL specification is +not possible in mount.cifs yet, it is possible to use an alternate format +for the server and sharename (which is somewhat similar to NFS style mount +syntax) instead of the more widely used UNC format (i.e. \\server\share):: + + mount -t cifs tcp_name_of_server:share_name /mnt -o user=myname,pass=mypasswd + +When using the mount helper mount.cifs, passwords may be specified via alternate +mechanisms, instead of specifying it after -o using the normal ``pass=`` syntax +on the command line: +1) By including it in a credential file. Specify credentials=filename as one +of the mount options. Credential files contain two lines:: + + username=someuser + password=your_password + +2) By specifying the password in the PASSWD environment variable (similarly + the user name can be taken from the USER environment variable). +3) By specifying the password in a file by name via PASSWD_FILE +4) By specifying the password in a file by file descriptor via PASSWD_FD + +If no password is provided, mount.cifs will prompt for password entry + +Restrictions +============ + +Servers must support either "pure-TCP" (port 445 TCP/IP CIFS connections) or RFC +1001/1002 support for "Netbios-Over-TCP/IP." This is not likely to be a +problem as most servers support this. + +Valid filenames differ between Windows and Linux. Windows typically restricts +filenames which contain certain reserved characters (e.g.the character : +which is used to delimit the beginning of a stream name by Windows), while +Linux allows a slightly wider set of valid characters in filenames. Windows +servers can remap such characters when an explicit mapping is specified in +the Server's registry. Samba starting with version 3.10 will allow such +filenames (ie those which contain valid Linux characters, which normally +would be forbidden for Windows/CIFS semantics) as long as the server is +configured for Unix Extensions (and the client has not disabled +/proc/fs/cifs/LinuxExtensionsEnabled). In addition the mount option +``mapposix`` can be used on CIFS (vers=1.0) to force the mapping of +illegal Windows/NTFS/SMB characters to a remap range (this mount parm +is the default for SMB3). This remap (``mapposix``) range is also +compatible with Mac (and "Services for Mac" on some older Windows). + +CIFS VFS Mount Options +====================== +A partial list of the supported mount options follows: + + username + The user name to use when trying to establish + the CIFS session. + password + The user password. If the mount helper is + installed, the user will be prompted for password + if not supplied. + ip + The ip address of the target server + unc + The target server Universal Network Name (export) to + mount. + domain + Set the SMB/CIFS workgroup name prepended to the + username during CIFS session establishment + forceuid + Set the default uid for inodes to the uid + passed in on mount. For mounts to servers + which do support the CIFS Unix extensions, such as a + properly configured Samba server, the server provides + the uid, gid and mode so this parameter should not be + specified unless the server and clients uid and gid + numbering differ. If the server and client are in the + same domain (e.g. running winbind or nss_ldap) and + the server supports the Unix Extensions then the uid + and gid can be retrieved from the server (and uid + and gid would not have to be specified on the mount. + For servers which do not support the CIFS Unix + extensions, the default uid (and gid) returned on lookup + of existing files will be the uid (gid) of the person + who executed the mount (root, except when mount.cifs + is configured setuid for user mounts) unless the ``uid=`` + (gid) mount option is specified. Also note that permission + checks (authorization checks) on accesses to a file occur + at the server, but there are cases in which an administrator + may want to restrict at the client as well. For those + servers which do not report a uid/gid owner + (such as Windows), permissions can also be checked at the + client, and a crude form of client side permission checking + can be enabled by specifying file_mode and dir_mode on + the client. (default) + forcegid + (similar to above but for the groupid instead of uid) (default) + noforceuid + Fill in file owner information (uid) by requesting it from + the server if possible. With this option, the value given in + the uid= option (on mount) will only be used if the server + can not support returning uids on inodes. + noforcegid + (similar to above but for the group owner, gid, instead of uid) + uid + Set the default uid for inodes, and indicate to the + cifs kernel driver which local user mounted. If the server + supports the unix extensions the default uid is + not used to fill in the owner fields of inodes (files) + unless the ``forceuid`` parameter is specified. + gid + Set the default gid for inodes (similar to above). + file_mode + If CIFS Unix extensions are not supported by the server + this overrides the default mode for file inodes. + fsc + Enable local disk caching using FS-Cache (off by default). This + option could be useful to improve performance on a slow link, + heavily loaded server and/or network where reading from the + disk is faster than reading from the server (over the network). + This could also impact scalability positively as the + number of calls to the server are reduced. However, local + caching is not suitable for all workloads for e.g. read-once + type workloads. So, you need to consider carefully your + workload/scenario before using this option. Currently, local + disk caching is functional for CIFS files opened as read-only. + dir_mode + If CIFS Unix extensions are not supported by the server + this overrides the default mode for directory inodes. + port + attempt to contact the server on this tcp port, before + trying the usual ports (port 445, then 139). + iocharset + Codepage used to convert local path names to and from + Unicode. Unicode is used by default for network path + names if the server supports it. If iocharset is + not specified then the nls_default specified + during the local client kernel build will be used. + If server does not support Unicode, this parameter is + unused. + rsize + default read size (usually 16K). The client currently + can not use rsize larger than CIFSMaxBufSize. CIFSMaxBufSize + defaults to 16K and may be changed (from 8K to the maximum + kmalloc size allowed by your kernel) at module install time + for cifs.ko. Setting CIFSMaxBufSize to a very large value + will cause cifs to use more memory and may reduce performance + in some cases. To use rsize greater than 127K (the original + cifs protocol maximum) also requires that the server support + a new Unix Capability flag (for very large read) which some + newer servers (e.g. Samba 3.0.26 or later) do. rsize can be + set from a minimum of 2048 to a maximum of 130048 (127K or + CIFSMaxBufSize, whichever is smaller) + wsize + default write size (default 57344) + maximum wsize currently allowed by CIFS is 57344 (fourteen + 4096 byte pages) + actimeo=n + attribute cache timeout in seconds (default 1 second). + After this timeout, the cifs client requests fresh attribute + information from the server. This option allows to tune the + attribute cache timeout to suit the workload needs. Shorter + timeouts mean better the cache coherency, but increased number + of calls to the server. Longer timeouts mean reduced number + of calls to the server at the expense of less stricter cache + coherency checks (i.e. incorrect attribute cache for a short + period of time). + rw + mount the network share read-write (note that the + server may still consider the share read-only) + ro + mount network share read-only + version + used to distinguish different versions of the + mount helper utility (not typically needed) + sep + if first mount option (after the -o), overrides + the comma as the separator between the mount + parms. e.g.:: + + -o user=myname,password=mypassword,domain=mydom + + could be passed instead with period as the separator by:: + + -o sep=.user=myname.password=mypassword.domain=mydom + + this might be useful when comma is contained within username + or password or domain. This option is less important + when the cifs mount helper cifs.mount (version 1.1 or later) + is used. + nosuid + Do not allow remote executables with the suid bit + program to be executed. This is only meaningful for mounts + to servers such as Samba which support the CIFS Unix Extensions. + If you do not trust the servers in your network (your mount + targets) it is recommended that you specify this option for + greater security. + exec + Permit execution of binaries on the mount. + noexec + Do not permit execution of binaries on the mount. + dev + Recognize block devices on the remote mount. + nodev + Do not recognize devices on the remote mount. + suid + Allow remote files on this mountpoint with suid enabled to + be executed (default for mounts when executed as root, + nosuid is default for user mounts). + credentials + Although ignored by the cifs kernel component, it is used by + the mount helper, mount.cifs. When mount.cifs is installed it + opens and reads the credential file specified in order + to obtain the userid and password arguments which are passed to + the cifs vfs. + guest + Although ignored by the kernel component, the mount.cifs + mount helper will not prompt the user for a password + if guest is specified on the mount options. If no + password is specified a null password will be used. + perm + Client does permission checks (vfs_permission check of uid + and gid of the file against the mode and desired operation), + Note that this is in addition to the normal ACL check on the + target machine done by the server software. + Client permission checking is enabled by default. + noperm + Client does not do permission checks. This can expose + files on this mount to access by other users on the local + client system. It is typically only needed when the server + supports the CIFS Unix Extensions but the UIDs/GIDs on the + client and server system do not match closely enough to allow + access by the user doing the mount, but it may be useful with + non CIFS Unix Extension mounts for cases in which the default + mode is specified on the mount but is not to be enforced on the + client (e.g. perhaps when MultiUserMount is enabled) + Note that this does not affect the normal ACL check on the + target machine done by the server software (of the server + ACL against the user name provided at mount time). + serverino + Use server's inode numbers instead of generating automatically + incrementing inode numbers on the client. Although this will + make it easier to spot hardlinked files (as they will have + the same inode numbers) and inode numbers may be persistent, + note that the server does not guarantee that the inode numbers + are unique if multiple server side mounts are exported under a + single share (since inode numbers on the servers might not + be unique if multiple filesystems are mounted under the same + shared higher level directory). Note that some older + (e.g. pre-Windows 2000) do not support returning UniqueIDs + or the CIFS Unix Extensions equivalent and for those + this mount option will have no effect. Exporting cifs mounts + under nfsd requires this mount option on the cifs mount. + This is now the default if server supports the + required network operation. + noserverino + Client generates inode numbers (rather than using the actual one + from the server). These inode numbers will vary after + unmount or reboot which can confuse some applications, + but not all server filesystems support unique inode + numbers. + setuids + If the CIFS Unix extensions are negotiated with the server + the client will attempt to set the effective uid and gid of + the local process on newly created files, directories, and + devices (create, mkdir, mknod). If the CIFS Unix Extensions + are not negotiated, for newly created files and directories + instead of using the default uid and gid specified on + the mount, cache the new file's uid and gid locally which means + that the uid for the file can change when the inode is + reloaded (or the user remounts the share). + nosetuids + The client will not attempt to set the uid and gid on + on newly created files, directories, and devices (create, + mkdir, mknod) which will result in the server setting the + uid and gid to the default (usually the server uid of the + user who mounted the share). Letting the server (rather than + the client) set the uid and gid is the default. If the CIFS + Unix Extensions are not negotiated then the uid and gid for + new files will appear to be the uid (gid) of the mounter or the + uid (gid) parameter specified on the mount. + netbiosname + When mounting to servers via port 139, specifies the RFC1001 + source name to use to represent the client netbios machine + name when doing the RFC1001 netbios session initialize. + direct + Do not do inode data caching on files opened on this mount. + This precludes mmapping files on this mount. In some cases + with fast networks and little or no caching benefits on the + client (e.g. when the application is doing large sequential + reads bigger than page size without rereading the same data) + this can provide better performance than the default + behavior which caches reads (readahead) and writes + (writebehind) through the local Linux client pagecache + if oplock (caching token) is granted and held. Note that + direct allows write operations larger than page size + to be sent to the server. + strictcache + Use for switching on strict cache mode. In this mode the + client read from the cache all the time it has Oplock Level II, + otherwise - read from the server. All written data are stored + in the cache, but if the client doesn't have Exclusive Oplock, + it writes the data to the server. + rwpidforward + Forward pid of a process who opened a file to any read or write + operation on that file. This prevent applications like WINE + from failing on read and write if we use mandatory brlock style. + acl + Allow setfacl and getfacl to manage posix ACLs if server + supports them. (default) + noacl + Do not allow setfacl and getfacl calls on this mount + user_xattr + Allow getting and setting user xattrs (those attributes whose + name begins with ``user.`` or ``os2.``) as OS/2 EAs (extended + attributes) to the server. This allows support of the + setfattr and getfattr utilities. (default) + nouser_xattr + Do not allow getfattr/setfattr to get/set/list xattrs + mapchars + Translate six of the seven reserved characters (not backslash):: + + *?<>|: + + to the remap range (above 0xF000), which also + allows the CIFS client to recognize files created with + such characters by Windows's POSIX emulation. This can + also be useful when mounting to most versions of Samba + (which also forbids creating and opening files + whose names contain any of these seven characters). + This has no effect if the server does not support + Unicode on the wire. + nomapchars + Do not translate any of these seven characters (default). + nocase + Request case insensitive path name matching (case + sensitive is the default if the server supports it). + (mount option ``ignorecase`` is identical to ``nocase``) + posixpaths + If CIFS Unix extensions are supported, attempt to + negotiate posix path name support which allows certain + characters forbidden in typical CIFS filenames, without + requiring remapping. (default) + noposixpaths + If CIFS Unix extensions are supported, do not request + posix path name support (this may cause servers to + reject creatingfile with certain reserved characters). + nounix + Disable the CIFS Unix Extensions for this mount (tree + connection). This is rarely needed, but it may be useful + in order to turn off multiple settings all at once (ie + posix acls, posix locks, posix paths, symlink support + and retrieving uids/gids/mode from the server) or to + work around a bug in server which implement the Unix + Extensions. + nobrl + Do not send byte range lock requests to the server. + This is necessary for certain applications that break + with cifs style mandatory byte range locks (and most + cifs servers do not yet support requesting advisory + byte range locks). + forcemandatorylock + Even if the server supports posix (advisory) byte range + locking, send only mandatory lock requests. For some + (presumably rare) applications, originally coded for + DOS/Windows, which require Windows style mandatory byte range + locking, they may be able to take advantage of this option, + forcing the cifs client to only send mandatory locks + even if the cifs server would support posix advisory locks. + ``forcemand`` is accepted as a shorter form of this mount + option. + nostrictsync + If this mount option is set, when an application does an + fsync call then the cifs client does not send an SMB Flush + to the server (to force the server to write all dirty data + for this file immediately to disk), although cifs still sends + all dirty (cached) file data to the server and waits for the + server to respond to the write. Since SMB Flush can be + very slow, and some servers may be reliable enough (to risk + delaying slightly flushing the data to disk on the server), + turning on this option may be useful to improve performance for + applications that fsync too much, at a small risk of server + crash. If this mount option is not set, by default cifs will + send an SMB flush request (and wait for a response) on every + fsync call. + nodfs + Disable DFS (global name space support) even if the + server claims to support it. This can help work around + a problem with parsing of DFS paths with Samba server + versions 3.0.24 and 3.0.25. + remount + remount the share (often used to change from ro to rw mounts + or vice versa) + cifsacl + Report mode bits (e.g. on stat) based on the Windows ACL for + the file. (EXPERIMENTAL) + servern + Specify the server 's netbios name (RFC1001 name) to use + when attempting to setup a session to the server. + This is needed for mounting to some older servers (such + as OS/2 or Windows 98 and Windows ME) since they do not + support a default server name. A server name can be up + to 15 characters long and is usually uppercased. + sfu + When the CIFS Unix Extensions are not negotiated, attempt to + create device files and fifos in a format compatible with + Services for Unix (SFU). In addition retrieve bits 10-12 + of the mode via the SETFILEBITS extended attribute (as + SFU does). In the future the bottom 9 bits of the + mode also will be emulated using queries of the security + descriptor (ACL). + mfsymlinks + Enable support for Minshall+French symlinks + (see http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks) + This option is ignored when specified together with the + 'sfu' option. Minshall+French symlinks are used even if + the server supports the CIFS Unix Extensions. + sign + Must use packet signing (helps avoid unwanted data modification + by intermediate systems in the route). Note that signing + does not work with lanman or plaintext authentication. + seal + Must seal (encrypt) all data on this mounted share before + sending on the network. Requires support for Unix Extensions. + Note that this differs from the sign mount option in that it + causes encryption of data sent over this mounted share but other + shares mounted to the same server are unaffected. + locallease + This option is rarely needed. Fcntl F_SETLEASE is + used by some applications such as Samba and NFSv4 server to + check to see whether a file is cacheable. CIFS has no way + to explicitly request a lease, but can check whether a file + is cacheable (oplocked). Unfortunately, even if a file + is not oplocked, it could still be cacheable (ie cifs client + could grant fcntl leases if no other local processes are using + the file) for cases for example such as when the server does not + support oplocks and the user is sure that the only updates to + the file will be from this client. Specifying this mount option + will allow the cifs client to check for leases (only) locally + for files which are not oplocked instead of denying leases + in that case. (EXPERIMENTAL) + sec + Security mode. Allowed values are: + + none + attempt to connection as a null user (no name) + krb5 + Use Kerberos version 5 authentication + krb5i + Use Kerberos authentication and packet signing + ntlm + Use NTLM password hashing (default) + ntlmi + Use NTLM password hashing with signing (if + /proc/fs/cifs/PacketSigningEnabled on or if + server requires signing also can be the default) + ntlmv2 + Use NTLMv2 password hashing + ntlmv2i + Use NTLMv2 password hashing with packet signing + lanman + (if configured in kernel config) use older + lanman hash + hard + Retry file operations if server is not responding + soft + Limit retries to unresponsive servers (usually only + one retry) before returning an error. (default) + +The mount.cifs mount helper also accepts a few mount options before -o +including: + +=============== =============================================================== + -S take password from stdin (equivalent to setting the environment + variable ``PASSWD_FD=0`` + -V print mount.cifs version + -? display simple usage information +=============== =============================================================== + +With most 2.6 kernel versions of modutils, the version of the cifs kernel +module can be displayed via modinfo. + +Misc /proc/fs/cifs Flags and Debug Info +======================================= + +Informational pseudo-files: + +======================= ======================================================= +DebugData Displays information about active CIFS sessions and + shares, features enabled as well as the cifs.ko + version. +Stats Lists summary resource usage information as well as per + share statistics. +======================= ======================================================= + +Configuration pseudo-files: + +======================= ======================================================= +SecurityFlags Flags which control security negotiation and + also packet signing. Authentication (may/must) + flags (e.g. for NTLM and/or NTLMv2) may be combined with + the signing flags. Specifying two different password + hashing mechanisms (as "must use") on the other hand + does not make much sense. Default flags are:: + + 0x07007 + + (NTLM, NTLMv2 and packet signing allowed). The maximum + allowable flags if you want to allow mounts to servers + using weaker password hashes is 0x37037 (lanman, + plaintext, ntlm, ntlmv2, signing allowed). Some + SecurityFlags require the corresponding menuconfig + options to be enabled (lanman and plaintext require + CONFIG_CIFS_WEAK_PW_HASH for example). Enabling + plaintext authentication currently requires also + enabling lanman authentication in the security flags + because the cifs module only supports sending + laintext passwords using the older lanman dialect + form of the session setup SMB. (e.g. for authentication + using plain text passwords, set the SecurityFlags + to 0x30030):: + + may use packet signing 0x00001 + must use packet signing 0x01001 + may use NTLM (most common password hash) 0x00002 + must use NTLM 0x02002 + may use NTLMv2 0x00004 + must use NTLMv2 0x04004 + may use Kerberos security 0x00008 + must use Kerberos 0x08008 + may use lanman (weak) password hash 0x00010 + must use lanman password hash 0x10010 + may use plaintext passwords 0x00020 + must use plaintext passwords 0x20020 + (reserved for future packet encryption) 0x00040 + +cifsFYI If set to non-zero value, additional debug information + will be logged to the system error log. This field + contains three flags controlling different classes of + debugging entries. The maximum value it can be set + to is 7 which enables all debugging points (default 0). + Some debugging statements are not compiled into the + cifs kernel unless CONFIG_CIFS_DEBUG2 is enabled in the + kernel configuration. cifsFYI may be set to one or + nore of the following flags (7 sets them all):: + + +-----------------------------------------------+------+ + | log cifs informational messages | 0x01 | + +-----------------------------------------------+------+ + | log return codes from cifs entry points | 0x02 | + +-----------------------------------------------+------+ + | log slow responses | 0x04 | + | (ie which take longer than 1 second) | | + | | | + | CONFIG_CIFS_STATS2 must be enabled in .config | | + +-----------------------------------------------+------+ + +traceSMB If set to one, debug information is logged to the + system error log with the start of smb requests + and responses (default 0) +LookupCacheEnable If set to one, inode information is kept cached + for one second improving performance of lookups + (default 1) +LinuxExtensionsEnabled If set to one then the client will attempt to + use the CIFS "UNIX" extensions which are optional + protocol enhancements that allow CIFS servers + to return accurate UID/GID information as well + as support symbolic links. If you use servers + such as Samba that support the CIFS Unix + extensions but do not want to use symbolic link + support and want to map the uid and gid fields + to values supplied at mount (rather than the + actual values, then set this to zero. (default 1) +======================= ======================================================= + +These experimental features and tracing can be enabled by changing flags in +/proc/fs/cifs (after the cifs module has been installed or built into the +kernel, e.g. insmod cifs). To enable a feature set it to 1 e.g. to enable +tracing to the kernel message log type:: + + echo 7 > /proc/fs/cifs/cifsFYI + +cifsFYI functions as a bit mask. Setting it to 1 enables additional kernel +logging of various informational messages. 2 enables logging of non-zero +SMB return codes while 4 enables logging of requests that take longer +than one second to complete (except for byte range lock requests). +Setting it to 4 requires CONFIG_CIFS_STATS2 to be set in kernel configuration +(.config). Setting it to seven enables all three. Finally, tracing +the start of smb requests and responses can be enabled via:: + + echo 1 > /proc/fs/cifs/traceSMB + +Per share (per client mount) statistics are available in /proc/fs/cifs/Stats. +Additional information is available if CONFIG_CIFS_STATS2 is enabled in the +kernel configuration (.config). The statistics returned include counters which +represent the number of attempted and failed (ie non-zero return code from the +server) SMB3 (or cifs) requests grouped by request type (read, write, close etc.). +Also recorded is the total bytes read and bytes written to the server for +that share. Note that due to client caching effects this can be less than the +number of bytes read and written by the application running on the client. +Statistics can be reset to zero by ``echo 0 > /proc/fs/cifs/Stats`` which may be +useful if comparing performance of two different scenarios. + +Also note that ``cat /proc/fs/cifs/DebugData`` will display information about +the active sessions and the shares that are mounted. + +Enabling Kerberos (extended security) works but requires version 1.2 or later +of the helper program cifs.upcall to be present and to be configured in the +/etc/request-key.conf file. The cifs.upcall helper program is from the Samba +project(http://www.samba.org). NTLM and NTLMv2 and LANMAN support do not +require this helper. Note that NTLMv2 security (which does not require the +cifs.upcall helper program), instead of using Kerberos, is sufficient for +some use cases. + +DFS support allows transparent redirection to shares in an MS-DFS name space. +In addition, DFS support for target shares which are specified as UNC +names which begin with host names (rather than IP addresses) requires +a user space helper (such as cifs.upcall) to be present in order to +translate host names to ip address, and the user space helper must also +be configured in the file /etc/request-key.conf. Samba, Windows servers and +many NAS appliances support DFS as a way of constructing a global name +space to ease network configuration and improve reliability. + +To use cifs Kerberos and DFS support, the Linux keyutils package should be +installed and something like the following lines should be added to the +/etc/request-key.conf file:: + + create cifs.spnego * * /usr/local/sbin/cifs.upcall %k + create dns_resolver * * /usr/local/sbin/cifs.upcall %k + +CIFS kernel module parameters +============================= +These module parameters can be specified or modified either during the time of +module loading or during the runtime by using the interface:: + + /proc/module/cifs/parameters/<param> + +i.e.:: + + echo "value" > /sys/module/cifs/parameters/<param> + +================= ========================================================== +1. enable_oplocks Enable or disable oplocks. Oplocks are enabled by default. + [Y/y/1]. To disable use any of [N/n/0]. +================= ========================================================== diff --git a/Documentation/admin-guide/cifs/winucase_convert.pl b/Documentation/admin-guide/cifs/winucase_convert.pl new file mode 100755 index 000000000000..322a9c833f23 --- /dev/null +++ b/Documentation/admin-guide/cifs/winucase_convert.pl @@ -0,0 +1,62 @@ +#!/usr/bin/perl -w +# +# winucase_convert.pl -- convert "Windows 8 Upper Case Mapping Table.txt" to +# a two-level set of C arrays. +# +# Copyright 2013: Jeff Layton <jlayton@redhat.com> +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +while(<>) { + next if (!/^0x(..)(..)\t0x(....)\t/); + $firstchar = hex($1); + $secondchar = hex($2); + $uppercase = hex($3); + + $top[$firstchar][$secondchar] = $uppercase; +} + +for ($i = 0; $i < 256; $i++) { + next if (!$top[$i]); + + printf("static const wchar_t t2_%2.2x[256] = {", $i); + for ($j = 0; $j < 256; $j++) { + if (($j % 8) == 0) { + print "\n\t"; + } else { + print " "; + } + printf("0x%4.4x,", $top[$i][$j] ? $top[$i][$j] : 0); + } + print "\n};\n\n"; +} + +printf("static const wchar_t *const toplevel[256] = {", $i); +for ($i = 0; $i < 256; $i++) { + if (($i % 8) == 0) { + print "\n\t"; + } elsif ($top[$i]) { + print " "; + } else { + print " "; + } + + if ($top[$i]) { + printf("t2_%2.2x,", $i); + } else { + print "NULL,"; + } +} +print "\n};\n\n"; diff --git a/Documentation/admin-guide/dell_rbu.rst b/Documentation/admin-guide/dell_rbu.rst new file mode 100644 index 000000000000..8d70e1fc9f9d --- /dev/null +++ b/Documentation/admin-guide/dell_rbu.rst @@ -0,0 +1,128 @@ +========================================= +Dell Remote BIOS Update driver (dell_rbu) +========================================= + +Purpose +======= + +Document demonstrating the use of the Dell Remote BIOS Update driver +for updating BIOS images on Dell servers and desktops. + +Scope +===== + +This document discusses the functionality of the rbu driver only. +It does not cover the support needed from applications to enable the BIOS to +update itself with the image downloaded in to the memory. + +Overview +======== + +This driver works with Dell OpenManage or Dell Update Packages for updating +the BIOS on Dell servers (starting from servers sold since 1999), desktops +and notebooks (starting from those sold in 2005). + +Please go to http://support.dell.com register and you can find info on +OpenManage and Dell Update packages (DUP). + +Libsmbios can also be used to update BIOS on Dell systems go to +http://linux.dell.com/libsmbios/ for details. + +Dell_RBU driver supports BIOS update using the monolithic image and packetized +image methods. In case of monolithic the driver allocates a contiguous chunk +of physical pages having the BIOS image. In case of packetized the app +using the driver breaks the image in to packets of fixed sizes and the driver +would place each packet in contiguous physical memory. The driver also +maintains a link list of packets for reading them back. + +If the dell_rbu driver is unloaded all the allocated memory is freed. + +The rbu driver needs to have an application (as mentioned above) which will +inform the BIOS to enable the update in the next system reboot. + +The user should not unload the rbu driver after downloading the BIOS image +or updating. + +The driver load creates the following directories under the /sys file system:: + + /sys/class/firmware/dell_rbu/loading + /sys/class/firmware/dell_rbu/data + /sys/devices/platform/dell_rbu/image_type + /sys/devices/platform/dell_rbu/data + /sys/devices/platform/dell_rbu/packet_size + +The driver supports two types of update mechanism; monolithic and packetized. +These update mechanism depends upon the BIOS currently running on the system. +Most of the Dell systems support a monolithic update where the BIOS image is +copied to a single contiguous block of physical memory. + +In case of packet mechanism the single memory can be broken in smaller chunks +of contiguous memory and the BIOS image is scattered in these packets. + +By default the driver uses monolithic memory for the update type. This can be +changed to packets during the driver load time by specifying the load +parameter image_type=packet. This can also be changed later as below:: + + echo packet > /sys/devices/platform/dell_rbu/image_type + +In packet update mode the packet size has to be given before any packets can +be downloaded. It is done as below:: + + echo XXXX > /sys/devices/platform/dell_rbu/packet_size + +In the packet update mechanism, the user needs to create a new file having +packets of data arranged back to back. It can be done as follows: +The user creates packets header, gets the chunk of the BIOS image and +places it next to the packetheader; now, the packetheader + BIOS image chunk +added together should match the specified packet_size. This makes one +packet, the user needs to create more such packets out of the entire BIOS +image file and then arrange all these packets back to back in to one single +file. + +This file is then copied to /sys/class/firmware/dell_rbu/data. +Once this file gets to the driver, the driver extracts packet_size data from +the file and spreads it across the physical memory in contiguous packet_sized +space. + +This method makes sure that all the packets get to the driver in a single operation. + +In monolithic update the user simply get the BIOS image (.hdr file) and copies +to the data file as is without any change to the BIOS image itself. + +Do the steps below to download the BIOS image. + +1) echo 1 > /sys/class/firmware/dell_rbu/loading +2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data +3) echo 0 > /sys/class/firmware/dell_rbu/loading + +The /sys/class/firmware/dell_rbu/ entries will remain till the following is +done. + +:: + + echo -1 > /sys/class/firmware/dell_rbu/loading + +Until this step is completed the driver cannot be unloaded. + +Also echoing either mono, packet or init in to image_type will free up the +memory allocated by the driver. + +If a user by accident executes steps 1 and 3 above without executing step 2; +it will make the /sys/class/firmware/dell_rbu/ entries disappear. + +The entries can be recreated by doing the following:: + + echo init > /sys/devices/platform/dell_rbu/image_type + +.. note:: echoing init in image_type does not change its original value. + +Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to +read back the image downloaded. + +.. note:: + + After updating the BIOS image a user mode application needs to execute + code which sends the BIOS update request to the BIOS. So on the next reboot + the BIOS knows about the new image downloaded and it updates itself. + Also don't unload the rbu driver if the image has to be updated. + diff --git a/Documentation/admin-guide/device-mapper/dm-clone.rst b/Documentation/admin-guide/device-mapper/dm-clone.rst new file mode 100644 index 000000000000..b43a34c1430a --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-clone.rst @@ -0,0 +1,333 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +======== +dm-clone +======== + +Introduction +============ + +dm-clone is a device mapper target which produces a one-to-one copy of an +existing, read-only source device into a writable destination device: It +presents a virtual block device which makes all data appear immediately, and +redirects reads and writes accordingly. + +The main use case of dm-clone is to clone a potentially remote, high-latency, +read-only, archival-type block device into a writable, fast, primary-type device +for fast, low-latency I/O. The cloned device is visible/mountable immediately +and the copy of the source device to the destination device happens in the +background, in parallel with user I/O. + +For example, one could restore an application backup from a read-only copy, +accessible through a network storage protocol (NBD, Fibre Channel, iSCSI, AoE, +etc.), into a local SSD or NVMe device, and start using the device immediately, +without waiting for the restore to complete. + +When the cloning completes, the dm-clone table can be removed altogether and be +replaced, e.g., by a linear table, mapping directly to the destination device. + +The dm-clone target reuses the metadata library used by the thin-provisioning +target. + +Glossary +======== + + Hydration + The process of filling a region of the destination device with data from + the same region of the source device, i.e., copying the region from the + source to the destination device. + +Once a region gets hydrated we redirect all I/O regarding it to the destination +device. + +Design +====== + +Sub-devices +----------- + +The target is constructed by passing three devices to it (along with other +parameters detailed later): + +1. A source device - the read-only device that gets cloned and source of the + hydration. + +2. A destination device - the destination of the hydration, which will become a + clone of the source device. + +3. A small metadata device - it records which regions are already valid in the + destination device, i.e., which regions have already been hydrated, or have + been written to directly, via user I/O. + +The size of the destination device must be at least equal to the size of the +source device. + +Regions +------- + +dm-clone divides the source and destination devices in fixed sized regions. +Regions are the unit of hydration, i.e., the minimum amount of data copied from +the source to the destination device. + +The region size is configurable when you first create the dm-clone device. The +recommended region size is the same as the file system block size, which usually +is 4KB. The region size must be between 8 sectors (4KB) and 2097152 sectors +(1GB) and a power of two. + +Reads and writes from/to hydrated regions are serviced from the destination +device. + +A read to a not yet hydrated region is serviced directly from the source device. + +A write to a not yet hydrated region will be delayed until the corresponding +region has been hydrated and the hydration of the region starts immediately. + +Note that a write request with size equal to region size will skip copying of +the corresponding region from the source device and overwrite the region of the +destination device directly. + +Discards +-------- + +dm-clone interprets a discard request to a range that hasn't been hydrated yet +as a hint to skip hydration of the regions covered by the request, i.e., it +skips copying the region's data from the source to the destination device, and +only updates its metadata. + +If the destination device supports discards, then by default dm-clone will pass +down discard requests to it. + +Background Hydration +-------------------- + +dm-clone copies continuously from the source to the destination device, until +all of the device has been copied. + +Copying data from the source to the destination device uses bandwidth. The user +can set a throttle to prevent more than a certain amount of copying occurring at +any one time. Moreover, dm-clone takes into account user I/O traffic going to +the devices and pauses the background hydration when there is I/O in-flight. + +A message `hydration_threshold <#regions>` can be used to set the maximum number +of regions being copied, the default being 1 region. + +dm-clone employs dm-kcopyd for copying portions of the source device to the +destination device. By default, we issue copy requests of size equal to the +region size. A message `hydration_batch_size <#regions>` can be used to tune the +size of these copy requests. Increasing the hydration batch size results in +dm-clone trying to batch together contiguous regions, so we copy the data in +batches of this many regions. + +When the hydration of the destination device finishes, a dm event will be sent +to user space. + +Updating on-disk metadata +------------------------- + +On-disk metadata is committed every time a FLUSH or FUA bio is written. If no +such requests are made then commits will occur every second. This means the +dm-clone device behaves like a physical disk that has a volatile write cache. If +power is lost you may lose some recent writes. The metadata should always be +consistent in spite of any crash. + +Target Interface +================ + +Constructor +----------- + + :: + + clone <metadata dev> <destination dev> <source dev> <region size> + [<#feature args> [<feature arg>]* [<#core args> [<core arg>]*]] + + ================ ============================================================== + metadata dev Fast device holding the persistent metadata + destination dev The destination device, where the source will be cloned + source dev Read only device containing the data that gets cloned + region size The size of a region in sectors + + #feature args Number of feature arguments passed + feature args no_hydration or no_discard_passdown + + #core args An even number of arguments corresponding to key/value pairs + passed to dm-clone + core args Key/value pairs passed to dm-clone, e.g. `hydration_threshold + 256` + ================ ============================================================== + +Optional feature arguments are: + + ==================== ========================================================= + no_hydration Create a dm-clone instance with background hydration + disabled + no_discard_passdown Disable passing down discards to the destination device + ==================== ========================================================= + +Optional core arguments are: + + ================================ ============================================== + hydration_threshold <#regions> Maximum number of regions being copied from + the source to the destination device at any + one time, during background hydration. + hydration_batch_size <#regions> During background hydration, try to batch + together contiguous regions, so we copy data + from the source to the destination device in + batches of this many regions. + ================================ ============================================== + +Status +------ + + :: + + <metadata block size> <#used metadata blocks>/<#total metadata blocks> + <region size> <#hydrated regions>/<#total regions> <#hydrating regions> + <#feature args> <feature args>* <#core args> <core args>* + <clone metadata mode> + + ======================= ======================================================= + metadata block size Fixed block size for each metadata block in sectors + #used metadata blocks Number of metadata blocks used + #total metadata blocks Total number of metadata blocks + region size Configurable region size for the device in sectors + #hydrated regions Number of regions that have finished hydrating + #total regions Total number of regions to hydrate + #hydrating regions Number of regions currently hydrating + #feature args Number of feature arguments to follow + feature args Feature arguments, e.g. `no_hydration` + #core args Even number of core arguments to follow + core args Key/value pairs for tuning the core, e.g. + `hydration_threshold 256` + clone metadata mode ro if read-only, rw if read-write + + In serious cases where even a read-only mode is deemed + unsafe no further I/O will be permitted and the status + will just contain the string 'Fail'. If the metadata + mode changes, a dm event will be sent to user space. + ======================= ======================================================= + +Messages +-------- + + `disable_hydration` + Disable the background hydration of the destination device. + + `enable_hydration` + Enable the background hydration of the destination device. + + `hydration_threshold <#regions>` + Set background hydration threshold. + + `hydration_batch_size <#regions>` + Set background hydration batch size. + +Examples +======== + +Clone a device containing a file system +--------------------------------------- + +1. Create the dm-clone device. + + :: + + dmsetup create clone --table "0 1048576000 clone $metadata_dev $dest_dev \ + $source_dev 8 1 no_hydration" + +2. Mount the device and trim the file system. dm-clone interprets the discards + sent by the file system and it will not hydrate the unused space. + + :: + + mount /dev/mapper/clone /mnt/cloned-fs + fstrim /mnt/cloned-fs + +3. Enable background hydration of the destination device. + + :: + + dmsetup message clone 0 enable_hydration + +4. When the hydration finishes, we can replace the dm-clone table with a linear + table. + + :: + + dmsetup suspend clone + dmsetup load clone --table "0 1048576000 linear $dest_dev 0" + dmsetup resume clone + + The metadata device is no longer needed and can be safely discarded or reused + for other purposes. + +Known issues +============ + +1. We redirect reads, to not-yet-hydrated regions, to the source device. If + reading the source device has high latency and the user repeatedly reads from + the same regions, this behaviour could degrade performance. We should use + these reads as hints to hydrate the relevant regions sooner. Currently, we + rely on the page cache to cache these regions, so we hopefully don't end up + reading them multiple times from the source device. + +2. Release in-core resources, i.e., the bitmaps tracking which regions are + hydrated, after the hydration has finished. + +3. During background hydration, if we fail to read the source or write to the + destination device, we print an error message, but the hydration process + continues indefinitely, until it succeeds. We should stop the background + hydration after a number of failures and emit a dm event for user space to + notice. + +Why not...? +=========== + +We explored the following alternatives before implementing dm-clone: + +1. Use dm-cache with cache size equal to the source device and implement a new + cloning policy: + + * The resulting cache device is not a one-to-one mirror of the source device + and thus we cannot remove the cache device once cloning completes. + + * dm-cache writes to the source device, which violates our requirement that + the source device must be treated as read-only. + + * Caching is semantically different from cloning. + +2. Use dm-snapshot with a COW device equal to the source device: + + * dm-snapshot stores its metadata in the COW device, so the resulting device + is not a one-to-one mirror of the source device. + + * No background copying mechanism. + + * dm-snapshot needs to commit its metadata whenever a pending exception + completes, to ensure snapshot consistency. In the case of cloning, we don't + need to be so strict and can rely on committing metadata every time a FLUSH + or FUA bio is written, or periodically, like dm-thin and dm-cache do. This + improves the performance significantly. + +3. Use dm-mirror: The mirror target has a background copying/mirroring + mechanism, but it writes to all mirrors, thus violating our requirement that + the source device must be treated as read-only. + +4. Use dm-thin's external snapshot functionality. This approach is the most + promising among all alternatives, as the thinly-provisioned volume is a + one-to-one mirror of the source device and handles reads and writes to + un-provisioned/not-yet-cloned areas the same way as dm-clone does. + + Still: + + * There is no background copying mechanism, though one could be implemented. + + * Most importantly, we want to support arbitrary block devices as the + destination of the cloning process and not restrict ourselves to + thinly-provisioned volumes. Thin-provisioning has an inherent metadata + overhead, for maintaining the thin volume mappings, which significantly + degrades performance. + + Moreover, cloning a device shouldn't force the use of thin-provisioning. On + the other hand, if we wish to use thin provisioning, we can just use a thin + LV as dm-clone's destination device. diff --git a/Documentation/admin-guide/device-mapper/dm-dust.txt b/Documentation/admin-guide/device-mapper/dm-dust.rst index 954d402a1f6a..b6e7e7ead831 100644 --- a/Documentation/admin-guide/device-mapper/dm-dust.txt +++ b/Documentation/admin-guide/device-mapper/dm-dust.rst @@ -31,218 +31,233 @@ configured "bad blocks" will be treated as bad, or bypassed. This allows the pre-writing of test data and metadata prior to simulating a "failure" event where bad sectors start to appear. -Table parameters: ------------------ +Table parameters +---------------- <device_path> <offset> <blksz> Mandatory parameters: - <device_path>: path to the block device. - <offset>: offset to data area from start of device_path - <blksz>: block size in bytes + <device_path>: + Path to the block device. + + <offset>: + Offset to data area from start of device_path + + <blksz>: + Block size in bytes + (minimum 512, maximum 1073741824, must be a power of 2) -Usage instructions: -------------------- +Usage instructions +------------------ -First, find the size (in 512-byte sectors) of the device to be used: +First, find the size (in 512-byte sectors) of the device to be used:: -$ sudo blockdev --getsz /dev/vdb1 -33552384 + $ sudo blockdev --getsz /dev/vdb1 + 33552384 Create the dm-dust device: (For a device with a block size of 512 bytes) -$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 512' + +:: + + $ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 512' (For a device with a block size of 4096 bytes) -$ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 4096' + +:: + + $ sudo dmsetup create dust1 --table '0 33552384 dust /dev/vdb1 0 4096' Check the status of the read behavior ("bypass" indicates that all I/O -will be passed through to the underlying device): -$ sudo dmsetup status dust1 -0 33552384 dust 252:17 bypass +will be passed through to the underlying device):: + + $ sudo dmsetup status dust1 + 0 33552384 dust 252:17 bypass -$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct -128+0 records in -128+0 records out + $ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=128 iflag=direct + 128+0 records in + 128+0 records out -$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct -128+0 records in -128+0 records out + $ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct + 128+0 records in + 128+0 records out -Adding and removing bad blocks: -------------------------------- +Adding and removing bad blocks +------------------------------ At any time (i.e.: whether the device has the "bad block" emulation enabled or disabled), bad blocks may be added or removed from the -device via the "addbadblock" and "removebadblock" messages: +device via the "addbadblock" and "removebadblock" messages:: -$ sudo dmsetup message dust1 0 addbadblock 60 -kernel: device-mapper: dust: badblock added at block 60 + $ sudo dmsetup message dust1 0 addbadblock 60 + kernel: device-mapper: dust: badblock added at block 60 -$ sudo dmsetup message dust1 0 addbadblock 67 -kernel: device-mapper: dust: badblock added at block 67 + $ sudo dmsetup message dust1 0 addbadblock 67 + kernel: device-mapper: dust: badblock added at block 67 -$ sudo dmsetup message dust1 0 addbadblock 72 -kernel: device-mapper: dust: badblock added at block 72 + $ sudo dmsetup message dust1 0 addbadblock 72 + kernel: device-mapper: dust: badblock added at block 72 These bad blocks will be stored in the "bad block list". -While the device is in "bypass" mode, reads and writes will succeed: +While the device is in "bypass" mode, reads and writes will succeed:: -$ sudo dmsetup status dust1 -0 33552384 dust 252:17 bypass + $ sudo dmsetup status dust1 + 0 33552384 dust 252:17 bypass -Enabling block read failures: ------------------------------ +Enabling block read failures +---------------------------- -To enable the "fail read on bad block" behavior, send the "enable" message: +To enable the "fail read on bad block" behavior, send the "enable" message:: -$ sudo dmsetup message dust1 0 enable -kernel: device-mapper: dust: enabling read failures on bad sectors + $ sudo dmsetup message dust1 0 enable + kernel: device-mapper: dust: enabling read failures on bad sectors -$ sudo dmsetup status dust1 -0 33552384 dust 252:17 fail_read_on_bad_block + $ sudo dmsetup status dust1 + 0 33552384 dust 252:17 fail_read_on_bad_block With the device in "fail read on bad block" mode, attempting to read a -block will encounter an "Input/output error": +block will encounter an "Input/output error":: -$ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct -dd: error reading '/dev/mapper/dust1': Input/output error -0+0 records in -0+0 records out -0 bytes copied, 0.00040651 s, 0.0 kB/s + $ sudo dd if=/dev/mapper/dust1 of=/dev/null bs=512 count=1 skip=67 iflag=direct + dd: error reading '/dev/mapper/dust1': Input/output error + 0+0 records in + 0+0 records out + 0 bytes copied, 0.00040651 s, 0.0 kB/s ...and writing to the bad blocks will remove the blocks from the list, -therefore emulating the "remap" behavior of hard disk drives: +therefore emulating the "remap" behavior of hard disk drives:: -$ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct -128+0 records in -128+0 records out + $ sudo dd if=/dev/zero of=/dev/mapper/dust1 bs=512 count=128 oflag=direct + 128+0 records in + 128+0 records out -kernel: device-mapper: dust: block 60 removed from badblocklist by write -kernel: device-mapper: dust: block 67 removed from badblocklist by write -kernel: device-mapper: dust: block 72 removed from badblocklist by write -kernel: device-mapper: dust: block 87 removed from badblocklist by write + kernel: device-mapper: dust: block 60 removed from badblocklist by write + kernel: device-mapper: dust: block 67 removed from badblocklist by write + kernel: device-mapper: dust: block 72 removed from badblocklist by write + kernel: device-mapper: dust: block 87 removed from badblocklist by write -Bad block add/remove error handling: ------------------------------------- +Bad block add/remove error handling +----------------------------------- Attempting to add a bad block that already exists in the list will -result in an "Invalid argument" error, as well as a helpful message: +result in an "Invalid argument" error, as well as a helpful message:: -$ sudo dmsetup message dust1 0 addbadblock 88 -device-mapper: message ioctl on dust1 failed: Invalid argument -kernel: device-mapper: dust: block 88 already in badblocklist + $ sudo dmsetup message dust1 0 addbadblock 88 + device-mapper: message ioctl on dust1 failed: Invalid argument + kernel: device-mapper: dust: block 88 already in badblocklist Attempting to remove a bad block that doesn't exist in the list will -result in an "Invalid argument" error, as well as a helpful message: +result in an "Invalid argument" error, as well as a helpful message:: -$ sudo dmsetup message dust1 0 removebadblock 87 -device-mapper: message ioctl on dust1 failed: Invalid argument -kernel: device-mapper: dust: block 87 not found in badblocklist + $ sudo dmsetup message dust1 0 removebadblock 87 + device-mapper: message ioctl on dust1 failed: Invalid argument + kernel: device-mapper: dust: block 87 not found in badblocklist -Counting the number of bad blocks in the bad block list: --------------------------------------------------------- +Counting the number of bad blocks in the bad block list +------------------------------------------------------- To count the number of bad blocks configured in the device, run the -following message command: +following message command:: -$ sudo dmsetup message dust1 0 countbadblocks + $ sudo dmsetup message dust1 0 countbadblocks A message will print with the number of bad blocks currently -configured on the device: +configured on the device:: -kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found + kernel: device-mapper: dust: countbadblocks: 895 badblock(s) found -Querying for specific bad blocks: ---------------------------------- +Querying for specific bad blocks +-------------------------------- To find out if a specific block is in the bad block list, run the -following message command: +following message command:: -$ sudo dmsetup message dust1 0 queryblock 72 + $ sudo dmsetup message dust1 0 queryblock 72 -The following message will print if the block is in the list: -device-mapper: dust: queryblock: block 72 found in badblocklist +The following message will print if the block is in the list:: -The following message will print if the block is in the list: -device-mapper: dust: queryblock: block 72 not found in badblocklist + device-mapper: dust: queryblock: block 72 found in badblocklist + +The following message will print if the block is not in the list:: + + device-mapper: dust: queryblock: block 72 not found in badblocklist The "queryblock" message command will work in both the "enabled" and "disabled" modes, allowing the verification of whether a block will be treated as "bad" without having to issue I/O to the device, or having to "enable" the bad block emulation. -Clearing the bad block list: ----------------------------- +Clearing the bad block list +--------------------------- To clear the bad block list (without needing to individually run a "removebadblock" message command for every block), run the -following message command: +following message command:: -$ sudo dmsetup message dust1 0 clearbadblocks + $ sudo dmsetup message dust1 0 clearbadblocks -After clearing the bad block list, the following message will appear: +After clearing the bad block list, the following message will appear:: -kernel: device-mapper: dust: clearbadblocks: badblocks cleared + kernel: device-mapper: dust: clearbadblocks: badblocks cleared If there were no bad blocks to clear, the following message will -appear: +appear:: -kernel: device-mapper: dust: clearbadblocks: no badblocks found + kernel: device-mapper: dust: clearbadblocks: no badblocks found -Message commands list: ----------------------- +Message commands list +--------------------- Below is a list of the messages that can be sent to a dust device: -Operations on blocks (requires a <blknum> argument): +Operations on blocks (requires a <blknum> argument):: -addbadblock <blknum> -queryblock <blknum> -removebadblock <blknum> + addbadblock <blknum> + queryblock <blknum> + removebadblock <blknum> ...where <blknum> is a block number within range of the device - (corresponding to the block size of the device.) +(corresponding to the block size of the device.) -Single argument message commands: +Single argument message commands:: -countbadblocks -clearbadblocks -disable -enable -quiet + countbadblocks + clearbadblocks + disable + enable + quiet -Device removal: ---------------- +Device removal +-------------- -When finished, remove the device via the "dmsetup remove" command: +When finished, remove the device via the "dmsetup remove" command:: -$ sudo dmsetup remove dust1 + $ sudo dmsetup remove dust1 -Quiet mode: ------------ +Quiet mode +---------- On test runs with many bad blocks, it may be desirable to avoid excessive logging (from bad blocks added, removed, or "remapped"). -This can be done by enabling "quiet mode" via the following message: +This can be done by enabling "quiet mode" via the following message:: -$ sudo dmsetup message dust1 0 quiet + $ sudo dmsetup message dust1 0 quiet This will suppress log messages from add / remove / removed by write operations. Log messages from "countbadblocks" or "queryblock" message commands will still print in quiet mode. -The status of quiet mode can be seen by running "dmsetup status": +The status of quiet mode can be seen by running "dmsetup status":: -$ sudo dmsetup status dust1 -0 33552384 dust 252:17 fail_read_on_bad_block quiet + $ sudo dmsetup status dust1 + 0 33552384 dust 252:17 fail_read_on_bad_block quiet -To disable quiet mode, send the "quiet" message again: +To disable quiet mode, send the "quiet" message again:: -$ sudo dmsetup message dust1 0 quiet + $ sudo dmsetup message dust1 0 quiet -$ sudo dmsetup status dust1 -0 33552384 dust 252:17 fail_read_on_bad_block verbose + $ sudo dmsetup status dust1 + 0 33552384 dust 252:17 fail_read_on_bad_block verbose (The presence of "verbose" indicates normal logging.) diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst index a30aa91b5fbe..c00f9f11e3f3 100644 --- a/Documentation/admin-guide/device-mapper/dm-integrity.rst +++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst @@ -144,7 +144,7 @@ journal_crypt:algorithm(:key) (the key is optional) Encrypt the journal using given algorithm to make sure that the attacker can't read the journal. You can use a block cipher here (such as "cbc(aes)") or a stream cipher (for example "chacha20", - "salsa20", "ctr(aes)" or "ecb(arc4)"). + "salsa20" or "ctr(aes)"). The journal contains history of last writes to the block device, an attacker reading the journal could see the last sector nubmers @@ -177,6 +177,11 @@ bitmap_flush_interval:number The bitmap flush interval in milliseconds. The metadata buffers are synchronized when this interval expires. +fix_padding + Use a smaller padding of the tag area that is more + space-efficient. If this option is not present, large padding is + used - that is for compatibility with older kernels. + The journal mode (D/J), buffer_sectors, journal_watermark, commit_time can be changed when reloading the target (load an inactive table and swap the diff --git a/Documentation/admin-guide/device-mapper/dm-raid.rst b/Documentation/admin-guide/device-mapper/dm-raid.rst index 2fe255b130fb..695a2ea1d1ae 100644 --- a/Documentation/admin-guide/device-mapper/dm-raid.rst +++ b/Documentation/admin-guide/device-mapper/dm-raid.rst @@ -417,3 +417,7 @@ Version History deadlock/potential data corruption. Update superblock when specific devices are requested via rebuild. Fix RAID leg rebuild errors. + 1.15.0 Fix size extensions not being synchronized in case of new MD bitmap + pages allocated; also fix those not occuring after previous reductions + 1.15.1 Fix argument count and arguments for rebuild/write_mostly/journal_(dev|mode) + on the status line. diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst index c77c58b8f67b..ec62fcc8eece 100644 --- a/Documentation/admin-guide/device-mapper/index.rst +++ b/Documentation/admin-guide/device-mapper/index.rst @@ -8,7 +8,9 @@ Device Mapper cache-policies cache delay + dm-clone dm-crypt + dm-dust dm-flakey dm-init dm-integrity diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst index a4d1c1476d72..bb02caa45289 100644 --- a/Documentation/admin-guide/device-mapper/verity.rst +++ b/Documentation/admin-guide/device-mapper/verity.rst @@ -125,6 +125,13 @@ check_at_most_once blocks, and a hash block will not be verified any more after all the data blocks it covers have been verified anyway. +root_hash_sig_key_desc <key_description> + This is the description of the USER_KEY that the kernel will lookup to get + the pkcs7 signature of the roothash. The pkcs7 signature is used to validate + the root hash during the creation of the device mapper block device. + Verification of roothash depends on the config DM_VERITY_VERIFY_ROOTHASH_SIG + being set in the kernel. + Theory of operation =================== diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt index e56e00655153..2a97aaec8b12 100644 --- a/Documentation/admin-guide/devices.txt +++ b/Documentation/admin-guide/devices.txt @@ -319,7 +319,7 @@ 182 = /dev/perfctr Performance-monitoring counters 183 = /dev/hwrng Generic random number generator 184 = /dev/cpu/microcode CPU microcode update interface - 186 = /dev/atomicps Atomic shapshot of process state data + 186 = /dev/atomicps Atomic snapshot of process state data 187 = /dev/irnet IrNET device 188 = /dev/smbusbios SMBus BIOS 189 = /dev/ussp_ctl User space serial port control @@ -1647,8 +1647,17 @@ 0 = /dev/comedi0 First comedi device 1 = /dev/comedi1 Second comedi device ... + 47 = /dev/comedi47 48th comedi device - See http://stm.lbl.gov/comedi. + Minors 48 to 255 are reserved for comedi subdevices with + pathnames of the form "/dev/comediX_subdY", where "X" is the + minor number of the associated comedi device and "Y" is the + subdevice number. These subdevice minors are assigned + dynamically, so there is no fixed mapping from subdevice + pathnames to minor numbers. + + See http://www.comedi.org/ for information about the Comedi + project. 98 block User-mode virtual block device 0 = /dev/ubda First user-mode block device diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index 059ddcbe769d..9443fcef1876 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -92,6 +92,8 @@ Currently Available * efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force the ordering) * Case-insensitive file name lookups +* file-based encryption support (fscrypt) +* file-based verity support (fsverity) [1] Filesystems with a block size of 1k may see a limit imposed by the directory hash tree having a maximum depth of two. @@ -181,14 +183,17 @@ When mounting an ext4 filesystem, the following option are accepted: system after its metadata has been committed to the journal. commit=nrsec (*) - Ext4 can be told to sync all its data and metadata every 'nrsec' - seconds. The default value is 5 seconds. This means that if you lose - your power, you will lose as much as the latest 5 seconds of work (your - filesystem will not be damaged though, thanks to the journaling). This - default value (or any low value) will hurt performance, but it's good - for data-safety. Setting it to 0 will have the same effect as leaving - it at the default (5 seconds). Setting it to very large values will - improve performance. + This setting limits the maximum age of the running transaction to + 'nrsec' seconds. The default value is 5 seconds. This means that if + you lose your power, you will lose as much as the latest 5 seconds of + metadata changes (your filesystem will not be damaged though, thanks + to the journaling). This default value (or any low value) will hurt + performance, but it's good for data-safety. Setting it to 0 will have + the same effect as leaving it at the default (5 seconds). Setting it + to very large values will improve performance. Note that due to + delayed allocation even older data can be lost on power failure since + writeback of those data begins only after time set in + /proc/sys/vm/dirty_expire_centisecs. barrier=<0|1(*)>, barrier(*), nobarrier This enables/disables the use of write barriers in the jbd code. diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 49311f3da6f2..0795e3c2643f 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -12,3 +12,5 @@ are configurable at compile, boot or run time. spectre l1tf mds + tsx_async_abort + multihit.rst diff --git a/Documentation/admin-guide/hw-vuln/mds.rst b/Documentation/admin-guide/hw-vuln/mds.rst index e3a796c0d3a2..2d19c9f4c1fe 100644 --- a/Documentation/admin-guide/hw-vuln/mds.rst +++ b/Documentation/admin-guide/hw-vuln/mds.rst @@ -265,8 +265,11 @@ time with the option "mds=". The valid arguments for this option are: ============ ============================================================= -Not specifying this option is equivalent to "mds=full". - +Not specifying this option is equivalent to "mds=full". For processors +that are affected by both TAA (TSX Asynchronous Abort) and MDS, +specifying just "mds=off" without an accompanying "tsx_async_abort=off" +will have no effect as the same mitigation is used for both +vulnerabilities. Mitigation selection guide -------------------------- diff --git a/Documentation/admin-guide/hw-vuln/multihit.rst b/Documentation/admin-guide/hw-vuln/multihit.rst new file mode 100644 index 000000000000..ba9988d8bce5 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/multihit.rst @@ -0,0 +1,163 @@ +iTLB multihit +============= + +iTLB multihit is an erratum where some processors may incur a machine check +error, possibly resulting in an unrecoverable CPU lockup, when an +instruction fetch hits multiple entries in the instruction TLB. This can +occur when the page size is changed along with either the physical address +or cache type. A malicious guest running on a virtualized system can +exploit this erratum to perform a denial of service attack. + + +Affected processors +------------------- + +Variations of this erratum are present on most Intel Core and Xeon processor +models. The erratum is not present on: + + - non-Intel processors + + - Some Atoms (Airmont, Bonnell, Goldmont, GoldmontPlus, Saltwell, Silvermont) + + - Intel processors that have the PSCHANGE_MC_NO bit set in the + IA32_ARCH_CAPABILITIES MSR. + + +Related CVEs +------------ + +The following CVE entry is related to this issue: + + ============== ================================================= + CVE-2018-12207 Machine Check Error Avoidance on Page Size Change + ============== ================================================= + + +Problem +------- + +Privileged software, including OS and virtual machine managers (VMM), are in +charge of memory management. A key component in memory management is the control +of the page tables. Modern processors use virtual memory, a technique that creates +the illusion of a very large memory for processors. This virtual space is split +into pages of a given size. Page tables translate virtual addresses to physical +addresses. + +To reduce latency when performing a virtual to physical address translation, +processors include a structure, called TLB, that caches recent translations. +There are separate TLBs for instruction (iTLB) and data (dTLB). + +Under this errata, instructions are fetched from a linear address translated +using a 4 KB translation cached in the iTLB. Privileged software modifies the +paging structure so that the same linear address using large page size (2 MB, 4 +MB, 1 GB) with a different physical address or memory type. After the page +structure modification but before the software invalidates any iTLB entries for +the linear address, a code fetch that happens on the same linear address may +cause a machine-check error which can result in a system hang or shutdown. + + +Attack scenarios +---------------- + +Attacks against the iTLB multihit erratum can be mounted from malicious +guests in a virtualized system. + + +iTLB multihit system information +-------------------------------- + +The Linux kernel provides a sysfs interface to enumerate the current iTLB +multihit status of the system:whether the system is vulnerable and which +mitigations are active. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/itlb_multihit + +The possible values in this file are: + +.. list-table:: + + * - Not affected + - The processor is not vulnerable. + * - KVM: Mitigation: Split huge pages + - Software changes mitigate this issue. + * - KVM: Vulnerable + - The processor is vulnerable, but no mitigation enabled + + +Enumeration of the erratum +-------------------------------- + +A new bit has been allocated in the IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) msr +and will be set on CPU's which are mitigated against this issue. + + ======================================= =========== =============================== + IA32_ARCH_CAPABILITIES MSR Not present Possibly vulnerable,check model + IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '0' Likely vulnerable,check model + IA32_ARCH_CAPABILITIES[PSCHANGE_MC_NO] '1' Not vulnerable + ======================================= =========== =============================== + + +Mitigation mechanism +------------------------- + +This erratum can be mitigated by restricting the use of large page sizes to +non-executable pages. This forces all iTLB entries to be 4K, and removes +the possibility of multiple hits. + +In order to mitigate the vulnerability, KVM initially marks all huge pages +as non-executable. If the guest attempts to execute in one of those pages, +the page is broken down into 4K pages, which are then marked executable. + +If EPT is disabled or not available on the host, KVM is in control of TLB +flushes and the problematic situation cannot happen. However, the shadow +EPT paging mechanism used by nested virtualization is vulnerable, because +the nested guest can trigger multiple iTLB hits by modifying its own +(non-nested) page tables. For simplicity, KVM will make large pages +non-executable in all shadow paging modes. + +Mitigation control on the kernel command line and KVM - module parameter +------------------------------------------------------------------------ + +The KVM hypervisor mitigation mechanism for marking huge pages as +non-executable can be controlled with a module parameter "nx_huge_pages=". +The kernel command line allows to control the iTLB multihit mitigations at +boot time with the option "kvm.nx_huge_pages=". + +The valid arguments for these options are: + + ========== ================================================================ + force Mitigation is enabled. In this case, the mitigation implements + non-executable huge pages in Linux kernel KVM module. All huge + pages in the EPT are marked as non-executable. + If a guest attempts to execute in one of those pages, the page is + broken down into 4K pages, which are then marked executable. + + off Mitigation is disabled. + + auto Enable mitigation only if the platform is affected and the kernel + was not booted with the "mitigations=off" command line parameter. + This is the default option. + ========== ================================================================ + + +Mitigation selection guide +-------------------------- + +1. No virtualization in use +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The system is protected by the kernel unconditionally and no further + action is required. + +2. Virtualization with trusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + If the guest comes from a trusted source, you may assume that the guest will + not attempt to maliciously exploit these errata and no further action is + required. + +3. Virtualization with untrusted guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + If the guest comes from an untrusted source, the guest host kernel will need + to apply iTLB multihit mitigation via the kernel command line or kvm + module parameter. diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst new file mode 100644 index 000000000000..af6865b822d2 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst @@ -0,0 +1,279 @@ +.. SPDX-License-Identifier: GPL-2.0 + +TAA - TSX Asynchronous Abort +====================================== + +TAA is a hardware vulnerability that allows unprivileged speculative access to +data which is available in various CPU internal buffers by using asynchronous +aborts within an Intel TSX transactional region. + +Affected processors +------------------- + +This vulnerability only affects Intel processors that support Intel +Transactional Synchronization Extensions (TSX) when the TAA_NO bit (bit 8) +is 0 in the IA32_ARCH_CAPABILITIES MSR. On processors where the MDS_NO bit +(bit 5) is 0 in the IA32_ARCH_CAPABILITIES MSR, the existing MDS mitigations +also mitigate against TAA. + +Whether a processor is affected or not can be read out from the TAA +vulnerability file in sysfs. See :ref:`tsx_async_abort_sys_info`. + +Related CVEs +------------ + +The following CVE entry is related to this TAA issue: + + ============== ===== =================================================== + CVE-2019-11135 TAA TSX Asynchronous Abort (TAA) condition on some + microprocessors utilizing speculative execution may + allow an authenticated user to potentially enable + information disclosure via a side channel with + local access. + ============== ===== =================================================== + +Problem +------- + +When performing store, load or L1 refill operations, processors write +data into temporary microarchitectural structures (buffers). The data in +those buffers can be forwarded to load operations as an optimization. + +Intel TSX is an extension to the x86 instruction set architecture that adds +hardware transactional memory support to improve performance of multi-threaded +software. TSX lets the processor expose and exploit concurrency hidden in an +application due to dynamically avoiding unnecessary synchronization. + +TSX supports atomic memory transactions that are either committed (success) or +aborted. During an abort, operations that happened within the transactional region +are rolled back. An asynchronous abort takes place, among other options, when a +different thread accesses a cache line that is also used within the transactional +region when that access might lead to a data race. + +Immediately after an uncompleted asynchronous abort, certain speculatively +executed loads may read data from those internal buffers and pass it to dependent +operations. This can be then used to infer the value via a cache side channel +attack. + +Because the buffers are potentially shared between Hyper-Threads cross +Hyper-Thread attacks are possible. + +The victim of a malicious actor does not need to make use of TSX. Only the +attacker needs to begin a TSX transaction and raise an asynchronous abort +which in turn potenitally leaks data stored in the buffers. + +More detailed technical information is available in the TAA specific x86 +architecture section: :ref:`Documentation/x86/tsx_async_abort.rst <tsx_async_abort>`. + + +Attack scenarios +---------------- + +Attacks against the TAA vulnerability can be implemented from unprivileged +applications running on hosts or guests. + +As for MDS, the attacker has no control over the memory addresses that can +be leaked. Only the victim is responsible for bringing data to the CPU. As +a result, the malicious actor has to sample as much data as possible and +then postprocess it to try to infer any useful information from it. + +A potential attacker only has read access to the data. Also, there is no direct +privilege escalation by using this technique. + + +.. _tsx_async_abort_sys_info: + +TAA system information +----------------------- + +The Linux kernel provides a sysfs interface to enumerate the current TAA status +of mitigated systems. The relevant sysfs file is: + +/sys/devices/system/cpu/vulnerabilities/tsx_async_abort + +The possible values in this file are: + +.. list-table:: + + * - 'Vulnerable' + - The CPU is affected by this vulnerability and the microcode and kernel mitigation are not applied. + * - 'Vulnerable: Clear CPU buffers attempted, no microcode' + - The system tries to clear the buffers but the microcode might not support the operation. + * - 'Mitigation: Clear CPU buffers' + - The microcode has been updated to clear the buffers. TSX is still enabled. + * - 'Mitigation: TSX disabled' + - TSX is disabled. + * - 'Not affected' + - The CPU is not affected by this issue. + +.. _ucode_needed: + +Best effort mitigation mode +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If the processor is vulnerable, but the availability of the microcode-based +mitigation mechanism is not advertised via CPUID the kernel selects a best +effort mitigation mode. This mode invokes the mitigation instructions +without a guarantee that they clear the CPU buffers. + +This is done to address virtualization scenarios where the host has the +microcode update applied, but the hypervisor is not yet updated to expose the +CPUID to the guest. If the host has updated microcode the protection takes +effect; otherwise a few CPU cycles are wasted pointlessly. + +The state in the tsx_async_abort sysfs file reflects this situation +accordingly. + + +Mitigation mechanism +-------------------- + +The kernel detects the affected CPUs and the presence of the microcode which is +required. If a CPU is affected and the microcode is available, then the kernel +enables the mitigation by default. + + +The mitigation can be controlled at boot time via a kernel command line option. +See :ref:`taa_mitigation_control_command_line`. + +.. _virt_mechanism: + +Virtualization mitigation +^^^^^^^^^^^^^^^^^^^^^^^^^ + +Affected systems where the host has TAA microcode and TAA is mitigated by +having disabled TSX previously, are not vulnerable regardless of the status +of the VMs. + +In all other cases, if the host either does not have the TAA microcode or +the kernel is not mitigated, the system might be vulnerable. + + +.. _taa_mitigation_control_command_line: + +Mitigation control on the kernel command line +--------------------------------------------- + +The kernel command line allows to control the TAA mitigations at boot time with +the option "tsx_async_abort=". The valid arguments for this option are: + + ============ ============================================================= + off This option disables the TAA mitigation on affected platforms. + If the system has TSX enabled (see next parameter) and the CPU + is affected, the system is vulnerable. + + full TAA mitigation is enabled. If TSX is enabled, on an affected + system it will clear CPU buffers on ring transitions. On + systems which are MDS-affected and deploy MDS mitigation, + TAA is also mitigated. Specifying this option on those + systems will have no effect. + + full,nosmt The same as tsx_async_abort=full, with SMT disabled on + vulnerable CPUs that have TSX enabled. This is the complete + mitigation. When TSX is disabled, SMT is not disabled because + CPU is not vulnerable to cross-thread TAA attacks. + ============ ============================================================= + +Not specifying this option is equivalent to "tsx_async_abort=full". For +processors that are affected by both TAA and MDS, specifying just +"tsx_async_abort=off" without an accompanying "mds=off" will have no +effect as the same mitigation is used for both vulnerabilities. + +The kernel command line also allows to control the TSX feature using the +parameter "tsx=" on CPUs which support TSX control. MSR_IA32_TSX_CTRL is used +to control the TSX feature and the enumeration of the TSX feature bits (RTM +and HLE) in CPUID. + +The valid options are: + + ============ ============================================================= + off Disables TSX on the system. + + Note that this option takes effect only on newer CPUs which are + not vulnerable to MDS, i.e., have MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 + and which get the new IA32_TSX_CTRL MSR through a microcode + update. This new MSR allows for the reliable deactivation of + the TSX functionality. + + on Enables TSX. + + Although there are mitigations for all known security + vulnerabilities, TSX has been known to be an accelerator for + several previous speculation-related CVEs, and so there may be + unknown security risks associated with leaving it enabled. + + auto Disables TSX if X86_BUG_TAA is present, otherwise enables TSX + on the system. + ============ ============================================================= + +Not specifying this option is equivalent to "tsx=off". + +The following combinations of the "tsx_async_abort" and "tsx" are possible. For +affected platforms tsx=auto is equivalent to tsx=off and the result will be: + + ========= ========================== ========================================= + tsx=on tsx_async_abort=full The system will use VERW to clear CPU + buffers. Cross-thread attacks are still + possible on SMT machines. + tsx=on tsx_async_abort=full,nosmt As above, cross-thread attacks on SMT + mitigated. + tsx=on tsx_async_abort=off The system is vulnerable. + tsx=off tsx_async_abort=full TSX might be disabled if microcode + provides a TSX control MSR. If so, + system is not vulnerable. + tsx=off tsx_async_abort=full,nosmt Ditto + tsx=off tsx_async_abort=off ditto + ========= ========================== ========================================= + + +For unaffected platforms "tsx=on" and "tsx_async_abort=full" does not clear CPU +buffers. For platforms without TSX control (MSR_IA32_ARCH_CAPABILITIES.MDS_NO=0) +"tsx" command line argument has no effect. + +For the affected platforms below table indicates the mitigation status for the +combinations of CPUID bit MD_CLEAR and IA32_ARCH_CAPABILITIES MSR bits MDS_NO +and TSX_CTRL_MSR. + + ======= ========= ============= ======================================== + MDS_NO MD_CLEAR TSX_CTRL_MSR Status + ======= ========= ============= ======================================== + 0 0 0 Vulnerable (needs microcode) + 0 1 0 MDS and TAA mitigated via VERW + 1 1 0 MDS fixed, TAA vulnerable if TSX enabled + because MD_CLEAR has no meaning and + VERW is not guaranteed to clear buffers + 1 X 1 MDS fixed, TAA can be mitigated by + VERW or TSX_CTRL_MSR + ======= ========= ============= ======================================== + +Mitigation selection guide +-------------------------- + +1. Trusted userspace and guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If all user space applications are from a trusted source and do not execute +untrusted code which is supplied externally, then the mitigation can be +disabled. The same applies to virtualized environments with trusted guests. + + +2. Untrusted userspace and guests +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If there are untrusted applications or guests on the system, enabling TSX +might allow a malicious actor to leak data from the host or from other +processes running on the same physical core. + +If the microcode is available and the TSX is disabled on the host, attacks +are prevented in a virtualized environment as well, even if the VMs do not +explicitly enable the mitigation. + + +.. _taa_default_mitigations: + +Default mitigations +------------------- + +The kernel's default action for vulnerable processors is: + + - Deploy TSX disable mitigation (tsx_async_abort=full tsx=off). diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 33feab2f4084..f1d0ccffbe72 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -57,55 +57,63 @@ configure specific aspects of kernel behavior to your liking. .. toctree:: :maxdepth: 1 - initrd - cgroup-v2 - cgroup-v1/index - serial-console - braille-console - parport - md - module-signing - rapidio - sysrq - unicode - vga-softcursor - binfmt-misc - mono - java - ras - bcache - blockdev/index - ext4 - binderfs - xfs - pm/index - thunderbolt - LSM/index - mm/index - namespaces/index - perf-security acpi/index aoe/index + auxdisplay/index + bcache + binderfs + binfmt-misc + blockdev/index + bootconfig + braille-console btmrvl + cgroup-v1/index + cgroup-v2 + cifs/index clearing-warn-once cpu-load cputopology + dell_rbu device-mapper/index efi-stub + ext4 + nfs/index gpio/index highuid hw_random + initrd iostats + java + jfs kernel-per-CPU-kthreads laptops/index lcd-panel-cgram ldm lockup-watchdogs + LSM/index + md + mm/index + module-signing + mono + namespaces/index numastat + parport + perf-security + pm/index pnp + rapidio + ras rtc + serial-console svga + sysrq + thunderbolt + ufs + unicode + vga-softcursor video-output + wimax/index + xfs .. only:: subproject and html diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst index 5d63b18bd6d1..df5b8345c41d 100644 --- a/Documentation/admin-guide/iostats.rst +++ b/Documentation/admin-guide/iostats.rst @@ -46,81 +46,91 @@ each snapshot of your disk statistics. In 2.4, the statistics fields are those after the device name. In the above example, the first field of statistics would be 446216. By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll -find just the eleven fields, beginning with 446216. If you look at -``/proc/diskstats``, the eleven fields will be preceded by the major and +find just the 15 fields, beginning with 446216. If you look at +``/proc/diskstats``, the 15 fields will be preceded by the major and minor device numbers, and device name. Each of these formats provides -eleven fields of statistics, each meaning exactly the same things. +15 fields of statistics, each meaning exactly the same things. All fields except field 9 are cumulative since boot. Field 9 should go to zero as I/Os complete; all others only increase (unless they -overflow and wrap). Yes, these are (32-bit or 64-bit) unsigned long -(native word size) numbers, and on a very busy or long-lived system they -may wrap. Applications should be prepared to deal with that; unless -your observations are measured in large numbers of minutes or hours, -they should not wrap twice before you notice them. +overflow and wrap). Wrapping might eventually occur on a very busy +or long-lived system; so applications should be prepared to deal with +it. Regarding wrapping, the types of the fields are either unsigned +int (32 bit) or unsigned long (32-bit or 64-bit, depending on your +machine) as noted per-field below. Unless your observations are very +spread in time, these fields should not wrap twice before you notice it. Each set of stats only applies to the indicated device; if you want system-wide stats you'll have to find all the devices and sum them all up. -Field 1 -- # of reads completed +Field 1 -- # of reads completed (unsigned long) This is the total number of reads completed successfully. -Field 2 -- # of reads merged, field 6 -- # of writes merged +Field 2 -- # of reads merged, field 6 -- # of writes merged (unsigned long) Reads and writes which are adjacent to each other may be merged for efficiency. Thus two 4K reads may become one 8K read before it is ultimately handed to the disk, and so it will be counted (and queued) as only one I/O. This field lets you know how often this was done. -Field 3 -- # of sectors read +Field 3 -- # of sectors read (unsigned long) This is the total number of sectors read successfully. -Field 4 -- # of milliseconds spent reading +Field 4 -- # of milliseconds spent reading (unsigned int) This is the total number of milliseconds spent by all reads (as measured from __make_request() to end_that_request_last()). -Field 5 -- # of writes completed +Field 5 -- # of writes completed (unsigned long) This is the total number of writes completed successfully. -Field 6 -- # of writes merged +Field 6 -- # of writes merged (unsigned long) See the description of field 2. -Field 7 -- # of sectors written +Field 7 -- # of sectors written (unsigned long) This is the total number of sectors written successfully. -Field 8 -- # of milliseconds spent writing +Field 8 -- # of milliseconds spent writing (unsigned int) This is the total number of milliseconds spent by all writes (as measured from __make_request() to end_that_request_last()). -Field 9 -- # of I/Os currently in progress +Field 9 -- # of I/Os currently in progress (unsigned int) The only field that should go to zero. Incremented as requests are given to appropriate struct request_queue and decremented as they finish. -Field 10 -- # of milliseconds spent doing I/Os +Field 10 -- # of milliseconds spent doing I/Os (unsigned int) This field increases so long as field 9 is nonzero. Since 5.0 this field counts jiffies when at least one request was started or completed. If request runs more than 2 jiffies then some I/O time will not be accounted unless there are other requests. -Field 11 -- weighted # of milliseconds spent doing I/Os +Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int) This field is incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress (field 9) times the number of milliseconds spent doing I/O since the last update of this field. This can provide an easy measure of both I/O completion time and the backlog that may be accumulating. -Field 12 -- # of discards completed +Field 12 -- # of discards completed (unsigned long) This is the total number of discards completed successfully. -Field 13 -- # of discards merged +Field 13 -- # of discards merged (unsigned long) See the description of field 2 -Field 14 -- # of sectors discarded +Field 14 -- # of sectors discarded (unsigned long) This is the total number of sectors discarded successfully. -Field 15 -- # of milliseconds spent discarding +Field 15 -- # of milliseconds spent discarding (unsigned int) This is the total number of milliseconds spent by all discards (as measured from __make_request() to end_that_request_last()). +Field 16 -- # of flush requests completed + This is the total number of flush requests completed successfully. + + Block layer combines flush requests and executes at most one at a time. + This counts flush requests executed by disk. Not tracked for partitions. + +Field 17 -- # of milliseconds spent flushing + This is the total number of milliseconds spent by all flush requests. + To avoid introducing performance bottlenecks, no locks are held while modifying these counters. This implies that minor inaccuracies may be introduced when changes collide, so (for instance) adding up all the diff --git a/Documentation/admin-guide/jfs.rst b/Documentation/admin-guide/jfs.rst new file mode 100644 index 000000000000..9e12d936bc90 --- /dev/null +++ b/Documentation/admin-guide/jfs.rst @@ -0,0 +1,66 @@ +=========================================== +IBM's Journaled File System (JFS) for Linux +=========================================== + +JFS Homepage: http://jfs.sourceforge.net/ + +The following mount options are supported: + +(*) == default + +iocharset=name + Character set to use for converting from Unicode to + ASCII. The default is to do no conversion. Use + iocharset=utf8 for UTF-8 translations. This requires + CONFIG_NLS_UTF8 to be set in the kernel .config file. + iocharset=none specifies the default behavior explicitly. + +resize=value + Resize the volume to <value> blocks. JFS only supports + growing a volume, not shrinking it. This option is only + valid during a remount, when the volume is mounted + read-write. The resize keyword with no value will grow + the volume to the full size of the partition. + +nointegrity + Do not write to the journal. The primary use of this option + is to allow for higher performance when restoring a volume + from backup media. The integrity of the volume is not + guaranteed if the system abnormally abends. + +integrity(*) + Commit metadata changes to the journal. Use this option to + remount a volume where the nointegrity option was + previously specified in order to restore normal behavior. + +errors=continue + Keep going on a filesystem error. +errors=remount-ro(*) + Remount the filesystem read-only on an error. +errors=panic + Panic and halt the machine if an error occurs. + +uid=value + Override on-disk uid with specified value +gid=value + Override on-disk gid with specified value +umask=value + Override on-disk umask with specified octal value. For + directories, the execute bit will be set if the corresponding + read bit is set. + +discard=minlen, discard/nodiscard(*) + This enables/disables the use of discard/TRIM commands. + The discard/TRIM commands are sent to the underlying + block device when blocks are freed. This is useful for SSD + devices and sparse/thinly-provisioned LUNs. The FITRIM ioctl + command is also available together with the nodiscard option. + The value of minlen specifies the minimum blockcount, when + a TRIM command to the block device is considered useful. + When no value is given to the discard option, it defaults to + 64 blocks, which means 256KiB in JFS. + The minlen value of discard overrides the minlen value given + on an FITRIM ioctl(). + +The JFS mailing list can be subscribed to by using the link labeled +"Mail list Subscribe" at our web page http://jfs.sourceforge.net/ diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index d05d531b4ec9..6d421694d98e 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -127,6 +127,7 @@ parameter is applicable:: NET Appropriate network support is enabled. NUMA NUMA support is enabled. NFS Appropriate NFS support is enabled. + OF Devicetree is enabled. OSS OSS sound support is enabled. PV_OPS A paravirtualized kernel is enabled. PARIDE The ParIDE (parallel port IDE) subsystem is enabled. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 47d981a86e2f..dbc22d684627 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -113,7 +113,7 @@ the GPE dispatcher. This facility can be used to prevent such uncontrolled GPE floodings. - Format: <int> + Format: <byte> acpi_no_auto_serialize [HW,ACPI] Disable auto-serialization of AML methods @@ -437,7 +437,11 @@ no delay (0). Format: integer - bootmem_debug [KNL] Enable bootmem allocator debug messages. + bootconfig [KNL] + Extended command line options can be added to an initrd + and this will cause the kernel to look for it. + + See Documentation/admin-guide/bootconfig.rst bert_disable [ACPI] Disable BERT OS support on buggy BIOSes. @@ -513,7 +517,7 @@ 1 -- check protection requested by application. Default value is set via a kernel config option. Value can be changed at runtime via - /selinux/checkreqprot. + /sys/fs/selinux/checkreqprot. cio_ignore= [S390] See Documentation/s390/common_io.rst for details. @@ -809,6 +813,8 @@ enables the feature at boot time. By default, it is disabled and the system will work mostly the same as a kernel built without CONFIG_DEBUG_PAGEALLOC. + Note: to get most of debug_pagealloc error reports, it's + useful to also enable the page_owner functionality. on: enable the feature debugpat [X86] Enable PAT debugging @@ -834,6 +840,18 @@ dump out devices still on the deferred probe list after retrying. + dfltcc= [HW,S390] + Format: { on | off | def_only | inf_only | always } + on: s390 zlib hardware support for compression on + level 1 and decompression (default) + off: No s390 zlib hardware support + def_only: s390 zlib hardware support for deflate + only (compression on level 1) + inf_only: s390 zlib hardware support for inflate + only (decompression) + always: Same as 'on' but ignores the selected compression + level always using hardware support (used for debugging) + dhash_entries= [KNL] Set number of hash buckets for dentry cache. @@ -860,6 +878,10 @@ disable_radix [PPC] Disable RADIX MMU mode on POWER9 + disable_tlbie [PPC] + Disable TLBIE instruction. Currently does not work + with KVM, with HASH MMU, or with coherent accelerators. + disable_cpu_apicid= [X86,APIC,SMP] Format: <int> The number of initial APIC ID for the @@ -977,12 +999,10 @@ earlycon= [KNL] Output early console device and options. - [ARM64] The early console is determined by the - stdout-path property in device tree's chosen node, - or determined by the ACPI SPCR table. - - [X86] When used with no options the early console is - determined by the ACPI SPCR table. + When used with no options, the early console is + determined by stdout-path property in device tree's + chosen node or the ACPI SPCR table if supported by + the platform. cdns,<addr>[,options] Start an early, polled-mode console on a Cadence @@ -1044,6 +1064,10 @@ specified address. The serial port must already be setup and configured. Options are not yet supported. + sbi + Use RISC-V SBI (Supervisor Binary Interface) for early + console. + smh Use ARM semihosting calls for early console. s3c2410,<addr> @@ -1090,6 +1114,12 @@ the framebuffer, pass the 'ram' option so that it is mapped with the correct attributes. + linflex,<addr> + Use early console provided by Freescale LINFlexD UART + serial driver for NXP S32V234 SoCs. A valid base + address must be provided, and the serial port must + already be setup and configured. + earlyprintk= [X86,SH,ARM,M68k,S390] earlyprintk=vga earlyprintk=sclp @@ -1152,15 +1182,26 @@ Format: {"off" | "on" | "skip[mbr]"} efi= [EFI] - Format: { "old_map", "nochunk", "noruntime", "debug" } + Format: { "old_map", "nochunk", "noruntime", "debug", + "nosoftreserve", "disable_early_pci_dma", + "no_disable_early_pci_dma" } old_map [X86-64]: switch to the old ioremap-based EFI - runtime services mapping. 32-bit still uses this one by - default. + runtime services mapping. [Needs CONFIG_X86_UV=y] nochunk: disable reading files in "chunks" in the EFI boot stub, as chunking can cause problems with some firmware implementations. noruntime : disable EFI runtime services support debug: enable misc debug output + nosoftreserve: The EFI_MEMORY_SP (Specific Purpose) + attribute may cause the kernel to reserve the + memory range for a memory mapping driver to + claim. Specify efi=nosoftreserve to disable this + reservation and treat the memory by its base type + (i.e. EFI_CONVENTIONAL_MEMORY / "System RAM"). + disable_early_pci_dma: Disable the busmaster bit on all + PCI bridges while in the EFI boot stub + no_disable_early_pci_dma: Leave the busmaster bit set + on all PCI bridges while in the EFI boot stub efi_no_storage_paranoia [EFI; X86] Using this parameter you can use more than 50% of @@ -1173,15 +1214,21 @@ updating original EFI memory map. Region of memory which aa attribute is added to is from ss to ss+nn. + If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000 is specified, EFI_MEMORY_MORE_RELIABLE(0x10000) attribute is added to range 0x100000000-0x180000000 and 0x10a0000000-0x1120000000. + If efi_fake_mem=8G@9G:0x40000 is specified, the + EFI_MEMORY_SP(0x40000) attribute is added to + range 0x240000000-0x43fffffff. + Using this parameter you can do debugging of EFI memmap - related feature. For example, you can do debugging of + related features. For example, you can do debugging of Address Range Mirroring feature even if your box - doesn't support it. + doesn't support it, or mark specific memory as + "soft reserved". efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT that is to be dynamically loaded by Linux. If there are @@ -1197,12 +1244,6 @@ See comment before function elanfreq_setup() in arch/x86/kernel/cpu/cpufreq/elanfreq.c. - elevator= [IOSCHED] - Format: { "mq-deadline" | "kyber" | "bfq" } - See Documentation/block/deadline-iosched.rst, - Documentation/block/kyber-iosched.rst and - Documentation/block/bfq-iosched.rst for details. - elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390] Specifies physical address of start of kernel core image elf header and optionally the size. Generally @@ -1226,7 +1267,8 @@ 0 -- permissive (log only, no denials). 1 -- enforcing (deny and log). Default value is 0. - Value can be changed at runtime via /selinux/enforce. + Value can be changed at runtime via + /sys/fs/selinux/enforce. erst_disable [ACPI] Disable Error Record Serialization Table (ERST) @@ -1732,6 +1774,11 @@ Note that using this option lowers the security provided by tboot because it makes the system vulnerable to DMA attacks. + nobounce [Default off] + Disable bounce buffer for unstrusted devices such as + the Thunderbolt devices. This will treat the untrusted + devices as the trusted ones, hence might expose security + risks of DMA attacks. intel_idle.max_cstate= [KNL,HW,ACPI,X86] 0 disables intel_idle and fall back on acpi_idle. @@ -1811,7 +1858,7 @@ synchronously. iommu.passthrough= - [ARM64] Configure DMA to bypass the IOMMU by default. + [ARM64, X86] Configure DMA to bypass the IOMMU by default. Format: { "0" | "1" } 0 - Use IOMMU translation for DMA. 1 - Bypass the IOMMU for DMA. @@ -1909,9 +1956,31 @@ <cpu number> begins at 0 and the maximum value is "number of CPUs in system - 1". - The format of <cpu-list> is described above. - + managed_irq + + Isolate from being targeted by managed interrupts + which have an interrupt mask containing isolated + CPUs. The affinity of managed interrupts is + handled by the kernel and cannot be changed via + the /proc/irq/* interfaces. + + This isolation is best effort and only effective + if the automatically assigned interrupt mask of a + device queue contains isolated and housekeeping + CPUs. If housekeeping CPUs are online then such + interrupts are directed to the housekeeping CPU + so that IO submitted on the housekeeping CPU + cannot disturb the isolated CPU. + + If a queue's affinity mask contains only isolated + CPUs then this parameter has no effect on the + interrupt routing decision, though interrupts are + only delivered when tasks running on those + isolated CPUs submit IO. IO submitted on + housekeeping CPUs has no influence on those + queues. + The format of <cpu-list> is described above. iucv= [HW,NET] @@ -2040,6 +2109,25 @@ KVM MMU at runtime. Default is 0 (off) + kvm.nx_huge_pages= + [KVM] Controls the software workaround for the + X86_BUG_ITLB_MULTIHIT bug. + force : Always deploy workaround. + off : Never deploy workaround. + auto : Deploy workaround based on the presence of + X86_BUG_ITLB_MULTIHIT. + + Default is 'auto'. + + If the software workaround is enabled for the host, + guests do need not to enable it for nested guests. + + kvm.nx_huge_pages_recovery_ratio= + [KVM] Controls how many 4KiB pages are periodically zapped + back to huge pages. 0 disables the recovery, otherwise if + the value is N KVM will zap 1/Nth of the 4KiB pages every + minute. The default is 60. + kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM. Default is 1 (enabled) @@ -2261,6 +2349,15 @@ lockd.nlm_udpport=M [NFS] Assign UDP port. Format: <integer> + lockdown= [SECURITY] + { integrity | confidentiality } + Enable the kernel lockdown feature. If set to + integrity, kernel features that allow userland to + modify the running kernel are disabled. If set to + confidentiality, kernel features that allow userland + to extract confidential information from the kernel + are also disabled. + locktorture.nreaders_stress= [KNL] Set the number of locking read-acquisition kthreads. Defaults to being automatically set based on the @@ -2373,7 +2470,7 @@ machvec= [IA-64] Force the use of a particular machine-vector (machvec) in a generic kernel. - Example: machvec=hpzx1_swiotlb + Example: machvec=hpzx1 machtype= [Loongson] Share the same kernel image file between different yeeloong laptop. @@ -2430,6 +2527,12 @@ SMT on vulnerable CPUs off - Unconditionally disable MDS mitigation + On TAA-affected machines, mds=off can be prevented by + an active TAA mitigation as both vulnerabilities are + mitigated with the same mechanism so in order to disable + this mitigation, you need to specify tsx_async_abort=off + too. + Not specifying this option is equivalent to mds=full. @@ -2612,6 +2715,13 @@ ssbd=force-off [ARM64] l1tf=off [X86] mds=off [X86] + tsx_async_abort=off [X86] + kvm.nx_huge_pages=off [X86] + + Exceptions: + This does not have any effect on + kvm.nx_huge_pages when + kvm.nx_huge_pages=force. auto (default) Mitigate all CPU vulnerabilities, but leave SMT @@ -2627,6 +2737,7 @@ be fully mitigated, even if it means losing SMT. Equivalent to: l1tf=flush,nosmt [X86] mds=full,nosmt [X86] + tsx_async_abort=full,nosmt [X86] mminit_loglevel= [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this @@ -3059,9 +3170,9 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting. - steal time is computed, but won't influence scheduler - behaviour + no-steal-acc [X86,KVM,ARM64] Disable paravirtualized steal time + accounting. steal time is computed, but won't + influence scheduler behaviour nolapic [X86-32,APIC] Do not enable or use the local APIC. @@ -3170,6 +3281,12 @@ This can be set from sysctl after boot. See Documentation/admin-guide/sysctl/vm.rst for details. + of_devlink [OF, KNL] Create device links between consumer and + supplier devices by scanning the devictree to infer the + consumer/supplier relationships. A consumer device + will not be probed until all the supplier devices have + probed successfully. + ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. See Documentation/debugging-via-ohci1394.txt for more info. @@ -3452,12 +3569,13 @@ specify the device is described above. If <order of align> is not specified, PAGE_SIZE is used as alignment. - PCI-PCI bridge can be specified, if resource + A PCI-PCI bridge can be specified if resource windows need to be expanded. To specify the alignment for several instances of a device, the PCI vendor, device, subvendor, and subdevice may be - specified, e.g., 4096@pci:8086:9c22:103c:198f + specified, e.g., 12@pci:8086:9c22:103c:198f + for 4096-byte alignment. ecrc= Enable/disable PCIe ECRC (transaction layer end-to-end CRC checking). bios: Use BIOS/firmware settings. This is the @@ -3467,8 +3585,15 @@ hpiosize=nn[KMG] The fixed amount of bus space which is reserved for hotplug bridge's IO window. Default size is 256 bytes. + hpmmiosize=nn[KMG] The fixed amount of bus space which is + reserved for hotplug bridge's MMIO window. + Default size is 2 megabytes. + hpmmioprefsize=nn[KMG] The fixed amount of bus space which is + reserved for hotplug bridge's MMIO_PREF window. + Default size is 2 megabytes. hpmemsize=nn[KMG] The fixed amount of bus space which is - reserved for hotplug bridge's memory window. + reserved for hotplug bridge's MMIO and + MMIO_PREF window. Default size is 2 megabytes. hpbussize=nn The minimum amount of additional bus numbers reserved for buses below a hotplug bridge. @@ -3515,6 +3640,8 @@ even if the platform doesn't give the OS permission to use them. This may cause conflicts if the platform also tries to use these services. + dpc-native Use native PCIe service for DPC only. May + cause conflicts if firmware uses AER or DPC. compat Disable native PCIe services (PME, AER, DPC, PCIe hotplug). @@ -3837,12 +3964,13 @@ RCU_BOOST is not set, valid values are 0-99 and the default is zero (non-realtime operation). - rcutree.rcu_nocb_leader_stride= [KNL] - Set the number of NOCB kthread groups, which - defaults to the square root of the number of - CPUs. Larger numbers reduces the wakeup overhead - on the per-CPU grace-period kthreads, but increases - that same overhead on each group's leader. + rcutree.rcu_nocb_gp_stride= [KNL] + Set the number of NOCB callback kthreads in + each group, which defaults to the square root + of the number of CPUs. Larger numbers reduce + the wakeup overhead on the global grace-period + kthread, but increases that same overhead on + each group's NOCB grace-period kthread. rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which @@ -3895,6 +4023,19 @@ test until boot completes in order to avoid interference. + rcuperf.kfree_rcu_test= [KNL] + Set to measure performance of kfree_rcu() flooding. + + rcuperf.kfree_nthreads= [KNL] + The number of threads running loops of kfree_rcu(). + + rcuperf.kfree_alloc_num= [KNL] + Number of allocations and frees done in an iteration. + + rcuperf.kfree_loops= [KNL] + Number of loops doing rcuperf.kfree_alloc_num number + of allocations and frees. + rcuperf.nreaders= [KNL] Set number of RCU readers. The value -1 selects N, where N is the number of CPUs. A value @@ -4047,6 +4188,10 @@ rcutorture.verbose= [KNL] Enable additional printk() statements. + rcupdate.rcu_cpu_stall_ftrace_dump= [KNL] + Dump ftrace buffer after reporting RCU CPU + stall warning. + rcupdate.rcu_cpu_stall_suppress= [KNL] Suppress RCU CPU stall warning messages. @@ -4090,6 +4235,13 @@ Run specified binary instead of /init from the ramdisk, used for early userspace startup. See initrd. + rdrand= [X86] + force - Override the decision by the kernel to hide the + advertisement of RDRAND support (this affects + certain AMD processors because of buggy BIOS + support, specifically around the suspend/resume + path). + rdt= [HW,X86,RDT] Turn on/off individual RDT features. List is: cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp, @@ -4254,9 +4406,7 @@ See security/selinux/Kconfig help text. 0 -- disable. 1 -- enable. - Default value is set via kernel config option. - If enabled at boot time, /selinux/disable can be used - later to disable prior to initial policy load. + Default value is 1. apparmor= [APPARMOR] Disable or enable AppArmor at boot time Format: { "0" | "1" } @@ -4620,6 +4770,11 @@ /sys/power/pm_test). Only available when CONFIG_PM_DEBUG is set. Default value is 5. + svm= [PPC] + Format: { on | off | y | n | 1 | 0 } + This parameter controls use of the Protected + Execution Facility on pSeries. + swapaccount=[0|1] [KNL] Enable accounting of swap in memory resource controller if no parameter or 1 is given or disable @@ -4806,6 +4961,76 @@ interruptions from clocksource watchdog are not acceptable). + tsx= [X86] Control Transactional Synchronization + Extensions (TSX) feature in Intel processors that + support TSX control. + + This parameter controls the TSX feature. The options are: + + on - Enable TSX on the system. Although there are + mitigations for all known security vulnerabilities, + TSX has been known to be an accelerator for + several previous speculation-related CVEs, and + so there may be unknown security risks associated + with leaving it enabled. + + off - Disable TSX on the system. (Note that this + option takes effect only on newer CPUs which are + not vulnerable to MDS, i.e., have + MSR_IA32_ARCH_CAPABILITIES.MDS_NO=1 and which get + the new IA32_TSX_CTRL MSR through a microcode + update. This new MSR allows for the reliable + deactivation of the TSX functionality.) + + auto - Disable TSX if X86_BUG_TAA is present, + otherwise enable TSX on the system. + + Not specifying this option is equivalent to tsx=off. + + See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst + for more details. + + tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async + Abort (TAA) vulnerability. + + Similar to Micro-architectural Data Sampling (MDS) + certain CPUs that support Transactional + Synchronization Extensions (TSX) are vulnerable to an + exploit against CPU internal buffers which can forward + information to a disclosure gadget under certain + conditions. + + In vulnerable processors, the speculatively forwarded + data can be used in a cache side channel attack, to + access data to which the attacker does not have direct + access. + + This parameter controls the TAA mitigation. The + options are: + + full - Enable TAA mitigation on vulnerable CPUs + if TSX is enabled. + + full,nosmt - Enable TAA mitigation and disable SMT on + vulnerable CPUs. If TSX is disabled, SMT + is not disabled because CPU is not + vulnerable to cross-thread TAA attacks. + off - Unconditionally disable TAA mitigation + + On MDS-affected machines, tsx_async_abort=off can be + prevented by an active MDS mitigation as both vulnerabilities + are mitigated with the same mechanism so in order to disable + this mitigation, you need to specify mds=off too. + + Not specifying this option is equivalent to + tsx_async_abort=full. On CPUs which are MDS affected + and deploy MDS mitigation, TAA mitigation is not + required and doesn't provide any additional + mitigation. + + For details see: + Documentation/admin-guide/hw-vuln/tsx_async_abort.rst + turbografx.map[2|3]= [HW,JOY] TurboGraFX parallel port interface Format: @@ -4956,13 +5181,13 @@ Flags is a set of characters, each corresponding to a common usb-storage quirk flag as follows: a = SANE_SENSE (collect more than 18 bytes - of sense data); + of sense data, not on uas); b = BAD_SENSE (don't collect more than 18 - bytes of sense data); + bytes of sense data, not on uas); c = FIX_CAPACITY (decrease the reported device capacity by one sector); d = NO_READ_DISC_INFO (don't use - READ_DISC_INFO command); + READ_DISC_INFO command, not on uas); e = NO_READ_CAPACITY_16 (don't use READ_CAPACITY_16 command); f = NO_REPORT_OPCODES (don't use report opcodes @@ -4977,17 +5202,18 @@ j = NO_REPORT_LUNS (don't use report luns command, uas only); l = NOT_LOCKABLE (don't try to lock and - unlock ejectable media); + unlock ejectable media, not on uas); m = MAX_SECTORS_64 (don't transfer more - than 64 sectors = 32 KB at a time); + than 64 sectors = 32 KB at a time, + not on uas); n = INITIAL_READ10 (force a retry of the - initial READ(10) command); + initial READ(10) command, not on uas); o = CAPACITY_OK (accept the capacity - reported by the device); + reported by the device, not on uas); p = WRITE_CACHE (the device cache is ON - by default); + by default, not on uas); r = IGNORE_RESIDUE (the device reports - bogus residue values); + bogus residue values, not on uas); s = SINGLE_LUN (the device has only one Logical Unit); t = NO_ATA_1X (don't allow ATA(12) and ATA(16) @@ -4996,7 +5222,8 @@ w = NO_WP_DETECT (don't test whether the medium is write-protected). y = ALWAYS_SYNC (issue a SYNCHRONIZE_CACHE - even if the device claims no cache) + even if the device claims no cache, + not on uas) Example: quirks=0419:aaf5:rl,0421:0433:rc user_debug= [KNL,ARM] @@ -5260,6 +5487,10 @@ the unplug protocol never -- do not unplug even if version check succeeds + xen_legacy_crash [X86,XEN] + Crash from Xen panic notifier, without executing late + panic() code such as dumping handler. + xen_nopvspin [X86,XEN] Disables the ticketlock slowpath using Xen PV optimizations. @@ -5305,3 +5536,22 @@ A hex value specifying bitmask with supplemental xhci host controller quirks. Meaning of each bit can be consulted in header drivers/usb/host/xhci.h. + + xmon [PPC] + Format: { early | on | rw | ro | off } + Controls if xmon debugger is enabled. Default is off. + Passing only "xmon" is equivalent to "xmon=early". + early Call xmon as early as possible on boot; xmon + debugger is called from setup_arch(). + on xmon debugger hooks will be installed so xmon + is only called on a kernel crash. Default mode, + i.e. either "ro" or "rw" mode, is controlled + with CONFIG_XMON_DEFAULT_RO_MODE. + rw xmon debugger hooks will be installed so xmon + is called only on a kernel crash, mode is write, + meaning SPR registers, memory and, other data + can be written using xmon commands. + ro same as "rw" option above but SPR registers, + memory, and other data can't be written using + xmon commands. + off xmon is disabled. diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index 4f18456dd3b1..baeeba8762ae 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -274,9 +274,7 @@ To reduce its OS jitter, do any of the following: (based on an earlier one from Gilad Ben-Yossef) that reduces or even eliminates vmstat overhead for some workloads at https://lkml.org/lkml/2013/9/4/379. - e. Boot with "elevator=noop" to avoid workqueue use by - the block layer. - f. If running on high-end powerpc servers, build with + e. If running on high-end powerpc servers, build with CONFIG_PPC_RTAS_DAEMON=n. This prevents the RTAS daemon from running on each CPU every second or so. (This will require editing Kconfig files and will defeat @@ -284,12 +282,12 @@ To reduce its OS jitter, do any of the following: due to the rtas_event_scan() function. WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - g. If running on Cell Processor, build your kernel with + f. If running on Cell Processor, build your kernel with CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from spu_gov_work(). WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - h. If running on PowerMAC, build your kernel with + g. If running on PowerMAC, build your kernel with CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, avoiding OS jitter from rackmeter_do_timer(). diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst index adea0bf2acc5..822907dcc845 100644 --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst @@ -49,6 +49,7 @@ detailed description): - Fan control and monitoring: fan speed, fan enable/disable - WAN enable and disable - UWB enable and disable + - LCD Shadow (PrivacyGuard) enable and disable A compatibility table by model and feature is maintained on the web site, http://ibm-acpi.sf.net/. I appreciate any success or failure @@ -1409,6 +1410,28 @@ Sysfs notes Documentation/driver-api/rfkill.rst for details. +LCD Shadow control +------------------ + +procfs: /proc/acpi/ibm/lcdshadow + +Some newer T480s and T490s ThinkPads provide a feature called +PrivacyGuard. By turning this feature on, the usable vertical and +horizontal viewing angles of the LCD can be limited (as if some privacy +screen was applied manually in front of the display). + +procfs notes +^^^^^^^^^^^^ + +The available commands are:: + + echo '0' >/proc/acpi/ibm/lcdshadow + echo '1' >/proc/acpi/ibm/lcdshadow + +The first command ensures the best viewing angle and the latter one turns +on the feature, restricting the viewing angles. + + EXPERIMENTAL: UWB ----------------- diff --git a/Documentation/admin-guide/nfs/fault_injection.rst b/Documentation/admin-guide/nfs/fault_injection.rst new file mode 100644 index 000000000000..eb029c0c15ce --- /dev/null +++ b/Documentation/admin-guide/nfs/fault_injection.rst @@ -0,0 +1,70 @@ +=================== +NFS Fault Injection +=================== + +Fault injection is a method for forcing errors that may not normally occur, or +may be difficult to reproduce. Forcing these errors in a controlled environment +can help the developer find and fix bugs before their code is shipped in a +production system. Injecting an error on the Linux NFS server will allow us to +observe how the client reacts and if it manages to recover its state correctly. + +NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this +feature. + + +Using Fault Injection +===================== +On the client, mount the fault injection server through NFS v4.0+ and do some +work over NFS (open files, take locks, ...). + +On the server, mount the debugfs filesystem to <debug_dir> and ls +<debug_dir>/nfsd. This will show a list of files that will be used for +injecting faults on the NFS server. As root, write a number n to the file +corresponding to the action you want the server to take. The server will then +process the first n items it finds. So if you want to forget 5 locks, echo '5' +to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget +all corresponding items. A log message will be created containing the number +of items forgotten (check dmesg). + +Go back to work on the client and check if the client recovered from the error +correctly. + + +Available Faults +================ +forget_clients: + The NFS server keeps a list of clients that have placed a mount call. If + this list is cleared, the server will have no knowledge of who the client + is, forcing the client to reauthenticate with the server. + +forget_openowners: + The NFS server keeps a list of what files are currently opened and who + they were opened by. Clearing this list will force the client to reopen + its files. + +forget_locks: + The NFS server keeps a list of what files are currently locked in the VFS. + Clearing this list will force the client to reclaim its locks (files are + unlocked through the VFS as they are cleared from this list). + +forget_delegations: + A delegation is used to assure the client that a file, or part of a file, + has not changed since the delegation was awarded. Clearing this list will + force the client to reacquire its delegation before accessing the file + again. + +recall_delegations: + Delegations can be recalled by the server when another client attempts to + access a file. This test will notify the client that its delegation has + been revoked, forcing the client to reacquire the delegation before using + the file again. + + +tools/nfs/inject_faults.sh script +================================= +This script has been created to ease the fault injection process. This script +will detect the mounted debugfs directory and write to the files located there +based on the arguments passed by the user. For example, running +`inject_faults.sh forget_locks 1` as root will instruct the server to forget +one lock. Running `inject_faults forget_locks` will instruct the server to +forgetall locks. diff --git a/Documentation/admin-guide/nfs/index.rst b/Documentation/admin-guide/nfs/index.rst new file mode 100644 index 000000000000..6b5a3c90fac5 --- /dev/null +++ b/Documentation/admin-guide/nfs/index.rst @@ -0,0 +1,15 @@ +============= +NFS +============= + +.. toctree:: + :maxdepth: 1 + + nfs-client + nfsroot + nfs-rdma + nfsd-admin-interfaces + nfs-idmapper + pnfs-block-server + pnfs-scsi-server + fault_injection diff --git a/Documentation/admin-guide/nfs/nfs-client.rst b/Documentation/admin-guide/nfs/nfs-client.rst new file mode 100644 index 000000000000..c4b777c7584b --- /dev/null +++ b/Documentation/admin-guide/nfs/nfs-client.rst @@ -0,0 +1,141 @@ +========== +NFS Client +========== + +The NFS client +============== + +The NFS version 2 protocol was first documented in RFC1094 (March 1989). +Since then two more major releases of NFS have been published, with NFSv3 +being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April +2003). + +The Linux NFS client currently supports all the above published versions, +and work is in progress on adding support for minor version 1 of the NFSv4 +protocol. + +The purpose of this document is to provide information on some of the +special features of the NFS client that can be configured by system +administrators. + + +The nfs4_unique_id parameter +============================ + +NFSv4 requires clients to identify themselves to servers with a unique +string. File open and lock state shared between one client and one server +is associated with this identity. To support robust NFSv4 state recovery +and transparent state migration, this identity string must not change +across client reboots. + +Without any other intervention, the Linux client uses a string that contains +the local system's node name. System administrators, however, often do not +take care to ensure that node names are fully qualified and do not change +over the lifetime of a client system. Node names can have other +administrative requirements that require particular behavior that does not +work well as part of an nfs_client_id4 string. + +The nfs.nfs4_unique_id boot parameter specifies a unique string that can be +used instead of a system's node name when an NFS client identifies itself to +a server. Thus, if the system's node name is not unique, or it changes, its +nfs.nfs4_unique_id stays the same, preventing collision with other clients +or loss of state during NFS reboot recovery or transparent state migration. + +The nfs.nfs4_unique_id string is typically a UUID, though it can contain +anything that is believed to be unique across all NFS clients. An +nfs4_unique_id string should be chosen when a client system is installed, +just as a system's root file system gets a fresh UUID in its label at +install time. + +The string should remain fixed for the lifetime of the client. It can be +changed safely if care is taken that the client shuts down cleanly and all +outstanding NFSv4 state has expired, to prevent loss of NFSv4 state. + +This string can be stored in an NFS client's grub.conf, or it can be provided +via a net boot facility such as PXE. It may also be specified as an nfs.ko +module parameter. Specifying a uniquifier string is not support for NFS +clients running in containers. + + +The DNS resolver +================ + +NFSv4 allows for one server to refer the NFS client to data that has been +migrated onto another server by means of the special "fs_locations" +attribute. See `RFC3530 Section 6: Filesystem Migration and Replication`_ and +`Implementation Guide for Referrals in NFSv4`_. + +.. _RFC3530 Section 6\: Filesystem Migration and Replication: http://tools.ietf.org/html/rfc3530#section-6 +.. _Implementation Guide for Referrals in NFSv4: http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 + +The fs_locations information can take the form of either an ip address and +a path, or a DNS hostname and a path. The latter requires the NFS client to +do a DNS lookup in order to mount the new volume, and hence the need for an +upcall to allow userland to provide this service. + +Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual +/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: + + (1) The process checks the dns_resolve cache to see if it contains a + valid entry. If so, it returns that entry and exits. + + (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' + (may be changed using the 'nfs.cache_getent' kernel boot parameter) + is run, with two arguments: + - the cache name, "dns_resolve" + - the hostname to resolve + + (3) After looking up the corresponding ip address, the helper script + writes the result into the rpc_pipefs pseudo-file + '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' + in the following (text) format: + + "<ip address> <hostname> <ttl>\n" + + Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6 + (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. + <hostname> is identical to the second argument of the helper + script, and <ttl> is the 'time to live' of this cache entry (in + units of seconds). + + .. note:: + If <ip address> is invalid, say the string "0", then a negative + entry is created, which will cause the kernel to treat the hostname + as having no valid DNS translation. + + + + +A basic sample /sbin/nfs_cache_getent +===================================== +.. code-block:: sh + + #!/bin/bash + # + ttl=600 + # + cut=/usr/bin/cut + getent=/usr/bin/getent + rpc_pipefs=/var/lib/nfs/rpc_pipefs + # + die() + { + echo "Usage: $0 cache_name entry_name" + exit 1 + } + + [ $# -lt 2 ] && die + cachename="$1" + cache_path=${rpc_pipefs}/cache/${cachename}/channel + + case "${cachename}" in + dns_resolve) + name="$2" + result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" + [ -z "${result}" ] && result="0" + ;; + *) + die + ;; + esac + echo "${result} ${name} ${ttl}" >${cache_path} diff --git a/Documentation/admin-guide/nfs/nfs-idmapper.rst b/Documentation/admin-guide/nfs/nfs-idmapper.rst new file mode 100644 index 000000000000..58b8e63412d5 --- /dev/null +++ b/Documentation/admin-guide/nfs/nfs-idmapper.rst @@ -0,0 +1,78 @@ +============= +NFS ID Mapper +============= + +Id mapper is used by NFS to translate user and group ids into names, and to +translate user and group names into ids. Part of this translation involves +performing an upcall to userspace to request the information. There are two +ways NFS could obtain this information: placing a call to /sbin/request-key +or by placing a call to the rpc.idmap daemon. + +NFS will attempt to call /sbin/request-key first. If this succeeds, the +result will be cached using the generic request-key cache. This call should +only fail if /etc/request-key.conf is not configured for the id_resolver key +type, see the "Configuring" section below if you wish to use the request-key +method. + +If the call to /sbin/request-key fails (if /etc/request-key.conf is not +configured with the id_resolver key type), then the idmapper will ask the +legacy rpc.idmap daemon for the id mapping. This result will be stored +in a custom NFS idmap cache. + + +Configuring +=========== + +The file /etc/request-key.conf will need to be modified so /sbin/request-key can +direct the upcall. The following line should be added: + +``#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...`` +``#====== ======= =============== =============== ===============================`` +``create id_resolver * * /usr/sbin/nfs.idmap %k %d 600`` + + +This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap. +The last parameter, 600, defines how many seconds into the future the key will +expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout +is not specified, nfs.idmap will default to 600 seconds. + +id mapper uses for key descriptions:: + + uid: Find the UID for the given user + gid: Find the GID for the given group + user: Find the user name for the given UID + group: Find the group name for the given GID + +You can handle any of these individually, rather than using the generic upcall +program. If you would like to use your own program for a uid lookup then you +would edit your request-key.conf so it look similar to this: + +``#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...`` +``#====== ======= =============== =============== ===============================`` +``create id_resolver uid:* * /some/other/program %k %d 600`` +``create id_resolver * * /usr/sbin/nfs.idmap %k %d 600`` + + +Notice that the new line was added above the line for the generic program. +request-key will find the first matching line and corresponding program. In +this case, /some/other/program will handle all uid lookups and +/usr/sbin/nfs.idmap will handle gid, user, and group lookups. + +See Documentation/security/keys/request-key.rst for more information +about the request-key function. + + +nfs.idmap +========= + +nfs.idmap is designed to be called by request-key, and should not be run "by +hand". This program takes two arguments, a serialized key and a key +description. The serialized key is first converted into a key_serial_t, and +then passed as an argument to keyctl_instantiate (both are part of keyutils.h). + +The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap +determines the correct function to call by looking at the first part of the +description string. For example, a uid lookup description will appear as +"uid:user@domain". + +nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise. diff --git a/Documentation/admin-guide/nfs/nfs-rdma.rst b/Documentation/admin-guide/nfs/nfs-rdma.rst new file mode 100644 index 000000000000..ef0f3678b1fb --- /dev/null +++ b/Documentation/admin-guide/nfs/nfs-rdma.rst @@ -0,0 +1,292 @@ +=================== +Setting up NFS/RDMA +=================== + +:Author: + NetApp and Open Grid Computing (May 29, 2008) + +.. warning:: + This document is probably obsolete. + +Overview +======== + +This document describes how to install and setup the Linux NFS/RDMA client +and server software. + +The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server +was first included in the following release, Linux 2.6.25. + +In our testing, we have obtained excellent performance results (full 10Gbit +wire bandwidth at minimal client CPU) under many workloads. The code passes +the full Connectathon test suite and operates over both Infiniband and iWARP +RDMA adapters. + +Getting Help +============ + +If you get stuck, you can ask questions on the +nfs-rdma-devel@lists.sourceforge.net mailing list. + +Installation +============ + +These instructions are a step by step guide to building a machine for +use with NFS/RDMA. + +- Install an RDMA device + + Any device supported by the drivers in drivers/infiniband/hw is acceptable. + + Testing has been performed using several Mellanox-based IB cards, the + Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter. + +- Install a Linux distribution and tools + + The first kernel release to contain both the NFS/RDMA client and server was + Linux 2.6.25 Therefore, a distribution compatible with this and subsequent + Linux kernel release should be installed. + + The procedures described in this document have been tested with + distributions from Red Hat's Fedora Project (http://fedora.redhat.com/). + +- Install nfs-utils-1.1.2 or greater on the client + + An NFS/RDMA mount point can be obtained by using the mount.nfs command in + nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils + version with support for NFS/RDMA mounts, but for various reasons we + recommend using nfs-utils-1.1.2 or greater). To see which version of + mount.nfs you are using, type: + + .. code-block:: sh + + $ /sbin/mount.nfs -V + + If the version is less than 1.1.2 or the command does not exist, + you should install the latest version of nfs-utils. + + Download the latest package from: http://www.kernel.org/pub/linux/utils/nfs + + Uncompress the package and follow the installation instructions. + + If you will not need the idmapper and gssd executables (you do not need + these to create an NFS/RDMA enabled mount command), the installation + process can be simplified by disabling these features when running + configure: + + .. code-block:: sh + + $ ./configure --disable-gss --disable-nfsv4 + + To build nfs-utils you will need the tcp_wrappers package installed. For + more information on this see the package's README and INSTALL files. + + After building the nfs-utils package, there will be a mount.nfs binary in + the utils/mount directory. This binary can be used to initiate NFS v2, v3, + or v4 mounts. To initiate a v4 mount, the binary must be called + mount.nfs4. The standard technique is to create a symlink called + mount.nfs4 to mount.nfs. + + This mount.nfs binary should be installed at /sbin/mount.nfs as follows: + + .. code-block:: sh + + $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs + + In this location, mount.nfs will be invoked automatically for NFS mounts + by the system mount command. + + .. note:: + mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed + on the NFS client machine. You do not need this specific version of + nfs-utils on the server. Furthermore, only the mount.nfs command from + nfs-utils-1.1.2 is needed on the client. + +- Install a Linux kernel with NFS/RDMA + + The NFS/RDMA client and server are both included in the mainline Linux + kernel version 2.6.25 and later. This and other versions of the Linux + kernel can be found at: https://www.kernel.org/pub/linux/kernel/ + + Download the sources and place them in an appropriate location. + +- Configure the RDMA stack + + Make sure your kernel configuration has RDMA support enabled. Under + Device Drivers -> InfiniBand support, update the kernel configuration + to enable InfiniBand support [NOTE: the option name is misleading. Enabling + InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)]. + + Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or + iWARP adapter support (amso, cxgb3, etc.). + + If you are using InfiniBand, be sure to enable IP-over-InfiniBand support. + +- Configure the NFS client and server + + Your kernel configuration must also have NFS file system support and/or + NFS server support enabled. These and other NFS related configuration + options can be found under File Systems -> Network File Systems. + +- Build, install, reboot + + The NFS/RDMA code will be enabled automatically if NFS and RDMA + are turned on. The NFS/RDMA client and server are configured via the hidden + SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The + value of SUNRPC_XPRT_RDMA will be: + + #. N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client + and server will not be built + + #. M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M, + in this case the NFS/RDMA client and server will be built as modules + + #. Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client + and server will be built into the kernel + + Therefore, if you have followed the steps above and turned no NFS and RDMA, + the NFS/RDMA client and server will be built. + + Build a new kernel, install it, boot it. + +Check RDMA and NFS Setup +======================== + +Before configuring the NFS/RDMA software, it is a good idea to test +your new kernel to ensure that the kernel is working correctly. +In particular, it is a good idea to verify that the RDMA stack +is functioning as expected and standard NFS over TCP/IP and/or UDP/IP +is working properly. + +- Check RDMA Setup + + If you built the RDMA components as modules, load them at + this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel + card: + + .. code-block:: sh + + $ modprobe ib_mthca + $ modprobe ib_ipoib + + If you are using InfiniBand, make sure there is a Subnet Manager (SM) + running on the network. If your IB switch has an embedded SM, you can + use it. Otherwise, you will need to run an SM, such as OpenSM, on one + of your end nodes. + + If an SM is running on your network, you should see the following: + + .. code-block:: sh + + $ cat /sys/class/infiniband/driverX/ports/1/state + 4: ACTIVE + + where driverX is mthca0, ipath5, ehca3, etc. + + To further test the InfiniBand software stack, use IPoIB (this + assumes you have two IB hosts named host1 and host2): + + .. code-block:: sh + + host1$ ip link set dev ib0 up + host1$ ip address add dev ib0 a.b.c.x + host2$ ip link set dev ib0 up + host2$ ip address add dev ib0 a.b.c.y + host1$ ping a.b.c.y + host2$ ping a.b.c.x + + For other device types, follow the appropriate procedures. + +- Check NFS Setup + + For the NFS components enabled above (client and/or server), + test their functionality over standard Ethernet using TCP/IP or UDP/IP. + +NFS/RDMA Setup +============== + +We recommend that you use two machines, one to act as the client and +one to act as the server. + +One time configuration: +----------------------- + +- On the server system, configure the /etc/exports file and start the NFS/RDMA server. + + Exports entries with the following formats have been tested:: + + /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) + /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) + + The IP address(es) is(are) the client's IPoIB address for an InfiniBand + HCA or the client's iWARP address(es) for an RNIC. + + .. note:: + The "insecure" option must be used because the NFS/RDMA client does + not use a reserved port. + +Each time a machine boots: +-------------------------- + +- Load and configure the RDMA drivers + + For InfiniBand using a Mellanox adapter: + + .. code-block:: sh + + $ modprobe ib_mthca + $ modprobe ib_ipoib + $ ip li set dev ib0 up + $ ip addr add dev ib0 a.b.c.d + + .. note:: + Please use unique addresses for the client and server! + +- Start the NFS server + + If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA transport module: + + .. code-block:: sh + + $ modprobe svcrdma + + Regardless of how the server was built (module or built-in), start the + server: + + .. code-block:: sh + + $ /etc/init.d/nfs start + + or + + .. code-block:: sh + + $ service nfs start + + Instruct the server to listen on the RDMA transport: + + .. code-block:: sh + + $ echo rdma 20049 > /proc/fs/nfsd/portlist + +- On the client system + + If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA client module: + + .. code-block:: sh + + $ modprobe xprtrdma.ko + + Regardless of how the client was built (module or built-in), use this + command to mount the NFS/RDMA server: + + .. code-block:: sh + + $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt + + To verify that the mount is using RDMA, run "cat /proc/mounts" and check + the "proto" field for the given mount. + + Congratulations! You're using NFS/RDMA! diff --git a/Documentation/admin-guide/nfs/nfsd-admin-interfaces.rst b/Documentation/admin-guide/nfs/nfsd-admin-interfaces.rst new file mode 100644 index 000000000000..c05926f79054 --- /dev/null +++ b/Documentation/admin-guide/nfs/nfsd-admin-interfaces.rst @@ -0,0 +1,40 @@ +================================== +Administrative interfaces for nfsd +================================== + +Note that normally these interfaces are used only by the utilities in +nfs-utils. + +nfsd is controlled mainly by pseudofiles under the "nfsd" filesystem, +which is normally mounted at /proc/fs/nfsd/. + +The server is always started by the first write of a nonzero value to +nfsd/threads. + +Before doing that, NFSD can be told which sockets to listen on by +writing to nfsd/portlist; that write may be: + + - an ascii-encoded file descriptor, which should refer to a + bound (and listening, for tcp) socket, or + - "transportname port", where transportname is currently either + "udp", "tcp", or "rdma". + +If nfsd is started without doing any of these, then it will create one +udp and one tcp listener at port 2049 (see nfsd_init_socks). + +On startup, nfsd and lockd grace periods start. nfsd is shut down by a write of +0 to nfsd/threads. All locks and state are thrown away at that point. + +Between startup and shutdown, the number of threads may be adjusted up +or down by additional writes to nfsd/threads or by writes to +nfsd/pool_threads. + +For more detail about files under nfsd/ and what they control, see +fs/nfsd/nfsctl.c; most of them have detailed comments. + +Implementation notes +==================== + +Note that the rpc server requires the caller to serialize addition and +removal of listening sockets, and startup and shutdown of the server. +For nfsd this is done using nfsd_mutex. diff --git a/Documentation/admin-guide/nfs/nfsroot.rst b/Documentation/admin-guide/nfs/nfsroot.rst new file mode 100644 index 000000000000..82a4fda057f9 --- /dev/null +++ b/Documentation/admin-guide/nfs/nfsroot.rst @@ -0,0 +1,364 @@ +=============================================== +Mounting the root filesystem via NFS (nfsroot) +=============================================== + +:Authors: + Written 1996 by Gero Kuhlmann <gero@gkminix.han.de> + + Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz> + + Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org> + + Updated 2006 by Horms <horms@verge.net.au> + + Updated 2018 by Chris Novakovic <chris@chrisn.me.uk> + + + +In order to use a diskless system, such as an X-terminal or printer server for +example, it is necessary for the root filesystem to be present on a non-disk +device. This may be an initramfs (see +Documentation/filesystems/ramfs-rootfs-initramfs.txt), a ramdisk (see +Documentation/admin-guide/initrd.rst) or a filesystem mounted via NFS. The +following text describes on how to use NFS for the root filesystem. For the rest +of this text 'client' means the diskless system, and 'server' means the NFS +server. + + + + +Enabling nfsroot capabilities +============================= + +In order to use nfsroot, NFS client support needs to be selected as +built-in during configuration. Once this has been selected, the nfsroot +option will become available, which should also be selected. + +In the networking options, kernel level autoconfiguration can be selected, +along with the types of autoconfiguration to support. Selecting all of +DHCP, BOOTP and RARP is safe. + + + + +Kernel command line +=================== + +When the kernel has been loaded by a boot loader (see below) it needs to be +told what root fs device to use. And in the case of nfsroot, where to find +both the server and the name of the directory on the server to mount as root. +This can be established using the following kernel command line parameters: + + +root=/dev/nfs + This is necessary to enable the pseudo-NFS-device. Note that it's not a + real device but just a synonym to tell the kernel to use NFS instead of + a real device. + + +nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] + If the `nfsroot' parameter is NOT given on the command line, + the default ``"/tftpboot/%s"`` will be used. + + <server-ip> Specifies the IP address of the NFS server. + The default address is determined by the ip parameter + (see below). This parameter allows the use of different + servers for IP autoconfiguration and NFS. + + <root-dir> Name of the directory on the server to mount as root. + If there is a "%s" token in the string, it will be + replaced by the ASCII-representation of the client's + IP address. + + <nfs-options> Standard NFS options. All options are separated by commas. + The following defaults are used:: + + port = as given by server portmap daemon + rsize = 4096 + wsize = 4096 + timeo = 7 + retrans = 3 + acregmin = 3 + acregmax = 60 + acdirmin = 30 + acdirmax = 60 + flags = hard, nointr, noposix, cto, ac + + +ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>:<dns0-ip>:<dns1-ip>:<ntp0-ip> + This parameter tells the kernel how to configure IP addresses of devices + and also how to set up the IP routing table. It was originally called + nfsaddrs, but now the boot-time IP configuration works independently of + NFS, so it was renamed to ip and the old name remained as an alias for + compatibility reasons. + + If this parameter is missing from the kernel command line, all fields are + assumed to be empty, and the defaults mentioned below apply. In general + this means that the kernel tries to configure everything using + autoconfiguration. + + The <autoconf> parameter can appear alone as the value to the ip + parameter (without all the ':' characters before). If the value is + "ip=off" or "ip=none", no autoconfiguration will take place, otherwise + autoconfiguration will take place. The most common way to use this + is "ip=dhcp". + + <client-ip> IP address of the client. + Default: Determined using autoconfiguration. + + <server-ip> IP address of the NFS server. + If RARP is used to determine + the client address and this parameter is NOT empty only + replies from the specified server are accepted. + + Only required for NFS root. That is autoconfiguration + will not be triggered if it is missing and NFS root is not + in operation. + + Value is exported to /proc/net/pnp with the prefix "bootserver " + (see below). + + Default: Determined using autoconfiguration. + The address of the autoconfiguration server is used. + + <gw-ip> IP address of a gateway if the server is on a different subnet. + Default: Determined using autoconfiguration. + + <netmask> Netmask for local network interface. + If unspecified the netmask is derived from the client IP address + assuming classful addressing. + + Default: Determined using autoconfiguration. + + <hostname> Name of the client. + If a '.' character is present, anything + before the first '.' is used as the client's hostname, and anything + after it is used as its NIS domain name. May be supplied by + autoconfiguration, but its absence will not trigger autoconfiguration. + If specified and DHCP is used, the user-provided hostname (and NIS + domain name, if present) will be carried in the DHCP request; this + may cause a DNS record to be created or updated for the client. + + Default: Client IP address is used in ASCII notation. + + <device> Name of network device to use. + Default: If the host only has one device, it is used. + Otherwise the device is determined using + autoconfiguration. This is done by sending + autoconfiguration requests out of all devices, + and using the device that received the first reply. + + <autoconf> Method to use for autoconfiguration. + In the case of options + which specify multiple autoconfiguration protocols, + requests are sent using all protocols, and the first one + to reply is used. + + Only autoconfiguration protocols that have been compiled + into the kernel will be used, regardless of the value of + this option:: + + off or none: don't use autoconfiguration + (do static IP assignment instead) + on or any: use any protocol available in the kernel + (default) + dhcp: use DHCP + bootp: use BOOTP + rarp: use RARP + both: use both BOOTP and RARP but not DHCP + (old option kept for backwards compatibility) + + if dhcp is used, the client identifier can be used by following + format "ip=dhcp,client-id-type,client-id-value" + + Default: any + + <dns0-ip> IP address of primary nameserver. + Value is exported to /proc/net/pnp with the prefix "nameserver " + (see below). + + Default: None if not using autoconfiguration; determined + automatically if using autoconfiguration. + + <dns1-ip> IP address of secondary nameserver. + See <dns0-ip>. + + <ntp0-ip> IP address of a Network Time Protocol (NTP) server. + Value is exported to /proc/net/ipconfig/ntp_servers, but is + otherwise unused (see below). + + Default: None if not using autoconfiguration; determined + automatically if using autoconfiguration. + + After configuration (whether manual or automatic) is complete, two files + are created in the following format; lines are omitted if their respective + value is empty following configuration: + + - /proc/net/pnp: + + #PROTO: <DHCP|BOOTP|RARP|MANUAL> (depending on configuration method) + domain <dns-domain> (if autoconfigured, the DNS domain) + nameserver <dns0-ip> (primary name server IP) + nameserver <dns1-ip> (secondary name server IP) + nameserver <dns2-ip> (tertiary name server IP) + bootserver <server-ip> (NFS server IP) + + - /proc/net/ipconfig/ntp_servers: + + <ntp0-ip> (NTP server IP) + <ntp1-ip> (NTP server IP) + <ntp2-ip> (NTP server IP) + + <dns-domain> and <dns2-ip> (in /proc/net/pnp) and <ntp1-ip> and <ntp2-ip> + (in /proc/net/ipconfig/ntp_servers) are requested during autoconfiguration; + they cannot be specified as part of the "ip=" kernel command line parameter. + + Because the "domain" and "nameserver" options are recognised by DNS + resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems + that use an NFS root filesystem. + + Note that the kernel will not synchronise the system time with any NTP + servers it discovers; this is the responsibility of a user space process + (e.g. an initrd/initramfs script that passes the IP addresses listed in + /proc/net/ipconfig/ntp_servers to an NTP client before mounting the real + root filesystem if it is on NFS). + + +nfsrootdebug + This parameter enables debugging messages to appear in the kernel + log at boot time so that administrators can verify that the correct + NFS mount options, server address, and root path are passed to the + NFS client. + + +rdinit=<executable file> + To specify which file contains the program that starts system + initialization, administrators can use this command line parameter. + The default value of this parameter is "/init". If the specified + file exists and the kernel can execute it, root filesystem related + kernel command line parameters, including 'nfsroot=', are ignored. + + A description of the process of mounting the root file system can be + found in Documentation/driver-api/early-userspace/early_userspace_support.rst + + +Boot Loader +=========== + +To get the kernel into memory different approaches can be used. +They depend on various facilities being available: + + +- Booting from a floppy using syslinux + + When building kernels, an easy way to create a boot floppy that uses + syslinux is to use the zdisk or bzdisk make targets which use zimage + and bzimage images respectively. Both targets accept the + FDARGS parameter which can be used to set the kernel command line. + + e.g:: + + make bzdisk FDARGS="root=/dev/nfs" + + Note that the user running this command will need to have + access to the floppy drive device, /dev/fd0 + + For more information on syslinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + .. note:: + Previously it was possible to write a kernel directly to + a floppy using dd, configure the boot device using rdev, and + boot using the resulting floppy. Linux no longer supports this + method of booting. + +- Booting from a cdrom using isolinux + + When building kernels, an easy way to create a bootable cdrom that + uses isolinux is to use the isoimage target which uses a bzimage + image. Like zdisk and bzdisk, this target accepts the FDARGS + parameter which can be used to set the kernel command line. + + e.g:: + + make isoimage FDARGS="root=/dev/nfs" + + The resulting iso image will be arch/<ARCH>/boot/image.iso + This can be written to a cdrom using a variety of tools including + cdrecord. + + e.g:: + + cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + +- Using LILO + + When using LILO all the necessary command line parameters may be + specified using the 'append=' directive in the LILO configuration + file. + + However, to use the 'root=' directive you also need to create + a dummy root device, which may be removed after LILO is run. + + e.g:: + + mknod /dev/boot255 c 0 255 + + For information on configuring LILO, please refer to its documentation. + +- Using GRUB + + When using GRUB, kernel parameter are simply appended after the kernel + specification: kernel <kernel> <parameters> + +- Using loadlin + + loadlin may be used to boot Linux from a DOS command prompt without + requiring a local hard disk to mount as root. This has not been + thoroughly tested by the authors of this document, but in general + it should be possible configure the kernel command line similarly + to the configuration of LILO. + + Please refer to the loadlin documentation for further information. + +- Using a boot ROM + + This is probably the most elegant way of booting a diskless client. + With a boot ROM the kernel is loaded using the TFTP protocol. The + authors of this document are not aware of any no commercial boot + ROMs that support booting Linux over the network. However, there + are two free implementations of a boot ROM, netboot-nfs and + etherboot, both of which are available on sunsite.unc.edu, and both + of which contain everything you need to boot a diskless Linux client. + +- Using pxelinux + + Pxelinux may be used to boot linux using the PXE boot loader + which is present on many modern network cards. + + When using pxelinux, the kernel image is specified using + "kernel <relative-path-below /tftpboot>". The nfsroot parameters + are passed to the kernel by adding them to the "append" line. + It is common to use serial console in conjunction with pxeliunx, + see Documentation/admin-guide/serial-console.rst for more information. + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + + + +Credits +======= + + The nfsroot code in the kernel and the RARP support have been written + by Gero Kuhlmann <gero@gkminix.han.de>. + + The rest of the IP layer autoconfiguration code has been written + by Martin Mares <mj@atrey.karlin.mff.cuni.cz>. + + In order to write the initial version of nfsroot I would like to thank + Jens-Uwe Mager <jum@anubis.han.de> for his help. diff --git a/Documentation/admin-guide/nfs/pnfs-block-server.rst b/Documentation/admin-guide/nfs/pnfs-block-server.rst new file mode 100644 index 000000000000..b00a2e705cc4 --- /dev/null +++ b/Documentation/admin-guide/nfs/pnfs-block-server.rst @@ -0,0 +1,42 @@ +=================================== +pNFS block layout server user guide +=================================== + +The Linux NFS server now supports the pNFS block layout extension. In this +case the NFS server acts as Metadata Server (MDS) for pNFS, which in addition +to handling all the metadata access to the NFS export also hands out layouts +to the clients to directly access the underlying block devices that are +shared with the client. + +To use pNFS block layouts with with the Linux NFS server the exported file +system needs to support the pNFS block layouts (currently just XFS), and the +file system must sit on shared storage (typically iSCSI) that is accessible +to the clients in addition to the MDS. As of now the file system needs to +sit directly on the exported volume, striping or concatenation of +volumes on the MDS and clients is not supported yet. + +On the server, pNFS block volume support is automatically if the file system +support it. On the client make sure the kernel has the CONFIG_PNFS_BLOCK +option enabled, the blkmapd daemon from nfs-utils is running, and the +file system is mounted using the NFSv4.1 protocol version (mount -o vers=4.1). + +If the nfsd server needs to fence a non-responding client it calls +/sbin/nfsd-recall-failed with the first argument set to the IP address of +the client, and the second argument set to the device node without the /dev +prefix for the file system to be fenced. Below is an example file that shows +how to translate the device into a serial number from SCSI EVPD 0x80:: + + cat > /sbin/nfsd-recall-failed << EOF + +.. code-block:: sh + + #!/bin/sh + + CLIENT="$1" + DEV="/dev/$2" + EVPD=`sg_inq --page=0x80 ${DEV} | \ + grep "Unit serial number:" | \ + awk -F ': ' '{print $2}'` + + echo "fencing client ${CLIENT} serial ${EVPD}" >> /var/log/pnfsd-fence.log + EOF diff --git a/Documentation/admin-guide/nfs/pnfs-scsi-server.rst b/Documentation/admin-guide/nfs/pnfs-scsi-server.rst new file mode 100644 index 000000000000..d2f6ee558071 --- /dev/null +++ b/Documentation/admin-guide/nfs/pnfs-scsi-server.rst @@ -0,0 +1,24 @@ + +================================== +pNFS SCSI layout server user guide +================================== + +This document describes support for pNFS SCSI layouts in the Linux NFS server. +With pNFS SCSI layouts, the NFS server acts as Metadata Server (MDS) for pNFS, +which in addition to handling all the metadata access to the NFS export, +also hands out layouts to the clients so that they can directly access the +underlying SCSI LUNs that are shared with the client. + +To use pNFS SCSI layouts with with the Linux NFS server, the exported file +system needs to support the pNFS SCSI layouts (currently just XFS), and the +file system must sit on a SCSI LUN that is accessible to the clients in +addition to the MDS. As of now the file system needs to sit directly on the +exported LUN, striping or concatenation of LUNs on the MDS and clients +is not supported yet. + +On a server built with CONFIG_NFSD_SCSI, the pNFS SCSI volume support is +automatically enabled if the file system is exported using the "pnfs" +option and the underlying SCSI device support persistent reservations. +On the client make sure the kernel has the CONFIG_PNFS_BLOCK option +enabled, and the file system is mounted using the NFSv4.1 protocol +version (mount -o vers=4.1). diff --git a/Documentation/admin-guide/perf/imx-ddr.rst b/Documentation/admin-guide/perf/imx-ddr.rst new file mode 100644 index 000000000000..3726a10a03ba --- /dev/null +++ b/Documentation/admin-guide/perf/imx-ddr.rst @@ -0,0 +1,70 @@ +===================================================== +Freescale i.MX8 DDR Performance Monitoring Unit (PMU) +===================================================== + +There are no performance counters inside the DRAM controller, so performance +signals are brought out to the edge of the controller where a set of 4 x 32 bit +counters is implemented. This is controlled by the CSV modes programed in counter +control register which causes a large number of PERF signals to be generated. + +Selection of the value for each counter is done via the config registers. There +is one register for each counter. Counter 0 is special in that it always counts +“time” and when expired causes a lock on itself and the other counters and an +interrupt is raised. If any other counter overflows, it continues counting, and +no interrupt is raised. + +The "format" directory describes format of the config (event ID) and config1 +(AXI filtering) fields of the perf_event_attr structure, see /sys/bus/event_source/ +devices/imx8_ddr0/format/. The "events" directory describes the events types +hardware supported that can be used with perf tool, see /sys/bus/event_source/ +devices/imx8_ddr0/events/. The "caps" directory describes filter features implemented +in DDR PMU, see /sys/bus/events_source/devices/imx8_ddr0/caps/. + + .. code-block:: bash + + perf stat -a -e imx8_ddr0/cycles/ cmd + perf stat -a -e imx8_ddr0/read/,imx8_ddr0/write/ cmd + +AXI filtering is only used by CSV modes 0x41 (axid-read) and 0x42 (axid-write) +to count reading or writing matches filter setting. Filter setting is various +from different DRAM controller implementations, which is distinguished by quirks +in the driver. You also can dump info from userspace, filter in "caps" directory +indicates whether PMU supports AXI ID filter or not; enhanced_filter indicates +whether PMU supports enhanced AXI ID filter or not. Value 0 for un-supported, and +value 1 for supported. + +* With DDR_CAP_AXI_ID_FILTER quirk(filter: 1, enhanced_filter: 0). + Filter is defined with two configuration parts: + --AXI_ID defines AxID matching value. + --AXI_MASKING defines which bits of AxID are meaningful for the matching. + + - 0: corresponding bit is masked. + - 1: corresponding bit is not masked, i.e. used to do the matching. + + AXI_ID and AXI_MASKING are mapped on DPCR1 register in performance counter. + When non-masked bits are matching corresponding AXI_ID bits then counter is + incremented. Perf counter is incremented if + AxID && AXI_MASKING == AXI_ID && AXI_MASKING + + This filter doesn't support filter different AXI ID for axid-read and axid-write + event at the same time as this filter is shared between counters. + + .. code-block:: bash + + perf stat -a -e imx8_ddr0/axid-read,axi_mask=0xMMMM,axi_id=0xDDDD/ cmd + perf stat -a -e imx8_ddr0/axid-write,axi_mask=0xMMMM,axi_id=0xDDDD/ cmd + + .. note:: + + axi_mask is inverted in userspace(i.e. set bits are bits to mask), and + it will be reverted in driver automatically. so that the user can just specify + axi_id to monitor a specific id, rather than having to specify axi_mask. + + .. code-block:: bash + + perf stat -a -e imx8_ddr0/axid-read,axi_id=0x12/ cmd, which will monitor ARID=0x12 + +* With DDR_CAP_AXI_ID_FILTER_ENHANCED quirk(filter: 1, enhanced_filter: 1). + This is an extension to the DDR_CAP_AXI_ID_FILTER quirk which permits + counting the number of bytes (as opposed to the number of bursts) from DDR + read and write transactions concurrently with another set of data counters. diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index ee4bfd2a740f..47c99f40cc16 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -8,6 +8,7 @@ Performance monitor support :maxdepth: 1 hisi-pmu + imx-ddr qcom_l2_pmu qcom_l3_pmu arm-ccn diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst b/Documentation/admin-guide/perf/thunderx2-pmu.rst index 08e33675853a..01f158238ae1 100644 --- a/Documentation/admin-guide/perf/thunderx2-pmu.rst +++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst @@ -3,24 +3,26 @@ Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE) ============================================================= The ThunderX2 SoC PMU consists of independent, system-wide, per-socket -PMUs such as the Level 3 Cache (L3C) and DDR4 Memory Controller (DMC). +PMUs such as the Level 3 Cache (L3C), DDR4 Memory Controller (DMC) and +Cavium Coherent Processor Interconnect (CCPI2). The DMC has 8 interleaved channels and the L3C has 16 interleaved tiles. Events are counted for the default channel (i.e. channel 0) and prorated to the total number of channels/tiles. -The DMC and L3C support up to 4 counters. Counters are independently -programmable and can be started and stopped individually. Each counter -can be set to a different event. Counters are 32-bit and do not support -an overflow interrupt; they are read every 2 seconds. +The DMC and L3C support up to 4 counters, while the CCPI2 supports up to 8 +counters. Counters are independently programmable to different events and +can be started and stopped individually. None of the counters support an +overflow interrupt. DMC and L3C counters are 32-bit and read every 2 seconds. +The CCPI2 counters are 64-bit and assumed not to overflow in normal operation. PMU UNCORE (perf) driver: The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and -L3C devices. Each PMU can be used to count up to 4 events -simultaneously. The PMUs provide a description of their available events -and configuration options under sysfs, see -/sys/devices/uncore_<l3c_S/dmc_S/>; S is the socket id. +L3C devices. Each PMU can be used to count up to 4 (DMC/L3C) or up to 8 +(CCPI2) events simultaneously. The PMUs provide a description of their +available events and configuration options under sysfs, see +/sys/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id. The driver does not support sampling, therefore "perf record" will not work. Per-task perf sessions are also not supported. diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index e70b365dbc60..6a06dc473dd6 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -506,6 +506,9 @@ object corresponding to it, as follows: ``disable`` Whether or not this idle state is disabled. +``default_status`` + The default status of this state, "enabled" or "disabled". + ``latency`` Exit latency of the idle state in microseconds. @@ -629,16 +632,16 @@ class priority list and destroyed. If that happens, the priority list mechanism will be used, again, to determine the new effective value for the whole list and that value will become the new real constraint. -In turn, for each CPU there is only one resume latency PM QoS request -associated with the :file:`power/pm_qos_resume_latency_us` file under +In turn, for each CPU there is one resume latency PM QoS request associated with +the :file:`power/pm_qos_resume_latency_us` file under :file:`/sys/devices/system/cpu/cpu<N>/` in ``sysfs`` and writing to it causes this single PM QoS request to be updated regardless of which user space process does that. In other words, this PM QoS request is shared by the entire user space, so access to the file associated with it needs to be arbitrated to avoid confusion. [Arguably, the only legitimate use of this mechanism in practice is to pin a process to the CPU in question and let it use the -``sysfs`` interface to control the resume latency constraint for it.] It -still only is a request, however. It is a member of a priority list used to +``sysfs`` interface to control the resume latency constraint for it.] It is +still only a request, however. It is an entry in a priority list used to determine the effective value to be set as the resume latency constraint for the CPU in question every time the list of requests is updated this way or another (there may be other requests coming from kernel code in that list). diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst new file mode 100644 index 000000000000..89309e1b0e48 --- /dev/null +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -0,0 +1,268 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +============================================== +``intel_idle`` CPU Idle Time Management Driver +============================================== + +:Copyright: |copy| 2020 Intel Corporation + +:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> + + +General Information +=================== + +``intel_idle`` is a part of the +:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel +(``CPUIdle``). It is the default CPU idle time management driver for the +Nehalem and later generations of Intel processors, but the level of support for +a particular processor model in it depends on whether or not it recognizes that +processor model and may also depend on information coming from the platform +firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle`` +works in general, so this is the time to get familiar with :doc:`cpuidle` if you +have not done that yet.] + +``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the +logical CPU executing it is idle and so it may be possible to put some of the +processor's functional blocks into low-power states. That instruction takes two +arguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the +first of which, referred to as a *hint*, can be used by the processor to +determine what can be done (for details refer to Intel Software Developer’s +Manual [1]_). Accordingly, ``intel_idle`` refuses to work with processors in +which the support for the ``MWAIT`` instruction has been disabled (for example, +via the platform firmware configuration menu) or which do not support that +instruction at all. + +``intel_idle`` is not modular, so it cannot be unloaded, which means that the +only way to pass early-configuration-time parameters to it is via the kernel +command line. + + +.. _intel-idle-enumeration-of-states: + +Enumeration of Idle States +========================== + +Each ``MWAIT`` hint value is interpreted by the processor as a license to +reconfigure itself in a certain way in order to save energy. The processor +configurations (with reduced power draw) resulting from that are referred to +as C-states (in the ACPI terminology) or idle states. The list of meaningful +``MWAIT`` hint values and idle states (i.e. low-power configurations of the +processor) corresponding to them depends on the processor model and it may also +depend on the configuration of the platform. + +In order to create a list of available idle states required by the ``CPUIdle`` +subsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`), +``intel_idle`` can use two sources of information: static tables of idle states +for different processor models included in the driver itself and the ACPI tables +of the system. The former are always used if the processor model at hand is +recognized by ``intel_idle`` and the latter are used if that is required for +the given processor model (which is the case for all server processor models +recognized by ``intel_idle``) or if the processor model is not recognized. +[There is a module parameter that can be used to make the driver use the ACPI +tables with any processor model recognized by it; see +`below <intel-idle-parameters_>`_.] + +If the ACPI tables are going to be used for building the list of available idle +states, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI +objects corresponding to the CPUs in the system (refer to the ACPI specification +[2]_ for the description of ``_CST`` and its output package). Because the +``CPUIdle`` subsystem expects that the list of idle states supplied by the +driver will be suitable for all of the CPUs handled by it and ``intel_idle`` is +registered as the ``CPUIdle`` driver for all of the CPUs in the system, the +driver looks for the first ``_CST`` object returning at least one valid idle +state description and such that all of the idle states included in its return +package are of the FFH (Functional Fixed Hardware) type, which means that the +``MWAIT`` instruction is expected to be used to tell the processor that it can +enter one of them. The return package of that ``_CST`` is then assumed to be +applicable to all of the other CPUs in the system and the idle state +descriptions extracted from it are stored in a preliminary list of idle states +coming from the ACPI tables. [This step is skipped if ``intel_idle`` is +configured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.] + +Next, the first (index 0) entry in the list of available idle states is +initialized to represent a "polling idle state" (a pseudo-idle state in which +the target CPU continuously fetches and executes instructions), and the +subsequent (real) idle state entries are populated as follows. + +If the processor model at hand is recognized by ``intel_idle``, there is a +(static) table of idle state descriptions for it in the driver. In that case, +the "internal" table is the primary source of information on idle states and the +information from it is copied to the final list of available idle states. If +using the ACPI tables for the enumeration of idle states is not required +(depending on the processor model), all of the listed idle state are enabled by +default (so all of them will be taken into consideration by ``CPUIdle`` +governors during CPU idle state selection). Otherwise, some of the listed idle +states may not be enabled by default if there are no matching entries in the +preliminary list of idle states coming from the ACPI tables. In that case user +space still can enable them later (on a per-CPU basis) with the help of +the ``disable`` idle state attribute in ``sysfs`` (see +:ref:`idle-states-representation` in :doc:`cpuidle`). This basically means that +the idle states "known" to the driver may not be enabled by default if they have +not been exposed by the platform firmware (through the ACPI tables). + +If the given processor model is not recognized by ``intel_idle``, but it +supports ``MWAIT``, the preliminary list of idle states coming from the ACPI +tables is used for building the final list that will be supplied to the +``CPUIdle`` core during driver registration. For each idle state in that list, +the description, ``MWAIT`` hint and exit latency are copied to the corresponding +entry in the final list of idle states. The name of the idle state represented +by it (to be returned by the ``name`` idle state attribute in ``sysfs``) is +"CX_ACPI", where X is the index of that idle state in the final list (note that +the minimum value of X is 1, because 0 is reserved for the "polling" state), and +its target residency is based on the exit latency value. Specifically, for +C1-type idle states the exit latency value is also used as the target residency +(for compatibility with the majority of the "internal" tables of idle states for +various processor models recognized by ``intel_idle``) and for the other idle +state types (C2 and C3) the target residency value is 3 times the exit latency +(again, that is because it reflects the target residency to exit latency ratio +in the majority of cases for the processor models recognized by ``intel_idle``). +All of the idle states in the final list are enabled by default in this case. + + +.. _intel-idle-initialization: + +Initialization +============== + +The initialization of ``intel_idle`` starts with checking if the kernel command +line options forbid the use of the ``MWAIT`` instruction. If that is the case, +an error code is returned right away. + +The next step is to check whether or not the processor model is known to the +driver, which determines the idle states enumeration method (see +`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor +supports ``MWAIT`` (the initialization fails if that is not the case). Then, +the ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the +driver initialization fails if the level of support is not as expected (for +example, if the total number of ``MWAIT`` substates returned is 0). + +Next, if the driver is not configured to ignore the ACPI tables (see +`below <intel-idle-parameters_>`_), the idle states information provided by the +platform firmware is extracted from them. + +Then, ``CPUIdle`` device objects are allocated for all CPUs and the list of +available idle states is created as explained +`above <intel-idle-enumeration-of-states_>`_. + +Finally, ``intel_idle`` is registered with the help of cpuidle_register_driver() +as the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback +for configuring individual CPUs is registered via cpuhp_setup_state(), which +(among other things) causes the callback routine to be invoked for all of the +CPUs present in the system at that time (each CPU executes its own instance of +the callback routine). That routine registers a ``CPUIdle`` device for the CPU +running it (which enables the ``CPUIdle`` subsystem to operate that CPU) and +optionally performs some CPU-specific initialization actions that may be +required for the given processor model. + + +.. _intel-idle-parameters: + +Kernel Command Line Options and Module Parameters +================================================= + +The *x86* architecture support code recognizes three kernel command line +options related to CPU idle time management: ``idle=poll``, ``idle=halt``, +and ``idle=nomwait``. If any of them is present in the kernel command line, the +``MWAIT`` instruction is not allowed to be used, so the initialization of +``intel_idle`` will fail. + +Apart from that there are four module parameters recognized by ``intel_idle`` +itself that can be set via the kernel command line (they cannot be updated via +sysfs, so that is the only way to change their values). + +The ``max_cstate`` parameter value is the maximum idle state index in the list +of idle states supplied to the ``CPUIdle`` core during the registration of the +driver. It is also the maximum number of regular (non-polling) idle states that +can be used by ``intel_idle``, so the enumeration of idle states is terminated +after finding that number of usable idle states (the other idle states that +potentially might have been used if ``max_cstate`` had been greater are not +taken into consideration at all). Setting ``max_cstate`` can prevent +``intel_idle`` from exposing idle states that are regarded as "too deep" for +some reason to the ``CPUIdle`` core, but it does so by making them effectively +invisible until the system is shut down and started again which may not always +be desirable. In practice, it is only really necessary to do that if the idle +states in question cannot be enabled during system startup, because in the +working state of the system the CPU power management quality of service (PM +QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states +even if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`). +Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. + +The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle`` +if the kernel has been configured with ACPI support) can be set to make the +driver ignore the system's ACPI tables entirely or use them for all of the +recognized processor models, respectively (they both are unset by default and +``use_acpi`` has no effect if ``no_acpi`` is set). + +The value of the ``states_off`` module parameter (0 by default) represents a +list of idle states to be disabled by default in the form of a bitmask. + +Namely, the positions of the bits that are set in the ``states_off`` value are +the indices of idle states to be disabled by default (as reflected by the names +of the corresponding idle state directories in ``sysfs``, :file:`state0`, +:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given +idle state; see :ref:`idle-states-representation` in :doc:`cpuidle`). + +For example, if ``states_off`` is equal to 3, the driver will disable idle +states 0 and 1 by default, and if it is equal to 8, idle state 3 will be +disabled by default and so on (bit positions beyond the maximum idle state index +are ignored). + +The idle states disabled this way can be enabled (on a per-CPU basis) from user +space via ``sysfs``. + + +.. _intel-idle-core-and-package-idle-states: + +Core and Package Levels of Idle States +====================================== + +Typically, in a processor supporting the ``MWAIT`` instruction there are (at +least) two levels of idle states (or C-states). One level, referred to as +"core C-states", covers individual cores in the processor, whereas the other +level, referred to as "package C-states", covers the entire processor package +and it may also involve other components of the system (GPUs, memory +controllers, I/O hubs etc.). + +Some of the ``MWAIT`` hint values allow the processor to use core C-states only +(most importantly, that is the case for the ``MWAIT`` hint value corresponding +to the ``C1`` idle state), but the majority of them give it a license to put +the target core (i.e. the core containing the logical CPU executing ``MWAIT`` +with the given hint value) into a specific core C-state and then (if possible) +to enter a specific package C-state at the deeper level. For example, the +``MWAIT`` hint value representing the ``C3`` idle state allows the processor to +put the target core into the low-power state referred to as "core ``C3``" (or +``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core +have executed ``MWAIT`` with the ``C3`` hint value (or with a hint value +representing a deeper idle state), and in addition to that (in the majority of +cases) it gives the processor a license to put the entire package (possibly +including some non-CPU components such as a GPU or a memory controller) into the +low-power state referred to as "package ``C3``" (or ``PC3``), which happens if +all of the cores have gone into the ``CC3`` state and (possibly) some additional +conditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may +be required to be in a certain GPU-specific low-power state for ``PC3`` to be +reachable). + +As a rule, there is no simple way to make the processor use core C-states only +if the conditions for entering the corresponding package C-states are met, so +the logical CPU executing ``MWAIT`` with a hint value that is not core-level +only (like for ``C1``) must always assume that this may cause the processor to +enter a package C-state. [That is why the exit latency and target residency +values corresponding to the majority of ``MWAIT`` hint values in the "internal" +tables of idle states in ``intel_idle`` reflect the properties of package +C-states.] If using package C-states is not desirable at all, either +:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of +``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to +restrict the range of permissible idle states to the ones with core-level only +``MWAIT`` hint values (like ``C1``). + + +References +========== + +.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*, + https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html + +.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*, + https://uefi.org/specifications diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst index cd3a28cb81f4..ee55a460c639 100644 --- a/Documentation/admin-guide/pm/sleep-states.rst +++ b/Documentation/admin-guide/pm/sleep-states.rst @@ -153,8 +153,11 @@ for the given CPU architecture includes the low-level code for system resume. Basic ``sysfs`` Interfaces for System Suspend and Hibernation ============================================================= -The following files located in the :file:`/sys/power/` directory can be used by -user space for sleep states control. +The power management subsystem provides userspace with a unified ``sysfs`` +interface for system sleep regardless of the underlying system architecture or +platform. That interface is located in the :file:`/sys/power/` directory +(assuming that ``sysfs`` is mounted at :file:`/sys`) and it consists of the +following attributes (files): ``state`` This file contains a list of strings representing sleep states supported @@ -162,9 +165,9 @@ user space for sleep states control. to start a transition of the system into the sleep state represented by that string. - In particular, the strings "disk", "freeze" and "standby" represent the + In particular, the "disk", "freeze" and "standby" strings represent the :ref:`hibernation <hibernation>`, :ref:`suspend-to-idle <s2idle>` and - :ref:`standby <standby>` sleep states, respectively. The string "mem" + :ref:`standby <standby>` sleep states, respectively. The "mem" string is interpreted in accordance with the contents of the ``mem_sleep`` file described below. @@ -177,7 +180,7 @@ user space for sleep states control. associated with the "mem" string in the ``state`` file described above. The strings that may be present in this file are "s2idle", "shallow" - and "deep". The string "s2idle" always represents :ref:`suspend-to-idle + and "deep". The "s2idle" string always represents :ref:`suspend-to-idle <s2idle>` and, by convention, "shallow" and "deep" represent :ref:`standby <standby>` and :ref:`suspend-to-RAM <s2ram>`, respectively. @@ -185,15 +188,17 @@ user space for sleep states control. Writing one of the listed strings into this file causes the system suspend variant represented by it to be associated with the "mem" string in the ``state`` file. The string representing the suspend variant - currently associated with the "mem" string in the ``state`` file - is listed in square brackets. + currently associated with the "mem" string in the ``state`` file is + shown in square brackets. If the kernel does not support system suspend, this file is not present. ``disk`` - This file contains a list of strings representing different operations - that can be carried out after the hibernation image has been saved. The - possible options are as follows: + This file controls the operating mode of hibernation (Suspend-to-Disk). + Specifically, it tells the kernel what to do after creating a + hibernation image. + + Reading from it returns a list of supported options encoded as: ``platform`` Put the system into a special low-power state (e.g. ACPI S4) to @@ -201,6 +206,11 @@ user space for sleep states control. platform firmware to take a simplified initialization path after wakeup. + It is only available if the platform provides a special + mechanism to put the system to sleep after creating a + hibernation image (platforms with ACPI do that as a rule, for + example). + ``shutdown`` Power off the system. @@ -214,22 +224,53 @@ user space for sleep states control. the hibernation image and continue. Otherwise, use the image to restore the previous state of the system. + It is available if system suspend is supported. + ``test_resume`` Diagnostic operation. Load the image as though the system had just woken up from hibernation and the currently running kernel instance was a restore kernel and follow up with full system resume. - Writing one of the listed strings into this file causes the option + Writing one of the strings listed above into this file causes the option represented by it to be selected. - The currently selected option is shown in square brackets which means + The currently selected option is shown in square brackets, which means that the operation represented by it will be carried out after creating - and saving the image next time hibernation is triggered by writing - ``disk`` to :file:`/sys/power/state`. + and saving the image when hibernation is triggered by writing ``disk`` + to :file:`/sys/power/state`. If the kernel does not support hibernation, this file is not present. +``image_size`` + This file controls the size of hibernation images. + + It can be written a string representing a non-negative integer that will + be used as a best-effort upper limit of the image size, in bytes. The + hibernation core will do its best to ensure that the image size will not + exceed that number, but if that turns out to be impossible to achieve, a + hibernation image will still be created and its size will be as small as + possible. In particular, writing '0' to this file causes the size of + hibernation images to be minimum. + + Reading from it returns the current image size limit, which is set to + around 2/5 of the available RAM size by default. + +``pm_trace`` + This file controls the "PM trace" mechanism saving the last suspend + or resume event point in the RTC memory across reboots. It helps to + debug hard lockups or reboots due to device driver failures that occur + during system suspend or resume (which is more common) more effectively. + + If it contains "1", the fingerprint of each suspend/resume event point + in turn will be stored in the RTC memory (overwriting the actual RTC + information), so it will survive a system crash if one occurs right + after storing it and it can be used later to identify the driver that + caused the crash to happen. + + It contains "0" by default, which may be changed to "1" by writing a + string representing a nonzero integer into it. + According to the above, there are two ways to make the system go into the :ref:`suspend-to-idle <s2idle>` state. The first one is to write "freeze" directly to :file:`/sys/power/state`. The second one is to write "s2idle" to @@ -244,6 +285,7 @@ system go into the :ref:`suspend-to-RAM <s2ram>` state (write "deep" into The default suspend variant (ie. the one to be used without writing anything into :file:`/sys/power/mem_sleep`) is either "deep" (on the majority of systems supporting :ref:`suspend-to-RAM <s2ram>`) or "s2idle", but it can be overridden -by the value of the "mem_sleep_default" parameter in the kernel command line. -On some ACPI-based systems, depending on the information in the ACPI tables, the -default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported. +by the value of the ``mem_sleep_default`` parameter in the kernel command line. +On some systems with ACPI, depending on the information in the ACPI tables, the +default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported in +principle. diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst index fc298eb1234b..88f717e59a42 100644 --- a/Documentation/admin-guide/pm/working-state.rst +++ b/Documentation/admin-guide/pm/working-state.rst @@ -8,6 +8,7 @@ Working-State Power Management :maxdepth: 2 cpuidle + intel_idle cpufreq intel_pstate intel_epb diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/ras.rst index 2b20f5f7380d..0310db624964 100644 --- a/Documentation/admin-guide/ras.rst +++ b/Documentation/admin-guide/ras.rst @@ -330,9 +330,12 @@ There can be multiple csrows and multiple channels. .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely used to refer to a memory module, although there are other memory - packaging alternatives, like SO-DIMM, SIMM, etc. Along this document, - and inside the EDAC system, the term "dimm" is used for all memory - modules, even when they use a different kind of packaging. + packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI + specification (Version 2.7) defines a memory module in the Common + Platform Error Record (CPER) section to be an SMBIOS Memory Device + (Type 17). Along this document, and inside the EDAC subsystem, the term + "dimm" is used for all memory modules, even when they use a + different kind of packaging. Memory controllers allow for several csrows, with 8 csrows being a typical value. Yet, the actual number of csrows depends on the layout of @@ -349,12 +352,14 @@ controllers. The following example will assume 2 channels: | | ``ch0`` | ``ch1`` | +============+===========+===========+ | ``csrow0`` | DIMM_A0 | DIMM_B0 | - +------------+ | | - | ``csrow1`` | | | + | | rank0 | rank0 | + +------------+ - | - | + | ``csrow1`` | rank1 | rank1 | +------------+-----------+-----------+ | ``csrow2`` | DIMM_A1 | DIMM_B1 | - +------------+ | | - | ``csrow3`` | | | + | | rank0 | rank0 | + +------------+ - | - | + | ``csrow3`` | rank1 | rank1 | +------------+-----------+-----------+ In the above example, there are 4 physical slots on the motherboard @@ -374,11 +379,13 @@ which the memory DIMM is placed. Thus, when 1 DIMM is placed in each Channel, the csrows cross both DIMMs. Memory DIMMs come single or dual "ranked". A rank is a populated csrow. -Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above -will have just one csrow (csrow0). csrow1 will be empty. On the other -hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0 -and csrow1 will be populated. The pattern repeats itself for csrow2 and -csrow3. +In the example above 2 dual ranked DIMMs are similarly placed. Thus, +both csrow0 and csrow1 are populated. On the other hand, when 2 single +ranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will +have just one csrow (csrow0) and csrow1 will be empty. The pattern +repeats itself for csrow2 and csrow3. Also note that some memory +controllers don't have any logic to identify the memory module, see +``rankX`` directories below. The representation of the above is reflected in the directory tree in EDAC's sysfs interface. Starting in directory diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 032c7cd3cede..def074807cee 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -831,8 +831,8 @@ printk_ratelimit: ================= Some warning messages are rate limited. printk_ratelimit specifies -the minimum length of time between these messages (in jiffies), by -default we allow one every 5 seconds. +the minimum length of time between these messages (in seconds). +The default value is 5 seconds. A value of 0 will disable rate limiting. @@ -845,6 +845,8 @@ seconds, we do allow a burst of messages to pass through. printk_ratelimit_burst specifies the number of messages we can send before ratelimiting kicks in. +The default value is 10 messages. + printk_devkmsg: =============== @@ -1101,7 +1103,7 @@ During initialization the kernel sets this value such that even if the maximum number of threads is created, the thread structures occupy only a part (1/8th) of the available RAM pages. -The minimum value that can be written to threads-max is 20. +The minimum value that can be written to threads-max is 1. The maximum value that can be written to threads-max is given by the constant FUTEX_TID_MASK (0x3fffffff). @@ -1109,10 +1111,6 @@ constant FUTEX_TID_MASK (0x3fffffff). If a value outside of this range is written to threads-max an error EINVAL occurs. -The value written is checked against the available RAM pages. If the -thread structures would occupy too much (more than 1/8th) of the -available RAM pages threads-max is reduced accordingly. - unknown_nmi_panic: ================== diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst index 7b9035c01a2e..72b2cfb066f4 100644 --- a/Documentation/admin-guide/sysrq.rst +++ b/Documentation/admin-guide/sysrq.rst @@ -171,22 +171,20 @@ It seems others find it useful as (System Attention Key) which is useful when you want to exit a program that will not let you switch consoles. (For example, X or a svgalib program.) -``reboot(b)`` is good when you're unable to shut down. But you should also -``sync(s)`` and ``umount(u)`` first. +``reboot(b)`` is good when you're unable to shut down, it is an equivalent +of pressing the "reset" button. ``crash(c)`` can be used to manually trigger a crashdump when the system is hung. Note that this just triggers a crash if there is no dump mechanism available. -``sync(s)`` is great when your system is locked up, it allows you to sync your -disks and will certainly lessen the chance of data loss and fscking. Note -that the sync hasn't taken place until you see the "OK" and "Done" appear -on the screen. (If the kernel is really in strife, you may not ever get the -OK or Done message...) +``sync(s)`` is handy before yanking removable medium or after using a rescue +shell that provides no graceful shutdown -- it will ensure your data is +safely written to the disk. Note that the sync hasn't taken place until you see +the "OK" and "Done" appear on the screen. -``umount(u)`` is basically useful in the same ways as ``sync(s)``. I generally -``sync(s)``, ``umount(u)``, then ``reboot(b)`` when my system locks. It's saved -me many a fsck. Again, the unmount (remount read-only) hasn't taken place until -you see the "OK" and "Done" message appear on the screen. +``umount(u)`` can be used to mark filesystems as properly unmounted. From the +running system's point of view, they will be remounted read-only. The remount +isn't complete until you see the "OK" and "Done" message appear on the screen. The loglevels ``0``-``9`` are useful when your console is being flooded with kernel messages you do not want to see. Selecting ``0`` will prevent all but diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst index 898ad78f3cc7..10c4f0ce2ad0 100644 --- a/Documentation/admin-guide/thunderbolt.rst +++ b/Documentation/admin-guide/thunderbolt.rst @@ -1,6 +1,28 @@ -============= - Thunderbolt -============= +.. SPDX-License-Identifier: GPL-2.0 + +====================== + USB4 and Thunderbolt +====================== +USB4 is the public specification based on Thunderbolt 3 protocol with +some differences at the register level among other things. Connection +manager is an entity running on the host router (host controller) +responsible for enumerating routers and establishing tunnels. A +connection manager can be implemented either in firmware or software. +Typically PCs come with a firmware connection manager for Thunderbolt 3 +and early USB4 capable systems. Apple systems on the other hand use +software connection manager and the later USB4 compliant devices follow +the suit. + +The Linux Thunderbolt driver supports both and can detect at runtime which +connection manager implementation is to be used. To be on the safe side the +software connection manager in Linux also advertises security level +``user`` which means PCIe tunneling is disabled by default. The +documentation below applies to both implementations with the exception that +the software connection manager only supports ``user`` security level and +is expected to be accompanied with an IOMMU based DMA protection. + +Security levels and how to use them +----------------------------------- The interface presented here is not meant for end users. Instead there should be a userspace tool that handles all the low-level details, keeps a database of the authorized devices and prompts users for new connections. @@ -18,8 +40,6 @@ This will authorize all devices automatically when they appear. However, keep in mind that this bypasses the security levels and makes the system vulnerable to DMA attacks. -Security levels and how to use them ------------------------------------ Starting with Intel Falcon Ridge Thunderbolt controller there are 4 security levels available. Intel Titan Ridge added one more security level (usbonly). The reason for these is the fact that the connected devices can diff --git a/Documentation/admin-guide/ufs.rst b/Documentation/admin-guide/ufs.rst new file mode 100644 index 000000000000..55d15297f8d7 --- /dev/null +++ b/Documentation/admin-guide/ufs.rst @@ -0,0 +1,68 @@ +========= +Using UFS +========= + +mount -t ufs -o ufstype=type_of_ufs device dir + + +UFS Options +=========== + +ufstype=type_of_ufs + UFS is a file system widely used in different operating systems. + The problem are differences among implementations. Features of + some implementations are undocumented, so its hard to recognize + type of ufs automatically. That's why user must specify type of + ufs manually by mount option ufstype. Possible values are: + + old + old format of ufs + default value, supported as read-only + + 44bsd + used in FreeBSD, NetBSD, OpenBSD + supported as read-write + + ufs2 + used in FreeBSD 5.x + supported as read-write + + 5xbsd + synonym for ufs2 + + sun + used in SunOS (Solaris) + supported as read-write + + sunx86 + used in SunOS for Intel (Solarisx86) + supported as read-write + + hp + used in HP-UX + supported as read-only + + nextstep + used in NextStep + supported as read-only + + nextstep-cd + used for NextStep CDROMs (block_size == 2048) + supported as read-only + + openstep + used in OpenStep + supported as read-only + + +Possible Problems +----------------- + +See next section, if you have any. + + +Bug Reports +----------- + +Any ufs bug report you can send to daniel.pirkl@email.cz or +to dushistov@mail.ru (do not send partition tables bug reports). diff --git a/Documentation/admin-guide/wimax/i2400m.rst b/Documentation/admin-guide/wimax/i2400m.rst new file mode 100644 index 000000000000..194388c0c351 --- /dev/null +++ b/Documentation/admin-guide/wimax/i2400m.rst @@ -0,0 +1,283 @@ +.. include:: <isonum.txt> + +==================================================== +Driver for the Intel Wireless Wimax Connection 2400m +==================================================== + +:Copyright: |copy| 2008 Intel Corporation < linux-wimax@intel.com > + + This provides a driver for the Intel Wireless WiMAX Connection 2400m + and a basic Linux kernel WiMAX stack. + +1. Requirements +=============== + + * Linux installation with Linux kernel 2.6.22 or newer (if building + from a separate tree) + * Intel i2400m Echo Peak or Baxter Peak; this includes the Intel + Wireless WiMAX/WiFi Link 5x50 series. + * build tools: + + + Linux kernel development package for the target kernel; to + build against your currently running kernel, you need to have + the kernel development package corresponding to the running + image installed (usually if your kernel is named + linux-VERSION, the development package is called + linux-dev-VERSION or linux-headers-VERSION). + + GNU C Compiler, make + +2. Compilation and installation +=============================== + +2.1. Compilation of the drivers included in the kernel +------------------------------------------------------ + + Configure the kernel; to enable the WiMAX drivers select Drivers > + Networking Drivers > WiMAX device support. Enable all of them as + modules (easier). + + If USB or SDIO are not enabled in the kernel configuration, the options + to build the i2400m USB or SDIO drivers will not show. Enable said + subsystems and go back to the WiMAX menu to enable the drivers. + + Compile and install your kernel as usual. + +2.2. Compilation of the drivers distributed as an standalone module +------------------------------------------------------------------- + + To compile:: + + $ cd source/directory + $ make + + Once built you can load and unload using the provided load.sh script; + load.sh will load the modules, load.sh u will unload them. + + To install in the default kernel directories (and enable auto loading + when the device is plugged):: + + $ make install + $ depmod -a + + If your kernel development files are located in a non standard + directory or if you want to build for a kernel that is not the + currently running one, set KDIR to the right location:: + + $ make KDIR=/path/to/kernel/dev/tree + + For more information, please contact linux-wimax@intel.com. + +3. Installing the firmware +-------------------------- + + The firmware can be obtained from http://linuxwimax.org or might have + been supplied with your hardware. + + It has to be installed in the target system:: + + $ cp FIRMWAREFILE.sbcf /lib/firmware/i2400m-fw-BUSTYPE-1.3.sbcf + + * NOTE: if your firmware came in an .rpm or .deb file, just install + it as normal, with the rpm (rpm -i FIRMWARE.rpm) or dpkg + (dpkg -i FIRMWARE.deb) commands. No further action is needed. + * BUSTYPE will be usb or sdio, depending on the hardware you have. + Each hardware type comes with its own firmware and will not work + with other types. + +4. Design +========= + + This package contains two major parts: a WiMAX kernel stack and a + driver for the Intel i2400m. + + The WiMAX stack is designed to provide for common WiMAX control + services to current and future WiMAX devices from any vendor; please + see README.wimax for details. + + The i2400m kernel driver is broken up in two main parts: the bus + generic driver and the bus-specific drivers. The bus generic driver + forms the drivercore and contain no knowledge of the actual method we + use to connect to the device. The bus specific drivers are just the + glue to connect the bus-generic driver and the device. Currently only + USB and SDIO are supported. See drivers/net/wimax/i2400m/i2400m.h for + more information. + + The bus generic driver is logically broken up in two parts: OS-glue and + hardware-glue. The OS-glue interfaces with Linux. The hardware-glue + interfaces with the device on using an interface provided by the + bus-specific driver. The reason for this breakup is to be able to + easily reuse the hardware-glue to write drivers for other OSes; note + the hardware glue part is written as a native Linux driver; no + abstraction layers are used, so to port to another OS, the Linux kernel + API calls should be replaced with the target OS's. + +5. Usage +======== + + To load the driver, follow the instructions in the install section; + once the driver is loaded, plug in the device (unless it is permanently + plugged in). The driver will enumerate the device, upload the firmware + and output messages in the kernel log (dmesg, /var/log/messages or + /var/log/kern.log) such as:: + + ... + i2400m_usb 5-4:1.0: firmware interface version 8.0.0 + i2400m_usb 5-4:1.0: WiMAX interface wmx0 (00:1d:e1:01:94:2c) ready + + At this point the device is ready to work. + + Current versions require the Intel WiMAX Network Service in userspace + to make things work. See the network service's README for instructions + on how to scan, connect and disconnect. + +5.1. Module parameters +---------------------- + + Module parameters can be set at kernel or module load time or by + echoing values:: + + $ echo VALUE > /sys/module/MODULENAME/parameters/PARAMETERNAME + + To make changes permanent, for example, for the i2400m module, you can + also create a file named /etc/modprobe.d/i2400m containing:: + + options i2400m idle_mode_disabled=1 + + To find which parameters are supported by a module, run:: + + $ modinfo path/to/module.ko + + During kernel bootup (if the driver is linked in the kernel), specify + the following to the kernel command line:: + + i2400m.PARAMETER=VALUE + +5.1.1. i2400m: idle_mode_disabled +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The i2400m module supports a parameter to disable idle mode. This + parameter, once set, will take effect only when the device is + reinitialized by the driver (eg: following a reset or a reconnect). + +5.2. Debug operations: debugfs entries +-------------------------------------- + + The driver will register debugfs entries that allow the user to tweak + debug settings. There are three main container directories where + entries are placed, which correspond to the three blocks a i2400m WiMAX + driver has: + + * /sys/kernel/debug/wimax:DEVNAME/ for the generic WiMAX stack + controls + * /sys/kernel/debug/wimax:DEVNAME/i2400m for the i2400m generic + driver controls + * /sys/kernel/debug/wimax:DEVNAME/i2400m-usb (or -sdio) for the + bus-specific i2400m-usb or i2400m-sdio controls). + + Of course, if debugfs is mounted in a directory other than + /sys/kernel/debug, those paths will change. + +5.2.1. Increasing debug output +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The files named *dl_* indicate knobs for controlling the debug output + of different submodules:: + + # find /sys/kernel/debug/wimax\:wmx0 -name \*dl_\* + /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_tx + /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_rx + /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_notif + /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_fw + /sys/kernel/debug/wimax:wmx0/i2400m-usb/dl_usb + /sys/kernel/debug/wimax:wmx0/i2400m/dl_tx + /sys/kernel/debug/wimax:wmx0/i2400m/dl_rx + /sys/kernel/debug/wimax:wmx0/i2400m/dl_rfkill + /sys/kernel/debug/wimax:wmx0/i2400m/dl_netdev + /sys/kernel/debug/wimax:wmx0/i2400m/dl_fw + /sys/kernel/debug/wimax:wmx0/i2400m/dl_debugfs + /sys/kernel/debug/wimax:wmx0/i2400m/dl_driver + /sys/kernel/debug/wimax:wmx0/i2400m/dl_control + /sys/kernel/debug/wimax:wmx0/wimax_dl_stack + /sys/kernel/debug/wimax:wmx0/wimax_dl_op_rfkill + /sys/kernel/debug/wimax:wmx0/wimax_dl_op_reset + /sys/kernel/debug/wimax:wmx0/wimax_dl_op_msg + /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table + /sys/kernel/debug/wimax:wmx0/wimax_dl_debugfs + + By reading the file you can obtain the current value of said debug + level; by writing to it, you can set it. + + To increase the debug level of, for example, the i2400m's generic TX + engine, just write:: + + $ echo 3 > /sys/kernel/debug/wimax:wmx0/i2400m/dl_tx + + Increasing numbers yield increasing debug information; for details of + what is printed and the available levels, check the source. The code + uses 0 for disabled and increasing values until 8. + +5.2.2. RX and TX statistics +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + The i2400m/rx_stats and i2400m/tx_stats provide statistics about the + data reception/delivery from the device:: + + $ cat /sys/kernel/debug/wimax:wmx0/i2400m/rx_stats + 45 1 3 34 3104 48 480 + + The numbers reported are: + + * packets/RX-buffer: total, min, max + * RX-buffers: total RX buffers received, accumulated RX buffer size + in bytes, min size received, max size received + + Thus, to find the average buffer size received, divide accumulated + RX-buffer / total RX-buffers. + + To clear the statistics back to 0, write anything to the rx_stats file:: + + $ echo 1 > /sys/kernel/debug/wimax:wmx0/i2400m_rx_stats + + Likewise for TX. + + Note the packets this debug file refers to are not network packet, but + packets in the sense of the device-specific protocol for communication + to the host. See drivers/net/wimax/i2400m/tx.c. + +5.2.3. Tracing messages received from user space +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + To echo messages received from user space into the trace pipe that the + i2400m driver creates, set the debug file i2400m/trace_msg_from_user to + 1:: + + $ echo 1 > /sys/kernel/debug/wimax:wmx0/i2400m/trace_msg_from_user + +5.2.4. Performing a device reset +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + By writing a 0, a 1 or a 2 to the file + /sys/kernel/debug/wimax:wmx0/reset, the driver performs a warm (without + disconnecting from the bus), cold (disconnecting from the bus) or bus + (bus specific) reset on the device. + +5.2.5. Asking the device to enter power saving mode +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + By writing any value to the /sys/kernel/debug/wimax:wmx0 file, the + device will attempt to enter power saving mode. + +6. Troubleshooting +================== + +6.1. Driver complains about ``i2400m-fw-usb-1.2.sbcf: request failed`` +---------------------------------------------------------------------- + + If upon connecting the device, the following is output in the kernel + log:: + + i2400m_usb 5-4:1.0: fw i2400m-fw-usb-1.3.sbcf: request failed: -2 + + This means that the driver cannot locate the firmware file named + /lib/firmware/i2400m-fw-usb-1.2.sbcf. Check that the file is present in + the right location. diff --git a/Documentation/admin-guide/wimax/index.rst b/Documentation/admin-guide/wimax/index.rst new file mode 100644 index 000000000000..fdf7c1f99ff5 --- /dev/null +++ b/Documentation/admin-guide/wimax/index.rst @@ -0,0 +1,19 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +WiMAX subsystem +=============== + +.. toctree:: + :maxdepth: 2 + + wimax + + i2400m + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/admin-guide/wimax/wimax.rst b/Documentation/admin-guide/wimax/wimax.rst new file mode 100644 index 000000000000..817ee8ba2732 --- /dev/null +++ b/Documentation/admin-guide/wimax/wimax.rst @@ -0,0 +1,89 @@ +.. include:: <isonum.txt> + +======================== +Linux kernel WiMAX stack +======================== + +:Copyright: |copy| 2008 Intel Corporation < linux-wimax@intel.com > + + This provides a basic Linux kernel WiMAX stack to provide a common + control API for WiMAX devices, usable from kernel and user space. + +1. Design +========= + + The WiMAX stack is designed to provide for common WiMAX control + services to current and future WiMAX devices from any vendor. + + Because currently there is only one and we don't know what would be the + common services, the APIs it currently provides are very minimal. + However, it is done in such a way that it is easily extensible to + accommodate future requirements. + + The stack works by embedding a struct wimax_dev in your device's + control structures. This provides a set of callbacks that the WiMAX + stack will call in order to implement control operations requested by + the user. As well, the stack provides API functions that the driver + calls to notify about changes of state in the device. + + The stack exports the API calls needed to control the device to user + space using generic netlink as a marshalling mechanism. You can access + them using your own code or use the wrappers provided for your + convenience in libwimax (in the wimax-tools package). + + For detailed information on the stack, please see + include/linux/wimax.h. + +2. Usage +======== + + For usage in a driver (registration, API, etc) please refer to the + instructions in the header file include/linux/wimax.h. + + When a device is registered with the WiMAX stack, a set of debugfs + files will appear in /sys/kernel/debug/wimax:wmxX can tweak for + control. + +2.1. Obtaining debug information: debugfs entries +------------------------------------------------- + + The WiMAX stack is compiled, by default, with debug messages that can + be used to diagnose issues. By default, said messages are disabled. + + The drivers will register debugfs entries that allow the user to tweak + debug settings. + + Each driver, when registering with the stack, will cause a debugfs + directory named wimax:DEVICENAME to be created; optionally, it might + create more subentries below it. + +2.1.1. Increasing debug output +------------------------------ + + The files named *dl_* indicate knobs for controlling the debug output + of different submodules of the WiMAX stack:: + + # find /sys/kernel/debug/wimax\:wmx0 -name \*dl_\* + /sys/kernel/debug/wimax:wmx0/wimax_dl_stack + /sys/kernel/debug/wimax:wmx0/wimax_dl_op_rfkill + /sys/kernel/debug/wimax:wmx0/wimax_dl_op_reset + /sys/kernel/debug/wimax:wmx0/wimax_dl_op_msg + /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table + /sys/kernel/debug/wimax:wmx0/wimax_dl_debugfs + /sys/kernel/debug/wimax:wmx0/.... # other driver specific files + + NOTE: + Of course, if debugfs is mounted in a directory other than + /sys/kernel/debug, those paths will change. + + By reading the file you can obtain the current value of said debug + level; by writing to it, you can set it. + + To increase the debug level of, for example, the id-table submodule, + just write: + + $ echo 3 > /sys/kernel/debug/wimax:wmx0/wimax_dl_id_table + + Increasing numbers yield increasing debug information; for details of + what is printed and the available levels, check the source. The code + uses 0 for disabled and increasing values until 8. diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index e76665a8f2f2..ad911be5b5e9 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -253,7 +253,7 @@ The following sysctls are available for the XFS filesystem: pool. fs.xfs.speculative_prealloc_lifetime - (Units: seconds Min: 1 Default: 300 Max: 86400) + (Units: seconds Min: 1 Default: 300 Max: 86400) The interval at which the background scanning for inodes with unused speculative preallocation runs. The scan removes unused preallocation from clean inodes and releases @@ -337,11 +337,12 @@ None at present. Removed Sysctls =============== +============================= ======= Name Removed - ---- ------- +============================= ======= fs.xfs.xfsbufd_centisec v4.0 fs.xfs.age_buffer_centisecs v4.0 - +============================= ======= Error handling ============== |