diff options
Diffstat (limited to 'Documentation/x86')
-rw-r--r-- | Documentation/x86/amd-memory-encryption.txt | 68 | ||||
-rw-r--r-- | Documentation/x86/early-microcode.txt | 70 | ||||
-rw-r--r-- | Documentation/x86/intel_rdt_ui.txt | 323 | ||||
-rw-r--r-- | Documentation/x86/microcode.txt | 137 | ||||
-rw-r--r-- | Documentation/x86/orc-unwinder.txt | 179 | ||||
-rw-r--r-- | Documentation/x86/protection-keys.txt | 6 | ||||
-rw-r--r-- | Documentation/x86/x86_64/5level-paging.txt | 64 |
7 files changed, 736 insertions, 111 deletions
diff --git a/Documentation/x86/amd-memory-encryption.txt b/Documentation/x86/amd-memory-encryption.txt new file mode 100644 index 000000000000..f512ab718541 --- /dev/null +++ b/Documentation/x86/amd-memory-encryption.txt @@ -0,0 +1,68 @@ +Secure Memory Encryption (SME) is a feature found on AMD processors. + +SME provides the ability to mark individual pages of memory as encrypted using +the standard x86 page tables. A page that is marked encrypted will be +automatically decrypted when read from DRAM and encrypted when written to +DRAM. SME can therefore be used to protect the contents of DRAM from physical +attacks on the system. + +A page is encrypted when a page table entry has the encryption bit set (see +below on how to determine its position). The encryption bit can also be +specified in the cr3 register, allowing the PGD table to be encrypted. Each +successive level of page tables can also be encrypted by setting the encryption +bit in the page table entry that points to the next table. This allows the full +page table hierarchy to be encrypted. Note, this means that just because the +encryption bit is set in cr3, doesn't imply the full hierarchy is encyrpted. +Each page table entry in the hierarchy needs to have the encryption bit set to +achieve that. So, theoretically, you could have the encryption bit set in cr3 +so that the PGD is encrypted, but not set the encryption bit in the PGD entry +for a PUD which results in the PUD pointed to by that entry to not be +encrypted. + +Support for SME can be determined through the CPUID instruction. The CPUID +function 0x8000001f reports information related to SME: + + 0x8000001f[eax]: + Bit[0] indicates support for SME + 0x8000001f[ebx]: + Bits[5:0] pagetable bit number used to activate memory + encryption + Bits[11:6] reduction in physical address space, in bits, when + memory encryption is enabled (this only affects + system physical addresses, not guest physical + addresses) + +If support for SME is present, MSR 0xc00100010 (MSR_K8_SYSCFG) can be used to +determine if SME is enabled and/or to enable memory encryption: + + 0xc0010010: + Bit[23] 0 = memory encryption features are disabled + 1 = memory encryption features are enabled + +Linux relies on BIOS to set this bit if BIOS has determined that the reduction +in the physical address space as a result of enabling memory encryption (see +CPUID information above) will not conflict with the address space resource +requirements for the system. If this bit is not set upon Linux startup then +Linux itself will not set it and memory encryption will not be possible. + +The state of SME in the Linux kernel can be documented as follows: + - Supported: + The CPU supports SME (determined through CPUID instruction). + + - Enabled: + Supported and bit 23 of MSR_K8_SYSCFG is set. + + - Active: + Supported, Enabled and the Linux kernel is actively applying + the encryption bit to page table entries (the SME mask in the + kernel is non-zero). + +SME can also be enabled and activated in the BIOS. If SME is enabled and +activated in the BIOS, then all memory accesses will be encrypted and it will +not be necessary to activate the Linux memory encryption support. If the BIOS +merely enables SME (sets bit 23 of the MSR_K8_SYSCFG), then Linux can activate +memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or +by supplying mem_encrypt=on on the kernel command line. However, if BIOS does +not enable SME, then Linux will not be able to activate memory encryption, even +if configured to do so by default or the mem_encrypt=on command line parameter +is specified. diff --git a/Documentation/x86/early-microcode.txt b/Documentation/x86/early-microcode.txt deleted file mode 100644 index 07749e7f3d50..000000000000 --- a/Documentation/x86/early-microcode.txt +++ /dev/null @@ -1,70 +0,0 @@ -Early load microcode -==================== -By Fenghua Yu <fenghua.yu@intel.com> - -Kernel can update microcode in early phase of boot time. Loading microcode early -can fix CPU issues before they are observed during kernel boot time. - -Microcode is stored in an initrd file. The microcode is read from the initrd -file and loaded to CPUs during boot time. - -The format of the combined initrd image is microcode in cpio format followed by -the initrd image (maybe compressed). Kernel parses the combined initrd image -during boot time. The microcode file in cpio name space is: -on Intel: kernel/x86/microcode/GenuineIntel.bin -on AMD : kernel/x86/microcode/AuthenticAMD.bin - -During BSP boot (before SMP starts), if the kernel finds the microcode file in -the initrd file, it parses the microcode and saves matching microcode in memory. -If matching microcode is found, it will be uploaded in BSP and later on in all -APs. - -The cached microcode patch is applied when CPUs resume from a sleep state. - -There are two legacy user space interfaces to load microcode, either through -/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file -in sysfs. - -In addition to these two legacy methods, the early loading method described -here is the third method with which microcode can be uploaded to a system's -CPUs. - -The following example script shows how to generate a new combined initrd file in -/boot/initrd-3.5.0.ucode.img with original microcode microcode.bin and -original initrd image /boot/initrd-3.5.0.img. - -mkdir initrd -cd initrd -mkdir -p kernel/x86/microcode -cp ../microcode.bin kernel/x86/microcode/GenuineIntel.bin (or AuthenticAMD.bin) -find . | cpio -o -H newc >../ucode.cpio -cd .. -cat ucode.cpio /boot/initrd-3.5.0.img >/boot/initrd-3.5.0.ucode.img - -Builtin microcode -================= - -We can also load builtin microcode supplied through the regular firmware -builtin method CONFIG_FIRMWARE_IN_KERNEL. Only 64-bit is currently -supported. - -Here's an example: - -CONFIG_FIRMWARE_IN_KERNEL=y -CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin" -CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware" - -This basically means, you have the following tree structure locally: - -/lib/firmware/ -|-- amd-ucode -... -| |-- microcode_amd_fam15h.bin -... -|-- intel-ucode -... -| |-- 06-3a-09 -... - -so that the build system can find those files and integrate them into -the final kernel image. The early loader finds them and applies them. diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt index c491a1b82de2..4d8848e4e224 100644 --- a/Documentation/x86/intel_rdt_ui.txt +++ b/Documentation/x86/intel_rdt_ui.txt @@ -6,8 +6,8 @@ Fenghua Yu <fenghua.yu@intel.com> Tony Luck <tony.luck@intel.com> Vikas Shivappa <vikas.shivappa@intel.com> -This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the -X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3". +This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the +X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3". To use the feature mount the file system: @@ -17,6 +17,13 @@ mount options are: "cdp": Enable code/data prioritization in L3 cache allocations. +RDT features are orthogonal. A particular system may support only +monitoring, only control, or both monitoring and control. + +The mount succeeds if either of allocation or monitoring is present, but +only those files and directories supported by the system will be created. +For more details on the behavior of the interface during monitoring +and allocation, see the "Resource alloc and monitor groups" section. Info directory -------------- @@ -24,7 +31,12 @@ Info directory The 'info' directory contains information about the enabled resources. Each resource has its own subdirectory. The subdirectory names reflect the resource names. -Cache resource(L3/L2) subdirectory contains the following files: + +Each subdirectory contains the following files with respect to +allocation: + +Cache resource(L3/L2) subdirectory contains the following files +related to allocation: "num_closids": The number of CLOSIDs which are valid for this resource. The kernel uses the smallest number of @@ -36,7 +48,15 @@ Cache resource(L3/L2) subdirectory contains the following files: "min_cbm_bits": The minimum number of consecutive bits which must be set when writing a mask. -Memory bandwitdh(MB) subdirectory contains the following files: +"shareable_bits": Bitmask of shareable resource with other executing + entities (e.g. I/O). User can use this when + setting up exclusive cache partitions. Note that + some platforms support devices that have their + own settings for cache use which can over-ride + these bits. + +Memory bandwitdh(MB) subdirectory contains the following files +with respect to allocation: "min_bandwidth": The minimum memory bandwidth percentage which user can request. @@ -52,48 +72,152 @@ Memory bandwitdh(MB) subdirectory contains the following files: non-linear. This field is purely informational only. -Resource groups ---------------- +If RDT monitoring is available there will be an "L3_MON" directory +with the following files: + +"num_rmids": The number of RMIDs available. This is the + upper bound for how many "CTRL_MON" + "MON" + groups can be created. + +"mon_features": Lists the monitoring events if + monitoring is enabled for the resource. + +"max_threshold_occupancy": + Read/write file provides the largest value (in + bytes) at which a previously used LLC_occupancy + counter can be considered for re-use. + + +Resource alloc and monitor groups +--------------------------------- + Resource groups are represented as directories in the resctrl file -system. The default group is the root directory. Other groups may be -created as desired by the system administrator using the "mkdir(1)" -command, and removed using "rmdir(1)". +system. The default group is the root directory which, immediately +after mounting, owns all the tasks and cpus in the system and can make +full use of all resources. + +On a system with RDT control features additional directories can be +created in the root directory that specify different amounts of each +resource (see "schemata" below). The root and these additional top level +directories are referred to as "CTRL_MON" groups below. + +On a system with RDT monitoring the root directory and other top level +directories contain a directory named "mon_groups" in which additional +directories can be created to monitor subsets of tasks in the CTRL_MON +group that is their ancestor. These are called "MON" groups in the rest +of this document. + +Removing a directory will move all tasks and cpus owned by the group it +represents to the parent. Removing one of the created CTRL_MON groups +will automatically remove all MON groups below it. + +All groups contain the following files: + +"tasks": + Reading this file shows the list of all tasks that belong to + this group. Writing a task id to the file will add a task to the + group. If the group is a CTRL_MON group the task is removed from + whichever previous CTRL_MON group owned the task and also from + any MON group that owned the task. If the group is a MON group, + then the task must already belong to the CTRL_MON parent of this + group. The task is removed from any previous MON group. + + +"cpus": + Reading this file shows a bitmask of the logical CPUs owned by + this group. Writing a mask to this file will add and remove + CPUs to/from this group. As with the tasks file a hierarchy is + maintained where MON groups may only include CPUs owned by the + parent CTRL_MON group. + -There are three files associated with each group: +"cpus_list": + Just like "cpus", only using ranges of CPUs instead of bitmasks. -"tasks": A list of tasks that belongs to this group. Tasks can be - added to a group by writing the task ID to the "tasks" file - (which will automatically remove them from the previous - group to which they belonged). New tasks created by fork(2) - and clone(2) are added to the same group as their parent. - If a pid is not in any sub partition, it is in root partition - (i.e. default partition). -"cpus": A bitmask of logical CPUs assigned to this group. Writing - a new mask can add/remove CPUs from this group. Added CPUs - are removed from their previous group. Removed ones are - given to the default (root) group. You cannot remove CPUs - from the default group. +When control is enabled all CTRL_MON groups will also contain: -"cpus_list": One or more CPU ranges of logical CPUs assigned to this - group. Same rules apply like for the "cpus" file. +"schemata": + A list of all the resources available to this group. + Each resource has its own line and format - see below for details. -"schemata": A list of all the resources available to this group. - Each resource has its own line and format - see below for - details. +When monitoring is enabled all MON groups will also contain: -When a task is running the following rules define which resources -are available to it: +"mon_data": + This contains a set of files organized by L3 domain and by + RDT event. E.g. on a system with two L3 domains there will + be subdirectories "mon_L3_00" and "mon_L3_01". Each of these + directories have one file per event (e.g. "llc_occupancy", + "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these + files provide a read out of the current value of the event for + all tasks in the group. In CTRL_MON groups these files provide + the sum for all tasks in the CTRL_MON group and all tasks in + MON groups. Please see example section for more details on usage. + +Resource allocation rules +------------------------- +When a task is running the following rules define which resources are +available to it: 1) If the task is a member of a non-default group, then the schemata -for that group is used. + for that group is used. 2) Else if the task belongs to the default group, but is running on a -CPU that is assigned to some specific group, then the schemata for -the CPU's group is used. + CPU that is assigned to some specific group, then the schemata for the + CPU's group is used. 3) Otherwise the schemata for the default group is used. +Resource monitoring rules +------------------------- +1) If a task is a member of a MON group, or non-default CTRL_MON group + then RDT events for the task will be reported in that group. + +2) If a task is a member of the default CTRL_MON group, but is running + on a CPU that is assigned to some specific group, then the RDT events + for the task will be reported in that group. + +3) Otherwise RDT events for the task will be reported in the root level + "mon_data" group. + + +Notes on cache occupancy monitoring and control +----------------------------------------------- +When moving a task from one group to another you should remember that +this only affects *new* cache allocations by the task. E.g. you may have +a task in a monitor group showing 3 MB of cache occupancy. If you move +to a new group and immediately check the occupancy of the old and new +groups you will likely see that the old group is still showing 3 MB and +the new group zero. When the task accesses locations still in cache from +before the move, the h/w does not update any counters. On a busy system +you will likely see the occupancy in the old group go down as cache lines +are evicted and re-used while the occupancy in the new group rises as +the task accesses memory and loads into the cache are counted based on +membership in the new group. + +The same applies to cache allocation control. Moving a task to a group +with a smaller cache partition will not evict any cache lines. The +process may continue to use them from the old partition. + +Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) +to identify a control group and a monitoring group respectively. Each of +the resource groups are mapped to these IDs based on the kind of group. The +number of CLOSid and RMID are limited by the hardware and hence the creation of +a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID +and creation of "MON" group may fail if we run out of RMIDs. + +max_threshold_occupancy - generic concepts +------------------------------------------ + +Note that an RMID once freed may not be immediately available for use as +the RMID is still tagged the cache lines of the previous user of RMID. +Hence such RMIDs are placed on limbo list and checked back if the cache +occupancy has gone down. If there is a time when system has a lot of +limbo RMIDs but which are not ready to be used, user may see an -EBUSY +during mkdir. + +max_threshold_occupancy is a user configurable value to determine the +occupancy at which an RMID can be freed. Schemata files - general concepts --------------------------------- @@ -143,22 +267,22 @@ SKUs. Using a high bandwidth and a low bandwidth setting on two threads sharing a core will result in both threads being throttled to use the low bandwidth. -L3 details (code and data prioritization disabled) --------------------------------------------------- +L3 schemata file details (code and data prioritization disabled) +---------------------------------------------------------------- With CDP disabled the L3 schemata format is: L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... -L3 details (CDP enabled via mount option to resctrl) ----------------------------------------------------- +L3 schemata file details (CDP enabled via mount option to resctrl) +------------------------------------------------------------------ When CDP is enabled L3 control is split into two separate resources so you can specify independent masks for code and data like this: L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... -L2 details ----------- +L2 schemata file details +------------------------ L2 cache does not support code and data prioritization, so the schemata format is always: @@ -185,6 +309,8 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff L3DATA:0=fffff;1=fffff;2=3c0;3=fffff L3CODE:0=fffff;1=fffff;2=fffff;3=fffff +Examples for RDT allocation usage: + Example 1 --------- On a two socket machine (one L3 cache per socket) with just four bits @@ -410,3 +536,124 @@ void main(void) /* code to read and write directory contents */ resctrl_release_lock(fd); } + +Examples for RDT Monitoring along with allocation usage: + +Reading monitored data +---------------------- +Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would +show the current snapshot of LLC occupancy of the corresponding MON +group or CTRL_MON group. + + +Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) +--------- +On a two socket machine (one L3 cache per socket) with just four bits +for cache bit masks + +# mount -t resctrl resctrl /sys/fs/resctrl +# cd /sys/fs/resctrl +# mkdir p0 p1 +# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata +# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata +# echo 5678 > p1/tasks +# echo 5679 > p1/tasks + +The default resource group is unmodified, so we have access to all parts +of all caches (its schemata file reads "L3:0=f;1=f"). + +Tasks that are under the control of group "p0" may only allocate from the +"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. +Tasks in group "p1" use the "lower" 50% of cache on both sockets. + +Create monitor groups and assign a subset of tasks to each monitor group. + +# cd /sys/fs/resctrl/p1/mon_groups +# mkdir m11 m12 +# echo 5678 > m11/tasks +# echo 5679 > m12/tasks + +fetch data (data shown in bytes) + +# cat m11/mon_data/mon_L3_00/llc_occupancy +16234000 +# cat m11/mon_data/mon_L3_01/llc_occupancy +14789000 +# cat m12/mon_data/mon_L3_00/llc_occupancy +16789000 + +The parent ctrl_mon group shows the aggregated data. + +# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy +31234000 + +Example 2 (Monitor a task from its creation) +--------- +On a two socket machine (one L3 cache per socket) + +# mount -t resctrl resctrl /sys/fs/resctrl +# cd /sys/fs/resctrl +# mkdir p0 p1 + +An RMID is allocated to the group once its created and hence the <cmd> +below is monitored from its creation. + +# echo $$ > /sys/fs/resctrl/p1/tasks +# <cmd> + +Fetch the data + +# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy +31789000 + +Example 3 (Monitor without CAT support or before creating CAT groups) +--------- + +Assume a system like HSW has only CQM and no CAT support. In this case +the resctrl will still mount but cannot create CTRL_MON directories. +But user can create different MON groups within the root group thereby +able to monitor all tasks including kernel threads. + +This can also be used to profile jobs cache size footprint before being +able to allocate them to different allocation groups. + +# mount -t resctrl resctrl /sys/fs/resctrl +# cd /sys/fs/resctrl +# mkdir mon_groups/m01 +# mkdir mon_groups/m02 + +# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks +# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks + +Monitor the groups separately and also get per domain data. From the +below its apparent that the tasks are mostly doing work on +domain(socket) 0. + +# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy +31234000 +# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy +34555 +# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy +31234000 +# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy +32789 + + +Example 4 (Monitor real time tasks) +----------------------------------- + +A single socket system which has real time tasks running on cores 4-7 +and non real time tasks on other cpus. We want to monitor the cache +occupancy of the real time threads on these cores. + +# mount -t resctrl resctrl /sys/fs/resctrl +# cd /sys/fs/resctrl +# mkdir p1 + +Move the cpus 4-7 over to p1 +# echo f0 > p0/cpus + +View the llc occupancy snapshot + +# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy +11234000 diff --git a/Documentation/x86/microcode.txt b/Documentation/x86/microcode.txt new file mode 100644 index 000000000000..f57e1b45e628 --- /dev/null +++ b/Documentation/x86/microcode.txt @@ -0,0 +1,137 @@ + The Linux Microcode Loader + +Authors: Fenghua Yu <fenghua.yu@intel.com> + Borislav Petkov <bp@suse.de> + +The kernel has a x86 microcode loading facility which is supposed to +provide microcode loading methods in the OS. Potential use cases are +updating the microcode on platforms beyond the OEM End-Of-Life support, +and updating the microcode on long-running systems without rebooting. + +The loader supports three loading methods: + +1. Early load microcode +======================= + +The kernel can update microcode very early during boot. Loading +microcode early can fix CPU issues before they are observed during +kernel boot time. + +The microcode is stored in an initrd file. During boot, it is read from +it and loaded into the CPU cores. + +The format of the combined initrd image is microcode in (uncompressed) +cpio format followed by the (possibly compressed) initrd image. The +loader parses the combined initrd image during boot. + +The microcode files in cpio name space are: + +on Intel: kernel/x86/microcode/GenuineIntel.bin +on AMD : kernel/x86/microcode/AuthenticAMD.bin + +During BSP (BootStrapping Processor) boot (pre-SMP), the kernel +scans the microcode file in the initrd. If microcode matching the +CPU is found, it will be applied in the BSP and later on in all APs +(Application Processors). + +The loader also saves the matching microcode for the CPU in memory. +Thus, the cached microcode patch is applied when CPUs resume from a +sleep state. + +Here's a crude example how to prepare an initrd with microcode (this is +normally done automatically by the distribution, when recreating the +initrd, so you don't really have to do it yourself. It is documented +here for future reference only). + +--- + #!/bin/bash + + if [ -z "$1" ]; then + echo "You need to supply an initrd file" + exit 1 + fi + + INITRD="$1" + + DSTDIR=kernel/x86/microcode + TMPDIR=/tmp/initrd + + rm -rf $TMPDIR + + mkdir $TMPDIR + cd $TMPDIR + mkdir -p $DSTDIR + + if [ -d /lib/firmware/amd-ucode ]; then + cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin + fi + + if [ -d /lib/firmware/intel-ucode ]; then + cat /lib/firmware/intel-ucode/* > $DSTDIR/GenuineIntel.bin + fi + + find . | cpio -o -H newc >../ucode.cpio + cd .. + mv $INITRD $INITRD.orig + cat ucode.cpio $INITRD.orig > $INITRD + + rm -rf $TMPDIR +--- + +The system needs to have the microcode packages installed into +/lib/firmware or you need to fixup the paths above if yours are +somewhere else and/or you've downloaded them directly from the processor +vendor's site. + +2. Late loading +=============== + +There are two legacy user space interfaces to load microcode, either through +/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file +in sysfs. + +The /dev/cpu/microcode method is deprecated because it needs a special +userspace tool for that. + +The easier method is simply installing the microcode packages your distro +supplies and running: + +# echo 1 > /sys/devices/system/cpu/microcode/reload + +as root. + +The loading mechanism looks for microcode blobs in +/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation +packages already put them there. + +3. Builtin microcode +==================== + +The loader supports also loading of a builtin microcode supplied through +the regular firmware builtin method CONFIG_FIRMWARE_IN_KERNEL. Only +64-bit is currently supported. + +Here's an example: + +CONFIG_FIRMWARE_IN_KERNEL=y +CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin" +CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware" + +This basically means, you have the following tree structure locally: + +/lib/firmware/ +|-- amd-ucode +... +| |-- microcode_amd_fam15h.bin +... +|-- intel-ucode +... +| |-- 06-3a-09 +... + +so that the build system can find those files and integrate them into +the final kernel image. The early loader finds them and applies them. + +Needless to say, this method is not the most flexible one because it +requires rebuilding the kernel each time updated microcode from the CPU +vendor is available. diff --git a/Documentation/x86/orc-unwinder.txt b/Documentation/x86/orc-unwinder.txt new file mode 100644 index 000000000000..af0c9a4c65a6 --- /dev/null +++ b/Documentation/x86/orc-unwinder.txt @@ -0,0 +1,179 @@ +ORC unwinder +============ + +Overview +-------- + +The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is +similar in concept to a DWARF unwinder. The difference is that the +format of the ORC data is much simpler than DWARF, which in turn allows +the ORC unwinder to be much simpler and faster. + +The ORC data consists of unwind tables which are generated by objtool. +They contain out-of-band data which is used by the in-kernel ORC +unwinder. Objtool generates the ORC data by first doing compile-time +stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing +all the code paths of a .o file, it determines information about the +stack state at each instruction address in the file and outputs that +information to the .orc_unwind and .orc_unwind_ip sections. + +The per-object ORC sections are combined at link time and are sorted and +post-processed at boot time. The unwinder uses the resulting data to +correlate instruction addresses with their stack states at run time. + + +ORC vs frame pointers +--------------------- + +With frame pointers enabled, GCC adds instrumentation code to every +function in the kernel. The kernel's .text size increases by about +3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel +Gorman [1] have shown a slowdown of 5-10% for some workloads. + +In contrast, the ORC unwinder has no effect on text size or runtime +performance, because the debuginfo is out of band. So if you disable +frame pointers and enable the ORC unwinder, you get a nice performance +improvement across the board, and still have reliable stack traces. + +Ingo Molnar says: + + "Note that it's not just a performance improvement, but also an + instruction cache locality improvement: 3.2% .text savings almost + directly transform into a similarly sized reduction in cache + footprint. That can transform to even higher speedups for workloads + whose cache locality is borderline." + +Another benefit of ORC compared to frame pointers is that it can +reliably unwind across interrupts and exceptions. Frame pointer based +unwinds can sometimes skip the caller of the interrupted function, if it +was a leaf function or if the interrupt hit before the frame pointer was +saved. + +The main disadvantage of the ORC unwinder compared to frame pointers is +that it needs more memory to store the ORC unwind tables: roughly 2-4MB +depending on the kernel config. + + +ORC vs DWARF +------------ + +ORC debuginfo's advantage over DWARF itself is that it's much simpler. +It gets rid of the complex DWARF CFI state machine and also gets rid of +the tracking of unnecessary registers. This allows the unwinder to be +much simpler, meaning fewer bugs, which is especially important for +mission critical oops code. + +The simpler debuginfo format also enables the unwinder to be much faster +than DWARF, which is important for perf and lockdep. In a basic +performance test by Jiri Slaby [2], the ORC unwinder was about 20x +faster than an out-of-tree DWARF unwinder. (Note: That measurement was +taken before some performance tweaks were added, which doubled +performance, so the speedup over DWARF may be closer to 40x.) + +The ORC data format does have a few downsides compared to DWARF. ORC +unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel) +than DWARF-based eh_frame tables. + +Another potential downside is that, as GCC evolves, it's conceivable +that the ORC data may end up being *too* simple to describe the state of +the stack for certain optimizations. But IMO this is unlikely because +GCC saves the frame pointer for any unusual stack adjustments it does, +so I suspect we'll really only ever need to keep track of the stack +pointer and the frame pointer between call frames. But even if we do +end up having to track all the registers DWARF tracks, at least we will +still be able to control the format, e.g. no complex state machines. + + +ORC unwind table generation +--------------------------- + +The ORC data is generated by objtool. With the existing compile-time +stack metadata validation feature, objtool already follows all code +paths, and so it already has all the information it needs to be able to +generate ORC data from scratch. So it's an easy step to go from stack +validation to ORC data generation. + +It should be possible to instead generate the ORC data with a simple +tool which converts DWARF to ORC data. However, such a solution would +be incomplete due to the kernel's extensive use of asm, inline asm, and +special sections like exception tables. + +That could be rectified by manually annotating those special code paths +using GNU assembler .cfi annotations in .S files, and homegrown +annotations for inline asm in .c files. But asm annotations were tried +in the past and were found to be unmaintainable. They were often +incorrect/incomplete and made the code harder to read and keep updated. +And based on looking at glibc code, annotating inline asm in .c files +might be even worse. + +Objtool still needs a few annotations, but only in code which does +unusual things to the stack like entry code. And even then, far fewer +annotations are needed than what DWARF would need, so they're much more +maintainable than DWARF CFI annotations. + +So the advantages of using objtool to generate ORC data are that it +gives more accurate debuginfo, with very few annotations. It also +insulates the kernel from toolchain bugs which can be very painful to +deal with in the kernel since we often have to workaround issues in +older versions of the toolchain for years. + +The downside is that the unwinder now becomes dependent on objtool's +ability to reverse engineer GCC code flow. If GCC optimizations become +too complicated for objtool to follow, the ORC data generation might +stop working or become incomplete. (It's worth noting that livepatch +already has such a dependency on objtool's ability to follow GCC code +flow.) + +If newer versions of GCC come up with some optimizations which break +objtool, we may need to revisit the current implementation. Some +possible solutions would be asking GCC to make the optimizations more +palatable, or having objtool use DWARF as an additional input, or +creating a GCC plugin to assist objtool with its analysis. But for now, +objtool follows GCC code quite well. + + +Unwinder implementation details +------------------------------- + +Objtool generates the ORC data by integrating with the compile-time +stack metadata validation feature, which is described in detail in +tools/objtool/Documentation/stack-validation.txt. After analyzing all +the code paths of a .o file, it creates an array of orc_entry structs, +and a parallel array of instruction addresses associated with those +structs, and writes them to the .orc_unwind and .orc_unwind_ip sections +respectively. + +The ORC data is split into the two arrays for performance reasons, to +make the searchable part of the data (.orc_unwind_ip) more compact. The +arrays are sorted in parallel at boot time. + +Performance is further improved by the use of a fast lookup table which +is created at runtime. The fast lookup table associates a given address +with a range of indices for the .orc_unwind table, so that only a small +subset of the table needs to be searched. + + +Etymology +--------- + +Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural +enemies. Similarly, the ORC unwinder was created in opposition to the +complexity and slowness of DWARF. + +"Although Orcs rarely consider multiple solutions to a problem, they do +excel at getting things done because they are creatures of action, not +thought." [3] Similarly, unlike the esoteric DWARF unwinder, the +veracious ORC unwinder wastes no time or siloconic effort decoding +variable-length zero-extended unsigned-integer byte-coded +state-machine-based debug information entries. + +Similar to how Orcs frequently unravel the well-intentioned plans of +their adversaries, the ORC unwinder frequently unravels stacks with +brutal, unyielding efficiency. + +ORC stands for Oops Rewind Capability. + + +[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de +[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz +[3] http://dustin.wikidot.com/half-orcs-and-orcs diff --git a/Documentation/x86/protection-keys.txt b/Documentation/x86/protection-keys.txt index b64304540821..fa46dcb347bc 100644 --- a/Documentation/x86/protection-keys.txt +++ b/Documentation/x86/protection-keys.txt @@ -34,7 +34,7 @@ with a key. In this example WRPKRU is wrapped by a C function called pkey_set(). int real_prot = PROT_READ|PROT_WRITE; - pkey = pkey_alloc(0, PKEY_DENY_WRITE); + pkey = pkey_alloc(0, PKEY_DISABLE_WRITE); ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey); ... application runs here @@ -42,9 +42,9 @@ called pkey_set(). Now, if the application needs to update the data at 'ptr', it can gain access, do the update, then remove its write access: - pkey_set(pkey, 0); // clear PKEY_DENY_WRITE + pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE *ptr = foo; // assign something - pkey_set(pkey, PKEY_DENY_WRITE); // set PKEY_DENY_WRITE again + pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again Now when it frees the memory, it will also free the pkey since it is no longer in use: diff --git a/Documentation/x86/x86_64/5level-paging.txt b/Documentation/x86/x86_64/5level-paging.txt new file mode 100644 index 000000000000..087251a0d99c --- /dev/null +++ b/Documentation/x86/x86_64/5level-paging.txt @@ -0,0 +1,64 @@ +== Overview == + +Original x86-64 was limited by 4-level paing to 256 TiB of virtual address +space and 64 TiB of physical address space. We are already bumping into +this limit: some vendors offers servers with 64 TiB of memory today. + +To overcome the limitation upcoming hardware will introduce support for +5-level paging. It is a straight-forward extension of the current page +table structure adding one more layer of translation. + +It bumps the limits to 128 PiB of virtual address space and 4 PiB of +physical address space. This "ought to be enough for anybody" ©. + +QEMU 2.9 and later support 5-level paging. + +Virtual memory layout for 5-level paging is described in +Documentation/x86/x86_64/mm.txt + +== Enabling 5-level paging == + +CONFIG_X86_5LEVEL=y enables the feature. + +So far, a kernel compiled with the option enabled will be able to boot +only on machines that supports the feature -- see for 'la57' flag in +/proc/cpuinfo. + +The plan is to implement boot-time switching between 4- and 5-level paging +in the future. + +== User-space and large virtual address space == + +On x86, 5-level paging enables 56-bit userspace virtual address space. +Not all user space is ready to handle wide addresses. It's known that +at least some JIT compilers use higher bits in pointers to encode their +information. It collides with valid pointers with 5-level paging and +leads to crashes. + +To mitigate this, we are not going to allocate virtual address space +above 47-bit by default. + +But userspace can ask for allocation from full address space by +specifying hint address (with or without MAP_FIXED) above 47-bits. + +If hint address set above 47-bit, but MAP_FIXED is not specified, we try +to look for unmapped area by specified address. If it's already +occupied, we look for unmapped area in *full* address space, rather than +from 47-bit window. + +A high hint address would only affect the allocation in question, but not +any future mmap()s. + +Specifying high hint address on older kernel or on machine without 5-level +paging support is safe. The hint will be ignored and kernel will fall back +to allocation from 47-bit address space. + +This approach helps to easily make application's memory allocator aware +about large address space without manually tracking allocated virtual +address space. + +One important case we need to handle here is interaction with MPX. +MPX (without MAWA extension) cannot handle addresses above 47-bit, so we +need to make sure that MPX cannot be enabled we already have VMA above +the boundary and forbid creating such VMAs once MPX is enabled. + |