From 5bbe3547aa3ba5242366a322a28996872301b703 Mon Sep 17 00:00:00 2001 From: Eric B Munson Date: Wed, 15 Apr 2015 16:13:20 -0700 Subject: mm: allow compaction of unevictable pages Currently, pages which are marked as unevictable are protected from compaction, but not from other types of migration. The POSIX real time extension explicitly states that mlock() will prevent a major page fault, but the spirit of this is that mlock() should give a process the ability to control sources of latency, including minor page faults. However, the mlock manpage only explicitly says that a locked page will not be written to swap and this can cause some confusion. The compaction code today does not give a developer who wants to avoid swap but wants to have large contiguous areas available any method to achieve this state. This patch introduces a sysctl for controlling compaction behavior with respect to the unevictable lru. Users who demand no page faults after a page is present can set compact_unevictable_allowed to 0 and users who need the large contiguous areas can enable compaction on locked memory by leaving the default value of 1. To illustrate this problem I wrote a quick test program that mmaps a large number of 1MB files filled with random data. These maps are created locked and read only. Then every other mmap is unmapped and I attempt to allocate huge pages to the static huge page pool. When the compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages after fragmenting memory. When the value is set to 1, allocations succeed. Signed-off-by: Eric B Munson Acked-by: Michal Hocko Acked-by: Vlastimil Babka Acked-by: Christoph Lameter Acked-by: David Rientjes Acked-by: Rik van Riel Cc: Vlastimil Babka Cc: Thomas Gleixner Cc: Christoph Lameter Cc: Peter Zijlstra Cc: Mel Gorman Cc: David Rientjes Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 11 +++++++++++ 1 file changed, 11 insertions(+) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 902b4574acfb..9832ec52f859 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -21,6 +21,7 @@ Currently, these files are in /proc/sys/vm: - admin_reserve_kbytes - block_dump - compact_memory +- compact_unevictable_allowed - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -106,6 +107,16 @@ huge pages although processes will also directly compact memory as required. ============================================================== +compact_unevictable_allowed + +Available only when CONFIG_COMPACTION is set. When set to 1, compaction is +allowed to examine the unevictable lru (mlocked pages) for pages to compact. +This should be used on systems where stalls for minor page faults are an +acceptable trade for large contiguous free memory. Set to 0 to prevent +compaction from moving pages that are unevictable. Default value is 1. + +============================================================== + dirty_background_bytes Contains the amount of dirty memory at which the background kernel -- cgit v1.2.1 From 922c0551a795dccadeb1dadc756d93fe3e303180 Mon Sep 17 00:00:00 2001 From: Eric B Munson Date: Wed, 15 Apr 2015 16:13:23 -0700 Subject: Documentation/vm/unevictable-lru.txt: document interaction between compaction and the unevictable LRU The memory compaction code uses the migration code to do most of the work in compaction. However, the compaction code interacts with the unevictable LRU differently than migration code and this difference should be noted in the documentation. [akpm@linux-foundation.org: identify /proc/sys/vm/compact_unevictable directly] Signed-off-by: Eric B Munson Cc: Michal Hocko Cc: Vlastimil Babka Cc: Christoph Lameter Cc: David Rientjes Cc: Rik van Riel Cc: Thomas Gleixner Cc: Peter Zijlstra Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/unevictable-lru.txt | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 86cb4624fc5a..3be0bfc4738d 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -22,6 +22,7 @@ CONTENTS - Filtering special vmas. - munlock()/munlockall() system call handling. - Migrating mlocked pages. + - Compacting mlocked pages. - mmap(MAP_LOCKED) system call handling. - munmap()/exit()/exec() system call handling. - try_to_unmap(). @@ -450,6 +451,17 @@ list because of a race between munlock and migration, page migration uses the putback_lru_page() function to add migrated pages back to the LRU. +COMPACTING MLOCKED PAGES +------------------------ + +The unevictable LRU can be scanned for compactable regions and the default +behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls +this behavior (see Documentation/sysctl/vm.txt). Once scanning of the +unevictable LRU is enabled, the work of compaction is mostly handled by +the page migration code and the same work flow as described in MIGRATING +MLOCKED PAGES will apply. + + mmap(MAP_LOCKED) SYSTEM CALL HANDLING ------------------------------------- -- cgit v1.2.1 From 8c9b97033547834404a58ea88da7226ed5167726 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Wed, 15 Apr 2015 16:13:45 -0700 Subject: hugetlbfs: document min_size mount option and cleanup Add min_size mount option to the hugetlbfs documentation. Also, add the missing pagesize option and mention that size can be specified as bytes or a percentage of huge page pool. Signed-off-by: Mike Kravetz Cc: Davidlohr Bueso Cc: Aneesh Kumar Cc: Joonsoo Kim Cc: Andi Kleen Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hugetlbpage.txt | 31 ++++++++++++++++++++++--------- 1 file changed, 22 insertions(+), 9 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index f2d3a100fe38..b32b9cd50c0e 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -267,21 +267,34 @@ call, then it is required that system administrator mount a file system of type hugetlbfs: mount -t hugetlbfs \ - -o uid=,gid=,mode=,size=,nr_inodes= \ - none /mnt/huge + -o uid=,gid=,mode=,pagesize=,size=,\ + min_size=,nr_inodes= none /mnt/huge This command mounts a (pseudo) filesystem of type hugetlbfs on the directory /mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid options sets the owner and group of the root of the file system. By default the uid and gid of the current process are taken. The mode option sets the mode of root of file system to value & 01777. This value is given in octal. -By default the value 0755 is picked. The size option sets the maximum value of -memory (huge pages) allowed for that filesystem (/mnt/huge). The size is -rounded down to HPAGE_SIZE. The option nr_inodes sets the maximum number of -inodes that /mnt/huge can use. If the size or nr_inodes option is not -provided on command line then no limits are set. For size and nr_inodes -options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For -example, size=2K has the same meaning as size=2048. +By default the value 0755 is picked. If the paltform supports multiple huge +page sizes, the pagesize option can be used to specify the huge page size and +associated pool. pagesize is specified in bytes. If pagesize is not specified +the paltform's default huge page size and associated pool will be used. The +size option sets the maximum value of memory (huge pages) allowed for that +filesystem (/mnt/huge). The size option can be specified in bytes, or as a +percentage of the specified huge page pool (nr_hugepages). The size is +rounded down to HPAGE_SIZE boundary. The min_size option sets the minimum +value of memory (huge pages) allowed for the filesystem. min_size can be +specified in the same way as size, either bytes or a percentage of the +huge page pool. At mount time, the number of huge pages specified by +min_size are reserved for use by the filesystem. If there are not enough +free huge pages available, the mount will fail. As huge pages are allocated +to the filesystem and freed, the reserve count is adjusted so that the sum +of allocated and reserved huge pages is always at least min_size. The option +nr_inodes sets the maximum number of inodes that /mnt/huge can use. If the +size, min_size or nr_inodes option is not provided on command line then +no limits are set. For pagesize, size, min_size and nr_inodes options, you +can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K +has the same meaning as size=2048. While read system calls are supported on files that reside on hugetlb file systems, write system calls are not. -- cgit v1.2.1 From 80d6b94bd69a7a49b52bf503ef6a841f43cf5bbb Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 15 Apr 2015 16:14:26 -0700 Subject: mm, doc: cleanup and clarify munmap behavior for hugetlb memory munmap(2) of hugetlb memory requires a length that is hugepage aligned, otherwise it may fail. Add this to the documentation. This also cleans up the documentation and separates it into logical units: one part refers to MAP_HUGETLB and another part refers to requirements for shared memory segments. Signed-off-by: David Rientjes Cc: Jonathan Corbet Cc: Davide Libenzi Cc: Luiz Capitulino Cc: Shuah Khan Acked-by: Hugh Dickins Cc: Andrea Arcangeli Cc: Joern Engel Cc: Jianguo Wu Cc: Eric B Munson Cc: Michael Ellerman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hugetlbpage.txt | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt index b32b9cd50c0e..030977fb8d2d 100644 --- a/Documentation/vm/hugetlbpage.txt +++ b/Documentation/vm/hugetlbpage.txt @@ -302,15 +302,23 @@ file systems, write system calls are not. Regular chown, chgrp, and chmod commands (with right permissions) could be used to change the file attributes on hugetlbfs. -Also, it is important to note that no such mount command is required if the +Also, it is important to note that no such mount command is required if applications are going to use only shmat/shmget system calls or mmap with -MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment -should be a member of a supplementary group and system admin needs to -configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for -same or different applications to use any combination of mmaps and shm* -calls, though the mount of filesystem will be required for using mmap calls -without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see -map_hugetlb.c. +MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see map_hugetlb +below. + +Users who wish to use hugetlb memory via shared memory segment should be a +member of a supplementary group and system admin needs to configure that gid +into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different +applications to use any combination of mmaps and shm* calls, though the mount of +filesystem will be required for using mmap calls without MAP_HUGETLB. + +Syscalls that operate on memory backed by hugetlb pages only have their lengths +aligned to the native page size of the processor; they will normally fail with +errno set to EINVAL or exclude hugetlb pages that extend beyond the length if +not hugepage aligned. For example, munmap(2) will fail if memory is backed by +a hugetlb page and the length is smaller than the hugepage size. + Examples ======== -- cgit v1.2.1 From dd9061846a3ba01b0fa45423aaa087e4a69187fa Mon Sep 17 00:00:00 2001 From: Boaz Harrosh Date: Wed, 15 Apr 2015 16:15:11 -0700 Subject: mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to get notified when access is a write to a read-only PFN. This can happen if we mmap() a file then first mmap-read from it to page-in a read-only PFN, than we mmap-write to the same page. We need this functionality to fix a DAX bug, where in the scenario above we fail to set ctime/mtime though we modified the file. An xfstest is attached to this patchset that shows the failure and the fix. (A DAX patch will follow) This functionality is extra important for us, because upon dirtying of a pmem page we also want to RDMA the page to a remote cluster node. We define a new pfn_mkwrite and do not reuse page_mkwrite because 1 - The name ;-) 2 - But mainly because it would take a very long and tedious audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP users. To make sure they do not now CRASH. For example current DAX code (which this is for) would crash. If we would want to reuse page_mkwrite, We will need to first patch all users, so to not-crash-on-no-page. Then enable this patch. But even if I did that I would not sleep so well at night. Adding a new vector is the safest thing to do, and is not that expensive. an extra pointer at a static function vector per driver. Also the new vector is better for performance, because else we Will call all current Kernel vectors, so to: check-ha-no-page-do-nothing and return. No need to call it from do_shared_fault because do_wp_page is called to change pte permissions anyway. Signed-off-by: Yigal Korman Signed-off-by: Boaz Harrosh Acked-by: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Jan Kara Cc: Hugh Dickins Cc: Mel Gorman Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/Locking | 8 ++++++++ 1 file changed, 8 insertions(+) (limited to 'Documentation') diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f91926f2f482..8bb8a7ee0f99 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -525,6 +525,7 @@ prototypes: void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: @@ -534,6 +535,7 @@ close: yes fault: yes can return with page locked map_pages: yes page_mkwrite: yes can return with page locked +pfn_mkwrite: yes access: yes ->fault() is called when a previously not present pte is about @@ -560,6 +562,12 @@ the page has been truncated, the filesystem should not look up a new page like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to retry the fault. + ->pfn_mkwrite() is the same as page_mkwrite but when the pte is +VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is +VM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior +after this call is to make the pte read-write, unless pfn_mkwrite returns +an error. + ->access() is called when get_user_pages() fails in access_process_vm(), typically used to debug a process through /proc/pid/mem or ptrace. This function is needed only for -- cgit v1.2.1 From 4e3ba87845420e0bfa21e6c4f7f81897aed38f8c Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 15 Apr 2015 16:15:36 -0700 Subject: zram: support compaction Now that zsmalloc supports compaction, zram can use it. For the first step, this patch exports compact knob via sysfs so user can do compaction via "echo 1 > /sys/block/zram0/compact". Signed-off-by: Minchan Kim Cc: Juneho Choi Cc: Gunho Lee Cc: Luigi Semenzato Cc: Dan Streetman Cc: Seth Jennings Cc: Nitin Gupta Cc: Jerome Marchand Cc: Sergey Senozhatsky Cc: Joonsoo Kim Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index a6148eaf91e5..bede9028a5a0 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -141,3 +141,18 @@ Description: amount of memory ZRAM can use to store the compressed data. The limit could be changed in run time and "0" means disable the limit. No limit is the initial state. Unit: bytes + +What: /sys/block/zram/compact +Date: August 2015 +Contact: Minchan Kim +Description: + The compact file is write-only and trigger compaction for + allocator zrm uses. The allocator moves some objects so that + it could free fragment space. + +What: /sys/block/zram/num_migrated +Date: August 2015 +Contact: Minchan Kim +Description: + The compact file is read-only and shows how many object + migrated by compaction. -- cgit v1.2.1 From d02be50dba649b4246e0c1c4b7cb5d8a8d49de9a Mon Sep 17 00:00:00 2001 From: Minchan Kim Date: Wed, 15 Apr 2015 16:15:46 -0700 Subject: zsmalloc: zsmalloc documentation Create zsmalloc doc which explains design concept and stat information. Signed-off-by: Minchan Kim Cc: Juneho Choi Cc: Gunho Lee Cc: Luigi Semenzato Cc: Dan Streetman Cc: Seth Jennings Cc: Nitin Gupta Cc: Jerome Marchand Cc: Sergey Senozhatsky Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/zsmalloc.txt | 70 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 Documentation/vm/zsmalloc.txt (limited to 'Documentation') diff --git a/Documentation/vm/zsmalloc.txt b/Documentation/vm/zsmalloc.txt new file mode 100644 index 000000000000..64ed63c4f69d --- /dev/null +++ b/Documentation/vm/zsmalloc.txt @@ -0,0 +1,70 @@ +zsmalloc +-------- + +This allocator is designed for use with zram. Thus, the allocator is +supposed to work well under low memory conditions. In particular, it +never attempts higher order page allocation which is very likely to +fail under memory pressure. On the other hand, if we just use single +(0-order) pages, it would suffer from very high fragmentation -- +any object of size PAGE_SIZE/2 or larger would occupy an entire page. +This was one of the major issues with its predecessor (xvmalloc). + +To overcome these issues, zsmalloc allocates a bunch of 0-order pages +and links them together using various 'struct page' fields. These linked +pages act as a single higher-order page i.e. an object can span 0-order +page boundaries. The code refers to these linked pages as a single entity +called zspage. + +For simplicity, zsmalloc can only allocate objects of size up to PAGE_SIZE +since this satisfies the requirements of all its current users (in the +worst case, page is incompressible and is thus stored "as-is" i.e. in +uncompressed form). For allocation requests larger than this size, failure +is returned (see zs_malloc). + +Additionally, zs_malloc() does not return a dereferenceable pointer. +Instead, it returns an opaque handle (unsigned long) which encodes actual +location of the allocated object. The reason for this indirection is that +zsmalloc does not keep zspages permanently mapped since that would cause +issues on 32-bit systems where the VA region for kernel space mappings +is very small. So, before using the allocating memory, the object has to +be mapped using zs_map_object() to get a usable pointer and subsequently +unmapped using zs_unmap_object(). + +stat +---- + +With CONFIG_ZSMALLOC_STAT, we could see zsmalloc internal information via +/sys/kernel/debug/zsmalloc/. Here is a sample of stat output: + +# cat /sys/kernel/debug/zsmalloc/zram0/classes + + class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage + .. + .. + 9 176 0 1 186 129 8 4 + 10 192 1 0 2880 2872 135 3 + 11 208 0 1 819 795 42 2 + 12 224 0 1 219 159 12 4 + .. + .. + + +class: index +size: object size zspage stores +almost_empty: the number of ZS_ALMOST_EMPTY zspages(see below) +almost_full: the number of ZS_ALMOST_FULL zspages(see below) +obj_allocated: the number of objects allocated +obj_used: the number of objects allocated to the user +pages_used: the number of pages allocated for the class +pages_per_zspage: the number of 0-order pages to make a zspage + +We assign a zspage to ZS_ALMOST_EMPTY fullness group when: + n <= N / f, where +n = number of allocated objects +N = total number of objects zspage can store +f = fullness_threshold_frac(ie, 4 at the moment) + +Similarly, we assign zspage to: + ZS_ALMOST_FULL when n > N / f + ZS_EMPTY when n == 0 + ZS_FULL when n == N -- cgit v1.2.1 From 10447b60bee52f026bdbc5fe2aca52d0492fc91d Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 15 Apr 2015 16:15:52 -0700 Subject: zram: remove `num_migrated' device attr This patch introduces rework to zram stats. We have per-stat sysfs nodes, and it makes things a bit hard to use in user space: it doesn't give an immediate stats 'snapshot', it requires user space to use more syscalls - open, read, close for every stat file, with appropriate error checks on every step, etc. First, zram now accounts block layer statistics, available in /sys/block/zram/stat and /proc/diskstats files. So some new stats are available (see Documentation/block/stat.txt), besides, zram's activities now can be monitored by sysstat's iostat or similar tools. Example: cat /sys/block/zram0/stat 248 0 1984 0 251029 0 2008232 5120 0 5116 5116 Second, group currently exported on per-stat basis nodes into two categories (files): -- zram/io_stat accumulates device's IO stats, that are not accounted by block layer, and contains: failed_reads failed_writes invalid_io notify_free Example: cat /sys/block/zram0/io_stat 0 0 0 652572 -- zram/mm_stat accumulates zram mm stats and contains: orig_data_size compr_data_size mem_used_total mem_limit mem_used_max zero_pages num_migrated Example: cat /sys/block/zram0/mm_stat 434634752 270288572 279158784 0 579895296 15060 0 per-stat sysfs nodes are now considered to be deprecated and we plan to remove them (and clean up some of the existing stat code) in two years (as of now, there is no warning printed to syslog about deprecated stats being used). User space is advised to use the above mentioned 3 files. This patch (of 7): Remove sysfs `num_migrated' attribute. We are moving away from per-stat device attrs towards 3 stat files that will accumulate io and mm stats in a format similar to block layer statistics in /sys/block//stat. That will be easier to use in user space, and reduce the number of syscalls needed to read zram device statistics. `num_migrated' will return back in zram/mm_stat file. Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 7 ------- 1 file changed, 7 deletions(-) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index bede9028a5a0..91ad7071b9e8 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -149,10 +149,3 @@ Description: The compact file is write-only and trigger compaction for allocator zrm uses. The allocator moves some objects so that it could free fragment space. - -What: /sys/block/zram/num_migrated -Date: August 2015 -Contact: Minchan Kim -Description: - The compact file is read-only and shows how many object - migrated by compaction. -- cgit v1.2.1 From 77ba015f9d5c584226a634753e9b318cb272cd41 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 15 Apr 2015 16:16:00 -0700 Subject: zram: describe device attrs in documentation Briefly describe exported device stat attrs in zram documentation. We will eventually get rid of per-stat sysfs nodes and, thus, clean up Documentation/ABI/testing/sysfs-block-zram file, which is the only source of information about device sysfs nodes. Add `num_migrated' description, since there is no independent `num_migrated' sysfs node (and no corresponding sysfs-block-zram entry), it will be exported via zram/mm_stat file. At this point we can provide minimal description, because sysfs-block-zram still contains detailed information. Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/blockdev/zram.txt | 48 +++++++++++++++++++++++++++++------------ 1 file changed, 34 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 7fcf9c6592ec..971765ed5ac1 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -98,20 +98,40 @@ size of the disk when not in use so a huge zram is wasteful. mount /dev/zram1 /tmp 7) Stats: - Per-device statistics are exported as various nodes under - /sys/block/zram/ - disksize - num_reads - num_writes - failed_reads - failed_writes - invalid_io - notify_free - zero_pages - orig_data_size - compr_data_size - mem_used_total - mem_used_max +Per-device statistics are exported as various nodes under /sys/block/zram/ + +A brief description of exported device attritbutes. For more details please +read Documentation/ABI/testing/sysfs-block-zram. + +Name access description +---- ------ ----------- +disksize RW show and set the device's disk size +initstate RO shows the initialization state of the device +reset WO trigger device reset +num_reads RO the number of reads +failed_reads RO the number of failed reads +num_write RO the number of writes +failed_writes RO the number of failed writes +invalid_io RO the number of non-page-size-aligned I/O requests +max_comp_streams RW the number of possible concurrent compress operations +comp_algorithm RW show and change the compression algorithm +notify_free RO the number of notifications to free pages (either + slot free notifications or REQ_DISCARD requests) +zero_pages RO the number of zero filled pages written to this disk +orig_data_size RO uncompressed size of data stored in this disk +compr_data_size RO compressed size of data stored in this disk +mem_used_total RO the amount of memory allocated for this disk +mem_used_max RW the maximum amount memory zram have consumed to + store compressed data +mem_limit RW the maximum amount of memory ZRAM can use to store + the compressed data +num_migrated RO the number of objects migrated migrated by compaction + + +File /sys/block/zram/stat + +Represents block layer statistics. Read Documentation/block/stat.txt for +details. 8) Deactivate: swapoff /dev/zram0 -- cgit v1.2.1 From 2f6a3bed7347ee94fe57b3501fddaa646a26d890 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 15 Apr 2015 16:16:03 -0700 Subject: zram: export new 'io_stat' sysfs attrs Per-device `zram/io_stat' file provides accumulated I/O statistics of particular zram device in a format similar to block layer statistics. The file consists of a single line and represents the following stats (separated by whitespace): failed_reads failed_writes invalid_io notify_free Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 9 +++++++++ Documentation/blockdev/zram.txt | 11 +++++++++++ 2 files changed, 20 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 91ad7071b9e8..a7f622f9bcf6 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -149,3 +149,12 @@ Description: The compact file is write-only and trigger compaction for allocator zrm uses. The allocator moves some objects so that it could free fragment space. + +What: /sys/block/zram/io_stat +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The io_stat file is read-only and accumulates device's I/O + statistics not accounted by block layer. For example, + failed_reads, failed_writes, etc. File format is similar to + block layer statistics file format. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 971765ed5ac1..9610be3e9d36 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -133,6 +133,17 @@ File /sys/block/zram/stat Represents block layer statistics. Read Documentation/block/stat.txt for details. +File /sys/block/zram/io_stat + +The stat file represents device's I/O statistics not accounted by block +layer and, thus, not available in zram/stat file. It consists of a +single line of text and contains the following stats separated by +whitespace: + failed_reads + failed_writes + invalid_io + notify_free + 8) Deactivate: swapoff /dev/zram0 umount /dev/zram1 -- cgit v1.2.1 From 4f2109f60881585dc04fa0b5657a60556576625c Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 15 Apr 2015 16:16:06 -0700 Subject: zram: export new 'mm_stat' sysfs attrs Per-device `zram/mm_stat' file provides mm statistics of a particular zram device in a format similar to block layer statistics. The file consists of a single line and represents the following stats (separated by whitespace): orig_data_size compr_data_size mem_used_total mem_limit mem_used_max zero_pages num_migrated Signed-off-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/testing/sysfs-block-zram | 8 ++++++++ Documentation/blockdev/zram.txt | 12 ++++++++++++ 2 files changed, 20 insertions(+) (limited to 'Documentation') diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index a7f622f9bcf6..2e69e83bf510 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -158,3 +158,11 @@ Description: statistics not accounted by block layer. For example, failed_reads, failed_writes, etc. File format is similar to block layer statistics file format. + +What: /sys/block/zram/mm_stat +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The mm_stat file is read-only and represents device's mm + statistics (orig_data_size, compr_data_size, etc.) in a format + similar to block layer statistics file format. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 9610be3e9d36..7920f4026d36 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -144,6 +144,18 @@ whitespace: invalid_io notify_free +File /sys/block/zram/mm_stat + +The stat file represents device's mm statistics. It consists of a single +line of text and contains the following stats separated by whitespace: + orig_data_size + compr_data_size + mem_used_total + mem_limit + mem_used_max + zero_pages + num_migrated + 8) Deactivate: swapoff /dev/zram0 umount /dev/zram1 -- cgit v1.2.1 From 8f7d282c717acaae25245c61b6b60e8995ec4ef4 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Date: Wed, 15 Apr 2015 16:16:09 -0700 Subject: zram: deprecate zram attrs sysfs nodes Add Documentation/ABI/obsolete/sysfs-block-zram file and list obsolete and deprecated attributes there. The patch also adds additional information to zram documentation and describes the basic strategy: - the existing RW nodes will be downgraded to WO nodes (in 4.11) - deprecated RO sysfs nodes will eventually be removed (in 4.11) Users will be additionally notified about deprecated attr usage by pr_warn_once() (added to every deprecated attr _show()), as suggested by Minchan Kim. User space is advised to use zram/stat, zram/io_stat and zram/mm_stat files. Signed-off-by: Sergey Senozhatsky Reported-by: Minchan Kim Cc: Nitin Gupta Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/ABI/obsolete/sysfs-block-zram | 119 ++++++++++++++++++++++++++++ Documentation/blockdev/zram.txt | 16 ++++ 2 files changed, 135 insertions(+) create mode 100644 Documentation/ABI/obsolete/sysfs-block-zram (limited to 'Documentation') diff --git a/Documentation/ABI/obsolete/sysfs-block-zram b/Documentation/ABI/obsolete/sysfs-block-zram new file mode 100644 index 000000000000..720ea92cfb2e --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-block-zram @@ -0,0 +1,119 @@ +What: /sys/block/zram/num_reads +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The num_reads file is read-only and specifies the number of + reads (failed or successful) done on this device. + Now accessible via zram/stat node. + +What: /sys/block/zram/num_writes +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The num_writes file is read-only and specifies the number of + writes (failed or successful) done on this device. + Now accessible via zram/stat node. + +What: /sys/block/zram/invalid_io +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The invalid_io file is read-only and specifies the number of + non-page-size-aligned I/O requests issued to this device. + Now accessible via zram/io_stat node. + +What: /sys/block/zram/failed_reads +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The failed_reads file is read-only and specifies the number of + failed reads happened on this device. + Now accessible via zram/io_stat node. + +What: /sys/block/zram/failed_writes +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The failed_writes file is read-only and specifies the number of + failed writes happened on this device. + Now accessible via zram/io_stat node. + +What: /sys/block/zram/notify_free +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The notify_free file is read-only. Depending on device usage + scenario it may account a) the number of pages freed because + of swap slot free notifications or b) the number of pages freed + because of REQ_DISCARD requests sent by bio. The former ones + are sent to a swap block device when a swap slot is freed, which + implies that this disk is being used as a swap disk. The latter + ones are sent by filesystem mounted with discard option, + whenever some data blocks are getting discarded. + Now accessible via zram/io_stat node. + +What: /sys/block/zram/zero_pages +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The zero_pages file is read-only and specifies number of zero + filled pages written to this disk. No memory is allocated for + such pages. + Now accessible via zram/mm_stat node. + +What: /sys/block/zram/orig_data_size +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The orig_data_size file is read-only and specifies uncompressed + size of data stored in this disk. This excludes zero-filled + pages (zero_pages) since no memory is allocated for them. + Unit: bytes + Now accessible via zram/mm_stat node. + +What: /sys/block/zram/compr_data_size +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The compr_data_size file is read-only and specifies compressed + size of data stored in this disk. So, compression ratio can be + calculated using orig_data_size and this statistic. + Unit: bytes + Now accessible via zram/mm_stat node. + +What: /sys/block/zram/mem_used_total +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The mem_used_total file is read-only and specifies the amount + of memory, including allocator fragmentation and metadata + overhead, allocated for this disk. So, allocator space + efficiency can be calculated using compr_data_size and this + statistic. + Unit: bytes + Now accessible via zram/mm_stat node. + +What: /sys/block/zram/mem_used_max +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The mem_used_max file is read/write and specifies the amount + of maximum memory zram have consumed to store compressed data. + For resetting the value, you should write "0". Otherwise, + you could see -EINVAL. + Unit: bytes + Downgraded to write-only node: so it's possible to set new + value only; its current value is stored in zram/mm_stat + node. + +What: /sys/block/zram/mem_limit +Date: August 2015 +Contact: Sergey Senozhatsky +Description: + The mem_limit file is read/write and specifies the maximum + amount of memory ZRAM can use to store the compressed data. + The limit could be changed in run time and "0" means disable + the limit. No limit is the initial state. Unit: bytes + Downgraded to write-only node: so it's possible to set new + value only; its current value is stored in zram/mm_stat + node. diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 7920f4026d36..48a183e29988 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -128,6 +128,22 @@ mem_limit RW the maximum amount of memory ZRAM can use to store num_migrated RO the number of objects migrated migrated by compaction +WARNING +======= +per-stat sysfs attributes are considered to be deprecated. +The basic strategy is: +-- the existing RW nodes will be downgraded to WO nodes (in linux 4.11) +-- deprecated RO sysfs nodes will eventually be removed (in linux 4.11) + +The list of deprecated attributes can be found here: +Documentation/ABI/obsolete/sysfs-block-zram + +Basically, every attribute that has its own read accessible sysfs node +(e.g. num_reads) *AND* is accessible via one of the stat files (zram/stat +or zram/io_stat or zram/mm_stat) is considered to be deprecated. + +User space is advised to use the following files to read the device statistics. + File /sys/block/zram/stat Represents block layer statistics. Read Documentation/block/stat.txt for -- cgit v1.2.1 From 7330660ed2e8d8d4c65b90cea62d8f1ed49c0104 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 15 Apr 2015 16:17:14 -0700 Subject: lib/vsprintf: document %p parameters passed by reference This patch series improves the documentation for printk() formats, and adds support for printing clocks. The latter has always been a hassle if you wanted to support both the common and legacy clock frameworks. - '%pC' and '%pCn' print the name (Common Clock Framework) or address (legacy clock framework) of a clock, - '%pCr' prints the current clock rate. This patch (of 3): Make sure all %p extensions that take parameters by references are documented to do so. Signed-off-by: Geert Uytterhoeven Cc: Jonathan Corbet Cc: Mike Turquette Cc: Stephen Boyd Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/printk-formats.txt | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'Documentation') diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt index 5a615c14f75d..71438f3eb0c0 100644 --- a/Documentation/printk-formats.txt +++ b/Documentation/printk-formats.txt @@ -54,6 +54,7 @@ Struct Resources: For printing struct resources. The 'R' and 'r' specifiers result in a printed resource with ('R') or without ('r') a decoded flags member. + Passed by reference. Physical addresses types phys_addr_t: @@ -132,6 +133,8 @@ MAC/FDDI addresses: specifier to use reversed byte order suitable for visual interpretation of Bluetooth addresses which are in the little endian order. + Passed by reference. + IPv4 addresses: %pI4 1.2.3.4 @@ -146,6 +149,8 @@ IPv4 addresses: host, network, big or little endian order addresses respectively. Where no specifier is provided the default network/big endian order is used. + Passed by reference. + IPv6 addresses: %pI6 0001:0002:0003:0004:0005:0006:0007:0008 @@ -160,6 +165,8 @@ IPv6 addresses: print a compressed IPv6 address as described by http://tools.ietf.org/html/rfc5952 + Passed by reference. + IPv4/IPv6 addresses (generic, with port, flowinfo, scope): %pIS 1.2.3.4 or 0001:0002:0003:0004:0005:0006:0007:0008 @@ -186,6 +193,8 @@ IPv4/IPv6 addresses (generic, with port, flowinfo, scope): specifiers can be used as well and are ignored in case of an IPv6 address. + Passed by reference. + Further examples: %pISfc 1.2.3.4 or [1:2:3:4:5:6:7:8]/123456789 @@ -207,6 +216,8 @@ UUID/GUID addresses: Where no additional specifiers are used the default little endian order with lower case hex characters will be printed. + Passed by reference. + dentry names: %pd{,2,3,4} %pD{,2,3,4} @@ -216,6 +227,8 @@ dentry names: equivalent of %s dentry->d_name.name we used to use, %pd prints n last components. %pD does the same thing for struct file. + Passed by reference. + struct va_format: %pV @@ -231,6 +244,8 @@ struct va_format: Do not use this feature without some mechanism to verify the correctness of the format string and va_list arguments. + Passed by reference. + u64 SHOULD be printed with %llu/%llx: printk("%llu", u64_var); -- cgit v1.2.1 From e8a7ba5f5c8c0c94c0c7f1bcd53c0289560c7446 Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 15 Apr 2015 16:17:17 -0700 Subject: lib/vsprintf: Move integer format types to the top Move the format types for 64-bit integers and configurable size integers to the top, so they're next to the other integer format types. While at it, add the missing format types for s32 and u32. Signed-off-by: Geert Uytterhoeven Cc: Jonathan Corbet Cc: Mike Turquette Cc: Stephen Boyd Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/printk-formats.txt | 32 +++++++++++++++----------------- 1 file changed, 15 insertions(+), 17 deletions(-) (limited to 'Documentation') diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt index 71438f3eb0c0..56804e40cb18 100644 --- a/Documentation/printk-formats.txt +++ b/Documentation/printk-formats.txt @@ -8,6 +8,21 @@ If variable is of Type, use printk format specifier: unsigned long long %llu or %llx size_t %zu or %zx ssize_t %zd or %zx + s32 %d or %x + u32 %u or %x + s64 %lld or %llx + u64 %llu or %llx + +If is dependent on a config option for its size (e.g., sector_t, +blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a +format specifier of its largest possible type and explicitly cast to it. +Example: + + printk("test: sector number/total blocks: %llu/%llu\n", + (unsigned long long)sector, (unsigned long long)blockcount); + +Reminder: sizeof() result is of type size_t. + Raw pointer value SHOULD be printed with %p. The kernel supports the following extended format specifiers for pointer types: @@ -246,23 +261,6 @@ struct va_format: Passed by reference. -u64 SHOULD be printed with %llu/%llx: - - printk("%llu", u64_var); - -s64 SHOULD be printed with %lld/%llx: - - printk("%lld", s64_var); - -If is dependent on a config option for its size (e.g., sector_t, -blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a -format specifier of its largest possible type and explicitly cast to it. -Example: - - printk("test: sector number/total blocks: %llu/%llu\n", - (unsigned long long)sector, (unsigned long long)blockcount); - -Reminder: sizeof() result is of type size_t. Thank you for your cooperation and attention. -- cgit v1.2.1 From 900cca2944254edd2d54dc181629314d3177a4af Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 15 Apr 2015 16:17:20 -0700 Subject: lib/vsprintf: add %pC{,n,r} format specifiers for clocks Add format specifiers for printing struct clk: - '%pC' or '%pCn': name (Common Clock Framework) or address (legacy clock framework) of the clock, - '%pCr': rate of the clock. [akpm@linux-foundation.org: omit code if !CONFIG_HAVE_CLK] Signed-off-by: Geert Uytterhoeven Cc: Jonathan Corbet Cc: Mike Turquette Cc: Stephen Boyd Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/printk-formats.txt | 12 ++++++++++++ 1 file changed, 12 insertions(+) (limited to 'Documentation') diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt index 56804e40cb18..cb6a596072bb 100644 --- a/Documentation/printk-formats.txt +++ b/Documentation/printk-formats.txt @@ -261,6 +261,18 @@ struct va_format: Passed by reference. +struct clk: + + %pC pll1 + %pCn pll1 + %pCr 1560000000 + + For printing struct clk structures. '%pC' and '%pCn' print the name + (Common Clock Framework) or address (legacy clock framework) of the + structure; '%pCr' prints the current clock rate. + + Passed by reference. + Thank you for your cooperation and attention. -- cgit v1.2.1