diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/filesystems/Locking | 4 | ||||
-rw-r--r-- | Documentation/filesystems/logfs.txt | 241 | ||||
-rw-r--r-- | Documentation/filesystems/overlayfs.txt | 26 | ||||
-rw-r--r-- | Documentation/filesystems/porting | 4 | ||||
-rw-r--r-- | Documentation/filesystems/sysfs-pci.txt | 2 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.txt | 11 | ||||
-rw-r--r-- | Documentation/filesystems/xfs.txt | 12 |
8 files changed, 41 insertions, 261 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index f66e748fc5e4..b7bd6c9009cc 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -87,8 +87,6 @@ jfs.txt - info and mount options for the JFS filesystem. locks.txt - info on file locking implementations, flock() vs. fcntl(), etc. -logfs.txt - - info on the LogFS flash filesystem. mandatory-locking.txt - info on the Linux implementation of Sys V mandatory file locking. ncpfs.txt diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 1b5f15653b1b..ace63cd7af8c 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -20,7 +20,7 @@ prototypes: void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); struct vfsmount *(*d_automount)(struct path *path); - int (*d_manage)(struct dentry *, bool); + int (*d_manage)(const struct path *, bool); struct dentry *(*d_real)(struct dentry *, const struct inode *, unsigned int); @@ -556,7 +556,7 @@ till "end_pgoff". ->map_pages() is called with page table locked and must not block. If it's not possible to reach a page without blocking, filesystem should skip it. Filesystem should use do_set_pte() to setup page table entry. Pointer to entry associated with the page is passed in -"pte" field in fault_env structure. Pointers to entries for other offsets +"pte" field in vm_fault structure. Pointers to entries for other offsets should be calculated relative to "pte". ->page_mkwrite() is called when a previously read-only pte is diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt deleted file mode 100644 index bca42c22a143..000000000000 --- a/Documentation/filesystems/logfs.txt +++ /dev/null @@ -1,241 +0,0 @@ - -The LogFS Flash Filesystem -========================== - -Specification -============= - -Superblocks ------------ - -Two superblocks exist at the beginning and end of the filesystem. -Each superblock is 256 Bytes large, with another 3840 Bytes reserved -for future purposes, making a total of 4096 Bytes. - -Superblock locations may differ for MTD and block devices. On MTD the -first non-bad block contains a superblock in the first 4096 Bytes and -the last non-bad block contains a superblock in the last 4096 Bytes. -On block devices, the first 4096 Bytes of the device contain the first -superblock and the last aligned 4096 Byte-block contains the second -superblock. - -For the most part, the superblocks can be considered read-only. They -are written only to correct errors detected within the superblocks, -move the journal and change the filesystem parameters through tunefs. -As a result, the superblock does not contain any fields that require -constant updates, like the amount of free space, etc. - -Segments --------- - -The space in the device is split up into equal-sized segments. -Segments are the primary write unit of LogFS. Within each segments, -writes happen from front (low addresses) to back (high addresses. If -only a partial segment has been written, the segment number, the -current position within and optionally a write buffer are stored in -the journal. - -Segments are erased as a whole. Therefore Garbage Collection may be -required to completely free a segment before doing so. - -Journal --------- - -The journal contains all global information about the filesystem that -is subject to frequent change. At mount time, it has to be scanned -for the most recent commit entry, which contains a list of pointers to -all currently valid entries. - -Object Store ------------- - -All space except for the superblocks and journal is part of the object -store. Each segment contains a segment header and a number of -objects, each consisting of the object header and the payload. -Objects are either inodes, directory entries (dentries), file data -blocks or indirect blocks. - -Levels ------- - -Garbage collection (GC) may fail if all data is written -indiscriminately. One requirement of GC is that data is separated -roughly according to the distance between the tree root and the data. -Effectively that means all file data is on level 0, indirect blocks -are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, -respectively. Inode file data is on level 6 for the inodes and 7-11 -for indirect blocks. - -Each segment contains objects of a single level only. As a result, -each level requires its own separate segment to be open for writing. - -Inode File ----------- - -All inodes are stored in a special file, the inode file. Single -exception is the inode file's inode (master inode) which for obvious -reasons is stored in the journal instead. Instead of data blocks, the -leaf nodes of the inode files are inodes. - -Aliases -------- - -Writes in LogFS are done by means of a wandering tree. A naïve -implementation would require that for each write or a block, all -parent blocks are written as well, since the block pointers have -changed. Such an implementation would not be very efficient. - -In LogFS, the block pointer changes are cached in the journal by means -of alias entries. Each alias consists of its logical address - inode -number, block index, level and child number (index into block) - and -the changed data. Any 8-byte word can be changes in this manner. - -Currently aliases are used for block pointers, file size, file used -bytes and the height of an inodes indirect tree. - -Segment Aliases ---------------- - -Related to regular aliases, these are used to handle bad blocks. -Initially, bad blocks are handled by moving the affected segment -content to a spare segment and noting this move in the journal with a -segment alias, a simple (to, from) tupel. GC will later empty this -segment and the alias can be removed again. This is used on MTD only. - -Vim ---- - -By cleverly predicting the life time of data, it is possible to -separate long-living data from short-living data and thereby reduce -the GC overhead later. Each type of distinc life expectency (vim) can -have a separate segment open for writing. Each (level, vim) tupel can -be open just once. If an open segment with unknown vim is encountered -at mount time, it is closed and ignored henceforth. - -Indirect Tree -------------- - -Inodes in LogFS are similar to FFS-style filesystems with direct and -indirect block pointers. One difference is that LogFS uses a single -indirect pointer that can be either a 1x, 2x, etc. indirect pointer. -A height field in the inode defines the height of the indirect tree -and thereby the indirection of the pointer. - -Another difference is the addressing of indirect blocks. In LogFS, -the first 16 pointers in the first indirect block are left empty, -corresponding to the 16 direct pointers in the inode. In ext2 (maybe -others as well) the first pointer in the first indirect block -corresponds to logical block 12, skipping the 12 direct pointers. -So where ext2 is using arithmetic to better utilize space, LogFS keeps -arithmetic simple and uses compression to save space. - -Compression ------------ - -Both file data and metadata can be compressed. Compression for file -data can be enabled with chattr +c and disabled with chattr -c. Doing -so has no effect on existing data, but new data will be stored -accordingly. New inodes will inherit the compression flag of the -parent directory. - -Metadata is always compressed. However, the space accounting ignores -this and charges for the uncompressed size. Failing to do so could -result in GC failures when, after moving some data, indirect blocks -compress worse than previously. Even on a 100% full medium, GC may -not consume any extra space, so the compression gains are lost space -to the user. - -However, they are not lost space to the filesystem internals. By -cheating the user for those bytes, the filesystem gained some slack -space and GC will run less often and faster. - -Garbage Collection and Wear Leveling ------------------------------------- - -Garbage collection is invoked whenever the number of free segments -falls below a threshold. The best (known) candidate is picked based -on the least amount of valid data contained in the segment. All -remaining valid data is copied elsewhere, thereby invalidating it. - -The GC code also checks for aliases and writes then back if their -number gets too large. - -Wear leveling is done by occasionally picking a suboptimal segment for -garbage collection. If a stale segments erase count is significantly -lower than the active segments' erase counts, it will be picked. Wear -leveling is rate limited, so it will never monopolize the device for -more than one segment worth at a time. - -Values for "occasionally", "significantly lower" are compile time -constants. - -Hashed directories ------------------- - -To satisfy efficient lookup(), directory entries are hashed and -located based on the hash. In order to both support large directories -and not be overly inefficient for small directories, several hash -tables of increasing size are used. For each table, the hash value -modulo the table size gives the table index. - -Tables sizes are chosen to limit the number of indirect blocks with a -fully populated table to 0, 1, 2 or 3 respectively. So the first -table contains 16 entries, the second 512-16, etc. - -The last table is special in several ways. First its size depends on -the effective 32bit limit on telldir/seekdir cookies. Since logfs -uses the upper half of the address space for indirect blocks, the size -is limited to 2^31. Secondly the table contains hash buckets with 16 -entries each. - -Using single-entry buckets would result in birthday "attacks". At -just 2^16 used entries, hash collisions would be likely (P >= 0.5). -My math skills are insufficient to do the combinatorics for the 17x -collisions necessary to overflow a bucket, but testing showed that in -10,000 runs the lowest directory fill before a bucket overflow was -188,057,130 entries with an average of 315,149,915 entries. So for -directory sizes of up to a million, bucket overflows should be -virtually impossible under normal circumstances. - -With carefully chosen filenames, it is obviously possible to cause an -overflow with just 21 entries (4 higher tables + 16 entries + 1). So -there may be a security concern if a malicious user has write access -to a directory. - -Open For Discussion -=================== - -Device Address Space --------------------- - -A device address space is used for caching. Both block devices and -MTD provide functions to either read a single page or write a segment. -Partial segments may be written for data integrity, but where possible -complete segments are written for performance on simple block device -flash media. - -Meta Inodes ------------ - -Inodes are stored in the inode file, which is just a regular file for -most purposes. At umount time, however, the inode file needs to -remain open until all dirty inodes are written. So -generic_shutdown_super() may not close this inode, but shouldn't -complain about remaining inodes due to the inode file either. Same -goes for mapping inode of the device address space. - -Currently logfs uses a hack that essentially copies part of fs/inode.c -code over. A general solution would be preferred. - -Indirect block mapping ----------------------- - -With compression, the block device (or mapping inode) cannot be used -to cache indirect blocks. Some other place is required. Currently -logfs uses the top half of each inode's address space. The low 8TB -(on 32bit) are filled with file data, the high 8TB are used for -indirect blocks. - -One problem is that 16TB files created on 64bit systems actually have -data in the top 8TB. But files >16TB would cause problems anyway, so -only the limit has changed. diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt index bcbf9710e4af..634d03e20c2d 100644 --- a/Documentation/filesystems/overlayfs.txt +++ b/Documentation/filesystems/overlayfs.txt @@ -66,7 +66,7 @@ At mount time, the two directories given as mount options "lowerdir" and "upperdir" are combined into a merged directory: mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\ -workdir=/work /merged + workdir=/work /merged The "workdir" needs to be an empty directory on the same filesystem as upperdir. @@ -118,6 +118,7 @@ programs. seek offsets are assigned sequentially when the directories are read. Thus if + - read part of a directory - remember an offset, and close the directory - re-open the directory some time later @@ -130,6 +131,23 @@ directory. Readdir on directories that are not merged is simply handled by the underlying directory (upper or lower). +renaming directories +-------------------- + +When renaming a directory that is on the lower layer or merged (i.e. the +directory was not created on the upper layer to start with) overlayfs can +handle it in two different ways: + +1. return EXDEV error: this error is returned by rename(2) when trying to + move a file or directory across filesystem boundaries. Hence + applications are usually prepared to hande this error (mv(1) for example + recursively copies the directory tree). This is the default behavior. + +2. If the "redirect_dir" feature is enabled, then the directory will be + copied up (but not the contents). Then the "trusted.overlay.redirect" + extended attribute is set to the path of the original location from the + root of the overlay. Finally the directory is moved to the new + location. Non-directories --------------- @@ -185,13 +203,13 @@ filesystem, so both st_dev and st_ino of the file may change. Any open files referring to this inode will access the old data. -Any file locks (and leases) obtained before copy_up will not apply -to the copied up file. - If a file with multiple hard links is copied up, then this will "break" the link. Changes will not be propagated to other names referring to the same inode. +Unless "redirect_dir" feature is enabled, rename(2) on a lower or merged +directory will fail with EXDEV. + Changes to underlying filesystems --------------------------------- diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index bdd025ceb763..95280079c0b3 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -596,3 +596,7 @@ in your dentry operations instead. [mandatory] ->rename() has an added flags argument. Any flags not handled by the filesystem should result in EINVAL being returned. +-- +[recommended] + ->readlink is optional for symlinks. Don't set, unless filesystem needs + to fake something for readlink(2). diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt index 74eaac26f8b8..6ea1ceda6f52 100644 --- a/Documentation/filesystems/sysfs-pci.txt +++ b/Documentation/filesystems/sysfs-pci.txt @@ -17,6 +17,7 @@ that support it. For example, a given bus might look like this: | |-- resource0 | |-- resource1 | |-- resource2 + | |-- revision | |-- rom | |-- subsystem_device | |-- subsystem_vendor @@ -41,6 +42,7 @@ files, each with their own function. resource PCI resource host addresses (ascii, ro) resource0..N PCI resource N, if present (binary, mmap, rw[1]) resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) + revision PCI revision (ascii, ro) rom PCI ROM resource, if present (binary, ro) subsystem_device PCI subsystem device (ascii, ro) subsystem_vendor PCI subsystem vendor (ascii, ro) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index b5039a00caaf..b968084eeac1 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -451,9 +451,6 @@ otherwise noted. exist; this is checked by the VFS. Unlike plain rename, source and target may be of different type. - readlink: called by the readlink(2) system call. Only required if - you want to support reading symbolic links - get_link: called by the VFS to follow a symbolic link to the inode it points to. Only required if you want to support symbolic links. This method returns the symlink body @@ -468,6 +465,12 @@ otherwise noted. argument. If request can't be handled without leaving RCU mode, have it return ERR_PTR(-ECHILD). + readlink: this is now just an override for use by readlink(2) for the + cases when ->get_link uses nd_jump_link() or object is not in + fact a symlink. Normally filesystems should only implement + ->get_link for symlinks and readlink(2) will automatically use + that. + permission: called by the VFS to check for access rights on a POSIX-like filesystem. @@ -948,7 +951,7 @@ struct dentry_operations { void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)(struct dentry *, char *, int); struct vfsmount *(*d_automount)(struct path *); - int (*d_manage)(struct dentry *, bool); + int (*d_manage)(const struct path *, bool); struct dentry *(*d_real)(struct dentry *, const struct inode *, unsigned int); }; diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index c2d44e6e117b..3b9b5c149f32 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -51,13 +51,6 @@ default behaviour. CRC enabled filesystems always use the attr2 format, and so will reject the noattr2 mount option if it is set. - barrier (*) - nobarrier - Enables/disables the use of block layer write barriers for - writes into the journal and for data integrity operations. - This allows for drive level write caching to be enabled, for - devices that support write barriers. - discard nodiscard (*) Enable/disable the issuing of commands to let the block @@ -228,7 +221,10 @@ default behaviour. Deprecated Mount Options ======================== -None at present. + Name Removal Schedule + ---- ---------------- + barrier no earlier than v4.15 + nobarrier no earlier than v4.15 Removed Mount Options |