diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/btrfs.txt | 261 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/pnfs-scsi-server.txt | 23 | ||||
-rw-r--r-- | Documentation/filesystems/ocfs2-online-filecheck.txt | 94 | ||||
-rw-r--r-- | Documentation/filesystems/orangefs.txt | 406 | ||||
-rw-r--r-- | Documentation/filesystems/vfat.txt | 7 |
5 files changed, 538 insertions, 253 deletions
diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt index c772b47e7ef0..f9dad22d95ce 100644 --- a/Documentation/filesystems/btrfs.txt +++ b/Documentation/filesystems/btrfs.txt @@ -1,20 +1,10 @@ - BTRFS ===== -Btrfs is a copy on write filesystem for Linux aimed at -implementing advanced features while focusing on fault tolerance, -repair and easy administration. Initially developed by Oracle, Btrfs -is licensed under the GPL and open for contribution from anyone. - -Linux has a wealth of filesystems to choose from, but we are facing a -number of challenges with scaling to the large storage subsystems that -are becoming common in today's data centers. Filesystems need to scale -in their ability to address and manage large storage, and also in -their ability to detect, repair and tolerate errors in the data stored -on disk. Btrfs is under heavy development, and is not suitable for -any uses other than benchmarking and review. The Btrfs disk format is -not yet finalized. +Btrfs is a copy on write filesystem for Linux aimed at implementing advanced +features while focusing on fault tolerance, repair and easy administration. +Jointly developed by several companies, licensed under the GPL and open for +contribution from anyone. The main Btrfs features include: @@ -28,243 +18,14 @@ The main Btrfs features include: * Checksums on data and metadata (multiple algorithms available) * Compression * Integrated multiple device support, with several raid algorithms - * Online filesystem check (not yet implemented) - * Very fast offline filesystem check - * Efficient incremental backup and FS mirroring (not yet implemented) + * Offline filesystem check + * Efficient incremental backup and FS mirroring * Online filesystem defragmentation +For more information please refer to the wiki -Mount Options -============= - -When mounting a btrfs filesystem, the following option are accepted. -Options with (*) are default options and will not show in the mount options. - - alloc_start=<bytes> - Debugging option to force all block allocations above a certain - byte threshold on each block device. The value is specified in - bytes, optionally with a K, M, or G suffix, case insensitive. - Default is 1MB. - - noautodefrag(*) - autodefrag - Disable/enable auto defragmentation. - Auto defragmentation detects small random writes into files and queue - them up for the defrag process. Works best for small files; - Not well suited for large database workloads. - - check_int - check_int_data - check_int_print_mask=<value> - These debugging options control the behavior of the integrity checking - module (the BTRFS_FS_CHECK_INTEGRITY config option required). - - check_int enables the integrity checker module, which examines all - block write requests to ensure on-disk consistency, at a large - memory and CPU cost. - - check_int_data includes extent data in the integrity checks, and - implies the check_int option. - - check_int_print_mask takes a bitmask of BTRFSIC_PRINT_MASK_* values - as defined in fs/btrfs/check-integrity.c, to control the integrity - checker module behavior. - - See comments at the top of fs/btrfs/check-integrity.c for more info. - - commit=<seconds> - Set the interval of periodic commit, 30 seconds by default. Higher - values defer data being synced to permanent storage with obvious - consequences when the system crashes. The upper bound is not forced, - but a warning is printed if it's more than 300 seconds (5 minutes). - - compress - compress=<type> - compress-force - compress-force=<type> - Control BTRFS file data compression. Type may be specified as "zlib" - "lzo" or "no" (for no compression, used for remounting). If no type - is specified, zlib is used. If compress-force is specified, - all files will be compressed, whether or not they compress well. - If compression is enabled, nodatacow and nodatasum are disabled. - - degraded - Allow mounts to continue with missing devices. A read-write mount may - fail with too many devices missing, for example if a stripe member - is completely missing. - - device=<devicepath> - Specify a device during mount so that ioctls on the control device - can be avoided. Especially useful when trying to mount a multi-device - setup as root. May be specified multiple times for multiple devices. - - nodiscard(*) - discard - Disable/enable discard mount option. - Discard issues frequent commands to let the block device reclaim space - freed by the filesystem. - This is useful for SSD devices, thinly provisioned - LUNs and virtual machine images, but may have a significant - performance impact. (The fstrim command is also available to - initiate batch trims from userspace). - - noenospc_debug(*) - enospc_debug - Disable/enable debugging option to be more verbose in some ENOSPC conditions. - - fatal_errors=<action> - Action to take when encountering a fatal error: - "bug" - BUG() on a fatal error. This is the default. - "panic" - panic() on a fatal error. - - noflushoncommit(*) - flushoncommit - The 'flushoncommit' mount option forces any data dirtied by a write in a - prior transaction to commit as part of the current commit. This makes - the committed state a fully consistent view of the file system from the - application's perspective (i.e., it includes all completed file system - operations). This was previously the behavior only when a snapshot is - created. - - inode_cache - Enable free inode number caching. Defaults to off due to an overflow - problem when the free space crcs don't fit inside a single page. - - max_inline=<bytes> - Specify the maximum amount of space, in bytes, that can be inlined in - a metadata B-tree leaf. The value is specified in bytes, optionally - with a K, M, or G suffix, case insensitive. In practice, this value - is limited by the root sector size, with some space unavailable due - to leaf headers. For a 4k sector size, max inline data is ~3900 bytes. - - metadata_ratio=<value> - Specify that 1 metadata chunk should be allocated after every <value> - data chunks. Off by default. - - acl(*) - noacl - Enable/disable support for Posix Access Control Lists (ACLs). See the - acl(5) manual page for more information about ACLs. - - barrier(*) - nobarrier - Enable/disable the use of block layer write barriers. Write barriers - ensure that certain IOs make it through the device cache and are on - persistent storage. If disabled on a device with a volatile - (non-battery-backed) write-back cache, nobarrier option will lead to - filesystem corruption on a system crash or power loss. - - datacow(*) - nodatacow - Enable/disable data copy-on-write for newly created files. - Nodatacow implies nodatasum, and disables all compression. - - datasum(*) - nodatasum - Enable/disable data checksumming for newly created files. - Datasum implies datacow. - - treelog(*) - notreelog - Enable/disable the tree logging used for fsync and O_SYNC writes. - - recovery - Enable autorecovery attempts if a bad tree root is found at mount time. - Currently this scans a list of several previous tree roots and tries to - use the first readable. - - rescan_uuid_tree - Force check and rebuild procedure of the UUID tree. This should not - normally be needed. - - skip_balance - Skip automatic resume of interrupted balance operation after mount. - May be resumed with "btrfs balance resume." - - space_cache (*) - Enable the on-disk freespace cache. - nospace_cache - Disable freespace cache loading without clearing the cache. - clear_cache - Force clearing and rebuilding of the disk space cache if something - has gone wrong. - - ssd - nossd - ssd_spread - Options to control ssd allocation schemes. By default, BTRFS will - enable or disable ssd allocation heuristics depending on whether a - rotational or non-rotational disk is in use. The ssd and nossd options - can override this autodetection. - - The ssd_spread mount option attempts to allocate into big chunks - of unused space, and may perform better on low-end ssds. ssd_spread - implies ssd, enabling all other ssd heuristics as well. - - subvol=<path> - Mount subvolume at <path> rather than the root subvolume. <path> is - relative to the top level subvolume. - - subvolid=<ID> - Mount subvolume specified by an ID number rather than the root subvolume. - This allows mounting of subvolumes which are not in the root of the mounted - filesystem. - You can use "btrfs subvolume list" to see subvolume ID numbers. - - subvolrootid=<objectid> (deprecated) - Mount subvolume specified by <objectid> rather than the root subvolume. - This allows mounting of subvolumes which are not in the root of the mounted - filesystem. - You can use "btrfs subvolume show " to see the object ID for a subvolume. - - thread_pool=<number> - The number of worker threads to allocate. The default number is equal - to the number of CPUs + 2, or 8, whichever is smaller. - - user_subvol_rm_allowed - Allow subvolumes to be deleted by a non-root user. Use with caution. - -MAILING LIST -============ - -There is a Btrfs mailing list hosted on vger.kernel.org. You can -find details on how to subscribe here: - -http://vger.kernel.org/vger-lists.html#linux-btrfs - -Mailing list archives are available from gmane: - -http://dir.gmane.org/gmane.comp.file-systems.btrfs - - - -IRC -=== - -Discussion of Btrfs also occurs on the #btrfs channel of the Freenode -IRC network. - - - - UTILITIES - ========= - -Userspace tools for creating and manipulating Btrfs file systems are -available from the git repository at the following location: - - http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git - git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git - -These include the following tools: - -* mkfs.btrfs: create a filesystem - -* btrfs: a single tool to manage the filesystems, refer to the manpage for more details - -* 'btrfsck' or 'btrfs check': do a consistency check of the filesystem - -Other tools for specific tasks: - -* btrfs-convert: in-place conversion from ext2/3/4 filesystems + https://btrfs.wiki.kernel.org -* btrfs-image: dump filesystem metadata for debugging +that maintains information about administration tasks, frequently asked +questions, use cases, mount options, comprehensible changelogs, features, +manual pages, source code repositories, contacts etc. diff --git a/Documentation/filesystems/nfs/pnfs-scsi-server.txt b/Documentation/filesystems/nfs/pnfs-scsi-server.txt new file mode 100644 index 000000000000..5bef7268bd9f --- /dev/null +++ b/Documentation/filesystems/nfs/pnfs-scsi-server.txt @@ -0,0 +1,23 @@ + +pNFS SCSI layout server user guide +================================== + +This document describes support for pNFS SCSI layouts in the Linux NFS server. +With pNFS SCSI layouts, the NFS server acts as Metadata Server (MDS) for pNFS, +which in addition to handling all the metadata access to the NFS export, +also hands out layouts to the clients so that they can directly access the +underlying SCSI LUNs that are shared with the client. + +To use pNFS SCSI layouts with with the Linux NFS server, the exported file +system needs to support the pNFS SCSI layouts (currently just XFS), and the +file system must sit on a SCSI LUN that is accessible to the clients in +addition to the MDS. As of now the file system needs to sit directly on the +exported LUN, striping or concatenation of LUNs on the MDS and clients +is not supported yet. + +On a server built with CONFIG_NFSD_SCSI, the pNFS SCSI volume support is +automatically enabled if the file system is exported using the "pnfs" +option and the underlying SCSI device support persistent reservations. +On the client make sure the kernel has the CONFIG_PNFS_BLOCK option +enabled, and the file system is mounted using the NFSv4.1 protocol +version (mount -o vers=4.1). diff --git a/Documentation/filesystems/ocfs2-online-filecheck.txt b/Documentation/filesystems/ocfs2-online-filecheck.txt new file mode 100644 index 000000000000..1ab07860430d --- /dev/null +++ b/Documentation/filesystems/ocfs2-online-filecheck.txt @@ -0,0 +1,94 @@ + OCFS2 online file check + ----------------------- + +This document will describe OCFS2 online file check feature. + +Introduction +============ +OCFS2 is often used in high-availaibility systems. However, OCFS2 usually +converts the filesystem to read-only when encounters an error. This may not be +necessary, since turning the filesystem read-only would affect other running +processes as well, decreasing availability. +Then, a mount option (errors=continue) is introduced, which would return the +-EIO errno to the calling process and terminate furhter processing so that the +filesystem is not corrupted further. The filesystem is not converted to +read-only, and the problematic file's inode number is reported in the kernel +log. The user can try to check/fix this file via online filecheck feature. + +Scope +===== +This effort is to check/fix small issues which may hinder day-to-day operations +of a cluster filesystem by turning the filesystem read-only. The scope of +checking/fixing is at the file level, initially for regular files and eventually +to all files (including system files) of the filesystem. + +In case of directory to file links is incorrect, the directory inode is +reported as erroneous. + +This feature is not suited for extravagant checks which involve dependency of +other components of the filesystem, such as but not limited to, checking if the +bits for file blocks in the allocation has been set. In case of such an error, +the offline fsck should/would be recommended. + +Finally, such an operation/feature should not be automated lest the filesystem +may end up with more damage than before the repair attempt. So, this has to +be performed using user interaction and consent. + +User interface +============== +When there are errors in the OCFS2 filesystem, they are usually accompanied +by the inode number which caused the error. This inode number would be the +input to check/fix the file. + +There is a sysfs directory for each OCFS2 file system mounting: + + /sys/fs/ocfs2/<devname>/filecheck + +Here, <devname> indicates the name of OCFS2 volumn device which has been already +mounted. The file above would accept inode numbers. This could be used to +communicate with kernel space, tell which file(inode number) will be checked or +fixed. Currently, three operations are supported, which includes checking +inode, fixing inode and setting the size of result record history. + +1. If you want to know what error exactly happened to <inode> before fixing, do + + # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check + # cat /sys/fs/ocfs2/<devname>/filecheck/check + +The output is like this: + INO DONE ERROR +39502 1 GENERATION + +<INO> lists the inode numbers. +<DONE> indicates whether the operation has been finished. +<ERROR> says what kind of errors was found. For the detailed error numbers, +please refer to the file linux/fs/ocfs2/filecheck.h. + +2. If you determine to fix this inode, do + + # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix + # cat /sys/fs/ocfs2/<devname>/filecheck/fix + +The output is like this: + INO DONE ERROR +39502 1 SUCCESS + +This time, the <ERROR> column indicates whether this fix is successful or not. + +3. The record cache is used to store the history of check/fix results. It's +defalut size is 10, and can be adjust between the range of 10 ~ 100. You can +adjust the size like this: + + # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set + +Fixing stuff +============ +On receivng the inode, the filesystem would read the inode and the +file metadata. In case of errors, the filesystem would fix the errors +and report the problems it fixed in the kernel log. As a precautionary measure, +the inode must first be checked for errors before performing a final fix. + +The inode and the result history will be maintained temporarily in a +small linked list buffer which would contain the last (N) inodes +fixed/checked, the detailed errors which were fixed/checked are printed in the +kernel log. diff --git a/Documentation/filesystems/orangefs.txt b/Documentation/filesystems/orangefs.txt new file mode 100644 index 000000000000..e1a0056a365f --- /dev/null +++ b/Documentation/filesystems/orangefs.txt @@ -0,0 +1,406 @@ +ORANGEFS +======== + +OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal +for large storage problems faced by HPC, BigData, Streaming Video, +Genomics, Bioinformatics. + +Orangefs, originally called PVFS, was first developed in 1993 by +Walt Ligon and Eric Blumer as a parallel file system for Parallel +Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns +of parallel programs. + +Orangefs features include: + + * Distributes file data among multiple file servers + * Supports simultaneous access by multiple clients + * Stores file data and metadata on servers using local file system + and access methods + * Userspace implementation is easy to install and maintain + * Direct MPI support + * Stateless + + +MAILING LIST +============ + +http://beowulf-underground.org/mailman/listinfo/pvfs2-users + + +DOCUMENTATION +============= + +http://www.orangefs.org/documentation/ + + +USERSPACE FILESYSTEM SOURCE +=========================== + +http://www.orangefs.org/download + +Orangefs versions prior to 2.9.3 would not be compatible with the +upstream version of the kernel client. + + +BUILDING THE USERSPACE FILESYSTEM ON A SINGLE SERVER +==================================================== + +When Orangefs is upstream, "--with-kernel" shouldn't be needed, but +until then the path to where the kernel with the Orangefs kernel client +patch was built is needed to ensure that pvfs2-client-core (the bridge +between kernel space and user space) will build properly. You can omit +--prefix if you don't care that things are sprinkled around in +/usr/local. + +./configure --prefix=/opt/ofs --with-kernel=/path/to/orangefs/kernel + +make + +make install + +Create an orangefs config file: +/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf + + for "Enter hostnames", use the hostname, don't let it default to + localhost. + +create a pvfs2tab file in /etc: +cat /etc/pvfs2tab +tcp://myhostname:3334/orangefs /mymountpoint pvfs2 defaults,noauto 0 0 + +create the mount point you specified in the tab file if needed: +mkdir /mymountpoint + +bootstrap the server: +/opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf -f + +start the server: +/opt/osf/sbin/pvfs2-server /etc/pvfs2.conf + +Now the server is running. At this point you might like to +prove things are working with: + +/opt/osf/bin/pvfs2-ls /mymountpoint + +You might not want to enforce selinux, it doesn't seem to matter by +linux 3.11... + +If stuff seems to be working, turn on the client core: +/opt/osf/sbin/pvfs2-client -p /opt/osf/sbin/pvfs2-client-core + +Mount your filesystem. +mount -t pvfs2 tcp://myhostname:3334/orangefs /mymountpoint + + +OPTIONS +======= + +The following mount options are accepted: + + acl + Allow the use of Access Control Lists on files and directories. + + intr + Some operations between the kernel client and the user space + filesystem can be interruptible, such as changes in debug levels + and the setting of tunable parameters. + + local_lock + Enable posix locking from the perspective of "this" kernel. The + default file_operations lock action is to return ENOSYS. Posix + locking kicks in if the filesystem is mounted with -o local_lock. + Distributed locking is being worked on for the future. + + +DEBUGGING +========= + +If you want the debug (GOSSIP) statements in a particular +source file (inode.c for example) go to syslog: + + echo inode > /sys/kernel/debug/orangefs/kernel-debug + +No debugging (the default): + + echo none > /sys/kernel/debug/orangefs/kernel-debug + +Debugging from several source files: + + echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug + +All debugging: + + echo all > /sys/kernel/debug/orangefs/kernel-debug + +Get a list of all debugging keywords: + + cat /sys/kernel/debug/orangefs/debug-help + + +PROTOCOL BETWEEN KERNEL MODULE AND USERSPACE +============================================ + +Orangefs is a user space filesystem and an associated kernel module. +We'll just refer to the user space part of Orangefs as "userspace" +from here on out. Orangefs descends from PVFS, and userspace code +still uses PVFS for function and variable names. Userspace typedefs +many of the important structures. Function and variable names in +the kernel module have been transitioned to "orangefs", and The Linux +Coding Style avoids typedefs, so kernel module structures that +correspond to userspace structures are not typedefed. + +The kernel module implements a pseudo device that userspace +can read from and write to. Userspace can also manipulate the +kernel module through the pseudo device with ioctl. + +THE BUFMAP: + +At startup userspace allocates two page-size-aligned (posix_memalign) +mlocked memory buffers, one is used for IO and one is used for readdir +operations. The IO buffer is 41943040 bytes and the readdir buffer is +4194304 bytes. Each buffer contains logical chunks, or partitions, and +a pointer to each buffer is added to its own PVFS_dev_map_desc structure +which also describes its total size, as well as the size and number of +the partitions. + +A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a +mapping routine in the kernel module with an ioctl. The structure is +copied from user space to kernel space with copy_from_user and is used +to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which +then contains: + + * refcnt - a reference counter + * desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's + partition size, which represents the filesystem's block size and + is used for s_blocksize in super blocks. + * desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of + partitions in the IO buffer. + * desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. + * total_size - the total size of the IO buffer. + * page_count - the number of 4096 byte pages in the IO buffer. + * page_array - a pointer to page_count * (sizeof(struct page*)) bytes + of kcalloced memory. This memory is used as an array of pointers + to each of the pages in the IO buffer through a call to get_user_pages. + * desc_array - a pointer to desc_count * (sizeof(struct orangefs_bufmap_desc)) + bytes of kcalloced memory. This memory is further intialized: + + user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc + structure. user_desc->ptr points to the IO buffer. + + pages_per_desc = bufmap->desc_size / PAGE_SIZE + offset = 0 + + bufmap->desc_array[0].page_array = &bufmap->page_array[offset] + bufmap->desc_array[0].array_count = pages_per_desc = 1024 + bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) + offset += 1024 + . + . + . + bufmap->desc_array[9].page_array = &bufmap->page_array[offset] + bufmap->desc_array[9].array_count = pages_per_desc = 1024 + bufmap->desc_array[9].uaddr = (user_desc->ptr) + + (9 * 1024 * 4096) + offset += 1024 + + * buffer_index_array - a desc_count sized array of ints, used to + indicate which of the IO buffer's partitions are available to use. + * buffer_index_lock - a spinlock to protect buffer_index_array during update. + * readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element + int array used to indicate which of the readdir buffer's partitions are + available to use. + * readdir_index_lock - a spinlock to protect readdir_index_array during + update. + +OPERATIONS: + +The kernel module builds an "op" (struct orangefs_kernel_op_s) when it +needs to communicate with userspace. Part of the op contains the "upcall" +which expresses the request to userspace. Part of the op eventually +contains the "downcall" which expresses the results of the request. + +The slab allocator is used to keep a cache of op structures handy. + +At init time the kernel module defines and initializes a request list +and an in_progress hash table to keep track of all the ops that are +in flight at any given time. + +Ops are stateful: + + * unknown - op was just initialized + * waiting - op is on request_list (upward bound) + * inprogr - op is in progress (waiting for downcall) + * serviced - op has matching downcall; ok + * purged - op has to start a timer since client-core + exited uncleanly before servicing op + * given up - submitter has given up waiting for it + +When some arbitrary userspace program needs to perform a +filesystem operation on Orangefs (readdir, I/O, create, whatever) +an op structure is initialized and tagged with a distinguishing ID +number. The upcall part of the op is filled out, and the op is +passed to the "service_operation" function. + +Service_operation changes the op's state to "waiting", puts +it on the request list, and signals the Orangefs file_operations.poll +function through a wait queue. Userspace is polling the pseudo-device +and thus becomes aware of the upcall request that needs to be read. + +When the Orangefs file_operations.read function is triggered, the +request list is searched for an op that seems ready-to-process. +The op is removed from the request list. The tag from the op and +the filled-out upcall struct are copy_to_user'ed back to userspace. + +If any of these (and some additional protocol) copy_to_users fail, +the op's state is set to "waiting" and the op is added back to +the request list. Otherwise, the op's state is changed to "in progress", +and the op is hashed on its tag and put onto the end of a list in the +in_progress hash table at the index the tag hashed to. + +When userspace has assembled the response to the upcall, it +writes the response, which includes the distinguishing tag, back to +the pseudo device in a series of io_vecs. This triggers the Orangefs +file_operations.write_iter function to find the op with the associated +tag and remove it from the in_progress hash table. As long as the op's +state is not "canceled" or "given up", its state is set to "serviced". +The file_operations.write_iter function returns to the waiting vfs, +and back to service_operation through wait_for_matching_downcall. + +Service operation returns to its caller with the op's downcall +part (the response to the upcall) filled out. + +The "client-core" is the bridge between the kernel module and +userspace. The client-core is a daemon. The client-core has an +associated watchdog daemon. If the client-core is ever signaled +to die, the watchdog daemon restarts the client-core. Even though +the client-core is restarted "right away", there is a period of +time during such an event that the client-core is dead. A dead client-core +can't be triggered by the Orangefs file_operations.poll function. +Ops that pass through service_operation during a "dead spell" can timeout +on the wait queue and one attempt is made to recycle them. Obviously, +if the client-core stays dead too long, the arbitrary userspace processes +trying to use Orangefs will be negatively affected. Waiting ops +that can't be serviced will be removed from the request list and +have their states set to "given up". In-progress ops that can't +be serviced will be removed from the in_progress hash table and +have their states set to "given up". + +Readdir and I/O ops are atypical with respect to their payloads. + + - readdir ops use the smaller of the two pre-allocated pre-partitioned + memory buffers. The readdir buffer is only available to userspace. + The kernel module obtains an index to a free partition before launching + a readdir op. Userspace deposits the results into the indexed partition + and then writes them to back to the pvfs device. + + - io (read and write) ops use the larger of the two pre-allocated + pre-partitioned memory buffers. The IO buffer is accessible from + both userspace and the kernel module. The kernel module obtains an + index to a free partition before launching an io op. The kernel module + deposits write data into the indexed partition, to be consumed + directly by userspace. Userspace deposits the results of read + requests into the indexed partition, to be consumed directly + by the kernel module. + +Responses to kernel requests are all packaged in pvfs2_downcall_t +structs. Besides a few other members, pvfs2_downcall_t contains a +union of structs, each of which is associated with a particular +response type. + +The several members outside of the union are: + - int32_t type - type of operation. + - int32_t status - return code for the operation. + - int64_t trailer_size - 0 unless readdir operation. + - char *trailer_buf - initialized to NULL, used during readdir operations. + +The appropriate member inside the union is filled out for any +particular response. + + PVFS2_VFS_OP_FILE_IO + fill a pvfs2_io_response_t + + PVFS2_VFS_OP_LOOKUP + fill a PVFS_object_kref + + PVFS2_VFS_OP_CREATE + fill a PVFS_object_kref + + PVFS2_VFS_OP_SYMLINK + fill a PVFS_object_kref + + PVFS2_VFS_OP_GETATTR + fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) + fill in a string with the link target when the object is a symlink. + + PVFS2_VFS_OP_MKDIR + fill a PVFS_object_kref + + PVFS2_VFS_OP_STATFS + fill a pvfs2_statfs_response_t with useless info <g>. It is hard for + us to know, in a timely fashion, these statistics about our + distributed network filesystem. + + PVFS2_VFS_OP_FS_MOUNT + fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref + except its members are in a different order and "__pad1" is replaced + with "id". + + PVFS2_VFS_OP_GETXATTR + fill a pvfs2_getxattr_response_t + + PVFS2_VFS_OP_LISTXATTR + fill a pvfs2_listxattr_response_t + + PVFS2_VFS_OP_PARAM + fill a pvfs2_param_response_t + + PVFS2_VFS_OP_PERF_COUNT + fill a pvfs2_perf_count_response_t + + PVFS2_VFS_OP_FSKEY + file a pvfs2_fs_key_response_t + + PVFS2_VFS_OP_READDIR + jamb everything needed to represent a pvfs2_readdir_response_t into + the readdir buffer descriptor specified in the upcall. + +Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests +made by the kernel side. + +A buffer_list containing: + - a pointer to the prepared response to the request from the + kernel (struct pvfs2_downcall_t). + - and also, in the case of a readdir request, a pointer to a + buffer containing descriptors for the objects in the target + directory. +... is sent to the function (PINT_dev_write_list) which performs +the writev. + +PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; + +The first four elements of io_array are initialized like this for all +responses: + + io_array[0].iov_base = address of local variable "proto_ver" (int32_t) + io_array[0].iov_len = sizeof(int32_t) + + io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) + io_array[1].iov_len = sizeof(int32_t) + + io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) + io_array[2].iov_len = sizeof(int64_t) + + io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) + of global variable vfs_request (vfs_request_t) + io_array[3].iov_len = sizeof(pvfs2_downcall_t) + +Readdir responses initialize the fifth element io_array like this: + + io_array[4].iov_base = contents of member trailer_buf (char *) + from out_downcall member of global variable + vfs_request + io_array[4].iov_len = contents of member trailer_size (PVFS_size) + from out_downcall member of global variable + vfs_request + + diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index 223c32171dcc..cf51360e3a9f 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -56,9 +56,10 @@ iocharset=<name> -- Character set to use for converting between the you should consider the following option instead. utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that - is used by the console. It can be enabled for the - filesystem with this option. If 'uni_xlate' gets set, - UTF-8 gets disabled. + is used by the console. It can be enabled or disabled + for the filesystem with this option. + If 'uni_xlate' gets set, UTF-8 gets disabled. + By default, FAT_DEFAULT_UTF8 setting is used. uni_xlate=<bool> -- Translate unhandled Unicode characters to special escaped sequences. This would let you backup and |