diff options
author | Miklos Szeredi <miklos@szeredi.hu> | 2005-09-09 13:10:27 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@g5.osdl.org> | 2005-09-09 14:03:44 -0700 |
commit | 334f485df85ac7736ebe14940bf0a059c5f26d7d (patch) | |
tree | 754e5528289048a7104f4c1b431cebc1df16e2ce /Documentation/filesystems | |
parent | d8a5ba45457e4a22aa39c939121efd7bb6c76672 (diff) | |
download | blackbird-op-linux-334f485df85ac7736ebe14940bf0a059c5f26d7d.tar.gz blackbird-op-linux-334f485df85ac7736ebe14940bf0a059c5f26d7d.zip |
[PATCH] FUSE - device functions
This adds the FUSE device handling functions.
This contains the following files:
o dev.c
- fuse device operations (read, write, release, poll)
- registers misc device
- support for sending requests to userspace
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/fuse.txt | 341 |
1 files changed, 341 insertions, 0 deletions
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt new file mode 100644 index 000000000000..83f96cf56960 --- /dev/null +++ b/Documentation/filesystems/fuse.txt @@ -0,0 +1,341 @@ +Definitions +~~~~~~~~~~~ + +Userspace filesystem: + + A filesystem in which data and metadata are provided by an ordinary + userspace process. The filesystem can be accessed normally through + the kernel interface. + +Filesystem daemon: + + The process(es) providing the data and metadata of the filesystem. + +Non-privileged mount (or user mount): + + A userspace filesystem mounted by a non-privileged (non-root) user. + The filesystem daemon is running with the privileges of the mounting + user. NOTE: this is not the same as mounts allowed with the "user" + option in /etc/fstab, which is not discussed here. + +Mount owner: + + The user who does the mounting. + +User: + + The user who is performing filesystem operations. + +What is FUSE? +~~~~~~~~~~~~~ + +FUSE is a userspace filesystem framework. It consists of a kernel +module (fuse.ko), a userspace library (libfuse.*) and a mount utility +(fusermount). + +One of the most important features of FUSE is allowing secure, +non-privileged mounts. This opens up new possibilities for the use of +filesystems. A good example is sshfs: a secure network filesystem +using the sftp protocol. + +The userspace library and utilities are available from the FUSE +homepage: + + http://fuse.sourceforge.net/ + +Mount options +~~~~~~~~~~~~~ + +'fd=N' + + The file descriptor to use for communication between the userspace + filesystem and the kernel. The file descriptor must have been + obtained by opening the FUSE device ('/dev/fuse'). + +'rootmode=M' + + The file mode of the filesystem's root in octal representation. + +'user_id=N' + + The numeric user id of the mount owner. + +'group_id=N' + + The numeric group id of the mount owner. + +'default_permissions' + + By default FUSE doesn't check file access permissions, the + filesystem is free to implement it's access policy or leave it to + the underlying file access mechanism (e.g. in case of network + filesystems). This option enables permission checking, restricting + access based on file mode. This is option is usually useful + together with the 'allow_other' mount option. + +'allow_other' + + This option overrides the security measure restricting file access + to the user mounting the filesystem. This option is by default only + allowed to root, but this restriction can be removed with a + (userspace) configuration option. + +'kernel_cache' + + This option disables flushing the cache of the file contents on + every open(). This should only be enabled on filesystems, where the + file data is never changed externally (not through the mounted FUSE + filesystem). Thus it is not suitable for network filesystems and + other "intermediate" filesystems. + + NOTE: if this option is not specified (and neither 'direct_io') data + is still cached after the open(), so a read() system call will not + always initiate a read operation. + +'direct_io' + + This option disables the use of page cache (file content cache) in + the kernel for this filesystem. This has several affects: + + - Each read() or write() system call will initiate one or more + read or write operations, data will not be cached in the + kernel. + + - The return value of the read() and write() system calls will + correspond to the return values of the read and write + operations. This is useful for example if the file size is not + known in advance (before reading it). + +'max_read=N' + + With this option the maximum size of read operations can be set. + The default is infinite. Note that the size of read requests is + limited anyway to 32 pages (which is 128kbyte on i386). + +How do non-privileged mounts work? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since the mount() system call is a privileged operation, a helper +program (fusermount) is needed, which is installed setuid root. + +The implication of providing non-privileged mounts is that the mount +owner must not be able to use this capability to compromise the +system. Obvious requirements arising from this are: + + A) mount owner should not be able to get elevated privileges with the + help of the mounted filesystem + + B) mount owner should not get illegitimate access to information from + other users' and the super user's processes + + C) mount owner should not be able to induce undesired behavior in + other users' or the super user's processes + +How are requirements fulfilled? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + A) The mount owner could gain elevated privileges by either: + + 1) creating a filesystem containing a device file, then opening + this device + + 2) creating a filesystem containing a suid or sgid application, + then executing this application + + The solution is not to allow opening device files and ignore + setuid and setgid bits when executing programs. To ensure this + fusermount always adds "nosuid" and "nodev" to the mount options + for non-privileged mounts. + + B) If another user is accessing files or directories in the + filesystem, the filesystem daemon serving requests can record the + exact sequence and timing of operations performed. This + information is otherwise inaccessible to the mount owner, so this + counts as an information leak. + + The solution to this problem will be presented in point 2) of C). + + C) There are several ways in which the mount owner can induce + undesired behavior in other users' processes, such as: + + 1) mounting a filesystem over a file or directory which the mount + owner could otherwise not be able to modify (or could only + make limited modifications). + + This is solved in fusermount, by checking the access + permissions on the mountpoint and only allowing the mount if + the mount owner can do unlimited modification (has write + access to the mountpoint, and mountpoint is not a "sticky" + directory) + + 2) Even if 1) is solved the mount owner can change the behavior + of other users' processes. + + i) It can slow down or indefinitely delay the execution of a + filesystem operation creating a DoS against the user or the + whole system. For example a suid application locking a + system file, and then accessing a file on the mount owner's + filesystem could be stopped, and thus causing the system + file to be locked forever. + + ii) It can present files or directories of unlimited length, or + directory structures of unlimited depth, possibly causing a + system process to eat up diskspace, memory or other + resources, again causing DoS. + + The solution to this as well as B) is not to allow processes + to access the filesystem, which could otherwise not be + monitored or manipulated by the mount owner. Since if the + mount owner can ptrace a process, it can do all of the above + without using a FUSE mount, the same criteria as used in + ptrace can be used to check if a process is allowed to access + the filesystem or not. + + Note that the ptrace check is not strictly necessary to + prevent B/2/i, it is enough to check if mount owner has enough + privilege to send signal to the process accessing the + filesystem, since SIGSTOP can be used to get a similar effect. + +I think these limitations are unacceptable? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a sysadmin trusts the users enough, or can ensure through other +measures, that system processes will never enter non-privileged +mounts, it can relax the last limitation with a "user_allow_other" +config option. If this config option is set, the mounting user can +add the "allow_other" mount option which disables the check for other +users' processes. + +Kernel - userspace interface +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following diagram shows how a filesystem operation (in this +example unlink) is performed in FUSE. + +NOTE: everything in this description is greatly simplified + + | "rm /mnt/fuse/file" | FUSE filesystem daemon + | | + | | >sys_read() + | | >fuse_dev_read() + | | >request_wait() + | | [sleep on fc->waitq] + | | + | >sys_unlink() | + | >fuse_unlink() | + | [get request from | + | fc->unused_list] | + | >request_send() | + | [queue req on fc->pending] | + | [wake up fc->waitq] | [woken up] + | >request_wait_answer() | + | [sleep on req->waitq] | + | | <request_wait() + | | [remove req from fc->pending] + | | [copy req to read buffer] + | | [add req to fc->processing] + | | <fuse_dev_read() + | | <sys_read() + | | + | | [perform unlink] + | | + | | >sys_write() + | | >fuse_dev_write() + | | [look up req in fc->processing] + | | [remove from fc->processing] + | | [copy write buffer to req] + | [woken up] | [wake up req->waitq] + | | <fuse_dev_write() + | | <sys_write() + | <request_wait_answer() | + | <request_send() | + | [add request to | + | fc->unused_list] | + | <fuse_unlink() | + | <sys_unlink() | + +There are a couple of ways in which to deadlock a FUSE filesystem. +Since we are talking about unprivileged userspace programs, +something must be done about these. + +Scenario 1 - Simple deadlock +----------------------------- + + | "rm /mnt/fuse/file" | FUSE filesystem daemon + | | + | >sys_unlink("/mnt/fuse/file") | + | [acquire inode semaphore | + | for "file"] | + | >fuse_unlink() | + | [sleep on req->waitq] | + | | <sys_read() + | | >sys_unlink("/mnt/fuse/file") + | | [acquire inode semaphore + | | for "file"] + | | *DEADLOCK* + +The solution for this is to allow requests to be interrupted while +they are in userspace: + + | [interrupted by signal] | + | <fuse_unlink() | + | [release semaphore] | [semaphore acquired] + | <sys_unlink() | + | | >fuse_unlink() + | | [queue req on fc->pending] + | | [wake up fc->waitq] + | | [sleep on req->waitq] + +If the filesystem daemon was single threaded, this will stop here, +since there's no other thread to dequeue and execute the request. +In this case the solution is to kill the FUSE daemon as well. If +there are multiple serving threads, you just have to kill them as +long as any remain. + +Moral: a filesystem which deadlocks, can soon find itself dead. + +Scenario 2 - Tricky deadlock +---------------------------- + +This one needs a carefully crafted filesystem. It's a variation on +the above, only the call back to the filesystem is not explicit, +but is caused by a pagefault. + + | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2 + | | + | [fd = open("/mnt/fuse/file")] | [request served normally] + | [mmap fd to 'addr'] | + | [close fd] | [FLUSH triggers 'magic' flag] + | [read a byte from addr] | + | >do_page_fault() | + | [find or create page] | + | [lock page] | + | >fuse_readpage() | + | [queue READ request] | + | [sleep on req->waitq] | + | | [read request to buffer] + | | [create reply header before addr] + | | >sys_write(addr - headerlength) + | | >fuse_dev_write() + | | [look up req in fc->processing] + | | [remove from fc->processing] + | | [copy write buffer to req] + | | >do_page_fault() + | | [find or create page] + | | [lock page] + | | * DEADLOCK * + +Solution is again to let the the request be interrupted (not +elaborated further). + +An additional problem is that while the write buffer is being +copied to the request, the request must not be interrupted. This +is because the destination address of the copy may not be valid +after the request is interrupted. + +This is solved with doing the copy atomically, and allowing +interruption while the page(s) belonging to the write buffer are +faulted with get_user_pages(). The 'req->locked' flag indicates +when the copy is taking place, and interruption is delayed until +this flag is unset. + |