summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* NFSv4: remove duplicate clientid in struct nfs_clientAndy Adamson2011-03-101-2/+2
| | | | | Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAYRicardo Labiaga2011-03-101-1/+11
| | | | | | | | | Fix bug where we currently retry the EXCHANGEID call again, eventhough we already have a valid clientid. Instead, delay and retry the CREATE_SESSION call. Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* (try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit ↵Frank Filz2011-03-101-1/+6
| | | | | | | | | | | | | | 31 or 63 are set in fileid The problem was use of an int32, which when converted to a uint64 is sign extended resulting in a fileid that doesn't fit in 32 bits even though the intent of the function is to fit the fileid into 32 bits. Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> [Trond: Added an include for compat.h] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: fix compilation warningJovi Zhang2011-03-101-1/+1
| | | | | | | | this commit fix compilation warning as following: linux-2.6/fs/nfs/nfs4proc.c:3265: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Jovi Zhang <bookjovi@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: add kmalloc return value check in decode_and_add_dsStanislav Fomichev2011-03-101-0/+4
| | | | | | | add kmalloc return value check in decode_and_add_ds Signed-off-by: Stanislav Fomichev <kernel@fomichev.me> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* nfs: close NFSv4 COMMIT vs. CLOSE raceJeff Layton2011-03-101-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I've been adding in more artificial delays in the NFSv4 commit and close codepaths to uncover races. The kernel I'm testing has the patch to close the race in __rpc_wait_for_completion_task that's in Trond's cthon2011 branch. The reproducer I've been using does this in a loop: mkdir("DIR"); fd = open("DIR/FILE", O_WRONLY|O_CREAT|O_EXCL, 0644); write(fd, "abcdefg", 7); close(fd); unlink("DIR/FILE"); rmdir("DIR"); The above reproducer shouldn't result in any silly-renaming. However, when I add a "msleep(100)" just after the nfs_commit_clear_lock call in nfs_commit_release, I can almost always force one to occur. If I can force it to occur with that, then it can happen without that delay given the right timing. nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait for the task to complete before putting its reference to it, so the last reference get put in rpc_release task and gets queued to a workqueue. In this situation, the last open context reference may be put by the COMMIT release instead of the close() syscall. The close() syscall returns too quickly and the unlink runs while the d_count is still high since the COMMIT release hasn't put its dentry reference yet. Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete before putting the task reference when FLUSH_SYNC is set. With this, the last reference is put by the process that's initiating the FLUSH_SYNC commit and the race is closed. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: Close a race in __rpc_wait_for_completion_task()Trond Myklebust2011-03-102-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although they run as rpciod background tasks, under normal operation (i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck() and nfs4_do_close() want to be fully synchronous. This means that when we exit, we want all references to the rpc_task to be gone, and we want any dentry references etc. held by that task to be released. For this reason these functions call __rpc_wait_for_completion_task(), followed by rpc_put_task() in the expectation that the latter will be releasing the last reference to the rpc_task, and thus ensuring that the callback_ops->rpc_release() has been called synchronously. This patch fixes a race which exists due to the fact that rpciod calls rpc_complete_task() (in order to wake up the callers of __rpc_wait_for_completion_task()) and then subsequently calls rpc_put_task() without ensuring that these two steps are done atomically. In order to avoid adding new spin locks, the patch uses the existing waitqueue spin lock to order the rpc_task reference count releases between the waiting process and rpciod. The common case where nobody is waiting for completion is optimised for by checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task reference count is 1: in those cases we drop trying to grab the spin lock, and immediately free up the rpc_task. Those few processes that need to put the rpc_task from inside an asynchronous context and that do not care about ordering are given a new helper: rpc_put_task_async(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2011-03-053-27/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: no .snap inside of snapped namespace libceph: fix msgr standby handling libceph: fix msgr keepalive flag libceph: fix msgr backoff libceph: retry after authorization failure libceph: fix handling of short returns from get_user_pages ceph: do not clear I_COMPLETE from d_release ceph: do not set I_COMPLETE Revert "ceph: keep reference to parent inode on ceph_dentry"
| * ceph: no .snap inside of snapped namespaceSage Weil2011-03-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | Otherwise you can do things like # mkdir .snap/foo # cd .snap/foo/.snap # ls <badness> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: do not clear I_COMPLETE from d_releaseSage Weil2011-03-031-21/+1
| | | | | | | | | | | | | | | | | | First, this was racy anyway: d_release isn't called until well after the dentry is unhashed. Second, this runs afoul of the recent dcache change that clears d_parent prior to calling d_release (949854d0), causing a NULL pointer dereference. Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: do not set I_COMPLETESage Weil2011-03-032-2/+2
| | | | | | | | | | | | | | Do not set the I_COMPLETE flag on directories until we resolve races with dcache pruning. Signed-off-by: Sage Weil <sage@newdream.net>
| * Revert "ceph: keep reference to parent inode on ceph_dentry"Sage Weil2011-03-032-5/+1
| | | | | | | | | | | | | | | | | | This reverts commit 97d79b403ef03f729883246208ef5d8a2ebc4d68. This fails to account for d_parent changes due to rename or disconnected dentries due to submounts or NFS reexports. Signed-off-by: Sage Weil <sage@newdream.net>
* | nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3)Neil Horman2011-03-041-2/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "bad_page()" page allocator sanity check was reported recently (call chain as follows): bad_page+0x69/0x91 free_hot_cold_page+0x81/0x144 skb_release_data+0x5f/0x98 __kfree_skb+0x11/0x1a tcp_ack+0x6a3/0x1868 tcp_rcv_established+0x7a6/0x8b9 tcp_v4_do_rcv+0x2a/0x2fa tcp_v4_rcv+0x9a2/0x9f6 do_timer+0x2df/0x52c ip_local_deliver+0x19d/0x263 ip_rcv+0x539/0x57c netif_receive_skb+0x470/0x49f :virtio_net:virtnet_poll+0x46b/0x5c5 net_rx_action+0xac/0x1b3 __do_softirq+0x89/0x133 call_softirq+0x1c/0x28 do_softirq+0x2c/0x7d do_IRQ+0xec/0xf5 default_idle+0x0/0x50 ret_from_intr+0x0/0xa default_idle+0x29/0x50 cpu_idle+0x95/0xb8 start_kernel+0x220/0x225 _sinittext+0x22f/0x236 It occurs because an skb with a fraglist was freed from the tcp retransmit queue when it was acked, but a page on that fraglist had PG_Slab set (indicating it was allocated from the Slab allocator (which means the free path above can't safely free it via put_page. We tracked this back to an nfsv4 setacl operation, in which the nfs code attempted to fill convert the passed in buffer to an array of pages in __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. __nfs4_proc_set_acl just converts each page in the buffer to a page struct via virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab bit is set. We can't create a buffer with kmalloc and free it later in the tcp ack path with put_page, so we need to either: 1) ensure that when we create the list of pages, no page struct has PG_Slab set or 2) not use a page list to send this data Given that these buffers can be multiple pages and arbitrarily sized, I think (1) is the right way to go. I've written the below patch to allocate a page from the buddy allocator directly and copy the data over to it. This ensures that we have a put_page free-able page for every entry that winds up on an skb frag list, so it can be safely freed when the frame is acked. We do a put page on each entry after the rpc_call_sync call so as to drop our own reference count to the page, leaving only the ref count taken by tcp_sendpages. This way the data will be properly freed when the ack comes in Successfully tested by myself to solve the above oops. Note, as this is the result of a setacl operation that exceeded a page of data, I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm CCing security on this as well. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: security@kernel.org CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'i_nlink' of ↵Linus Torvalds2011-03-038-76/+28
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'i_nlink' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: hfs: fix rename() over non-empty directory udf: fix i_nlink limit fix reiserfs mkdir() breakage exofs: i_nlink races in rename() nilfs2: i_nlink races in rename() minix: i_nlink races in rename() ufs: i_nlink races in rename() sysv: i_nlink races in rename()
| * | hfs: fix rename() over non-empty directoryAl Viro2011-03-031-37/+13
| | | | | | | | | | | | | | | | | | merge hfs_unlink() and hfs_rmdir(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | udf: fix i_nlink limitAl Viro2011-03-031-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | (256 << sizeof(x)) - 1 is not the maximal possible value of x... In reality, the maximal allowed value for UDF FileLinkCount is 65535. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | fix reiserfs mkdir() breakageAl Viro2011-03-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if directory has so many subdirectories that its link count is set to 1 (i.e. "can't tell accurately") and reiserfs_new_inode() fails, we shouldn't decrement the parent's link count in cleanup path; that's what DEC_DIR_INODE_NLINK() is for. As it is, we end up with parent suddenly getting zero i_nlink, with very unpleasant effects. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | exofs: i_nlink races in rename()Al Viro2011-03-031-6/+2
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | nilfs2: i_nlink races in rename()Al Viro2011-03-031-7/+1
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | minix: i_nlink races in rename()Al Viro2011-03-031-6/+2
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | ufs: i_nlink races in rename()Al Viro2011-03-031-7/+2
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | sysv: i_nlink races in rename()Al Viro2011-03-031-6/+2
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2011-03-031-3/+8
|\ \ \ | | | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: zero proper structure size for geometry calls
| * | | xfs: zero proper structure size for geometry callsAlex Elder2011-03-011-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 493f3358cb289ccf716c5a14fa5bb52ab75943e5 added this call to xfs_fs_geometry() in order to avoid passing kernel stack data back to user space: + memset(geo, 0, sizeof(*geo)); Unfortunately, one of the callers of that function passes the address of a smaller data type, cast to fit the type that xfs_fs_geometry() requires. As a result, this can happen: Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: f87aca93 Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1 Call Trace: [<c12991ac>] ? panic+0x50/0x150 [<c102ed71>] ? __stack_chk_fail+0x10/0x18 [<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs] Fix this by fixing that one caller to pass the right type and then copy out the subset it is interested in. Note: This patch is an alternative to one originally proposed by Eric Sandeen. Reported-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu> Signed-off-by: Alex Elder <aelder@sgi.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Tested-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-03-031-1/+2
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: fix regression that i-flag is not set on changeless checkpoints
| * | | nilfs2: fix regression that i-flag is not set on changeless checkpointsRyusuke Konishi2011-03-021-1/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the report from Jiro SEKIBA titled "regression in 2.6.37?" (Message-Id: <8739n8vs1f.wl%jir@sekiba.com>), on 2.6.37 and later kernels, lscp command no longer displays "i" flag on checkpoints that snapshot operations or garbage collection created. This is a regression of nilfs2 checkpointing function, and it's critical since it broke behavior of a part of nilfs2 applications. For instance, snapshot manager of TimeBrowse gets to create meaningless snapshots continuously; snapshot creation triggers another checkpoint, but applications cannot distinguish whether the new checkpoint contains meaningful changes or not without the i-flag. This patch fixes the regression and brings that application behavior back to normal. Reported-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Jiro SEKIBA <jir@unicus.jp> Cc: stable <stable@kernel.org> [2.6.37]
* | | Merge branch 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds2011-03-021-1/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'devicetree/merge' of git://git.secretlab.ca/git/linux-2.6: of/promtree: allow DT device matching by fixing 'name' brokenness (v5) x86: OLPC: have prom_early_alloc BUG rather than return NULL of/flattree: Drop an uninteresting message to pr_debug level of: Add missing of_address.h to xilinx ehci driver
| * | | of/flattree: Drop an uninteresting message to pr_debug levelPaul Bolle2011-03-021-1/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | This message looks like an error (which it isn't) when booting with a flattened device tree. Remove the message from normal kernel builds. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
* | | ext2: Fix link count corruption under heavy link+rename loadJosh Hunt2011-03-021-7/+2
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt it as reported and analyzed by Josh. In fact, there is no good reason to mess with i_nlink of the moved file. We did it presumably to simulate linking into the new directory and unlinking from an old one. But the practical effect of this is disputable because fsck can possibly treat file as being properly linked into both directories without writing any error which is confusing. So we just stop increment-decrement games with i_nlink which also fixes the corruption. CC: stable@kernel.org CC: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | fs/block_dev.c: fix new kernel-doc warningRandy Dunlap2011-02-281-0/+1
| | | | | | | | | | | | | | | | | | Fix new kernel-doc warning in fs/block_dev.c: Warning(fs/block_dev.c:937): No description found for parameter 'kill_dirty' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2011-02-283-10/+55
|\ \ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix truncate after open fuse: fix hang of single threaded fuseblk filesystem
| * | fuse: fix truncate after openMiklos Szeredi2011-02-251-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e1181ee6 "vfs: pass struct file to do_truncate on O_TRUNC opens" broke the behavior of open(O_TRUNC|O_RDONLY) in fuse. Fuse assumed that when called from open, a truncate() will be done, not an ftruncate(). Fix by restoring the old behavior, based on the ATTR_OPEN flag. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * | fuse: fix hang of single threaded fuseblk filesystemMiklos Szeredi2011-02-252-8/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Single threaded NTFS-3G could get stuck if a delayed RELEASE reply triggered a DESTROY request via path_put(). Fix this by a) making RELEASE requests synchronous, whenever possible, on fuseblk filesystems b) if not possible (triggered by an asynchronous read/write) then do the path_put() in a separate thread with schedule_work(). Reported-by: Oliver Neukum <oneukum@suse.de> Cc: stable@kernel.org Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | | Merge branch 'upstream-linus' of ↵Linus Torvalds2011-02-283-13/+28
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Check heartbeat mode for kernel stacks only Ocfs2/refcounttree: Fix a bug for refcounttree to writeback clusters in a right number. ocfs2: Fix estimate of necessary credits for mkdir
| * | | ocfs2: Check heartbeat mode for kernel stacks onlyMark Fasheh2011-02-201-8/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 2c442719e90a44a6982c033d69df4aae4b167cfa added some checks for proper heartbeat mode when the o2cb stack is running. Unfortunately, it didn't take into account that a userpsace stack could be running. Fix this by only doing the check if o2cb is in use. This patch allows userspace stacks to mount the fs again. Cc: stable@kernel.org Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
| * | | Ocfs2/refcounttree: Fix a bug for refcounttree to writeback clusters in a ↵Tristan Ye2011-02-201-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | right number. Current refcounttree codes actually didn't writeback the new pages out in write-back mode, due to a bug of always passing a ZERO number of clusters to 'ocfs2_cow_sync_writeback', the patch tries to pass a proper one in. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Cc: stable@kernel.org Signed-off-by: Joel Becker <jlbec@evilplan.org>
| * | | ocfs2: Fix estimate of necessary credits for mkdirJan Kara2011-02-201-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the rare case that INLINE_DATA, INDEX_DIR, QUOTA, XATTR features are disabled and both the allocation of the directory inode and the allocation of the first directory block need to relink allocation group, there need not be enough credits reserved in a transaction. Fix the estimate. CC: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <jlbec@evilplan.org>
* | | | aio: fix race between io_destroy() and io_submit()Jan Kara2011-02-251-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A race can occur when io_submit() races with io_destroy(): CPU1 CPU2 io_submit() do_io_submit() ... ctx = lookup_ioctx(ctx_id); io_destroy() Now do_io_submit() holds the last reference to ctx. ... queue new AIO put_ioctx(ctx) - frees ctx with active AIOs We solve this issue by checking whether ctx is being destroyed in AIO submission path after adding new AIO to ctx. Then we are guaranteed that either io_destroy() waits for new AIO or we see that ctx is being destroyed and bail out. Cc: Nick Piggin <npiggin@kernel.dk> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | aio: fix rcu ioctx lookupNick Piggin2011-02-251-11/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | aio-dio-invalidate-failure GPFs in aio_put_req from io_submit. lookup_ioctx doesn't implement the rcu lookup pattern properly. rcu_read_lock does not prevent refcount going to zero, so we might take a refcount on a zero count ioctx. Fix the bug by atomically testing for zero refcount before incrementing. [jack@suse.cz: added comment into the code] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | ldm: corrupted partition table can cause kernel oopsTimo Warns2011-02-251-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel automatically evaluates partition tables of storage devices. The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains a bug that causes a kernel oops on certain corrupted LDM partitions. A kernel subsystem seems to crash, because, after the oops, the kernel no longer recognizes newly connected storage devices. The patch changes ldm_parse_vmdb() to Validate the value of vblk_size. Signed-off-by: Timo Warns <warns@pre-sense.de> Cc: Eugene Teo <eugeneteo@kernel.sg> Acked-by: Richard Russon <ldm@flatcap.org> Cc: Harvey Harrison <harvey.harrison@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | epoll: prevent creating circular epoll structuresDavide Libenzi2011-02-251-0/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In several places, an epoll fd can call another file's ->f_op->poll() method with ep->mtx held. This is in general unsafe, because that other file could itself be an epoll fd that contains the original epoll fd. The code defends against this possibility in its own ->poll() method using ep_call_nested, but there are several other unsafe calls to ->poll elsewhere that can be made to deadlock. For example, the following simple program causes the call in ep_insert recursively call the original fd's ->poll, leading to deadlock: #include <unistd.h> #include <sys/epoll.h> int main(void) { int e1, e2, p[2]; struct epoll_event evt = { .events = EPOLLIN }; e1 = epoll_create(1); e2 = epoll_create(2); pipe(p); epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt); epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt); write(p[1], p, sizeof p); epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt); return 0; } On insertion, check whether the inserted file is itself a struct epoll, and if so, do a recursive walk to detect whether inserting this file would create a loop of epoll structures, which could lead to deadlock. [nelhage@ksplice.com: Use epmutex to serialize concurrent inserts] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Reported-by: Nelson Elhage <nelhage@ksplice.com> Tested-by: Nelson Elhage <nelhage@ksplice.com> Cc: <stable@kernel.org> [2.6.34+, possibly earlier] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2011-02-2510-57/+282
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix fiemap bugs with delalloc Btrfs: set FMODE_EXCL in btrfs_device->mode Btrfs: make btrfs_rm_device() fail gracefully Btrfs: Avoid accessing unmapped kernel address Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl Btrfs: allow balance to explicitly allocate chunks as it relocates Btrfs: put ENOSPC debugging under a mount option
| * | | | Btrfs: fix fiemap bugs with delallocChris Mason2011-02-233-42/+224
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Btrfs fiemap code wasn't properly returning delalloc extents, so applications that trust fiemap to decide if there are holes in the file see holes instead of delalloc. This reworks the btrfs fiemap code, adding a get_extent helper that searches for delalloc ranges and also adding a helper for extent_fiemap that skips past holes in the file. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | Btrfs: set FMODE_EXCL in btrfs_device->modeIlya Dryomov2011-02-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a bug introduced in d4d77629, where the device added online (and therefore initialized via btrfs_init_new_device()) would be left with the positive bdev->bd_holders after unmount. Since d4d77629 we no longer OR FMODE_EXCL explicitly on blkdev_put(), set it in btrfs_device->mode. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | Btrfs: make btrfs_rm_device() fail gracefullyIlya Dryomov2011-02-161-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If shrinking done as part of the online device removal fails add that device back to the allocation list and increment the rw_devices counter. This fixes two bugs: 1) we could have a perfectly good device out of alloc list for no good reason; 2) in the btrfs consisting of two devices, failure in btrfs_rm_device() could lead to a situation where it was impossible to remove any of the devices because of the "unable to remove the only writeable device" error. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | Btrfs: Avoid accessing unmapped kernel addressLi Zefan2011-02-161-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When decompressing a chunk of data, we'll copy the data out to a working buffer if the data is stored in more than one page, otherwise we'll use the mapped page directly to avoid memory copy. In the latter case, we'll end up accessing the kernel address after we've unmapped the page in a corner case. Reported-by: Juan Francisco Cantero Hurtado <iam@juanfra.info> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctlLi Zefan2011-02-161-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Check user-specified flags correctly - Check the inode owership - Search root item in root tree but not fs tree Reported-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | Btrfs: allow balance to explicitly allocate chunks as it relocatesChris Mason2011-02-163-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs device shrinking and balancing ends up reallocating all the blocks in order to allow COW to move them to new destinations. It is somewhat awkward in terms of ENOSPC because most of the enospc code is built around the idea that some operation on a reference counted tree triggers allocations in the non-reference counted trees. This commit changes the balancing code to deal with enospc by trying to allocate a new chunk. If that allocation succeeds, we go ahead and retry whatever failed due to enospc. Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | | | Btrfs: put ENOSPC debugging under a mount optionChris Mason2011-02-163-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ENOSPC in btrfs is getting to the point where the extra debugging isn't required. I've put it under mount -o enospc_debug just in case someone is having difficult problems. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | | | | Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2011-02-253-8/+15
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: md: Fix - again - partition detection when array becomes active Fix over-zealous flush_disk when changing device size. md: avoid spinlock problem in blk_throtl_exit md: correctly handle probe of an 'mdp' device. md: don't set_capacity before array is active. md: Fix raid1->raid0 takeover
OpenPOWER on IntegriCloud