summaryrefslogtreecommitdiffstats
path: root/net/sunrpc/sched.c
Commit message (Collapse)AuthorAgeFilesLines
* SUNRPC: Separate buffer pointers for RPC Call and Reply messagesChuck Lever2016-09-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For xprtrdma, the RPC Call and Reply buffers are involved in real I/O operations. To start with, the DMA direction of the I/O for a Call is opposite that of a Reply. In the current arrangement, the Reply buffer address is on a four-byte alignment just past the call buffer. Would be friendlier on some platforms if that was at a DMA cache alignment instead. Because the current arrangement allocates a single memory region which contains both buffers, the RPC Reply buffer often contains a page boundary in it when the Call buffer is large enough (which is frequent). It would be a little nicer for setting up DMA operations (and possible registration of the Reply buffer) if the two buffers were separated, well-aligned, and contained as few page boundaries as possible. Now, I could just pad out the single memory region used for the pair of buffers. But frequently that would mean a lot of unused space to ensure the Reply buffer did not have a page boundary. Add a separate pointer to rpc_rqst that points right to the RPC Reply buffer. This makes no difference to xprtsock, but it will help xprtrdma in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* SUNRPC: Generalize the RPC buffer release APIChuck Lever2016-09-191-6/+4
| | | | | | | | | | | | | | | | | xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Instead of passing just the rq_buffer into the buf_free method, pass the task structure and let buf_free take care of freeing both XDR buffers at once. There's a micro-optimization here. In the common case, both xprt_release and the transport's buf_free method were checking if rq_buffer was NULL. Now the check is done only once per RPC. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* SUNRPC: Generalize the RPC buffer allocation APIChuck Lever2016-09-191-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Transports that want to allocate separate Call and Reply buffers will ignore the "size" argument anyway. Don't bother passing it. The buf_alloc method can't return two pointers. Instead, make the method's return value an error code, and set the rq_buffer pointer in the method itself. This gives call_allocate an opportunity to terminate an RPC instead of looping forever when a permanent problem occurs. If a request is just bogus, or the transport is in a state where it can't allocate resources for any request, there needs to be a way to kill the RPC right there and not loop. This immediately fixes a rare problem in the backchannel send path, which loops if the server happens to send a CB request whose call+reply size is larger than a page (which it shouldn't do yet). One more issue: looks like xprt_inject_disconnect was incorrectly placed in the failure path in call_allocate. It needs to be in the success path, as it is for other call-sites. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* SUNRPC: Reduce latency when send queue is congestedTrond Myklebust2016-06-131-10/+33
| | | | | | | Use the low latency transport workqueue to process the task that is next in line on the xprt->sending queue. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* SUNRPC: RPC transport queue must be low latencyTrond Myklebust2016-06-131-4/+20
| | | | | | | | | | | | rpciod can easily get congested due to the long list of queued rpc_tasks. Having the receive queue wait in turn for those tasks to complete can therefore be a bottleneck. Address the problem by separating the workqueues into: - rpciod: manages rpc_tasks - xprtiod: manages transport related work. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* SUNRPC: Allow caller to specify the transport to useTrond Myklebust2016-02-051-0/+2
| | | | | | | This is needed in order to allow the NFSv4.1 backchannel and BIND_CONN_TO_SESSION function to work. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* sched/wait: Fix the signal handling fixPeter Zijlstra2015-12-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Jan Stancek reported that I wrecked things for him by fixing things for Vladimir :/ His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which should not be possible, however my previous patch made this possible by unconditionally checking signal_pending(). We cannot use current->state as was done previously, because the instruction after the store to that variable it can be changed. We must instead pass the initial state along and use that. Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers") Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Chris Mason <clm@fb.com> Tested-by: Jan Stancek <jstancek@redhat.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Paul Turner <pjt@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: tglx@linutronix.de Cc: Oleg Nesterov <oleg@redhat.com> Cc: hpa@zytor.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2015-09-251-8/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - fix v4.2 SEEK on files over 2 gigs - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O. - Fix recovery of recalled read delegations Bugfixes: - Fix a case where NFSv4 fails to send CLOSE after a server reboot - Fix sunrpc to wait for connections to complete before retrying - Fix sunrpc races between transport connect/disconnect and shutdown - Fix an infinite loop when layoutget fail with BAD_STATEID - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set - Fix layoutreturn/close ordering issues" * tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS41: make close wait for layoutreturn NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget NFSv4: Recovery of recalled read delegations is broken NFS: Fix an infinite loop when layoutget fail with BAD_STATEID NFS: Do cleanup before resetting pageio read/write to mds SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose SUNRPC: Lock the transport layer on shutdown nfs/filelayout: Fix NULL reference caused by double freeing of fh_array SUNRPC: Ensure that we wait for connections to complete before retrying SUNRPC: drop null test before destroy functions nfs: fix v4.2 SEEK on files over 2 gigs SUNRPC: Fix races between socket connection and destroy code nfs: fix pg_test page count calculation Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
| * SUNRPC: drop null test before destroy functionsJulia Lawall2015-09-171-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* | userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to ↵Andrea Arcangeli2015-09-221-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | __wake_up_locked_key" This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts fs/userfaultfd.c to use the old version of that function. It didn't look robust to call __wake_up_common with "nr == 1" when we absolutely require wakeall semantics, but we've full control of what we insert in the two waitqueue heads of the blocked userfaults. No exclusive waitqueue risks to be inserted into those two waitqueue heads so we can as well stick to "nr == 1" of the old code and we can rely purely on the fact no waitqueue inserted in one of the two waitqueue heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Thierry Reding <treding@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_keyAndrea Arcangeli2015-09-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead of the current hardcoded 1 (that would wake just the first waitqueue in the head list). Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* SUNRPC: fix build-warning due to format missmatchNicholas Mc Guire2015-03-131-2/+2
| | | | | | | | | fix build-warning introduced by commit: f0eede10fd4 ("SUNRPC: use jiffies_to_msecs for converting jiffies") which did not fixup the format properly (my bad). Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* SUNRPC: use jiffies_to_msecs for converting jiffiesNicholas Mc Guire2015-03-121-1/+1
| | | | | | | | Use jiffies_to_msecs for converting jiffies as it handles all of the corner cases reliably and also helps readability. Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* SUNRPC: Allow waiting on memory allocationTrond Myklebust2015-01-241-2/+2
| | | | | | We should be safe now, as long as we don't do GFP_IO or higher allocations Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* SUNRPC: Adjust rpciod workqueue parametersTrond Myklebust2015-01-241-1/+2
| | | | | | | | | | Increase the concurrency level for rpciod threads to allow for allocations etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range locks may now need to allocate memory from inside rpciod. Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* sunrpc: eliminate RPC_TRACEPOINTSJeff Layton2014-11-241-1/+1
| | | | | | | | It's always set to the same value as CONFIG_TRACEPOINTS, so we can just use that instead. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* sunrpc: eliminate RPC_DEBUGJeff Layton2014-11-241-2/+2
| | | | | | | It's always set to whatever CONFIG_SUNRPC_DEBUG is, so just use that. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()NeilBrown2014-09-251-2/+0
| | | | | | | | | | | | | | | | | | | Now that nfs_release_page() doesn't block indefinitely, other deadlock avoidance mechanisms aren't needed. - it doesn't hurt for kswapd to block occasionally. If it doesn't want to block it would clear __GFP_WAIT. The current_is_kswapd() was only added to avoid deadlocks and we have a new approach for that. - memory allocation in the SUNRPC layer can very rarely try to ->releasepage() a page it is trying to handle. The deadlock is removed as nfs_release_page() doesn't block indefinitely. So we don't need to set PF_FSTRANS for sunrpc network operations any more. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* sched: Allow wait_on_bit_action() functions to support a timeoutNeilBrown2014-07-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is currently not possible for various wait_on_bit functions to implement a timeout. While the "action" function that is called to do the waiting could certainly use schedule_timeout(), there is no way to carry forward the remaining timeout after a false wake-up. As false-wakeups a clearly possible at least due to possible hash collisions in bit_waitqueue(), this is a real problem. The 'action' function is currently passed a pointer to the word containing the bit being waited on. No current action functions use this pointer. So changing it to something else will be a little noisy but will have no immediate effect. This patch changes the 'action' function to take a pointer to the "struct wait_bit_key", which contains a pointer to the word containing the bit so nothing is really lost. It also adds a 'private' field to "struct wait_bit_key", which is initialized to zero. An action function can now implement a timeout with something like static int timed_out_waiter(struct wait_bit_key *key) { unsigned long waited; if (key->private == 0) { key->private = jiffies; if (key->private == 0) key->private -= 1; } waited = jiffies - key->private; if (waited > 10 * HZ) return -EAGAIN; schedule_timeout(waited - 10 * HZ); return 0; } If any other need for context in a waiter were found it would be easy to use ->private for some other purpose, or even extend "struct wait_bit_key". My particular need is to support timeouts in nfs_release_page() to avoid deadlocks with loopback mounted NFS. While wait_on_bit_timeout() would be a cleaner interface, it will not meet my need. I need the timeout to be sensitive to the state of the connection with the server, which could change. So I need to use an 'action' interface. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Steve French <sfrench@samba.org> Cc: David Howells <dhowells@redhat.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown Signed-off-by: Ingo Molnar <mingo@kernel.org>
* net, sunrpc: suppress allocation warning in rpc_malloc()David Rientjes2014-05-291-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rpc_malloc() allocates with GFP_NOWAIT without making any attempt at reclaim so it easily fails when low on memory. This ends up spamming the kernel log: SLAB: Unable to allocate memory on node 0 (gfp=0x4000) cache: kmalloc-8192, object size: 8192, order: 1 node 0: slabs: 207/207, objs: 207/207, free: 0 rekonq: page allocation failure: order:1, mode:0x204000 CPU: 2 PID: 14321 Comm: rekonq Tainted: G O 3.15.0-rc3-12.gfc9498b-desktop+ #6 Hardware name: System manufacturer System Product Name/M4A785TD-V EVO, BIOS 2105 07/23/2010 0000000000000000 ffff880010ff17d0 ffffffff815e693c 0000000000204000 ffff880010ff1858 ffffffff81137bd2 0000000000000000 0000001000000000 ffff88011ffebc38 0000000000000001 0000000000204000 ffff88011ffea000 Call Trace: [<ffffffff815e693c>] dump_stack+0x4d/0x6f [<ffffffff81137bd2>] warn_alloc_failed+0xd2/0x140 [<ffffffff8113be19>] __alloc_pages_nodemask+0x7e9/0xa30 [<ffffffff811824a8>] kmem_getpages+0x58/0x140 [<ffffffff81183de6>] fallback_alloc+0x1d6/0x210 [<ffffffff81183be3>] ____cache_alloc_node+0x123/0x150 [<ffffffff81185953>] __kmalloc+0x203/0x490 [<ffffffffa06b0ee2>] rpc_malloc+0x32/0xa0 [sunrpc] [<ffffffffa06a6999>] call_allocate+0xb9/0x170 [sunrpc] [<ffffffffa06b19d8>] __rpc_execute+0x88/0x460 [sunrpc] [<ffffffffa06b2da9>] rpc_execute+0x59/0xc0 [sunrpc] [<ffffffffa06a932b>] rpc_run_task+0x6b/0x90 [sunrpc] [<ffffffffa077b5c1>] nfs4_call_sync_sequence+0x51/0x80 [nfsv4] [<ffffffffa077d45d>] _nfs4_do_setattr+0x1ed/0x280 [nfsv4] [<ffffffffa0782a72>] nfs4_do_setattr+0x72/0x180 [nfsv4] [<ffffffffa078334c>] nfs4_proc_setattr+0xbc/0x140 [nfsv4] [<ffffffffa074a7e8>] nfs_setattr+0xd8/0x240 [nfs] [<ffffffff811baa71>] notify_change+0x231/0x380 [<ffffffff8119cf5c>] chmod_common+0xfc/0x120 [<ffffffff8119df80>] SyS_chmod+0x40/0x90 [<ffffffff815f4cfd>] system_call_fastpath+0x1a/0x1f ... If the allocation fails, simply return NULL and avoid spamming the kernel log. Reported-by: Marc Dietrich <marvin24@gmx.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* SUNRPC: Don't let rpc_delay() clobber non-timeout errorsTrond Myklebust2014-03-201-1/+2
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* SUNRPC: Ensure rpc_task->tk_pid is available for tracepointsTrond Myklebust2013-09-041-1/+1
| | | | Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Merge tag 'nfs-for-3.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2013-07-091-16/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Feature highlights include: - Add basic client support for NFSv4.2 - Add basic client support for Labeled NFS (selinux for NFSv4.2) - Fix the use of credentials in NFSv4.1 stateful operations, and add support for NFSv4.1 state protection. Bugfix highlights: - Fix another NFSv4 open state recovery race - Fix an NFSv4.1 back channel session regression - Various rpc_pipefs races - Fix another issue with NFSv3 auth negotiation Please note that Labeled NFS does require some additional support from the security subsystem. The relevant changesets have all been reviewed and acked by James Morris." * tag 'nfs-for-3.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (54 commits) NFS: Set NFS_CS_MIGRATION for NFSv4 mounts NFSv4.1 Refactor nfs4_init_session and nfs4_init_channel_attrs nfs: have NFSv3 try server-specified auth flavors in turn nfs: have nfs_mount fake up a auth_flavs list when the server didn't provide it nfs: move server_authlist into nfs_try_mount_request nfs: refactor "need_mount" code out of nfs_try_mount SUNRPC: PipeFS MOUNT notification optimization for dying clients SUNRPC: split client creation routine into setup and registration SUNRPC: fix races on PipeFS UMOUNT notifications SUNRPC: fix races on PipeFS MOUNT notifications NFSv4.1 use pnfs_device maxcount for the objectlayout gdia_maxcount NFSv4.1 use pnfs_device maxcount for the blocklayout gdia_maxcount NFSv4.1 Fix gdia_maxcount calculation to fit in ca_maxresponsesize NFS: Improve legacy idmapping fallback NFSv4.1 end back channel session draining NFS: Apply v4.1 capabilities to v4.2 NFSv4.1: Clean up layout segment comparison helper names NFSv4.1: layout segment comparison helpers should take 'const' parameters NFSv4: Move the DNS resolver into the NFSv4 module rpc_pipefs: only set rpc_dentry_ops if d_op isn't already set ...
| * SUNRPC: Remove redundant call to rpc_set_running() in __rpc_execute()Trond Myklebust2013-06-061-1/+0
| | | | | | | | | | | | | | | | The RPC_TASK_RUNNING flag will always have been set in rpc_make_runnable() once we get past the test for out_of_line_wait_on_bit() returning ERESTARTSYS. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * SUNRPC: Remove unused function rpc_queue_emptyTrond Myklebust2013-06-061-14/+0
| | | | | | | | Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * SUNRPC: Fix a potential race in rpc_executeTrond Myklebust2013-06-061-1/+3
| | | | | | | | | | | | | | | | | | | | | | If the rpc_task is asynchronous, it could theoretically finish executing on the workqueue it was assigned by rpc_make_runnable() before we get round to testing RPC_IS_ASYNC() in rpc_execute. In practice, however, all the existing callers hold a reference to the rpc_task, so this can't happen today... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* | Merge branch 'freezer'Rafael J. Wysocki2013-06-281-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * freezer: af_unix: use freezable blocking calls in read sigtimedwait: use freezable blocking call nanosleep: use freezable blocking call futex: use freezable blocking call select: use freezable blocking call epoll: use freezable blocking call binder: use freezable blocking calls freezer: add new freezable helpers using freezer_do_not_count() freezer: convert freezable helpers to static inline where possible freezer: convert freezable helpers to freezer_do_not_count() freezer: skip waking up tasks with PF_FREEZER_SKIP set freezer: shorten freezer sleep time using exponential backoff lockdep: check that no locks held at freeze time lockdep: remove task argument from debug_check_no_locks_held freezer: add unsafe versions of freezable helpers for CIFS freezer: add unsafe versions of freezable helpers for NFS
| * freezer: add unsafe versions of freezable helpers for NFSColin Cross2013-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFS calls the freezable helpers with locks held, which is unsafe and will cause lockdep warnings when 6aa9707 "lockdep: check that no locks held at freeze time" is reapplied (it was reverted in dbf520a). NFS shouldn't be doing this, but it has long-running syscalls that must hold a lock but also shouldn't block suspend. Until NFS freeze handling is rewritten to use a signal to exit out of the critical section, add new *_unsafe versions of the helpers that will not run the lockdep test when 6aa9707 is reapplied, and call them from NFS. In practice the likley result of holding the lock while freezing is that a second task blocked on the lock will never freeze, aborting suspend, but it is possible to manufacture a case using the cgroup freezer, the lock, and the suspend freezer to create a deadlock. Silencing the lockdep warning here will allow problems to be found in other drivers that may have a more serious deadlock risk, and prevent new problems from being added. Signed-off-by: Colin Cross <ccross@android.com> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
* | SUNRPC: Prevent an rpc_task wakeup raceTrond Myklebust2013-05-221-1/+7
|/ | | | | | | | | | | The lockless RPC_IS_QUEUED() test in __rpc_execute means that we need to be careful about ordering the calls to rpc_test_and_set_running(task) and rpc_clear_queued(task). If we get the order wrong, then we may end up testing the RPC_TASK_RUNNING flag after __rpc_execute() has looped and changed the state of the rpc_task. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
* SUNRPC: Add barriers to ensure read ordering in rpc_wake_up_task_queue_lockedTrond Myklebust2013-03-251-2/+7
| | | | | | | | | | | We need to be careful when testing task->tk_waitqueue in rpc_wake_up_task_queue_locked, because it can be changed while we are holding the queue->lock. By adding appropriate memory barriers, we can ensure that it is safe to test task->tk_waitqueue for equality if the RPC_TASK_QUEUED bit is set. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
* SUNRPC: When changing the queue priority, ensure that we change the ownerTrond Myklebust2013-01-301-1/+17
| | | | | | | | | | | | | | This fixes a livelock in the xprt->sending queue where we end up never making progress on lower priority tasks because sleep_on_priority() keeps adding new tasks with the same owner to the head of the queue, and priority bumps mean that we keep resetting the queue->owner to whatever task is at the head of the queue. Regression introduced by commit c05eecf636101dd4347b2d8fa457626bf0088e0a (SUNRPC: Don't allow low priority tasks to pre-empt higher priority ones). Reported-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: Ensure we release the socket write lock if the rpc_task exits earlyTrond Myklebust2013-01-081-2/+1
| | | | | | | | | | | | | | | | | If the rpc_task exits while holding the socket write lock before it has allocated an rpc slot, then the usual mechanism for releasing the write lock in xprt_release() is defeated. The problem occurs if the call to xprt_lock_write() initially fails, so that the rpc_task is put on the xprt->sending wait queue. If the task exits after being assigned the lock by __xprt_lock_write_func, but before it has retried the call to xprt_lock_and_alloc_slot(), then it calls xprt_release() while holding the write lock, but will immediately exit due to the test for task->tk_rqstp != NULL. Reported-by: Chris Perl <chris.perl@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>= 3.1]
* SUNRPC: Ensure that we free the rpc_task after cleanups are doneTrond Myklebust2013-01-041-4/+23
| | | | | | | | | | | | This patch ensures that we free the rpc_task after the cleanup callbacks are done in order to avoid a deadlock problem that can be triggered if the callback needs to wait for another workqueue item to complete. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Weston Andros Adamson <dros@netapp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Bruce Fields <bfields@fieldses.org> Cc: stable@vger.kernel.org
* SUNRPC: Don't allow low priority tasks to pre-empt higher priority onesTrond Myklebust2012-12-061-22/+22
| | | | | | | | | | | | Currently, the priority queues attempt to be 'fair' to lower priority tasks by scheduling them after a certain number of higher priority tasks have run. The problem is that both the transport send queue and the NFSv4.1 session slot queue have strong ordering requirements. This patch therefore removes the fairness code in favour of strong ordering of task priorities. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* NFSv4.1: Don't mess with task priorities in nfs41_setup_sequenceTrond Myklebust2012-12-061-0/+1
| | | | | | | We want to preserve the rpc_task priority for things like writebacks, that may have differing levels of urgency. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: remove BUG_ON in rpc_release_taskWeston Andros Adamson2012-11-041-1/+1
| | | | | | | Replace BUG_ON() with WARN_ON_ONCE(). Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: remove BUG_ONs checking RPC_IS_QUEUEDWeston Andros Adamson2012-11-041-2/+6
| | | | | | | Replace two BUG_ON() calls with WARN_ON_ONCE() and early returns. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: remove BUG_ON from __rpc_sleep_on_priorityWeston Andros Adamson2012-11-041-1/+1
| | | | | | | Replace BUG_ON() with WARN_ON_ONCE(). Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: remove BUG_ON from rpc_sleep_on*Weston Andros Adamson2012-11-041-2/+12
| | | | | | | Replace BUG_ON() with WARN_ON_ONCE() and clean up after inactive task. Signed-off-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: Limit the rpciod workqueue concurrencyTrond Myklebust2012-09-281-1/+1
| | | | | | | We shouldn't need more than 1 worker thread per cpu, since rpciod is designed to run without sleeping in most cases. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds2012-07-311-2/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge Andrew's second set of patches: - MM - a few random fixes - a couple of RTC leftovers * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits) rtc/rtc-88pm80x: remove unneed devm_kfree rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables tmpfs: distribute interleave better across nodes mm: remove redundant initialization mm: warn if pg_data_t isn't initialized with zero mips: zero out pg_data_t when it's allocated memcg: gix memory accounting scalability in shrink_page_list mm/sparse: remove index_init_lock mm/sparse: more checks on mem_section number mm/sparse: optimize sparse_index_alloc memcg: add mem_cgroup_from_css() helper memcg: further prevent OOM with too many dirty pages memcg: prevent OOM with too many dirty pages mm: mmu_notifier: fix freed page still mapped in secondary MMU mm: memcg: only check anon swapin page charges for swap cache mm: memcg: only check swap cache pages for repeated charging mm: memcg: split swapin charge function into private and public part mm: memcg: remove needless !mm fixup to init_mm when charging mm: memcg: remove unneeded shmem charge type ...
| * nfs: enable swap on NFSMel Gorman2012-07-311-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol ->connect() method. PF_MEMALLOC should allow the allocation of struct socket and related objects and the early (re)setting of SOCK_MEMALLOC should allow us to receive the packets required for the TCP connection buildup. [jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases] [dfeng@redhat.com: Fix handling of multiple swap files] [a.p.zijlstra@chello.nl: Original patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | nfs: skip commit in releasepage if we're freeing memory for fs-related reasonsJeff Layton2012-07-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've had some reports of a deadlock where rpciod ends up with a stack trace like this: PID: 2507 TASK: ffff88103691ab40 CPU: 14 COMMAND: "rpciod/14" #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9 #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs] #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8 #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs] #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs] #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670 #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271 #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638 #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942 #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9 #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808 #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6 #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7 #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc] #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc] #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0 #24 [ffff8810343bfee8] kthread at ffffffff8108dd96 #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca rpciod is trying to allocate memory for a new socket to talk to the server. The VM ends up calling ->releasepage to get more memory, and it tries to do a blocking commit. That commit can't succeed however without a connected socket, so we deadlock. Fix this by setting PF_FSTRANS on the workqueue task prior to doing the socket allocation, and having nfs_release_page check for that flag when deciding whether to do a commit call. Also, set PF_FSTRANS unconditionally in rpc_async_schedule since that function can also do allocations sometimes. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
* | sunrpc: clarify comments on rpc_make_runnableJeff Layton2012-07-301-2/+3
|/ | | | | | | | | | rpc_make_runnable is not generally called with the queue lock held, unless it's waking up a task that has been sitting on a waitqueue. This is safe when the task has not entered the FSM yet, but the comments don't really spell this out. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: We must not use list_for_each_entry_safe() in rpc_wake_up()Trond Myklebust2012-03-191-4/+11
| | | | | | | | | | | | | | | | | The problem is that for the case of priority queues, we have to assume that __rpc_remove_wait_queue_priority will move new elements from the tk_wait.links lists into the queue->tasks[] list. We therefore cannot use list_for_each_entry_safe() on queue->tasks[], since that will skip these new tasks that __rpc_remove_wait_queue_priority is adding. Without this fix, rpc_wake_up and rpc_wake_up_status will both fail to wake up all functions on priority wait queues, which can result in some nasty hangs. Reported-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
* SUNRPC: Ensure that we can trace waitqueues when !defined(CONFIG_SYSCTL)Trond Myklebust2012-02-151-3/+1
| | | | | | | | | | | | The tracepoint code relies on the queue->name being defined in order to be able to display the name of the waitqueue on which an RPC task is sleeping. Reported-by: Randy Dunlap <rdunlap@xenotime.net> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Randy Dunlap <rdunlap@xenotime.net>
* SUNRPC: Add trace events to the sunrpc subsystemTrond Myklebust2012-01-311-0/+12
| | | | | | | Add declarations to allow tracing of RPC call creation, running, sleeping, and destruction. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* SUNRPC: Fix potential races in xprt_lock_write_next()Trond Myklebust2012-01-311-9/+33
| | | | | | | | We have to ensure that the wake up from the waitqueue and the assignment of xprt->snd_task are atomic. We can do this by assigning the snd_task while under the waitqueue spinlock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* Merge branch 'master' into pm-sleepRafael J. Wysocki2011-12-211-7/+23
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * master: (848 commits) SELinux: Fix RCU deref check warning in sel_netport_insert() binary_sysctl(): fix memory leak mm/vmalloc.c: remove static declaration of va from __get_vm_area_node ipmi_watchdog: restore settings when BMC reset oom: fix integer overflow of points in oom_badness memcg: keep root group unchanged if creation fails nilfs2: potential integer overflow in nilfs_ioctl_clean_segments() nilfs2: unbreak compat ioctl cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask evm: prevent racing during tfm allocation evm: key must be set once during initialization mmc: vub300: fix type of firmware_rom_wait_states module parameter Revert "mmc: enable runtime PM by default" mmc: sdhci: remove "state" argument from sdhci_suspend_host x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT IB/qib: Correct sense on freectxts increment and decrement RDMA/cma: Verify private data length cgroups: fix a css_set not found bug in cgroup_attach_proc oprofile: Fix uninitialized memory access when writing to writing to oprofilefs Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel" ... Conflicts: kernel/cgroup_freezer.c
| * SUNRPC: Fix the execution time statistics in the face of RPC restartsTrond Myklebust2011-12-011-7/+23
| | | | | | | | | | | | | | If the rpc_task gets restarted, then we want to ensure that we don't double-count the execution time statistics, timeout data, etc. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
OpenPOWER on IntegriCloud