summaryrefslogtreecommitdiffstats
path: root/net/sunrpc/xprtrdma/verbs.c
Commit message (Collapse)AuthorAgeFilesLines
* xprtrdma: Add vector of ops for each memory registration strategyChuck Lever2015-03-311-4/+7
| | | | | | | | | | | | | | | | Instead of employing switch() statements, let's use the typical Linux kernel idiom for handling behavioral variation: virtual functions. Start by defining a vector of operations for each supported memory registration mode, and by adding a source file for each mode. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com> Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com> Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Prevent infinite loop in rpcrdma_ep_create()Chuck Lever2015-03-311-2/+3
| | | | | | | | | | | | | | If a provider advertizes a zero max_fast_reg_page_list_len, FRWR depth detection loops forever. Instead of just failing the mount, try other memory registration modes. Fixes: 0fc6c4e7bb28 ("xprtrdma: mind the device's max fast . . .") Reported-by: Devesh Sharma <Devesh.Sharma@Emulex.Com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com> Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com> Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Byte-align FRWR registrationChuck Lever2015-03-311-8/+4
| | | | | | | | | | | | | | | | | | | | | The RPC/RDMA transport's FRWR registration logic registers whole pages. This means areas in the first and last pages that are not involved in the RDMA I/O are needlessly exposed to the server. Buffered I/O is typically page-aligned, so not a problem there. But for direct I/O, which can be byte-aligned, and for reply chunks, which are nearly always smaller than a page, the transport could expose memory outside the I/O buffer. FRWR allows byte-aligned memory registration, so let's use it as it was intended. Reported-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com> Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com> Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Display IPv6 addresses and port numbers correctlyChuck Lever2015-03-311-12/+9
| | | | | | | | | Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Tested-by: Devesh Sharma <Devesh.Sharma@Emulex.Com> Tested-by: Meghana Cheripady <Meghana.Cheripady@Emulex.Com> Tested-by: Veeresh U. Kokatnur <veereshuk@chelsio.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Clean up after adding regbuf managementChuck Lever2015-01-301-2/+2
| | | | | | | | | | rpcrdma_{de}register_internal() are used only in verbs.c now. MAX_RPCRDMAHDR is no longer used and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Allocate zero pad separately from rpcrdma_bufferChuck Lever2015-01-301-19/+10
| | | | | | | | | | | | | Use the new rpcrdma_alloc_regbuf() API to shrink the amount of contiguous memory needed for a buffer pool by moving the zero pad buffer into a regbuf. This is for consistency with the other uses of internally registered memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_repChuck Lever2015-01-301-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | The rr_base field is currently the buffer where RPC replies land. An RPC/RDMA reply header lands in this buffer. In some cases an RPC reply header also lands in this buffer, just after the RPC/RDMA header. The inline threshold is an agreed-on size limit for RDMA SEND operations that pass from server and client. The sum of the RPC/RDMA reply header size and the RPC reply header size must be less than this threshold. The largest RDMA RECV that the client should have to handle is the size of the inline threshold. The receive buffer should thus be the size of the inline threshold, and not related to RPCRDMA_MAX_SEGS. RPC replies received via RDMA WRITE (long replies) are caught in rq_rcv_buf, which is the second half of the RPC send buffer. Ie, such replies are not involved in any way with rr_base. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_reqChuck Lever2015-01-301-19/+3
| | | | | | | | | | | | | | | | | | | | | | The rl_base field is currently the buffer where each RPC/RDMA call header is built. The inline threshold is an agreed-on size limit to for RDMA SEND operations that pass between client and server. The sum of the RPC/RDMA header size and the RPC header size must be less than or equal to this threshold. Increasing the r/wsize maximum will require MAX_SEGS to grow significantly, but the inline threshold size won't change (both sides agree on it). The server's inline threshold doesn't change. Since an RPC/RDMA header can never be larger than the inline threshold, make all RPC/RDMA header buffers the size of the inline threshold. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_reqChuck Lever2015-01-301-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Because internal memory registration is an expensive and synchronous operation, xprtrdma pre-registers send and receive buffers at mount time, and then re-uses them for each RPC. A "hardway" allocation is a memory allocation and registration that replaces a send buffer during the processing of an RPC. Hardway must be done if the RPC send buffer is too small to accommodate an RPC's call and reply headers. For xprtrdma, each RPC send buffer is currently part of struct rpcrdma_req so that xprt_rdma_free(), which is passed nothing but the address of an RPC send buffer, can find its matching struct rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof. That means that hardway currently has to replace a whole rpcrmda_req when it replaces an RPC send buffer. This is often a fairly hefty chunk of contiguous memory due to the size of the rl_segments array and the fact that both the send and receive buffers are part of struct rpcrdma_req. Some obscure re-use of fields in rpcrdma_req is done so that xprt_rdma_free() can detect replaced rpcrdma_req structs, and restore the original. This commit breaks apart the RPC send buffer and struct rpcrdma_req so that increasing the size of the rl_segments array does not change the alignment of each RPC send buffer. (Increasing rl_segments is needed to bump up the maximum r/wsize for NFS/RDMA). This change opens up some interesting possibilities for improving the design of xprt_rdma_allocate(). xprt_rdma_allocate() is now the one place where RPC send buffers are allocated or re-allocated, and they are now always left in place by xprt_rdma_free(). A large re-allocation that includes both the rl_segments array and the RPC send buffer is no longer needed. Send buffer re-allocation becomes quite rare. Good send buffer alignment is guaranteed no matter what the size of the rl_segments array is. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Add struct rpcrdma_regbuf and helpersChuck Lever2015-01-301-0/+55
| | | | | | | | | | | | | | | There are several spots that allocate a buffer via kmalloc (usually contiguously with another data structure) and then register that buffer internally. I'd like to split the buffers out of these data structures to allow the data structures to scale. Start by adding functions that can kmalloc and register a buffer, and can manage/preserve the buffer's associated ib_sge and ib_mr fields. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Refactor rpcrdma_buffer_create() and rpcrdma_buffer_destroy()Chuck Lever2015-01-301-53/+95
| | | | | | | | | Move the details of how to create and destroy rpcrdma_req and rpcrdma_rep structures into helper functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Simplify synopsis of rpcrdma_buffer_create()Chuck Lever2015-01-301-2/+5
| | | | | | | | | Clean up: There is one call site for rpcrdma_buffer_create(). All of the arguments there are fields of an rpcrdma_xprt. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Take struct ib_qp_attr and ib_qp_init_attr off the stackChuck Lever2015-01-301-7/+8
| | | | | | | | Reduce stack footprint of the connection upcall handler function. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Take struct ib_device_attr off the stackChuck Lever2015-01-301-24/+13
| | | | | | | | | Device attributes are large, and are used in more than one place. Stash a copy in dynamically allocated memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Free the pd if ib_query_qp() failsChuck Lever2015-01-301-3/+7
| | | | | | | | | | | If ib_query_qp() fails or the memory registration mode isn't supported, don't leak the PD. An orphaned IB/core resource will cause IB module removal to hang. Fixes: bd7ed1d13304 ("RPC/RDMA: check selected memory registration ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprtChuck Lever2015-01-301-3/+3
| | | | | | | | | | Clean up: The rep_func field always refers to rpcrdma_conn_func(). rep_func should have been removed by commit b45ccfd25d50 ("xprtrdma: Remove MEMWINDOWS registration modes"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Move credit update to RPC reply handlerChuck Lever2015-01-301-13/+2
| | | | | | | | | | | | | | | | | | Reduce work in the receive CQ handler, which can be run at hardware interrupt level, by moving the RPC/RDMA credit update logic to the RPC reply handler. This has some additional benefits: More header sanity checking is done before trusting the incoming credit value, and the receive CQ handler no longer touches the RPC/RDMA header (the CPU stalls while waiting for the header contents to be brought into the cache). This further extends work begun by commit e7ce710a8802 ("xprtrdma: Avoid deadlock when credit window is reset"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove rl_mr field, and the mr_chunk unionChuck Lever2015-01-301-13/+12
| | | | | | | | | | | | | | Clean up: Since commit 0ac531c18323 ("xprtrdma: Remove REGISTER memory registration mode"), the rl_mr pointer is no longer used anywhere. After removal, there's only a single member of the mr_chunk union, so mr_chunk can be removed as well, in favor of a single pointer field. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove rpcrdma_ep::rep_iaChuck Lever2015-01-301-1/+0
| | | | | | | | Clean up: This field is not used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: human-readable completion statusChuck Lever2015-01-301-13/+57
| | | | | | | | | | | | Make it easier to grep the system log for specific error conditions. The wc.opcode field is not included because opcode numbers are sparse, and because wc.opcode is not necessarily valid when completion reports an error. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* Merge tag 'nfs-rdma-for-3.19' of ↵Trond Myklebust2014-11-261-15/+99
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next Pull NFS client RDMA changes for 3.19 from Anna Schumaker: "NFS: Client side changes for RDMA These patches various bugfixes and cleanups for using NFS over RDMA, including better error handling and performance improvements by using pad optimization. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>" * tag 'nfs-rdma-for-3.19' of git://git.linux-nfs.org/projects/anna/nfs-rdma: xprtrdma: Display async errors xprtrdma: Enable pad optimization xprtrdma: Re-write rpcrdma_flush_cqs() xprtrdma: Refactor tasklet scheduling xprtrdma: unmap all FMRs during transport disconnect xprtrdma: Cap req_cqinit xprtrdma: Return an errno from rpcrdma_register_external()
| * xprtrdma: Display async errorsChuck Lever2014-11-251-4/+32
| | | | | | | | | | | | | | | | An async error upcall is a hard error, and should be reported in the system log. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Re-write rpcrdma_flush_cqs()Chuck Lever2014-11-251-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently rpcrdma_flush_cqs() attempts to avoid code duplication, and simply invokes rpcrdma_recvcq_upcall and rpcrdma_sendcq_upcall. 1. rpcrdma_flush_cqs() can run concurrently with provider upcalls. Both flush_cqs() and the upcalls were invoking ib_poll_cq() in different threads using the same wc buffers (ep->rep_recv_wcs and ep->rep_send_wcs), added by commit 1c00dd077654 ("xprtrmda: Reduce calls to ib_poll_cq() in completion handlers"). During transport disconnect processing, this sometimes resulted in the same reply getting added to the rpcrdma_tasklets_g list more than once, which corrupted the list. 2. The upcall functions drain only a limited number of CQEs, thanks to the poll budget added by commit 8301a2c047cc ("xprtrdma: Limit work done by completion handler"). Fixes: a7bc211ac926 ("xprtrdma: On disconnect, don't ignore ... ") BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=276 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Refactor tasklet schedulingChuck Lever2014-11-251-5/+12
| | | | | | | | | | | | | | | | Restore the separate function that schedules the reply handling tasklet. I need to call it from two different paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: unmap all FMRs during transport disconnectChuck Lever2014-11-251-1/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When using RPCRDMA_MTHCAFMR memory registration, after a few transport disconnect / reconnect cycles, ib_map_phys_fmr() starts to return EINVAL because the provider has exhausted its map pool. Make sure that all FMRs are unmapped during transport disconnect, and that ->send_request remarshals them during an RPC retransmit. This resets the transport's MRs to ensure that none are leaked during a disconnect. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Cap req_cqinitChuck Lever2014-11-251-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent work made FRMR registration and invalidation completions unsignaled. This greatly reduces the adapter interrupt rate. Every so often, however, a posted send Work Request is allowed to signal. Otherwise, the provider's Work Queue will wrap and the workload will hang. The number of Work Requests that are allowed to remain unsignaled is determined by the value of req_cqinit. Currently, this is set to the size of the send Work Queue divided by two, minus 1. For FRMR, the send Work Queue is the maximum number of concurrent RPCs (currently 32) times the maximum number of Work Requests an RPC might use (currently 7, though some adapters may need more). For mlx4, this is 224 entries. This leaves completion signaling disabled for 111 send Work Requests. Some providers hold back dispatching Work Requests until a CQE is generated. If completions are disabled, then no CQEs are generated for quite some time, and that can stall the Work Queue. I've seen this occur running xfstests generic/113 over NFSv4, where eventually, posting a FAST_REG_MR Work Request fails with -ENOMEM because the Work Queue has overflowed. The connection is dropped and re-established. Cap the rep_cqinit setting so completions are not left turned off for too long. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=269 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
| * xprtrdma: Return an errno from rpcrdma_register_external()Chuck Lever2014-11-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | The RPC/RDMA send_request method and the chunk registration code expects an errno from the registration function. This allows the upper layers to distinguish between a recoverable failure (for example, temporary memory exhaustion) and a hard failure (for example, a bug in the registration logic). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* | sunrpc: eliminate RPC_DEBUGJeff Layton2014-11-241-4/+4
|/ | | | | | | It's always set to whatever CONFIG_SUNRPC_DEBUG is, so just use that. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* xprtrdma: Handle additional connection eventsChuck Lever2014-07-311-11/+14
| | | | | | | | | | | | | | | Commit 38ca83a5 added RDMA_CM_EVENT_TIMEWAIT_EXIT. But that status is relevant only for consumers that re-use their QPs on new connections. xprtrdma creates a fresh QP on reconnection, so that event should be explicitly ignored. Squelch the alarming "unexpected CM event" message. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macroChuck Lever2014-07-311-13/+0
| | | | | | | | | | | | | | | Clean up. RPCRDMA_PERSISTENT_REGISTRATION was a compile-time switch between RPCRDMA_REGISTER mode and RPCRDMA_ALLPHYSICAL mode. Since RPCRDMA_REGISTER has been removed, there's no need for the extra conditional compilation. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Make rpcrdma_ep_disconnect() return voidChuck Lever2014-07-311-10/+4
| | | | | | | | | | | Clean up: The return code is used only for dprintk's that are already redundant. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Schedule reply tasklet once per upcallChuck Lever2014-07-311-16/+15
| | | | | | | | | | | Minor optimization: grab rpcrdma_tk_lock_g and disable hard IRQs just once after clearing the receive completion queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Allocate each struct rpcrdma_mw separatelyChuck Lever2014-07-311-99/+143
| | | | | | | | | | | | | Currently rpcrdma_buffer_create() allocates struct rpcrdma_mw's as a single contiguous area of memory. It amounts to quite a bit of memory, and there's no requirement for these to be carved from a single piece of contiguous memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Rename frmr_wrChuck Lever2014-07-311-13/+13
| | | | | | | | | | | Clean up: Name frmr_wr after the opcode of the Work Request, consistent with the send and local invalidation paths. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Disable completions for LOCAL_INV Work RequestsChuck Lever2014-07-311-9/+8
| | | | | | | | | | | | Instead of relying on a completion to change the state of an FRMR to FRMR_IS_INVALID, set it in advance. If an error occurs, a completion will fire anyway and mark the FRMR FRMR_IS_STALE. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Disable completions for FAST_REG_MR Work RequestsChuck Lever2014-07-311-5/+4
| | | | | | | | | | | | Instead of relying on a completion to change the state of an FRMR to FRMR_IS_VALID, set it in advance. If an error occurs, a completion will fire anyway and mark the FRMR FRMR_IS_STALE. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external()Chuck Lever2014-07-311-20/+2
| | | | | | | | | | | | | | | | | | | | | | | | Any FRMR arriving in rpcrdma_register_frmr_external() is now guaranteed to be either invalid, or to be targeted by a queued LOCAL_INV that will invalidate it before the adapter processes the FAST_REG_MR being built here. The problem with current arrangement of chaining a LOCAL_INV to the FAST_REG_MR is that if the transport is not connected, the LOCAL_INV is flushed and the FAST_REG_MR is flushed. This leaves the FRMR valid with the old rkey. But rpcrdma_register_frmr_external() has already bumped the in-memory rkey. Next time through rpcrdma_register_frmr_external(), a LOCAL_INV and FAST_REG_MR is attempted again because the FRMR is still valid. But the rkey no longer matches the hardware's rkey, and a memory management operation error occurs. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work RequestChuck Lever2014-07-311-2/+92
| | | | | | | | | | | | | | | When a LOCAL_INV Work Request is flushed, it leaves an FRMR in the VALID state. This FRMR can be returned by rpcrdma_buffer_get(), and must be knocked down in rpcrdma_register_frmr_external() before it can be re-used. Instead, capture these in rpcrdma_buffer_get(), and reset them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnectChuck Lever2014-07-311-2/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | FAST_REG_MR Work Requests update a Memory Region's rkey. Rkey's are used to block unwanted access to the memory controlled by an MR. The rkey is passed to the receiver (the NFS server, in our case), and is also used by xprtrdma to invalidate the MR when the RPC is complete. When a FAST_REG_MR Work Request is flushed after a transport disconnect, xprtrdma cannot tell whether the WR actually hit the adapter or not. So it is indeterminant at that point whether the existing rkey is still valid. After the transport connection is re-established, the next FAST_REG_MR or LOCAL_INV Work Request against that MR can sometimes fail because the rkey value does not match what xprtrdma expects. The only reliable way to recover in this case is to deregister and register the MR before it is used again. These operations can be done only in a process context, so handle it in the transport connect worker. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Properly handle exhaustion of the rb_mws listChuck Lever2014-07-311-32/+71
| | | | | | | | | | | If the rb_mws list is exhausted, clean up and return NULL so that call_allocate() will delay and try again. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Chain together all MWs in same buffer poolChuck Lever2014-07-311-0/+4
| | | | | | | | | | | | During connection loss recovery, need to visit every MW in a buffer pool. Any MW that is in use by an RPC will not be on the rb_mws list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Back off rkey when FAST_REG_MR failsChuck Lever2014-07-311-0/+1
| | | | | | | | | If posting a FAST_REG_MR Work Reqeust fails, revert the rkey update to avoid subsequent IB_WC_MW_BIND_ERR completions. Suggested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Unclutter struct rpcrdma_mr_segChuck Lever2014-07-311-17/+17
| | | | | | | | | | | | | | | | Clean ups: - make it obvious that the rl_mw field is a pointer -- allocated separately, not as part of struct rpcrdma_mr_seg - promote "struct {} frmr;" to a named type - promote the state enum to a named type - name the MW state field the same way other fields in rpcrdma_mw are named Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: On disconnect, don't ignore pending CQEsChuck Lever2014-07-311-5/+9
| | | | | | | | | | | | | xprtrdma is currently throwing away queued completions during a reconnect. RPC replies posted just before connection loss, or successful completions that change the state of an FRMR, can be missed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Limit data payload size for ALLPHYSICALChuck Lever2014-07-311-0/+41
| | | | | | | | | | | | | | | | When the client uses physical memory registration, each page in the payload gets its own array entry in the RPC/RDMA header's chunk list. Therefore, don't advertise a maximum payload size that would require more array entries than can fit in the RPC buffer where RPC/RDMA headers are built. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=248 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Protect ia->ri_id when unmapping/invalidating MRsChuck Lever2014-07-311-6/+17
| | | | | | | | | | | | | | Ensure ia->ri_id remains valid while invoking dma_unmap_page() or posting LOCAL_INV during a transport reconnect. Otherwise, ia->ri_id->device or ia->ri_id->qp is NULL, which triggers a panic. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=259 Fixes: ec62f40 'xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting' Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Fix panic in rpcrdma_register_frmr_external()Chuck Lever2014-07-311-5/+7
| | | | | | | | | | | | | | | | seg1->mr_nsegs is not yet initialized when it is used to unmap segments during an error exit. Use the same unmapping logic for all error exits. "if (frmr_wr.wr.fast_reg.length < len) {" used to be a BUG_ON check. The broken code will never be executed under normal operation. Fixes: c977dea (xprtrdma: Remove BUG_ON() call sites) Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Shirley Ma <shirley.ma@oracle.com> Tested-by: Devesh Sharma <devesh.sharma@emulex.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Fix DMA-API-DEBUG warning by checking dma_map resultYan Burman2014-07-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following warning when DMA-API debug is enabled by checking ib_dma_map_single result: [ 1455.345548] ------------[ cut here ]------------ [ 1455.346863] WARNING: CPU: 3 PID: 3929 at /home/yanb/kernel/net-next/lib/dma-debug.c:1140 check_unmap+0x4e5/0x990() [ 1455.349350] mlx4_core 0000:00:07.0: DMA-API: device driver failed to check map error[device address=0x000000007c9f2090] [size=2656 bytes] [mapped as single] [ 1455.349350] Modules linked in: xprtrdma netconsole configfs nfsv3 nfs_acl ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm autofs4 auth_rpcgss oid_registry nfsv4 nfs fscache lockd sunrpc dm_mirror dm_region_hash dm_log microcode pcspkr mlx4_ib ib_sa ib_mad ib_core ib_addr mlx4_en ipv6 ptp pps_core vxlan mlx4_core virtio_balloon cirrus ttm drm_kms_helper drm sysimgblt sysfillrect syscopyarea i2c_piix4 i2c_core button ext3 jbd virtio_blk virtio_net virtio_pci virtio_ring virtio uhci_hcd ata_generic ata_piix libata [ 1455.349350] CPU: 3 PID: 3929 Comm: mount.nfs Not tainted 3.15.0-rc1-dbg+ #13 [ 1455.349350] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [ 1455.349350] 0000000000000474 ffff880069dcf628 ffffffff8151c341 ffffffff817b69d8 [ 1455.349350] ffff880069dcf678 ffff880069dcf668 ffffffff8105b5fc 0000000069dcf658 [ 1455.349350] ffff880069dcf778 ffff88007b0c9f00 ffffffff8255ec40 0000000000000a60 [ 1455.349350] Call Trace: [ 1455.349350] [<ffffffff8151c341>] dump_stack+0x52/0x81 [ 1455.349350] [<ffffffff8105b5fc>] warn_slowpath_common+0x8c/0xc0 [ 1455.349350] [<ffffffff8105b6e6>] warn_slowpath_fmt+0x46/0x50 [ 1455.349350] [<ffffffff812e6305>] check_unmap+0x4e5/0x990 [ 1455.349350] [<ffffffff81521fb0>] ? _raw_spin_unlock_irq+0x30/0x60 [ 1455.349350] [<ffffffff812e6a0a>] debug_dma_unmap_page+0x5a/0x60 [ 1455.349350] [<ffffffffa0389583>] rpcrdma_deregister_internal+0xb3/0xd0 [xprtrdma] [ 1455.349350] [<ffffffffa038a639>] rpcrdma_buffer_destroy+0x69/0x170 [xprtrdma] [ 1455.349350] [<ffffffffa03872ff>] xprt_rdma_destroy+0x3f/0xb0 [xprtrdma] [ 1455.349350] [<ffffffffa04a95ff>] xprt_destroy+0x6f/0x80 [sunrpc] [ 1455.349350] [<ffffffffa04a9625>] xprt_put+0x15/0x20 [sunrpc] [ 1455.349350] [<ffffffffa04a899a>] rpc_free_client+0x8a/0xe0 [sunrpc] [ 1455.349350] [<ffffffffa04a8a58>] rpc_release_client+0x68/0xa0 [sunrpc] [ 1455.349350] [<ffffffffa04a9060>] rpc_shutdown_client+0xb0/0xc0 [sunrpc] [ 1455.349350] [<ffffffffa04a8f5d>] ? rpc_ping+0x5d/0x70 [sunrpc] [ 1455.349350] [<ffffffffa04a91ab>] rpc_create_xprt+0xbb/0xd0 [sunrpc] [ 1455.349350] [<ffffffffa04a9273>] rpc_create+0xb3/0x160 [sunrpc] [ 1455.349350] [<ffffffff81129749>] ? __probe_kernel_read+0x69/0xb0 [ 1455.349350] [<ffffffffa053851c>] nfs_create_rpc_client+0xdc/0x100 [nfs] [ 1455.349350] [<ffffffffa0538cfa>] nfs_init_client+0x3a/0x90 [nfs] [ 1455.349350] [<ffffffffa05391c8>] nfs_get_client+0x478/0x5b0 [nfs] [ 1455.349350] [<ffffffffa0538e50>] ? nfs_get_client+0x100/0x5b0 [nfs] [ 1455.349350] [<ffffffff81172c6d>] ? kmem_cache_alloc_trace+0x24d/0x260 [ 1455.349350] [<ffffffffa05393f3>] nfs_create_server+0xf3/0x4c0 [nfs] [ 1455.349350] [<ffffffffa0545ff0>] ? nfs_request_mount+0xf0/0x1a0 [nfs] [ 1455.349350] [<ffffffffa031c0c3>] nfs3_create_server+0x13/0x30 [nfsv3] [ 1455.349350] [<ffffffffa0546293>] nfs_try_mount+0x1f3/0x230 [nfs] [ 1455.349350] [<ffffffff8108ea21>] ? get_parent_ip+0x11/0x50 [ 1455.349350] [<ffffffff812d6343>] ? __this_cpu_preempt_check+0x13/0x20 [ 1455.349350] [<ffffffff810d632b>] ? try_module_get+0x6b/0x190 [ 1455.349350] [<ffffffffa05449f7>] nfs_fs_mount+0x187/0x9d0 [nfs] [ 1455.349350] [<ffffffffa0545940>] ? nfs_clone_super+0x140/0x140 [nfs] [ 1455.349350] [<ffffffffa0543b20>] ? nfs_auth_info_match+0x40/0x40 [nfs] [ 1455.349350] [<ffffffff8117e360>] mount_fs+0x20/0xe0 [ 1455.349350] [<ffffffff811a1c16>] vfs_kern_mount+0x76/0x160 [ 1455.349350] [<ffffffff811a29a8>] do_mount+0x428/0xae0 [ 1455.349350] [<ffffffff811a30f0>] SyS_mount+0x90/0xe0 [ 1455.349350] [<ffffffff8152af52>] system_call_fastpath+0x16/0x1b [ 1455.349350] ---[ end trace f1f31572972e211d ]--- Signed-off-by: Yan Burman <yanb@mellanox.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove BUG_ON() call sitesChuck Lever2014-06-041-8/+10
| | | | | | | | If an error occurs in the marshaling logic, fail the RPC request being processed, but leave the client running. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* xprtrdma: Remove Tavor MTU settingChuck Lever2014-06-041-14/+0
| | | | | | | | | | | | | | Clean up. Remove HCA-specific clutter in xprtrdma, which is supposed to be device-independent. Hal Rosenstock <hal@dev.mellanox.co.il> observes: > Note that there is OpenSM option (enable_quirks) to return 1K MTU > in SA PathRecord responses for Tavor so that can be used for this. > The default setting for enable_quirks is FALSE so that would need > changing. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
OpenPOWER on IntegriCloud