diff options
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/af_xdp.rst | 101 |
1 files changed, 58 insertions, 43 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 91928d9ee4bf..ff929cfab4f4 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -12,7 +12,7 @@ packet processing. This document assumes that the reader is familiar with BPF and XDP. If not, the Cilium project has an excellent reference guide at -http://cilium.readthedocs.io/en/doc-1.0/bpf/. +http://cilium.readthedocs.io/en/latest/bpf/. Using the XDP_REDIRECT action from an XDP program, the program can redirect ingress frames to other XDP enabled netdevs, using the @@ -33,22 +33,22 @@ for a while due to a possible retransmit, the descriptor that points to that packet can be changed to point to another and reused right away. This again avoids copying data. -The UMEM consists of a number of equally size frames and each frame -has a unique frame id. A descriptor in one of the rings references a -frame by referencing its frame id. The user space allocates memory for -this UMEM using whatever means it feels is most appropriate (malloc, -mmap, huge pages, etc). This memory area is then registered with the -kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two -rings: the FILL ring and the COMPLETION ring. The fill ring is used by -the application to send down frame ids for the kernel to fill in with -RX packet data. References to these frames will then appear in the RX -ring once each packet has been received. The completion ring, on the -other hand, contains frame ids that the kernel has transmitted -completely and can now be used again by user space, for either TX or -RX. Thus, the frame ids appearing in the completion ring are ids that -were previously transmitted using the TX ring. In summary, the RX and -FILL rings are used for the RX path and the TX and COMPLETION rings -are used for the TX path. +The UMEM consists of a number of equally sized chunks. A descriptor in +one of the rings references a frame by referencing its addr. The addr +is simply an offset within the entire UMEM region. The user space +allocates memory for this UMEM using whatever means it feels is most +appropriate (malloc, mmap, huge pages, etc). This memory area is then +registered with the kernel using the new setsockopt XDP_UMEM_REG. The +UMEM also has two rings: the FILL ring and the COMPLETION ring. The +fill ring is used by the application to send down addr for the kernel +to fill in with RX packet data. References to these frames will then +appear in the RX ring once each packet has been received. The +completion ring, on the other hand, contains frame addr that the +kernel has transmitted completely and can now be used again by user +space, for either TX or RX. Thus, the frame addrs appearing in the +completion ring are addrs that were previously transmitted using the +TX ring. In summary, the RX and FILL rings are used for the RX path +and the TX and COMPLETION rings are used for the TX path. The socket is then finally bound with a bind() call to a device and a specific queue id on that device, and it is not until bind is @@ -59,13 +59,13 @@ wants to do this, it simply skips the registration of the UMEM and its corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind call and submits the XSK of the process it would like to share UMEM with as well as its own newly created XSK socket. The new process will -then receive frame id references in its own RX ring that point to this -shared UMEM. Note that since the ring structures are single-consumer / -single-producer (for performance reasons), the new process has to -create its own socket with associated RX and TX rings, since it cannot -share this with the other process. This is also the reason that there -is only one set of FILL and COMPLETION rings per UMEM. It is the -responsibility of a single process to handle the UMEM. +then receive frame addr references in its own RX ring that point to +this shared UMEM. Note that since the ring structures are +single-consumer / single-producer (for performance reasons), the new +process has to create its own socket with associated RX and TX rings, +since it cannot share this with the other process. This is also the +reason that there is only one set of FILL and COMPLETION rings per +UMEM. It is the responsibility of a single process to handle the UMEM. How is then packets distributed from an XDP program to the XSKs? There is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The @@ -102,10 +102,10 @@ UMEM UMEM is a region of virtual contiguous memory, divided into equal-sized frames. An UMEM is associated to a netdev and a specific -queue id of that netdev. It is created and configured (frame size, -frame headroom, start address and size) by using the XDP_UMEM_REG -setsockopt system call. A UMEM is bound to a netdev and queue id, via -the bind() system call. +queue id of that netdev. It is created and configured (chunk size, +headroom, start address and size) by using the XDP_UMEM_REG setsockopt +system call. A UMEM is bound to a netdev and queue id, via the bind() +system call. An AF_XDP is socket linked to a single UMEM, but one UMEM can have multiple AF_XDP sockets. To share an UMEM created via one socket A, @@ -147,13 +147,17 @@ UMEM Fill Ring ~~~~~~~~~~~~~~ The Fill ring is used to transfer ownership of UMEM frames from -user-space to kernel-space. The UMEM indicies are passed in the -ring. As an example, if the UMEM is 64k and each frame is 4k, then the -UMEM has 16 frames and can pass indicies between 0 and 15. +user-space to kernel-space. The UMEM addrs are passed in the ring. As +an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has +16 chunks and can pass addrs between 0 and 64k. Frames passed to the kernel are used for the ingress path (RX rings). -The user application produces UMEM indicies to this ring. +The user application produces UMEM addrs to this ring. Note that the +kernel will mask the incoming addr. E.g. for a chunk size of 2k, the +log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050 +and 3000 refers to the same chunk. + UMEM Completetion Ring ~~~~~~~~~~~~~~~~~~~~~~ @@ -165,16 +169,15 @@ used. Frames passed from the kernel to user-space are frames that has been sent (TX ring) and can be used by user-space again. -The user application consumes UMEM indicies from this ring. +The user application consumes UMEM addrs from this ring. RX Ring ~~~~~~~ The RX ring is the receiving side of a socket. Each entry in the ring -is a struct xdp_desc descriptor. The descriptor contains UMEM index -(idx), the length of the data (len), the offset into the frame -(offset). +is a struct xdp_desc descriptor. The descriptor contains UMEM offset +(addr) and the length of the data (len). If no frames have been passed to kernel via the Fill ring, no descriptors will (or can) appear on the RX ring. @@ -221,38 +224,50 @@ side is xdpsock_user.c and the XDP side xdpsock_kern.c. Naive ring dequeue and enqueue could look like this:: + // struct xdp_rxtx_ring { + // __u32 *producer; + // __u32 *consumer; + // struct xdp_desc *desc; + // }; + + // struct xdp_umem_ring { + // __u32 *producer; + // __u32 *consumer; + // __u64 *desc; + // }; + // typedef struct xdp_rxtx_ring RING; // typedef struct xdp_umem_ring RING; // typedef struct xdp_desc RING_TYPE; - // typedef __u32 RING_TYPE; + // typedef __u64 RING_TYPE; int dequeue_one(RING *ring, RING_TYPE *item) { - __u32 entries = ring->ptrs.producer - ring->ptrs.consumer; + __u32 entries = *ring->producer - *ring->consumer; if (entries == 0) return -1; // read-barrier! - *item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)]; - ring->ptrs.consumer++; + *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; + (*ring->consumer)++; return 0; } int enqueue_one(RING *ring, const RING_TYPE *item) { - u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer); + u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); if (free_entries == 0) return -1; - ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item; + ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; // write-barrier! - ring->ptrs.producer++; + (*ring->producer)++; return 0; } |