talos-op-linux/kernel/bpf/sockmap.c, branch v4.17

bpf: parse and verdict prog attach may race with bpf map update

2018-05-17T22:27:37+00:00

In the sockmap design BPF programs (SK_SKB_STREAM_PARSER, SK_SKB_STREAM_VERDICT and SK_MSG_VERDICT) are attached to the sockmap map type and when a sock is added to the map the programs are used by the socket. However, sockmap updates from both userspace and BPF programs can happen concurrently with the attach and detach of these programs. To resolve this we use the bpf_prog_inc_not_zero and a READ_ONCE() primitive to ensure the program pointer is not refeched and possibly NULL'd before the refcnt increment. This happens inside a RCU critical section so although the pointer reference in the map object may be NULL (by a concurrent detach operation) the reference from READ_ONCE will not be free'd until after grace period. This ensures the object returned by READ_ONCE() is valid through the RCU criticl section and safe to use as long as we "know" it may be free'd shortly. Daniel spotted a case in the sock update API where instead of using the READ_ONCE() program reference we used the pointer from the original map, stab->bpf_{verdict|parse|txmsg}. The problem with this is the logic checks the object returned from the READ_ONCE() is not NULL and then tries to reference the object again but using the above map pointer, which may have already been NULL'd by a parallel detach operation. If this happened bpf_porg_inc_not_zero could dereference a NULL pointer. Fix this by using variable returned by READ_ONCE() that is checked for NULL. Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") Reported-by: Daniel Borkmann Signed-off-by: John Fastabend Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann

bpf: sockmap update rollback on error can incorrectly dec prog refcnt

2018-05-17T22:27:37+00:00

If the user were to only attach one of the parse or verdict programs then it is possible a subsequent sockmap update could incorrectly decrement the refcnt on the program. This happens because in the rollback logic, after an error, we have to decrement the program reference count when its been incremented. However, we only increment the program reference count if the user has both a verdict and a parse program. The reason for this is because, at least at the moment, both are required for any one to be meaningful. The problem fixed here is in the rollback path we decrement the program refcnt even if only one existing. But we never incremented the refcnt in the first place creating an imbalance. This patch fixes the error path to handle this case. Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") Reported-by: Daniel Borkmann Signed-off-by: John Fastabend Acked-by: Martin KaFai Lau Signed-off-by: Daniel Borkmann

bpf: sockmap, fix error handling in redirect failures

2018-05-02T22:30:45+00:00

When a redirect failure happens we release the buffers in-flight without calling a sk_mem_uncharge(), the uncharge is called before dropping the sock lock for the redirecte, however we missed updating the ring start index. When no apply actions are in progress this is OK because we uncharge the entire buffer before the redirect. But, when we have apply logic running its possible that only a portion of the buffer is being redirected. In this case we only do memory accounting for the buffer slice being redirected and expect to be able to loop over the BPF program again and/or if a sock is closed uncharge the memory at sock destruct time. With an invalid start index however the program logic looks at the start pointer index, checks the length, and when seeing the length is zero (from the initial release and failure to update the pointer) aborts without uncharging/releasing the remaining memory. The fix for this is simply to update the start index. To avoid fixing this error in two locations we do a small refactor and remove one case where it is open-coded. Then fix it in the single function. Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov

bpf: sockmap, zero sg_size on error when buffer is released

2018-05-02T22:30:45+00:00

When an error occurs during a redirect we have two cases that need to be handled (i) we have a cork'ed buffer (ii) we have a normal sendmsg buffer. In the cork'ed buffer case we don't currently support recovering from errors in a redirect action. So the buffer is released and the error should _not_ be pushed back to the caller of sendmsg/sendpage. The rationale here is the user will get an error that relates to old data that may have been sent by some arbitrary thread on that sock. Instead we simple consume the data and tell the user that the data has been consumed. We may add proper error recovery in the future. However, this patch fixes a bug where the bytes outstanding counter sg_size was not zeroed. This could result in a case where if the user has both a cork'ed action and apply action in progress we may incorrectly call into the BPF program when the user expected an old verdict to be applied via the apply action. I don't have a use case where using apply and cork at the same time is valid but we never explicitly reject it because it should work fine. This patch ensures the sg_size is zeroed so we don't have this case. In the normal sendmsg buffer case (no cork data) we also do not zero sg_size. Again this can confuse the apply logic when the logic calls into the BPF program when the BPF programmer expected the old verdict to remain. So ensure we set sg_size to zero here as well. And additionally to keep the psock state in-sync with the sk_msg_buff release all the memory as well. Previously we did this before returning to the user but this left a gap where psock and sk_msg_buff states were out of sync which seems fragile. No additional overhead is taken here except for a call to check the length and realize its already been freed. This is in the error path as well so in my opinion lets have robust code over optimized error paths. Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov

bpf: sockmap, fix scatterlist update on error path in send with apply

2018-05-02T22:30:44+00:00

When the call to do_tcp_sendpage() fails to send the complete block requested we either retry if only a partial send was completed or abort if we receive a error less than or equal to zero. Before returning though we must update the scatterlist length/offset to account for any partial send completed. Before this patch we did this at the end of the retry loop, but this was buggy when used while applying a verdict to fewer bytes than in the scatterlist. When the scatterlist length was being set we forgot to account for the apply logic reducing the size variable. So the result was we chopped off some bytes in the scatterlist without doing proper cleanup on them. This results in a WARNING when the sock is tore down because the bytes have previously been charged to the socket but are never uncharged. The simple fix is to simply do the accounting inside the retry loop subtracting from the absolute scatterlist values rather than trying to accumulate the totals and subtract at the end. Reported-by: Alexei Starovoitov Signed-off-by: John Fastabend Signed-off-by: Alexei Starovoitov

bpf: sockmap, fix double page_put on ENOMEM error in redirect path

2018-04-23T22:49:45+00:00

In the case where the socket memory boundary is hit the redirect path returns an ENOMEM error. However, before checking for this condition the redirect scatterlist buffer is setup with a valid page and length. This is never unwound so when the buffers are released latter in the error path we do a put_page() and clear the scatterlist fields. But, because the initial error happens before completing the scatterlist buffer we end up with both the original buffer and the redirect buffer pointing to the same page resulting in duplicate put_page() calls. To fix this simply move the initial configuration of the redirect scatterlist buffer below the sock memory check. Found this while running TCP_STREAM test with netperf using Cilium. Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT") Signed-off-by: John Fastabend Signed-off-by: Daniel Borkmann

bpf: sockmap, sk_wait_event needed to handle blocking cases

2018-04-23T22:49:45+00:00

In the recvmsg handler we need to add a wait event to support the blocking use cases. Without this we return zero and may confuse user applications. In the wait event any data received on the sk either via sk_receive_queue or the psock ingress list will wake up the sock. Fixes: fa246693a111 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT") Signed-off-by: John Fastabend Signed-off-by: Daniel Borkmann

bpf: sockmap, map_release does not hold refcnt for pinned maps

2018-04-23T22:49:45+00:00

Relying on map_release hook to decrement the reference counts when a map is removed only works if the map is not being pinned. In the pinned case the ref is decremented immediately and the BPF programs released. After this BPF programs may not be in-use which is not what the user would expect. This patch moves the release logic into bpf_map_put_uref() and brings sockmap in-line with how a similar case is handled in prog array maps. Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs") Signed-off-by: John Fastabend Signed-off-by: Daniel Borkmann

bpf: sockmap remove dead check

2018-04-20T20:09:51+00:00

Remove dead code that bails on `attr->value_size > KMALLOC_MAX_SIZE` - the previous check already bails on `attr->value_size != 4`. Signed-off-by: Jann Horn Signed-off-by: Daniel Borkmann

bpf: sockmap, duplicates release calls may NULL sk_prot

2018-04-04T09:04:31+00:00

It is possible to have multiple ULP tcp_release call paths in flight if a sock is closed and simultaneously being removed from the sockmap control path. The result would be setting the sk_prot to the saved values on the first iteration and then on the second iteration setting the value to NULL. This patch resolves this by ensuring we only reset the sk_prot pointer if we have a valid saved state to set. Fixes: 4f738adba30a7 ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Signed-off-by: John Fastabend Signed-off-by: Daniel Borkmann