summaryrefslogtreecommitdiffstats
path: root/net/core
Commit message (Collapse)AuthorAgeFilesLines
...
| | * | bpf: bpf_setsockopt: reset sock dst on SO_MARK changesPeter Oskolkov2019-01-181-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In sock_setsockopt() (net/core/sock.h), when SO_MARK option is used to change sk_mark, sk_dst_reset(sk) is called. The same should be done in bpf_setsockopt(). Fixes: 8c4b4c7e9ff0 ("bpf: Add setsockopt helper function to bpf") Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | bpf: Correctly annotate implicit fall through in bpf_base_func_protoMathieu Malaterre2019-01-171-1/+1
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warnings (W=1). To preserve as much of the existing comment only change a ‘:’ into a ‘,’. This is enough change, to match the regular expression expected by GCC. This commit removes the following warning: net/core/filter.c:5310:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | neighbour: Do not perturb drop profiles when neigh_probeYang Wei2019-01-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Replace the kfree_skb() by consume_skb() to be drop monitor(dropwatch, perf) friendly. Signed-off-by: Yang Wei <yang.wei9@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/core/neighbour: fix kmemleak minimal reference count for hash tablesKonstantin Khlebnikov2019-01-161-1/+1
| |/ | | | | | | | | | | | | | | | | This should be 1 for normal allocations, 0 disables leak reporting. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Fixes: 85704cb8dcfd ("net/core/neighbour: tell kmemleak about hash tables") Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sock: do not set sk_cookie in sk_clone_lock()Yafang Shao2019-01-191-1/+0
| | | | | | | | | | | | | | | | | | | | The only call site of sk_clone_lock is in inet_csk_clone_lock, and sk_cookie will be set there. So we don't need to set sk_cookie in sk_clone_lock(). Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: namespace: perform strict checks also for doit handlersJakub Kicinski2019-01-191-2/+36
| | | | | | | | | | | | | | | | | | | | Make RTM_GETNSID's doit handler use strict checks when NETLINK_F_STRICT_CHK is set. v2: - don't check size >= sizeof(struct rtgenmsg) (Nicolas). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: ifinfo: perform strict checks also for doit handlerJakub Kicinski2019-01-191-1/+48
| | | | | | | | | | | | | | | | Make RTM_GETLINK's doit handler use strict checks when NETLINK_F_STRICT_CHK is set. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: stats: reject requests for unknown statsJakub Kicinski2019-01-191-0/+4
| | | | | | | | | | | | | | | | In the spirit of strict checks reject requests of stats the kernel does not support when NETLINK_F_STRICT_CHK is set. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: stats: validate attributes in get as well as dumpsJakub Kicinski2019-01-191-21/+37
| | | | | | | | | | | | | | | | Make sure NETLINK_GET_STRICT_CHK influences both GETSTATS doit as well as the dump. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add health dump {get,clear} commandsEran Ben Elisha2019-01-181-0/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add devlink health dump commands, in order to run an dump operation over a specific reporter. The supported operations are dump_get in order to get last saved dump (if not exist, dump now) and dump_clear to clear last saved dump. It is expected from driver's callback for diagnose command to fill it via the buffer descriptors API. Devlink will parse it and convert it to netlink nla API in order to pass it to the user. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add health diagnose commandEran Ben Elisha2019-01-181-0/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | Add devlink health diagnose command, in order to run a diagnose operation over a specific reporter. It is expected from driver's callback for diagnose command to fill it via the buffer descriptors API. Devlink will parse it and convert it to netlink nla API in order to pass it to the user. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add health recover commandEran Ben Elisha2019-01-181-0/+20
| | | | | | | | | | | | | | | | | | Add devlink health recover command to the uapi, in order to allow the user to execute a recover operation over a specific reporter. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add health set commandEran Ben Elisha2019-01-181-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | Add devlink health set command, in order to set configuration parameters for a specific reporter. Supported parameters are: - graceful_period: Time interval between auto recoveries (in msec) - auto_recover: Determines if the devlink shall execute recover upon receiving error for the reporter Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add health get commandEran Ben Elisha2019-01-181-0/+152
| | | | | | | | | | | | | | | | | | | | Add devlink health get command to provide reporter/s data for user space. Add the ability to get data per reporter or dump data from all available reporters. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add health report functionalityEran Ben Elisha2019-01-181-0/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upon error discover, every driver can report it to the devlink health mechanism via devlink_health_report function, using the appropriate reporter registered to it. Driver can pass error specific context which will be delivered to it as part of the dump / recovery callbacks. Once an error is reported, devlink health will do the following actions: * A log is being send to the kernel trace events buffer * Health status and statistics are being updated for the reporter instance * Object dump is being taken and stored at the reporter instance (as long as there is no other dump which is already stored) * Auto recovery attempt is being done. depends on: - Auto Recovery configuration - Grace period vs. time since last recover Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add health reporter create/destroy functionalityEran Ben Elisha2019-01-181-0/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Devlink health reporter is an instance for reporting, diagnosing and recovering from run time errors discovered by the reporters. Define it's data structure and supported operations. In addition, expose devlink API to create and destroy a reporter. Each devlink instance will hold it's own reporters list. As part of the allocation, driver shall provide a set of callbacks which will be used the devlink in order to handle health reports and user commands related to this reporter. In addition, driver is entitled to provide some priv pointer, which can be fetched from the reporter by devlink_health_reporter_priv function. For each reporter, devlink will hold a metadata of statistics, buffers and status. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | devlink: Add health buffer supportEran Ben Elisha2019-01-181-0/+501
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Devlink health buffer is a mechanism to pass descriptors between drivers and devlink. The API allows the driver to add objects, object pair, value array (nested attributes), value and name. Driver can use this API to fill the buffers in a format which can be translated by the devlink to the netlink message. In order to fulfill it, an internal buffer descriptor is defined. This will hold the data and metadata per each attribute and by used to pass actual commands to the netlink. This mechanism will be later used in devlink health for dump and diagnose data store by the drivers. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add a route cache full diagnostic messagePeter Oskolkov2019-01-171-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | In some testing scenarios, dst/route cache can fill up so quickly that even an explicit GC call occasionally fails to clean it up. This leads to sporadically failing calls to dst_alloc and "network unreachable" errors to the user, which is confusing. This patch adds a diagnostic message to make the cause of the failure easier to determine. Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Add extack argument to ndo_fdb_add()Petr Machata2019-01-171-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Drivers may not be able to support certain FDB entries, and an error code is insufficient to give clear hints as to the reasons of rejection. In order to make it possible to communicate the rejection reason, extend ndo_fdb_add() with an extack argument. Adapt the existing implementations of ndo_fdb_add() to take the parameter (and ignore it). Pass the extack parameter when invoking ndo_fdb_add() from rtnl_fdb_add(). Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: introduce SO_BINDTOIFINDEX sockoptDavid Herrmann2019-01-171-10/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This introduces a new generic SOL_SOCKET-level socket option called SO_BINDTOIFINDEX. It behaves similar to SO_BINDTODEVICE, but takes a network interface index as argument, rather than the network interface name. User-space often refers to network-interfaces via their index, but has to temporarily resolve it to a name for a call into SO_BINDTODEVICE. This might pose problems when the network-device is renamed asynchronously by other parts of the system. When this happens, the SO_BINDTODEVICE might either fail, or worse, it might bind to the wrong device. In most cases user-space only ever operates on devices which they either manage themselves, or otherwise have a guarantee that the device name will not change (e.g., devices that are UP cannot be renamed). However, particularly in libraries this guarantee is non-obvious and it would be nice if that race-condition would simply not exist. It would make it easier for those libraries to operate even in situations where the device-name might change under the hood. A real use-case that we recently hit is trying to start the network stack early in the initrd but make it survive into the real system. Existing distributions rename network-interfaces during the transition from initrd into the real system. This, obviously, cannot affect devices that are up and running (unless you also consider moving them between network-namespaces). However, the network manager now has to make sure its management engine for dormant devices will not run in parallel to these renames. Particularly, when you offload operations like DHCP into separate processes, these might setup their sockets early, and thus have to resolve the device-name possibly running into this race-condition. By avoiding a call to resolve the device-name, we no longer depend on the name and can run network setup of dormant devices in parallel to the transition off the initrd. The SO_BINDTOIFINDEX ioctl plugs this race. Reviewed-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Optimize sk_msg_clone() by data merge to end dst sg entryVakul Garg2019-01-171-8/+17
|/ | | | | | | | | | | | | | | | | | | | | Function sk_msg_clone has been modified to merge the data from source sg entry to destination sg entry if the cloned data resides in same page and is contiguous to the end entry of destination sk_msg. This improves kernel tls throughput to the tune of 10%. When the user space tls application calls sendmsg() with MSG_MORE, it leads to calling sk_msg_clone() with new data being cloned placed continuous to previously cloned data. Without this optimization, a new SG entry in the destination sk_msg i.e. rec->msg_plaintext in tls_clone_plaintext_msg() gets used. This leads to exhaustion of sg entries in rec->msg_plaintext even before a full 16K of allowable record data is accumulated. Hence we lose oppurtunity to encrypt and send a full 16K record. With this patch, the kernel tls can accumulate full 16K of record data irrespective of the size of data passed in sendmsg() with MSG_MORE. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2019-01-163-11/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix regression in multi-SKB responses to RTM_GETADDR, from Arthur Gautier. 2) Fix ipv6 frag parsing in openvswitch, from Yi-Hung Wei. 3) Unbounded recursion in ipv4 and ipv6 GUE tunnels, from Stefano Brivio. 4) Use after free in hns driver, from Yonglong Liu. 5) icmp6_send() needs to handle the case of NULL skb, from Eric Dumazet. 6) Missing rcu read lock in __inet6_bind() when operating on mapped addresses, from David Ahern. 7) Memory leak in tipc-nl_compat_publ_dump(), from Gustavo A. R. Silva. 8) Fix PHY vs r8169 module loading ordering issues, from Heiner Kallweit. 9) Fix bridge vlan memory leak, from Ido Schimmel. 10) Dev refcount leak in AF_PACKET, from Jason Gunthorpe. 11) Infoleak in ipv6_local_error(), flow label isn't completely initialized. From Eric Dumazet. 12) Handle mv88e6390 errata, from Andrew Lunn. 13) Making vhost/vsock CID hashing consistent, from Zha Bin. 14) Fix lack of UMH cleanup when it unexpectedly exits, from Taehee Yoo. 15) Bridge forwarding must clear skb->tstamp, from Paolo Abeni. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits) bnxt_en: Fix context memory allocation. bnxt_en: Fix ring checking logic on 57500 chips. mISDN: hfcsusb: Use struct_size() in kzalloc() net: clear skb->tstamp in bridge forwarding path net: bpfilter: disallow to remove bpfilter module while being used net: bpfilter: restart bpfilter_umh when error occurred net: bpfilter: use cleanup callback to release umh_info umh: add exit routine for UMH process isdn: i4l: isdn_tty: Fix some concurrency double-free bugs vhost/vsock: fix vhost vsock cid hashing inconsistent net: stmmac: Prevent RX starvation in stmmac_napi_poll() net: stmmac: Fix the logic of checking if RX Watchdog must be enabled net: stmmac: Check if CBS is supported before configuring net: stmmac: dwxgmac2: Only clear interrupts that are active net: stmmac: Fix PCI module removal leak tools/bpf: fix bpftool map dump with bitfields tools/bpf: test btf bitfield with >=256 struct member offset bpf: fix bpffs bitfield pretty print net: ethernet: mediatek: fix warning in phy_start_aneg tcp: change txhash on SYN-data timeout ...
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2019-01-111-1/+1
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2019-01-11 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix TCP-BPF support for correctly setting the initial window via TCP_BPF_IW on an active TFO sender, from Yuchung. 2) Fix a panic in BPF's stack_map_get_build_id()'s ELF parsing on 32 bit archs caused by page_address() returning NULL, from Song. 3) Fix BTF pretty print in kernel and bpftool when bitfield member offset is greater than 256. Also add test cases, from Yonghong. 4) Fix improper argument handling in xdp1 sample, from Ioana. 5) Install missing tcp_server.py and tcp_client.py files from BPF selftests, from Anders. 6) Add test_libbpf to gitignore in libbpf and BPF selftests, from Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * bpf: correctly set initial window on active Fast Open senderYuchung Cheng2019-01-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing BPF TCP initial congestion window (TCP_BPF_IW) does not to work on (active) Fast Open sender. This is because it changes the (initial) window only if data_segs_out is zero -- but data_segs_out is also incremented on SYN-data. This patch fixes the issue by proerly accounting for SYN-data additionally. Fixes: fc7478103c84 ("bpf: Adds support for setting initial cwnd") Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * | net/core/neighbour: tell kmemleak about hash tablesKonstantin Khlebnikov2019-01-101-4/+9
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes false-positive kmemleak reports about leaked neighbour entries: unreferenced object 0xffff8885c6e4d0a8 (size 1024): comm "softirq", pid 0, jiffies 4294922664 (age 167640.804s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 20 2c f3 83 ff ff ff ff ........ ,...... 08 c0 ef 5f 84 88 ff ff 01 8c 7d 02 01 00 00 00 ..._......}..... backtrace: [<00000000748509fe>] ip6_finish_output2+0x887/0x1e40 [<0000000036d7a0d8>] ip6_output+0x1ba/0x600 [<0000000027ea7dba>] ip6_send_skb+0x92/0x2f0 [<00000000d6e2111d>] udp_v6_send_skb.isra.24+0x680/0x15e0 [<000000000668a8be>] udpv6_sendmsg+0x18c9/0x27a0 [<000000004bd5fa90>] sock_sendmsg+0xb3/0xf0 [<000000008227b29f>] ___sys_sendmsg+0x745/0x8f0 [<000000008698009d>] __sys_sendmsg+0xde/0x170 [<00000000889dacf1>] do_syscall_64+0x9b/0x400 [<0000000081cdb353>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<000000005767ed39>] 0xffffffffffffffff Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net, skbuff: do not prefer skb allocation fails earlyDavid Rientjes2019-01-041-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit dcda9b04713c ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic") replaced __GFP_REPEAT in alloc_skb_with_frags() with __GFP_RETRY_MAYFAIL when the allocation may directly reclaim. The previous behavior would require reclaim up to 1 << order pages for skb aligned header_len of order > PAGE_ALLOC_COSTLY_ORDER before failing, otherwise the allocations in alloc_skb() would loop in the page allocator looking for memory. __GFP_RETRY_MAYFAIL makes both allocations failable under memory pressure, including for the HEAD allocation. This can cause, among many other things, write() to fail with ENOTCONN during RPC when under memory pressure. These allocations should succeed as they did previous to dcda9b04713c even if it requires calling the oom killer and additional looping in the page allocator to find memory. There is no way to specify the previous behavior of __GFP_REPEAT, but it's unlikely to be necessary since the previous behavior only guaranteed that 1 << order pages would be reclaimed before failing for order > PAGE_ALLOC_COSTLY_ORDER. That reclaim is not guaranteed to be contiguous memory, so repeating for such large orders is usually not beneficial. Removing the setting of __GFP_RETRY_MAYFAIL to restore the previous behavior, specifically not allowing alloc_skb() to fail for small orders and oom kill if necessary rather than allowing RPCs to fail. Fixes: dcda9b04713c ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic") Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | jump_label: move 'asm goto' support test to KconfigMasahiro Yamada2019-01-061-3/+3
|/ | | | | | | | | | | | | | | | | | | | | Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label". The jump label is controlled by HAVE_JUMP_LABEL, which is defined like this: #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) # define HAVE_JUMP_LABEL #endif We can improve this by testing 'asm goto' support in Kconfig, then make JUMP_LABEL depend on CC_HAS_ASM_GOTO. Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will match to the real kernel capability. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2019-01-033-7/+25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "Several fixes here. Basically split down the line between newly introduced regressions and long existing problems: 1) Double free in tipc_enable_bearer(), from Cong Wang. 2) Many fixes to nf_conncount, from Florian Westphal. 3) op->get_regs_len() can throw an error, check it, from Yunsheng Lin. 4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman driver, from Scott Wood. 5) Inifnite loop in fib_empty_table(), from Yue Haibing. 6) Use after free in ax25_fillin_cb(), from Cong Wang. 7) Fix socket locking in nr_find_socket(), also from Cong Wang. 8) Fix WoL wakeup enable in r8169, from Heiner Kallweit. 9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani. 10) Fix ptr_ring wrap during queue swap, from Cong Wang. 11) Missing shutdown callback in hinic driver, from Xue Chaojing. 12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano Brivio. 13) BPF out of bounds speculation fixes from Daniel Borkmann" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits) ipv6: Consider sk_bound_dev_if when binding a socket to an address ipv6: Fix dump of specific table with strict checking bpf: add various test cases to selftests bpf: prevent out of bounds speculation on pointer arithmetic bpf: fix check_map_access smin_value test when pointer contains offset bpf: restrict unknown scalars of mixed signed bounds for unprivileged bpf: restrict stack pointer arithmetic for unprivileged bpf: restrict map value pointer arithmetic for unprivileged bpf: enable access to ax register also from verifier rewrite bpf: move tmp variable into ax register in interpreter bpf: move {prev_,}insn_idx into verifier env isdn: fix kernel-infoleak in capi_unlocked_ioctl ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error net/hamradio/6pack: use mod_timer() to rearm timers net-next/hinic:add shutdown callback net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT ip: validate header length on virtual device xmit tap: call skb_probe_transport_header after setting skb->dev ptr_ring: wrap back ->producer in __ptr_ring_swap_queue() net: rds: remove unnecessary NULL check ...
| * sock: Make sock->sk_stamp thread-safeDeepa Dinamani2019-01-011-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Al Viro mentioned (Message-ID <20170626041334.GZ10672@ZenIV.linux.org.uk>) that there is probably a race condition lurking in accesses of sk_stamp on 32-bit machines. sock->sk_stamp is of type ktime_t which is always an s64. On a 32 bit architecture, we might run into situations of unsafe access as the access to the field becomes non atomic. Use seqlocks for synchronization. This allows us to avoid using spinlocks for readers as readers do not need mutual exclusion. Another approach to solve this is to require sk_lock for all modifications of the timestamps. The current approach allows for timestamps to have their own lock: sk_stamp_lock. This allows for the patch to not compete with already existing critical sections, and side effects are limited to the paths in the patch. The addition of the new field maintains the data locality optimizations from commit 9115e8cd2a0c ("net: reorganize struct sock for better data locality") Note that all the instances of the sk_stamp accesses are either through the ioctl or the syscall recvmsg. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: rtnetlink: address is mandatory for rtnl_fdb_getNikolay Aleksandrov2018-12-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | We must have an address to lookup otherwise we'll derefence a null pointer in the ndo_fdb_get callbacks. CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: David Ahern <dsa@cumulusnetworks.com> Reported-by: syzbot+017b1f61c82a1c3e7efd@syzkaller.appspotmail.com Fixes: 5b2f94b27622 ("net: rtnetlink: support for fdb get") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ethtool: check the return value of get_regs_lenYunsheng Lin2018-12-281-2/+10
| | | | | | | | | | | | | | | | | | The return type for get_regs_len in struct ethtool_ops is int, the hns3 driver may return error when failing to get the regs len by sending cmd to firmware. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-blockLinus Torvalds2018-12-281-95/+64
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: "This is the main pull request for block/storage for 4.21. Larger than usual, it was a busy round with lots of goodies queued up. Most notable is the removal of the old IO stack, which has been a long time coming. No new features for a while, everything coming in this week has all been fixes for things that were previously merged. This contains: - Use atomic counters instead of semaphores for mtip32xx (Arnd) - Cleanup of the mtip32xx request setup (Christoph) - Fix for circular locking dependency in loop (Jan, Tetsuo) - bcache (Coly, Guoju, Shenghui) * Optimizations for writeback caching * Various fixes and improvements - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith) * host and target support for NVMe over TCP * Error log page support * Support for separate read/write/poll queues * Much improved polling * discard OOM fallback * Tracepoint improvements - lightnvm (Hans, Hua, Igor, Matias, Javier) * Igor added packed metadata to pblk. Now drives without metadata per LBA can be used as well. * Fix from Geert on uninitialized value on chunk metadata reads. * Fixes from Hans and Javier to pblk recovery and write path. * Fix from Hua Su to fix a race condition in the pblk recovery code. * Scan optimization added to pblk recovery from Zhoujie. * Small geometry cleanup from me. - Conversion of the last few drivers that used the legacy path to blk-mq (me) - Removal of legacy IO path in SCSI (me, Christoph) - Removal of legacy IO stack and schedulers (me) - Support for much better polling, now without interrupts at all. blk-mq adds support for multiple queue maps, which enables us to have a map per type. This in turn enables nvme to have separate completion queues for polling, which can then be interrupt-less. Also means we're ready for async polled IO, which is hopefully coming in the next release. - Killing of (now) unused block exports (Christoph) - Unification of the blk-rq-qos and blk-wbt wait handling (Josef) - Support for zoned testing with null_blk (Masato) - sx8 conversion to per-host tag sets (Christoph) - IO priority improvements (Damien) - mq-deadline zoned fix (Damien) - Ref count blkcg series (Dennis) - Lots of blk-mq improvements and speedups (me) - sbitmap scalability improvements (me) - Make core inflight IO accounting per-cpu (Mikulas) - Export timeout setting in sysfs (Weiping) - Cleanup the direct issue path (Jianchao) - Export blk-wbt internals in block debugfs for easier debugging (Ming) - Lots of other fixes and improvements" * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits) kyber: use sbitmap add_wait_queue/list_del wait helpers sbitmap: add helpers for add/del wait queue handling block: save irq state in blkg_lookup_create() dm: don't reuse bio for flushes nvme-pci: trace SQ status on completions nvme-rdma: implement polling queue map nvme-fabrics: allow user to pass in nr_poll_queues nvme-fabrics: allow nvmf_connect_io_queue to poll nvme-core: optionally poll sync commands block: make request_to_qc_t public nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" nvme-tcp: fix endianess annotations nvmet-tcp: fix endianess annotations nvme-pci: refactor nvme_poll_irqdisable to make sparse happy nvme-pci: only set nr_maps to 2 if poll queues are supported nvmet: use a macro for default error location nvmet: fix comparison of a u16 with -1 blk-mq: enable IO poll if .nr_queues of type poll > 0 blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() blk-mq: skip zero-queue maps in blk_mq_map_swqueue ...
| * datagram: introduce skb_copy_and_hash_datagram_iter helperSagi Grimberg2018-12-131-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | Introduce a helper to copy datagram into an iovec iterator but also update a predefined hash. This is useful for consumers of skb_copy_datagram_iter to also support inflight data digest without having to finish to copy and only then traverse the iovec and calculate the digest hash. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * datagram: consolidate datagram copy to iter helpersSagi Grimberg2018-12-131-94/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skb_copy_datagram_iter and skb_copy_and_csum_datagram are essentialy the same but with a couple of differences: The first is the copy operation used which either a simple copy or a csum_and_copy, and the second are the behavior on the "short copy" path where simply copy needs to return the number of bytes successfully copied while csum_and_copy needs to fault immediately as the checksum is partial. Introduce __skb_datagram_iter that additionally accepts: 1. copy operation function pointer 2. private data that goes with the copy operation 3. fault_short flag to indicate the action on short copy Suggested-by: David S. Miller <davem@davemloft.net> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * datagram: open-code copy_page_to_iterSagi Grimberg2018-12-131-3/+6
| | | | | | | | | | | | | | | | | | This will be useful to consolidate skb_copy_and_hash_datagram_iter and skb_copy_and_csum_datagram to a single code path. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2018-12-2717-458/+1790
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) New ipset extensions for matching on destination MAC addresses, from Stefano Brivio. 2) Add ipv4 ttl and tos, plus ipv6 flow label and hop limit offloads to nfp driver. From Stefano Brivio. 3) Implement GRO for plain UDP sockets, from Paolo Abeni. 4) Lots of work from Michał Mirosław to eliminate the VLAN_TAG_PRESENT bit so that we could support the entire vlan_tci value. 5) Rework the IPSEC policy lookups to better optimize more usecases, from Florian Westphal. 6) Infrastructure changes eliminating direct manipulation of SKB lists wherever possible, and to always use the appropriate SKB list helpers. This work is still ongoing... 7) Lots of PHY driver and state machine improvements and simplifications, from Heiner Kallweit. 8) Various TSO deferral refinements, from Eric Dumazet. 9) Add ntuple filter support to aquantia driver, from Dmitry Bogdanov. 10) Batch dropping of XDP packets in tuntap, from Jason Wang. 11) Lots of cleanups and improvements to the r8169 driver from Heiner Kallweit, including support for ->xmit_more. This driver has been getting some much needed love since he started working on it. 12) Lots of new forwarding selftests from Petr Machata. 13) Enable VXLAN learning in mlxsw driver, from Ido Schimmel. 14) Packed ring support for virtio, from Tiwei Bie. 15) Add new Aquantia AQtion USB driver, from Dmitry Bezrukov. 16) Add XDP support to dpaa2-eth driver, from Ioana Ciocoi Radulescu. 17) Implement coalescing on TCP backlog queue, from Eric Dumazet. 18) Implement carrier change in tun driver, from Nicolas Dichtel. 19) Support msg_zerocopy in UDP, from Willem de Bruijn. 20) Significantly improve garbage collection of neighbor objects when the table has many PERMANENT entries, from David Ahern. 21) Remove egdev usage from nfp and mlx5, and remove the facility completely from the tree as it no longer has any users. From Oz Shlomo and others. 22) Add a NETDEV_PRE_CHANGEADDR so that drivers can veto the change and therefore abort the operation before the commit phase (which is the NETDEV_CHANGEADDR event). From Petr Machata. 23) Add indirect call wrappers to avoid retpoline overhead, and use them in the GRO code paths. From Paolo Abeni. 24) Add support for netlink FDB get operations, from Roopa Prabhu. 25) Support bloom filter in mlxsw driver, from Nir Dotan. 26) Add SKB extension infrastructure. This consolidates the handling of the auxiliary SKB data used by IPSEC and bridge netfilter, and is designed to support the needs to MPTCP which could be integrated in the future. 27) Lots of XDP TX optimizations in mlx5 from Tariq Toukan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1845 commits) net: dccp: fix kernel crash on module load drivers/net: appletalk/cops: remove redundant if statement and mask bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw net/net_namespace: Check the return value of register_pernet_subsys() net/netlink_compat: Fix a missing check of nla_parse_nested ieee802154: lowpan_header_create check must check daddr net/mlx4_core: drop useless LIST_HEAD mlxsw: spectrum: drop useless LIST_HEAD net/mlx5e: drop useless LIST_HEAD iptunnel: Set tun_flags in the iptunnel_metadata_reply from src net/mlx5e: fix semicolon.cocci warnings staging: octeon: fix build failure with XFRM enabled net: Revert recent Spectre-v1 patches. can: af_can: Fix Spectre v1 vulnerability packet: validate address length if non-zero nfc: af_nfc: Fix Spectre v1 vulnerability phonet: af_phonet: Fix Spectre v1 vulnerability net: core: Fix Spectre v1 vulnerability net: minor cleanup in skb_ext_add() net: drop the unused helper skb_ext_get() ...
| * \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-12-241-1/+2
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | Pull in bug fixes before respinning my net-next pull request. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net/net_namespace: Check the return value of register_pernet_subsys()Aditya Pakki2018-12-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In net_ns_init(), register_pernet_subsys() could fail while registering network namespace subsystems. The fix checks the return value and sends a panic() on failure. Signed-off-by: Aditya Pakki <pakki001@umn.edu> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: Revert recent Spectre-v1 patches.David S. Miller2018-12-231-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts: 50d5258634ae ("net: core: Fix Spectre v1 vulnerability") d686026b1e6e ("phonet: af_phonet: Fix Spectre v1 vulnerability") a95386f0390a ("nfc: af_nfc: Fix Spectre v1 vulnerability") a3ac5817ffe8 ("can: af_can: Fix Spectre v1 vulnerability") After some discussion with Alexei Starovoitov these all seem to be completely unnecessary. Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: core: Fix Spectre v1 vulnerabilityGustavo A. R. Silva2018-12-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | flen is indirectly controlled by user-space, hence leading to a potential exploitation of the Spectre variant 1 vulnerability. This issue was detected with the help of Smatch: net/core/filter.c:1101 bpf_check_classic() warn: potential spectre issue 'filter' [w] Fix this by sanitizing flen before using it to index filter at line 1101: switch (filter[flen - 1].code) { and through pc at line 1040: const struct sock_filter *ftest = &filter[pc]; Notice that given that speculation windows are large, the policy is to kill the speculation on the first load and not worry if it can be completed with a dependent load/store [1]. [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2 Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-12-211-0/+3
| |\| |
| * | | net: minor cleanup in skb_ext_add()Paolo Abeni2018-12-211-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the extension to be added is already present, the only skb field we may need to update is 'extensions': we can reorder the code and avoid a branch. v1 -> v2: - be sure to flag the newly added extension as active Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: fix possible user-after-free in skb_ext_add()Paolo Abeni2018-12-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On cow we can free the old extension: we must avoid dereferencing such extension after skb_ext_maybe_cow(). Since 'new' contents are always equal to 'old' after the copy, we can fix the above accessing the relevant data using 'new'. Fixes: df5042f4c5b9 ("sk_buff: add skb extension infrastructure") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2018-12-202-10/+43
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2018-12-21 The following pull-request contains BPF updates for your *net-next* tree. There is a merge conflict in test_verifier.c. Result looks as follows: [...] }, { "calls: cross frame pruning", .insns = { [...] .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, .errstr_unpriv = "function calls to other bpf functions are allowed for root only", .result_unpriv = REJECT, .errstr = "!read_ok", .result = REJECT, }, { "jset: functional", .insns = { [...] { "jset: unknown const compare not taken", .insns = { BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_prandom_u32), BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1), BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0), BPF_EXIT_INSN(), }, .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, .errstr_unpriv = "!read_ok", .result_unpriv = REJECT, .errstr = "!read_ok", .result = REJECT, }, [...] { "jset: range", .insns = { [...] }, .prog_type = BPF_PROG_TYPE_SOCKET_FILTER, .result_unpriv = ACCEPT, .result = ACCEPT, }, The main changes are: 1) Various BTF related improvements in order to get line info working. Meaning, verifier will now annotate the corresponding BPF C code to the error log, from Martin and Yonghong. 2) Implement support for raw BPF tracepoints in modules, from Matt. 3) Add several improvements to verifier state logic, namely speeding up stacksafe check, optimizations for stack state equivalence test and safety checks for liveness analysis, from Alexei. 4) Teach verifier to make use of BPF_JSET instruction, add several test cases to kselftests and remove nfp specific JSET optimization now that verifier has awareness, from Jakub. 5) Improve BPF verifier's slot_type marking logic in order to allow more stack slot sharing, from Jiong. 6) Add sk_msg->size member for context access and add set of fixes and improvements to make sock_map with kTLS usable with openssl based applications, from John. 7) Several cleanups and documentation updates in bpftool as well as auto-mount of tracefs for "bpftool prog tracelog" command, from Quentin. 8) Include sub-program tags from now on in bpf_prog_info in order to have a reliable way for user space to get all tags of the program e.g. needed for kallsyms correlation, from Song. 9) Add BTF annotations for cgroup_local_storage BPF maps and implement bpf fs pretty print support, from Roman. 10) Fix bpftool in order to allow for cross-compilation, from Ivan. 11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order to be compatible with libbfd and allow for Debian packaging, from Jakub. 12) Remove an obsolete prog->aux sanitation in dump and get rid of version check for prog load, from Daniel. 13) Fix a memory leak in libbpf's line info handling, from Prashant. 14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info does not get unaligned, from Jesper. 15) Fix test_progs kselftest to work with older compilers which are less smart in optimizing (and thus throwing build error), from Stanislav. 16) Cleanup and simplify AF_XDP socket teardown, from Björn. 17) Fix sk lookup in BPF kselftest's test_sock_addr with regards to netns_id argument, from Andrey. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | bpf: sk_msg, zap ingress queue on psock downJohn Fastabend2018-12-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In addition to releasing any cork'ed data on a psock when the psock is removed we should also release any skb's in the ingress work queue. Otherwise the skb's eventually get free'd but late in the tear down process so we see the WARNING due to non-zero sk_forward_alloc. void sk_stream_kill_queues(struct sock *sk) { ... WARN_ON(sk->sk_forward_alloc); ... } Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | bpf: sk_msg, fix socket data_ready eventsJohn Fastabend2018-12-201-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a skb verdict program is in-use and either another BPF program redirects to that socket or the new SK_PASS support is used the data_ready callback does not wake up application. Instead because the stream parser/verdict is using the sk data_ready callback we wake up the stream parser/verdict block. Fix this by adding a helper to check if the stream parser block is enabled on the sk and if so call the saved pointer which is the upper layers wake up function. This fixes application stalls observed when an application is waiting for data in a blocking read(). Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | bpf: skb_verdict, support SK_PASS on RX BPF pathJohn Fastabend2018-12-201-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add SK_PASS verdict support to SK_SKB_VERDICT programs. Now that support for redirects exists we can implement SK_PASS as a redirect to the same socket. This simplifies the BPF programs and avoids an extra map lookup on RX path for simple visibility cases. Further, reduces user (BPF programmer in this context) confusion when their program drops skb due to lack of support. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | bpf: skmsg, replace comments with BUILD bugJohn Fastabend2018-12-201-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enforce comment on structure layout dependency with a BUILD_BUG_ON to ensure the condition is maintained. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | bpf: sk_msg, improve offset chk in _is_valid_accessJohn Fastabend2018-12-201-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check for max offset in sk_msg_is_valid_access uses sizeof() which is incorrect because it would allow accessing possibly past the end of the struct in the padded case. Further, it doesn't preclude accessing any padding that may be added in the middle of a struct. All told this makes it fragile to rely on. To fix this explicitly check offsets with fields using the bpf_ctx_range() and bpf_ctx_range_till() macros. For reference the current structure layout looks as follows (reported by pahole) struct sk_msg_md { union { void * data; /* 8 */ }; /* 0 8 */ union { void * data_end; /* 8 */ }; /* 8 8 */ __u32 family; /* 16 4 */ __u32 remote_ip4; /* 20 4 */ __u32 local_ip4; /* 24 4 */ __u32 remote_ip6[4]; /* 28 16 */ __u32 local_ip6[4]; /* 44 16 */ __u32 remote_port; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ __u32 local_port; /* 64 4 */ __u32 size; /* 68 4 */ /* size: 72, cachelines: 2, members: 10 */ /* last cacheline: 8 bytes */ }; So there should be no padding at the moment but fixing this now prevents future errors. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | | bpf: sockmap, metadata support for reporting size of msgJohn Fastabend2018-12-191-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds metadata to sk_msg_md for BPF programs to read the sk_msg size. When the SK_MSG program is running under an application that is using sendfile the data is not copied into sk_msg buffers by default. Rather the BPF program uses sk_msg_pull_data to read the bytes in. This avoids doing the costly memcopy instructions when they are not in fact needed. However, if we don't know the size of the sk_msg we have to guess if needed bytes are available by doing a pull request which may fail. By including the size of the sk_msg BPF programs can check the size before issuing sk_msg_pull_data requests. Additionally, the same applies for sendmsg calls when the application provides multiple iovs. Here the BPF program needs to pull in data to update data pointers but its not clear where the data ends without a size parameter. In many cases "guessing" is not easy to do and results in multiple calls to pull and without bounded loops everything gets fairly tricky. Clean this up by including a u32 size field. Note, all writes into sk_msg_md are rejected already from sk_msg_is_valid_access so nothing additional is needed there. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
OpenPOWER on IntegriCloud