summaryrefslogtreecommitdiffstats
path: root/net/core
Commit message (Collapse)AuthorAgeFilesLines
* net: Add support for filtering neigh dump by master deviceDavid Ahern2015-09-291-1/+31
| | | | | | | | | | | Add support for filtering neighbor dumps by master device by adding the NDA_MASTER attribute to the dump request. A new netlink flag, NLM_F_DUMP_FILTERED, is added to indicate the kernel supports the request and output is filtered as requested. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: prepare fastopen code for upcoming listener changesEric Dumazet2015-09-291-1/+8
| | | | | | | | | | | | | | | | | | | | | While auditing TCP stack for upcoming 'lockless' listener changes, I found I had to change fastopen_init_queue() to properly init the object before publishing it. Otherwise an other cpu could try to lock the spinlock before it gets properly initialized. Instead of adding appropriate barriers, just remove dynamic memory allocations : - Structure is 28 bytes on 64bit arches. Using additional 8 bytes for holding a pointer seems overkill. - Two listeners can share same cache line and performance would suffer. If we really want to save few bytes, we would instead dynamically allocate whole struct request_sock_queue in the future. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netpoll: Drop budget parameter from NAPI polling call hierarchyAlexander Duyck2015-09-291-12/+11
| | | | | | | | | | | | | | | | | | | | | | | For some reason we were carrying the budget value around between the various calls to napi->poll. If for example one of the drivers called had a bug in which it returned a non-zero value for work this could result in the budget value becoming negative. Rather than carry around a value of budget that is 0 or less we can instead just loop through and pass 0 to each napi->poll call. If any driver returns a value for work done that is non-zero then we can report that driver and continue rather than allowing a bad actor to make the budget value negative and pass that negative value to napi->poll. Note, the only actual change here is that instead of letting budget become negative we are keeping it at 0 regardless of the value returned for work since it should not be possible for the polling routine to do any actual work with a budget of 0. So if the polling routine returns a non-0 value we are just reporting it and continuing with a budget of 0 rather than letting that work value be subtracted from the budget of 0. Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-09-267-26/+49
|\ | | | | | | | | | | | | | | | | | | Conflicts: net/ipv4/arp.c The net/ipv4/arp.c conflict was one commit adding a new local variable while another commit was deleting one. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix net_device refcountingRussell King2015-09-241-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | of_find_net_device_by_node() uses class_find_device() internally to lookup the corresponding network device. class_find_device() returns a reference to the embedded struct device, with its refcount incremented. Add a comment to the definition in net/core/net-sysfs.c indicating the need to drop this refcount, and fix the DSA code to drop this refcount when the OF-generated platform data is cleaned up and freed. Also arrange for the ref to be dropped when handling errors. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fib_rules: fix fib rule dumps across multiple skbsWilson Kok2015-09-241-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dump_rules returns skb length and not error. But when family == AF_UNSPEC, the caller of dump_rules assumes that it returns an error. Hence, when family == AF_UNSPEC, we continue trying to dump on -EMSGSIZE errors resulting in incorrect dump idx carried between skbs belonging to the same dump. This results in fib rule dump always only dumping rules that fit into the first skb. This patch fixes dump_rules to return error so that we exit correctly and idx is correctly maintained between skbs that are part of the same dump. Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netpoll: Close race condition between poll_one_napi and napi_disableNeil Horman2015-09-232-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drivers might call napi_disable while not holding the napi instance poll_lock. In those instances, its possible for a race condition to exist between poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such the following may happen: CPU0 CPU1 ndo_tx_timeout napi_poll_dev napi_disable poll_one_napi test_and_set_bit (ret 0) test_bit (ret 1) reset adapter napi_poll_routine If the adapter gets a tx timeout without a napi instance scheduled, its possible for the adapter to think it has exclusive access to the hardware (as the napi instance is now scheduled via the napi_disable call), while the netpoll code thinks there is simply work to do. The result is parallel hardware access leading to corrupt data structures in the driver, and a crash. Additionaly, there is another, more critical race between netpoll and napi_disable. The disabled napi state is actually identical to the scheduled state for a given napi instance. The implication being that, if a napi instance is disabled, a netconsole instance would see the napi state of the device as having been scheduled, and poll it, likely while the driver was dong something requiring exclusive access. In the case above, its fairly clear that not having the rings in a state ready to be polled will cause any number of crashes. The fix should be pretty easy. netpoll uses its own bit to indicate that that the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC). We can just gate disabling on that bit as well as the sched bit. That should prevent netpoll from conducting a napi poll if we convert its set bit to a test_and_set_bit operation to provide mutual exclusion Change notes: V2) Remove a trailing whtiespace Resubmit with proper subject prefix V3) Clean up spacing nits Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: "David S. Miller" <davem@davemloft.net> CC: jmaxwell@redhat.com Tested-by: jmaxwell@redhat.com Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: core: drop null test before destroy functionsJulia Lawall2015-09-151-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) { \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); x = NULL; -} // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rtnetlink: catch -EOPNOTSUPP errors from ndo_bridge_getlinkRoopa Prabhu2015-09-151-10/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | problem reported: kernel 4.1.3 ------------ # bridge vlan port vlan ids eth0 1 PVID Egress Untagged 90 91 92 93 94 95 96 97 98 99 100 vmbr0 1 PVID Egress Untagged 94 kernel 4.2 ----------- # bridge vlan port vlan ids ndo_bridge_getlink can return -EOPNOTSUPP when an interfaces ndo_bridge_getlink op is set to switchdev_port_bridge_getlink and CONFIG_SWITCHDEV is not defined. This today can happen to bond, rocker and team devices. This patch adds -EOPNOTSUPP checks after calls to ndo_bridge_getlink. Fixes: 85fdb956726ff2a ("switchdev: cut over to new switchdev_port_bridge_getlink") Reported-by: Alexandre DERUMIER <aderumier@odiso.com> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ebpf: emit correct src_reg for conditional jumpsTycho Andersen2015-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of always emitting BPF_REG_X, let's emit BPF_REG_X only when the source actually is BPF_X. This causes programs generated by the classic converter to not be importable via bpf(), as the eBPF verifier checks that the src_reg is correct or 0. While not a problem yet, this will be a problem when BPF_PROG_DUMP lands, and we can potentially dump and re-import programs generated by the converter. Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> CC: Alexei Starovoitov <ast@kernel.org> CC: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove unused argument of __netdev_find_adj()Michal Kubeček2015-09-251-8/+7
| | | | | | | | | | | | | | | | The __netdev_find_adj() helper does not use its first argument, only the device to find and list to walk through. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sched, bpf: let stack handle !IFF_UP devs on bpf_clone_redirectDaniel Borkmann2015-09-231-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Similarly as already the case in bpf_redirect()/skb_do_redirect() pair, let the stack deal with devs that are !IFF_UP. dev_forward_skb() as well as dev_queue_xmit() will free the skb and increment drop counter internally in such cases, so we can spare the condition in bpf_clone_redirect(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: add bpf_redirect() helperAlexei Starovoitov2015-09-172-0/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Existing bpf_clone_redirect() helper clones skb before redirecting it to RX or TX of destination netdev. Introduce bpf_redirect() helper that does that without cloning. Benchmarked with two hosts using 10G ixgbe NICs. One host is doing line rate pktgen. Another host is configured as: $ tc qdisc add dev $dev ingress $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop so it receives the packet on $dev and immediately xmits it on $dev + 1 The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program that does bpf_clone_redirect() and performance is 2.0 Mpps $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \ action bpf run object-file tcbpf1_kern.o section redirect_xmit drop which is using bpf_redirect() - 2.4 Mpps and using cls_bpf with integrated actions as: $ tc filter add dev $dev root pref 10 \ bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1 performance is 2.5 Mpps To summarize: u32+act_bpf using clone_redirect - 2.0 Mpps u32+act_bpf using redirect - 2.4 Mpps cls_bpf using redirect - 2.5 Mpps For comparison linux bridge in this setup is doing 2.1 Mpps and ixgbe rx + drop in ip_rcv - 7.8 Mpps Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | cls_bpf: introduce integrated actionsDaniel Borkmann2015-09-171-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Often cls_bpf classifier is used with single action drop attached. Optimize this use case and let cls_bpf return both classid and action. For backwards compatibility reasons enable this feature under TCA_BPF_FLAG_ACT_DIRECT flag. Then more interesting programs like the following are easier to write: int cls_bpf_prog(struct __sk_buff *skb) { /* classify arp, ip, ipv6 into different traffic classes * and drop all other packets */ switch (skb->protocol) { case htons(ETH_P_ARP): skb->tc_classid = 1; break; case htons(ETH_P_IP): skb->tc_classid = 2; break; case htons(ETH_P_IPV6): skb->tc_classid = 3; break; default: return TC_ACT_SHOT; } return TC_ACT_OK; } Joint work with Daniel Borkmann. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netfilter: Pass net into okfnEric W. Biederman2015-09-171-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is immediately motivated by the bridge code that chains functions that call into netfilter. Without passing net into the okfns the bridge code would need to guess about the best expression for the network namespace to process packets in. As net is frequently one of the first things computed in continuation functions after netfilter has done it's job passing in the desired network namespace is in many cases a code simplification. To support this change the function dst_output_okfn is introduced to simplify passing dst_output as an okfn. For the moment dst_output_okfn just silently drops the struct net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bridge: Add br_netif_receive_skb remove netif_receive_skb_skEric W. Biederman2015-09-171-2/+2
| | | | | | | | | | | | | | | | netif_receive_skb_sk is only called once in the bridge code, replace it with a bridge specific function that calls netif_receive_skb. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Remove dev_queue_xmit_skEric W. Biederman2015-09-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A function with weird arguments that it will never use to accomdate a netfilter callback prototype is absolutely in the core of the networking stack. Frankly it does not make sense and it causes a lot of confusion as to why arguments that are never used are being passed to the function. As I am preparing to make a second change to arguments to the okfn even the names stops making sense. As I have removed the two callers of this function remove this confusion from the networking stack. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net-sysfs: get_netdev_queue_index() cleanupThadeu Lima de Souza Cascardo2015-09-171-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redo commit ed1acc8cd8c22efa919da8d300bab646e01c2dce. Commit 822b3b2ebfff8e9b3d006086c527738a7ca00cd0 ("net: Add max rate tx queue attribute") moved get_netdev_queue_index around, but kept the old version. Probably because of a reuse of the original patch from before Eric's change to that function. Remove one inline keyword, and no need for a loop to find an index into a table. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Fixes: 822b3b2ebfff ("net: Add max rate tx queue attribute") Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 statsSowmini Varadhan2015-09-151-1/+1
|/ | | | | | | | | | | | | | | | | | Many commonly used functions like getifaddrs() invoke RTM_GETLINK to dump the interface information, and do not need the the AF_INET6 statististics that are always returned by default from rtnl_fill_ifinfo(). Computing the statistics can be an expensive operation that impacts scaling, so it is desirable to avoid this if the information is not needed. This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that can be passed with netlink_request() to avoid statistics computation for the ifinfo path. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv6: use common fib_default_rule_prefPhil Sutter2015-09-091-7/+3
| | | | | | | | | | | | | | | | | | | | This switches IPv6 policy routing to use the shared fib_default_rule_pref() function of IPv4 and DECnet. It is also used in multicast routing for IPv4 as well as IPv6. The motivation for this patch is a complaint about iproute2 behaving inconsistent between IPv4 and IPv6 when adding policy rules: Formerly, IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the assigned priority value was decreased with each rule added. Since then all users of the default_pref field have been converted to assign the generic function fib_default_rule_pref(), fib_nl_newrule() may just use it directly instead. Therefore get rid of the function pointer altogether and make fib_default_rule_pref() static, as it's not used outside fib_rules.c anymore. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock, diag: fix panic in sock_diag_put_filterinfoDaniel Borkmann2015-09-021-0/+3
| | | | | | | | | | | | | | | | | | | | | | diag socket's sock_diag_put_filterinfo() dumps classic BPF programs upon request to user space (ss -0 -b). However, native eBPF programs attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method: Their orig_prog is always NULL. However, sock_diag_put_filterinfo() unconditionally tries to access its filter length resp. wants to copy the filter insns from there. Internal cBPF to eBPF transformations attached to sockets don't have this issue, as orig_prog state is kept. It's currently only used by packet sockets. If we would want to add native eBPF support in the future, this needs to be done through a different attribute than PACKET_DIAG_FILTER to not confuse possible user space disassemblers that work on diag data. Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Use 'const' where possible.David S. Miller2015-09-011-38/+41
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* flow: Move __get_hash_from_flowi{4,6} into flow_dissector.cDavid S. Miller2015-09-012-36/+35
| | | | | | | | These cannot live in net/core/flow.c which only builds when XFRM is enabled. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Don't use bit fields.David S. Miller2015-09-011-7/+7
| | | | | | | | | | | | | | Just have a flags member instead. In file included from include/linux/linkage.h:4:0, from include/linux/kernel.h:6, from net/core/flow_dissector.c:1: In function 'flow_keys_hash_start', inlined from 'flow_hash_from_keys' at net/core/flow_dissector.c:553:34: >> include/linux/compiler.h:447:38: error: call to '__compiletime_assert_459' declared with attribute error: BUILD_BUG_ON failed: FLOW_KEYS_HASH_OFFSET % sizeof(u32) Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Ignore flow dissector return value from ___skb_get_hashTom Herbert2015-09-011-9/+3
| | | | | | | | | | | In ___skb_get_hash ignore return value from skb_flow_dissect_flow_keys. A failure in that function likely means that there was a parse error, so we may as well use whatever fields were found before the error was hit. This is also good because it means we won't keep trying to derive the hash on subsequent calls to skb_get_hash for the same packet. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Add control/reporting of encapsulationTom Herbert2015-09-011-0/+15
| | | | | | | | | | Add an input flag to flow dissector on rather dissection should stop when encapsulation is detected (IP/IP or GRE). Also, add a key_control flag that indicates encapsulation was encountered during the dissection. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Add flag to stop parsing when an IPv6 flow label is seenTom Herbert2015-09-011-1/+4
| | | | | | | | | | | | | | | Add an input flag to flow dissector on rather dissection should be stopped when a flow label is encountered. Presumably, the flow label is derived from a sufficient hash of an inner transport packet so further dissection is not needed (that is ports are not included in the flow hash). Using the flow label instead of ports has the additional benefit that packet fragments should hash to same value as non-fragments for a flow (assuming that the same flow label is used). We set this flag by default in for skb_get_hash. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Add flag to stop parsing at L3Tom Herbert2015-09-011-0/+6
| | | | | | | | | | Add an input flag to flow dissector on rather dissection should be stopped when an L3 packet is encountered. This would be useful if a caller just wanted to get IP addresses of the outermost header (e.g. to do an L3 hash). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Support IPv6 fragment headerTom Herbert2015-09-011-0/+25
| | | | | | | | Parse NEXTHDR_FRAGMENT. When seen account for it in the fragment bits of key_control. Also, check if first fragment should be parsed. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Add control/reporting of fragmentationTom Herbert2015-09-011-2/+13
| | | | | | | | | Add an input flag to flow dissector on rather dissection should be attempted on a first fragment. Also add key_control flags to indicate that a packet is a fragment or first fragment. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Add flags argument to skb_flow_dissector functionsTom Herbert2015-09-011-3/+4
| | | | | | | | The flags argument will allow control of the dissection process (for instance whether to parse beyond L3). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flow_dissector: Jump to exit code in __skb_flow_dissectTom Herbert2015-09-011-26/+25
| | | | | | | | | | Instead of returning immediately (on a parsing failure for instance) we jump to cleanup code. This always sets protocol values in key_control (even on a failure there is still valid information in the key_tags that was set before the problem was hit). Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* flowi: Abstract out functions to get flow hash based on flowiTom Herbert2015-09-011-0/+36
| | | | | | | | | | | Create __get_hash_from_flowi6 and __get_hash_from_flowi4 to get the flow keys and hash based on flowi structures. These are called by __skb_get_hash_flowi6 and __skb_get_hash_flowi4. Also, created get_hash_from_flowi6 and get_hash_from_flowi4 which can be called when just the hash value for a flowi is needed. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* skbuff: Make __skb_set_sw_hash a general functionTom Herbert2015-09-011-12/+6
| | | | | | | | | | | Move __skb_set_sw_hash to skbuff.h and add __skb_set_hash which is a common method (between __skb_set_sw_hash and skb_set_hash) to set the hash in an skbuff. Also, move skb_clear_hash to be closer to __skb_set_hash. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun_dst: Remove opts_sizePravin B Shelar2015-08-311-1/+0
| | | | | | | | opts_size is only written and never read. Following patch removes this unused variable. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: use dctcp if enabled on the route to the initiatorDaniel Borkmann2015-08-311-0/+6
| | | | | | | | | | | | | | | | | | | | | | Currently, the following case doesn't use DCTCP, even if it should: A responder has f.e. Cubic as system wide default, but for a specific route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The initiating host then uses DCTCP as congestion control, but since the initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok, and we have to fall back to Reno after 3WHS completes. We were thinking on how to solve this in a minimal, non-intrusive way without bloating tcp_ecn_create_request() needlessly: lets cache the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0) is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows to only do a single metric feature lookup inside tcp_ecn_create_request(). Joint work with Florian Westphal. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip_tunnels: record IP version in tunnel infoJiri Benc2015-08-291-0/+2
| | | | | | | | | | | | | | | | There's currently nothing preventing directing packets with IPv6 encapsulation data to IPv4 tunnels (and vice versa). If this happens, IPv6 addresses are incorrectly interpreted as IPv4 ones. Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this in ip_tunnel_info. Reject packets at appropriate places if they are supposed to be encapsulated into an incompatible protocol. Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: FIB tracepointsDavid Ahern2015-08-291-0/+1
| | | | | | | A few useful tracepoints developing VRF driver. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netpoll: warn on netpoll_send_udp users who haven't disabled irqsNikolay Aleksandrov2015-08-281-0/+2
| | | | | | | | Make sure we catch future netpoll_send_udp users who use it without disabling irqs and also as a hint for poll_controller users. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-271-1/+1
|\
| * mm: make page pfmemalloc check more robustMichal Hocko2015-08-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added checks for page->pfmemalloc to __skb_fill_page_desc(): if (page->pfmemalloc && !page->mapping) skb->pfmemalloc = true; It assumes page->mapping == NULL implies that page->pfmemalloc can be trusted. However, __delete_from_page_cache() can set set page->mapping to NULL and leave page->index value alone. Due to being in union, a non-zero page->index will be interpreted as true page->pfmemalloc. So the assumption is invalid if the networking code can see such a page. And it seems it can. We have encountered this with a NFS over loopback setup when such a page is attached to a new skbuf. There is no copying going on in this case so the page confuses __skb_fill_page_desc which interprets the index as pfmemalloc flag and the network stack drops packets that have been allocated using the reserves unless they are to be queued on sockets handling the swapping which is the case here and that leads to hangs when the nfs client waits for a response from the server which has been dropped and thus never arrive. The struct page is already heavily packed so rather than finding another hole to put it in, let's do a trick instead. We can reuse the index again but define it to an impossible value (-1UL). This is the page index so it should never see the value that large. Replace all direct users of page->pfmemalloc by page_is_pfmemalloc which will hide this nastiness from unspoiled eyes. The information will get lost if somebody wants to use page->index obviously but that was the case before and the original code expected that the information should be persisted somewhere else if that is really needed (e.g. what SLAB and SLUB do). [akpm@linux-foundation.org: fix blooper in slub] Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.com> Debugged-by: Jiri Bohac <jbohac@suse.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | net: fix IFF_NO_QUEUE for drivers using alloc_netdevPhil Sutter2015-08-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Printing a warning in alloc_netdev_mqs() if tx_queue_len is zero and IFF_NO_QUEUE not set is not appropriate since drivers may use one of the alloc_netdev* macros instead of alloc_etherdev*, thereby not intentionally leaving tx_queue_len uninitialized. Instead check here if tx_queue_len is zero and set IFF_NO_QUEUE, so the value of tx_queue_len can be ignored in net/sched_generic.c. Fixes: 906470c ("net: warn if drivers set tx_queue_len = 0") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sock: fix kernel doc errorJean Sacren2015-08-271-1/+1
| | | | | | | | | | | | | | | | | | | | The symbol '__sk_reclaim' is not present in the current tree. Apparently '__sk_reclaim' was meant to be '__sk_mem_reclaim', so fix it with the right symbol name for the kernel doc. Signed-off-by: Jean Sacren <sakiwit@gmail.com> Cc: Hideo Aoki <haoki@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: introduce change upper device notifier change infoJiri Pirko2015-08-271-2/+14
| | | | | | | | | | | | | | Add info that is passed along with NETDEV_CHANGEUPPER event. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: consolidate tc_classify{,_compat}Daniel Borkmann2015-08-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For classifiers getting invoked via tc_classify(), we always need an extra function call into tc_classify_compat(), as both are being exported as symbols and tc_classify() itself doesn't do much except handling of reclassifications when tp->classify() returned with TC_ACT_RECLASSIFY. CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(), all others use tc_classify(). When tc actions are being configured out in the kernel, tc_classify() effectively does nothing besides delegating. We could spare this layer and consolidate both functions. pktgen on single CPU constantly pushing skbs directly into the netif_receive_skb() path with a dummy classifier on ingress qdisc attached, improves slightly from 22.3Mpps to 23.1Mpps. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bpf: fix bpf_skb_set_tunnel_key() helperAlexei Starovoitov2015-08-261-0/+1
| | | | | | | | | | | | | | | | | | Make sure to indicate to tunnel driver that key.tun_id is set, otherwise gre won't recognize the metadata. Fixes: d3aa45ce6b94 ("bpf: add helpers to access tunnel metadata") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | route: fix a use-after-freeWANG Cong2015-08-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes the following crash: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc7+ #166 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88010656d280 ti: ffff880106570000 task.ti: ffff880106570000 RIP: 0010:[<ffffffff8182f91b>] [<ffffffff8182f91b>] dst_destroy+0xa6/0xef RSP: 0018:ffff880107603e38 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff8800d225a000 RCX: ffffffff82250fd0 RDX: 0000000000000001 RSI: ffffffff82250fd0 RDI: 6b6b6b6b6b6b6b6b RBP: ffff880107603e58 R08: 0000000000000001 R09: 0000000000000001 R10: 000000000000b530 R11: ffff880107609000 R12: 0000000000000000 R13: ffffffff82343c40 R14: 0000000000000000 R15: ffffffff8182fb4f FS: 0000000000000000(0000) GS:ffff880107600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fcabd9d3000 CR3: 00000000d7279000 CR4: 00000000000006e0 Stack: ffffffff82250fd0 ffff8801077d6f00 ffffffff82253c40 ffff8800d225a000 ffff880107603e68 ffffffff8182fb5d ffff880107603f08 ffffffff810d795e ffffffff810d7648 ffff880106574000 ffff88010656d280 ffff88010656d280 Call Trace: <IRQ> [<ffffffff8182fb5d>] dst_destroy_rcu+0xe/0x1d [<ffffffff810d795e>] rcu_process_callbacks+0x618/0x7eb [<ffffffff810d7648>] ? rcu_process_callbacks+0x302/0x7eb [<ffffffff8182fb4f>] ? dst_gc_task+0x1eb/0x1eb [<ffffffff8107e11b>] __do_softirq+0x178/0x39f [<ffffffff8107e52e>] irq_exit+0x41/0x95 [<ffffffff81a4f215>] smp_apic_timer_interrupt+0x34/0x40 [<ffffffff81a4d5cd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff8100b968>] ? default_idle+0x21/0x32 [<ffffffff8100b966>] ? default_idle+0x1f/0x32 [<ffffffff8100bf19>] arch_cpu_idle+0xf/0x11 [<ffffffff810b0bc7>] default_idle_call+0x1f/0x21 [<ffffffff810b0dce>] cpu_startup_entry+0x1ad/0x273 [<ffffffff8102fe67>] start_secondary+0x135/0x156 dst is freed right before lwtstate_put(), this is not correct... Fixes: 61adedf3e3f1 ("route: move lwtunnel state to dst_entry") Acked-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net-next: Fix warning while make xmldocs caused by skbuff.cMasanari Iida2015-08-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fix following warnings. .//net/core/skbuff.c:407: warning: No description found for parameter 'len' .//net/core/skbuff.c:407: warning: Excess function parameter 'length' description in '__netdev_alloc_skb' .//net/core/skbuff.c:476: warning: No description found for parameter 'len' .//net/core/skbuff.c:476: warning: Excess function parameter 'length' description in '__napi_alloc_skb' Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | lwt: Add cfg argument to build_stateTom Herbert2015-08-241-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Add cfg and family arguments to lwt build state functions. cfg is a void pointer and will either be a pointer to a fib_config or fib6_config structure. The family parameter indicates which one (either AF_INET or AF_INET6). LWT encpasulation implementation may use the fib configuration to build the LWT state. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-211-19/+18
|\ \ | |/ | | | | | | | | | | | | | | Conflicts: drivers/net/usb/qmi_wwan.c Overlapping additions of new device IDs to qmi_wwan.c Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud