summaryrefslogtreecommitdiffstats
path: root/net/ipv4/ip_output.c
Commit message (Collapse)AuthorAgeFilesLines
* net: ip: push gso skb forwarding handling down the stackFlorian Westphal2014-05-071-3/+48
| | | | | | | | | | | | | | | | Doing the segmentation in the forward path has one major drawback: When using virtio, we may process gso udp packets coming from host network stack. In that case, netfilter POSTROUTING will see one packet with udp header followed by multiple ip fragments. Delay the segmentation and do it after POSTROUTING invocation to avoid this. Fixes: fe6cc55f3a9 ("net: ip, ipv6: handle gso skbs in forwarding path") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: add a sock pointer to dst->output() path.Eric Dumazet2014-04-151-6/+5
| | | | | | | | | | | | | | | | In the dst->output() path for ipv4, the code assumes the skb it has to transmit is attached to an inet socket, specifically via ip_mc_output() : The sk_mc_loop() test triggers a WARN_ON() when the provider of the packet is an AF_PACKET socket. The dst->output() method gets an additional 'struct sock *sk' parameter. This needs a cascade of changes so that this parameter can be propagated from vxlan to final consumer. Fixes: 8f646c922d55 ("vxlan: keep original skb ownership") Reported-by: lucien xin <lucien.xin@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: add a sock pointer to ip_queue_xmit()Eric Dumazet2014-04-151-2/+3
| | | | | | | | | | | | | | | | ip_queue_xmit() assumes the skb it has to transmit is attached to an inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership") changed l2tp to not change skb ownership and thus broke this assumption. One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(), so that we do not assume skb->sk points to the socket used by l2tp tunnel. Fixes: 31c70d5956fc ("l2tp: keep original skb ownership") Reported-by: Zhan Jianyu <nasa4836@gmail.com> Tested-by: Zhan Jianyu <nasa4836@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2014-03-051-3/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/wireless/ath/ath9k/recv.c drivers/net/wireless/mwifiex/pcie.c net/ipv6/sit.c The SIT driver conflict consists of a bug fix being done by hand in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper was created (netdev_alloc_pcpu_stats()) which takes care of this. The two wireless conflicts were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=nFlorian Westphal2014-02-171-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When using nftables with CONFIG_NETFILTER_XT_TARGET_TRACE=n, we get lots of "TRACE: filter:output:policy:1 IN=..." warnings as several places will leave skb->nf_trace uninitialised. Unlike iptables tracing functionality is not conditional in nftables, so always copy/zero nf_trace setting when nftables is enabled. Move this into __nf_copy() helper. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMITHannes Frederic Sowa2014-02-261-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IP_PMTUDISC_INTERFACE has a design error: because it does not allow the generation of fragments if the interface mtu is exceeded, it is very hard to make use of this option in already deployed name server software for which I introduced this option. This patch adds yet another new IP_MTU_DISCOVER option to not honor any path mtu information and not accepting new icmp notifications destined for the socket this option is enabled on. But we allow outgoing fragmentation in case the packet size exceeds the outgoing interface mtu. As such this new option can be used as a drop-in replacement for IP_PMTUDISC_DONT, which is currently in use by most name server software making the adoption of this option very smooth and easy. The original advantage of IP_PMTUDISC_INTERFACE is still maintained: ignoring incoming path MTU updates and not honoring discovered path MTUs in the output path. Fixes: 482fc6094afad5 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE") Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragmentHannes Frederic Sowa2014-02-261-2/+1
|/ | | | | | | | | | | | | | ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket is attached to the skb (in case of forwarding) or determines the mtu like we do in ip_finish_output, which actually checks if we should branch to ip_fragment. Thus use the same function to determine the mtu here, too. This is important for the introduction of IP_PMTUDISC_OMIT, where we want the packets getting cut in pieces of the size of the outgoing interface mtu. IPv6 already does this correctly. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: register igmp_notifier even when !CONFIG_PROC_FSWANG Cong2014-01-141-2/+2
| | | | | | | | | | | | | | We still need this notifier even when we don't config PROC_FS. It should be rare to have a kernel without PROC_FS, so just for completeness. Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: David S. Miller <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against ↵Hannes Frederic Sowa2014-01-131-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pmtu spoofing While forwarding we should not use the protocol path mtu to calculate the mtu for a forwarded packet but instead use the interface mtu. We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was introduced for multicast forwarding. But as it does not conflict with our usage in unicast code path it is perfect for reuse. I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular dependencies because of IPSKB_FORWARDED. Because someone might have written a software which does probe destinations manually and expects the kernel to honour those path mtus I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone can disable this new behaviour. We also still use mtus which are locked on a route for forwarding. The reason for this change is, that path mtus information can be injected into the kernel via e.g. icmp_err protocol handler without verification of local sockets. As such, this could cause the IPv4 forwarding path to wrongfully emit fragmentation needed notifications or start to fragment packets along a path. Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED won't be set and further fragmentation logic will use the path mtu to determine the fragmentation size. They also recheck packet size with help of path mtu discovery and report appropriate errors. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: consistent reporting of pmtu data in case of corkingHannes Frederic Sowa2013-12-221-2/+3
| | | | | | | | | | | | | | | | | We report different pmtu values back on the first write and on further writes on an corked socket. Also don't include the dst.header_len (respectively exthdrlen) as this should already be dealt with by the interface mtu of the outgoing (virtual) interface and policy of that interface should dictate if fragmentation should happen. Instead reduce the pmtu data by IP options as we do for IPv6. Make the same changes for ip_append_data, where we did not care about options or dst.header_len at all. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACEHannes Frederic Sowa2013-11-051-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sockets marked with IP_PMTUDISC_INTERFACE won't do path mtu discovery, their sockets won't accept and install new path mtu information and they will always use the interface mtu for outgoing packets. It is guaranteed that the packet is not fragmented locally. But we won't set the DF-Flag on the outgoing frames. Florian Weimer had the idea to use this flag to ensure DNS servers are never generating outgoing fragments. They may well be fragmented on the path, but the server never stores or usees path mtu values, which could well be forged in an attack. (The root of the problem with path MTU discovery is that there is no reliable way to authenticate ICMP Fragmentation Needed But DF Set messages because they are sent from intermediate routers with their source addresses, and the IMCP payload will not always contain sufficient information to identify a flow.) Recent research in the DNS community showed that it is possible to implement an attack where DNS cache poisoning is feasible by spoofing fragments. This work was done by Amir Herzberg and Haya Shulman: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> This issue was previously discussed among the DNS community, e.g. <http://www.ietf.org/mail-archive/web/dnsext/current/msg01204.html>, without leading to fixes. This patch depends on the patch "ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORK" for the enforcement of the non-fragmentable checks. If other users than ip_append_page/data should use this semantic too, we have to add a new flag to IPCB(skb)->flags to suppress local fragmentation and check for this in ip_finish_output. Many thanks to Florian Weimer for the idea and feedback while implementing this patch. Cc: David S. Miller <davem@davemloft.net> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: fix DO and PROBE pmtu mode regarding local fragmentation with UFO/CORKHannes Frederic Sowa2013-10-291-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UFO as well as UDP_CORK do not respect IP_PMTUDISC_DO and IP_PMTUDISC_PROBE well enough. UFO enabled packet delivery just appends all frags to the cork and hands it over to the network card. So we just deliver non-DF udp fragments (DF-flag may get overwritten by hardware or virtual UFO enabled interface). UDP_CORK does enqueue the data until the cork is disengaged. At this point it sets the correct IP_DF and local_df flags and hands it over to ip_fragment which in this case will generate an icmp error which gets appended to the error socket queue. This is not reflected in the syscall error (of course, if UFO is enabled this also won't happen). Improve this by checking the pmtudisc flags before appending data to the socket and if we still can fit all data in one packet when IP_PMTUDISC_DO or IP_PMTUDISC_PROBE is set, only then proceed. We use (mtu-fragheaderlen) to check for the maximum length because we ensure not to generate a fragment and non-fragmented data does not need to have its length aligned on 64 bit boundaries. Also the passed in ip_options are already aligned correctly. Maybe, we can relax some other checks around ip_fragment. This needs more research. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-10-231-4/+9
|\ | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/qmi_wwan.c include/net/dst.h Trivial merge conflicts, both were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip_output: do skb ufo init for peeked non ufo skb as wellJiri Pirko2013-10-191-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now, if user application does: sendto len<mtu flag MSG_MORE sendto len>mtu flag 0 The skb is not treated as fragmented one because it is not initialized that way. So move the initialization to fix this. introduced by: commit e89e9cf539a28df7d0eb1d0a545368e9920b34ac "[IPv4/IPv6]: UFO Scatter-gather approach" Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv4: processing ancillary IP_TOS or IP_TTLFrancesco Fusco2013-09-281-3/+10
|/ | | | | | | | | | | | | | | | | | | If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out packets with the specified TTL or TOS overriding the socket values specified with the traditional setsockopt(). The struct inet_cork stores the values of TOS, TTL and priority that are passed through the struct ipcm_cookie. If there are user-specified TOS (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are used to override the per-socket values. In case of TOS also the priority is changed accordingly. Two helper functions get_rttos and get_rtconn_flags are defined to take into account the presence of a user specified TOS value when computing RT_TOS and RT_CONN_FLAGS. Signed-off-by: Francesco Fusco <ffusco@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip: generate unique IP identificator if local fragmentation is allowedAnsis Atteka2013-09-191-3/+3
| | | | | | | | | | | | | | | | | If local fragmentation is allowed, then ip_select_ident() and ip_select_ident_more() need to generate unique IDs to ensure correct defragmentation on the peer. For example, if IPsec (tunnel mode) has to encrypt large skbs that have local_df bit set, then all IP fragments that belonged to different ESP datagrams would have used the same identificator. If one of these IP fragments would get lost or reordered, then peer could possibly stitch together wrong IP fragments that did not belong to the same datagram. This would lead to a packet loss or data corruption. Signed-off-by: Ansis Atteka <aatteka@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip: use ip_hdr() in __ip_make_skb() to retrieve IP headerAnsis Atteka2013-09-191-1/+1
| | | | | | | | skb->data already points to IP header, but for the sake of consistency we can also use ip_hdr() to retrieve it. Signed-off-by: Ansis Atteka <aatteka@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm: introduce helper for safe determination of mtuHannes Frederic Sowa2013-08-141-8/+0
| | | | | | | | | | | | | skb->sk socket can be of AF_INET or AF_INET6 address family. Thus we always have to make sure we a referring to the correct interpretation of skb->sk. We only depend on header defines to query the mtu, so we don't introduce a new dependency to ipv6 by this change. Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* ipv4: ip_output: remove inline marking of EXPORT_SYMBOL functionsDenis Efremov2013-05-111-1/+1
| | | | | | | | | | EXPORT_SYMBOL and inline directives are contradictory to each other. The patch fixes this inconsistency. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Denis Efremov <yefremov.denis@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netfilter: use IS_ENABLE to replace if defined in TRACE targetGao feng2013-04-021-2/+1
| | | | | Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* net: Fix possible wrong checksum generation.Pravin B Shelar2013-02-131-0/+1
| | | | | | | | | | | | | | | | | | | | | Patch cef401de7be8c4e (net: fix possible wrong checksum generation) fixed wrong checksum calculation but it broke TSO by defining new GSO type but not a netdev feature for that type. net_gso_ok() would not allow hardware checksum/segmentation offload of such packets without the feature. Following patch fixes TSO and wrong checksum. This patch uses same logic that Eric Dumazet used. Patch introduces new flag SKBTX_SHARED_FRAG if at least one frag can be modified by the user. but SKBTX_SHARED_FRAG flag is kept in skb shared info tx_flags rather than gso_type. tx_flags is better compared to gso_type since we can have skb with shared frag without gso packet. It does not link SHARED_FRAG to GSO, So there is no need to define netdev feature for this. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Handle encapsulated offloads before fragmentation or handing to lower devAlexander Duyck2012-12-091-0/+4
| | | | | | | | | | This change allows the VXLAN to enable Tx checksum offloading even on devices that do not support encapsulated checksum offloads. The advantage to this is that it allows for the lower device to change due to routing table changes without impacting features on the VXLAN itself. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: introduce rt_uses_gatewayJulian Anastasov2012-10-081-2/+2
| | | | | | | | | | | | | | | | Add new flag to remember when route is via gateway. We will use it to allow rt_gateway to contain address of directly connected host for the cases when DST_NOCACHE is used or when the NH exception caches per-destination route without DST_NOCACHE flag, i.e. when routes are not used for other destinations. By this way we force the neighbour resolving to work with the routed destination but we can use different address in the packet, feature needed for IPVS-DR where original packet for virtual IP is routed via route to real IP. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use a per task frag allocatorEric Dumazet2012-09-241-42/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently use a per socket order-0 page cache for tcp_sendmsg() operations. This page is used to build fragments for skbs. Its done to increase probability of coalescing small write() into single segments in skbs still in write queue (not yet sent) But it wastes a lot of memory for applications handling many mostly idle sockets, since each socket holds one page in sk->sk_sndmsg_page Its also quite inefficient to build TSO 64KB packets, because we need about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit page allocator more than wanted. This patch adds a per task frag allocator and uses bigger pages, if available. An automatic fallback is done in case of memory pressure. (up to 32768 bytes per frag, thats order-3 pages on x86) This increases TCP stream performance by 20% on loopback device, but also benefits on other network devices, since 8x less frags are mapped on transmit and unmapped on tx completion. Alexander Duyck mentioned a probable performance win on systems with IOMMU enabled. Its possible some SG enabled hardware cant cope with bigger fragments, but their ndo_start_xmit() should already handle this, splitting a fragment in sub fragments, since some arches have PAGE_SIZE=65536 Successfully tested on various ethernet devices. (ixgbe, igb, bnx2x, tg3, mellanox mlx4) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: fix path MTU discovery with connection trackingPatrick McHardy2012-08-261-1/+3
| | | | | | | | | | | | | | | | IPv4 conntrack defragments incoming packet at the PRE_ROUTING hook and (in case of forwarded packets) refragments them at POST_ROUTING independent of the IP_DF flag. Refragmentation uses the dst_mtu() of the local route without caring about the original fragment sizes, thereby breaking PMTUD. This patch fixes this by keeping track of the largest received fragment with IP_DF set and generates an ICMP fragmentation required error during refragmentation if that size exceeds the MTU. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: David S. Miller <davem@davemloft.net>
* ipv4: fix ip header ident selection in __ip_make_skb()Eric Dumazet2012-08-211-1/+1
| | | | | | | | | | | | | | | | | | Christian Casteyde reported a kmemcheck 32-bit read from uninitialized memory in __ip_select_ident(). It turns out that __ip_make_skb() called ip_select_ident() before properly initializing iph->daddr. This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of cached route inetpeer.) Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131 Reported-by: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: fix ip_send_skb()Eric Dumazet2012-08-101-3/+2
| | | | | | | | | | | ip_send_skb() can send orphaned skb, so we must pass the net pointer to avoid possible NULL dereference in error path. Bug added by commit 3a7c384ffd57 (ipv4: tcp: unicast_sock should not land outside of TCP stack) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: tcp: unicast_sock should not land outside of TCP stackEric Dumazet2012-08-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | commit be9f4a44e7d41cee (ipv4: tcp: remove per net tcp_sock) added a selinux regression, reported and bisected by John Stultz selinux_ip_postroute_compat() expect to find a valid sk->sk_security pointer, but this field is NULL for unicast_sock It turns out that unicast_sock are really temporary stuff to be able to reuse part of IP stack (ip_append_data()/ip_push_pending_frames()) Fact is that frames sent by ip_send_unicast_reply() should be orphaned to not fool LSM. Note IPv6 never had this problem, as tcp_v6_send_response() doesnt use a fake socket at all. I'll probably implement tcp_v4_send_response() to remove these unicast_sock in linux-3.7 Reported-by: John Stultz <johnstul@us.ibm.com> Bisected-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Paris <eparis@parisplace.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip: fix error handling in ip_finish_output2()Vasiliy Kulikov2012-08-061-1/+1
| | | | | | | | | | | | __neigh_create() returns either a pointer to struct neighbour or PTR_ERR(). But the caller expects it to return either a pointer or NULL. Replace the NULL check with IS_ERR() check. The bug was introduced in a263b3093641fb1ec377582c90986a7fd0625184 ("ipv4: Make neigh lookups directly in output packet path."). Signed-off-by: Vasily Kulikov <segoon@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'kill_rtcache'David S. Miller2012-07-221-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ipv4 routing cache is non-deterministic, performance wise, and is subject to reasonably easy to launch denial of service attacks. The routing cache works great for well behaved traffic, and the world was a much friendlier place when the tradeoffs that led to the routing cache's design were considered. What it boils down to is that the performance of the routing cache is a product of the traffic patterns seen by a system rather than being a product of the contents of the routing tables. The former of which is controllable by external entitites. Even for "well behaved" legitimate traffic, high volume sites can see hit rates in the routing cache of only ~%10. The general flow of this patch series is that first the routing cache is removed. We build a completely new rtable entry every lookup request. Next we make some simplifications due to the fact that removing the routing cache causes several members of struct rtable to become no longer necessary. Then we need to make some amends such that we can legally cache pre-constructed routes in the FIB nexthops. Firstly, we need to invalidate routes which are hit with nexthop exceptions. Secondly we have to change the semantics of rt->rt_gateway such that zero means that the destination is on-link and non-zero otherwise. Now that the preparations are ready, we start caching precomputed routes in the FIB nexthops. Output and input routes need different kinds of care when determining if we can legally do such caching or not. The details are in the commit log messages for those changes. The patch series then winds down with some more struct rtable simplifications and other tidy ups that remove unnecessary overhead. On a SPARC-T3 output route lookups are ~876 cycles. Input route lookups are ~1169 cycles with rpfilter disabled, and about ~1468 cycles with rpfilter enabled. These measurements were taken with the kbench_mod test module in the net_test_tools GIT tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git That GIT tree also includes a udpflood tester tool and stresses route lookups on packet output. For example, on the same SPARC-T3 system we can run: time ./udpflood -l 10000000 10.2.2.11 with routing cache: real 1m21.955s user 0m6.530s sys 1m15.390s without routing cache: real 1m31.678s user 0m6.520s sys 1m25.140s Performance undoubtedly can easily be improved further. For example fib_table_lookup() performs a lot of excessive computations with all the masking and shifting, some of it conditionalized to deal with edge cases. Also, Eric's no-ref optimization for input route lookups can be re-instated for the FIB nexthop caching code path. I would be really pleased if someone would work on that. In fact anyone suitable motivated can just fire up perf on the loading of the test net_test_tools benchmark kernel module. I spend much of my time going: bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2 bash# perf report Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben Hutchings, and others. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: Adjust semantics of rt->rt_gateway.David S. Miller2012-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to allow prefixed routes, we have to adjust how rt_gateway is set and interpreted. The new interpretation is: 1) rt_gateway == 0, destination is on-link, nexthop is iph->daddr 2) rt_gateway != 0, destination requires a nexthop gateway Abstract the fetching of the proper nexthop value using a new inline helper, rt_nexthop(), as suggested by Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
* | ipv4: tcp: set unicast_sock uc_ttl to -1Eric Dumazet2012-07-221-1/+2
|/ | | | | | | | | | | Set unicast_sock uc_ttl to -1 so that we select the right ttl, instead of sending packets with a 0 ttl. Bug added in commit be9f4a44e7d4 (ipv4: tcp: remove per net tcp_sock) Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: tcp: remove per net tcp_sockEric Dumazet2012-07-191-18/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace. This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations. To better resist to attacks, we use a percpu socket. Each cpu can run without contention, using appropriate memory (local node) Additional features : 1) We also mirror the queue_mapping of the incoming skb, so that answers use the same queue if possible. 2) Setting SOCK_USE_WRITE_QUEUE socket flag speedup sock_wfree() 3) We now limit the number of in-flight RST/ACK [1] packets per cpu, instead of per namespace, and we honor the sysctl_wmem_default limit dynamically. (Prior to this patch, sysctl_wmem_default value was copied at boot time, so any further change would not affect tcp_sock limit) [1] These packets are only generated when no socket was matched for the incoming packet. Reported-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Do delayed neigh confirmation.David S. Miller2012-07-051-1/+1
| | | | | | | | | | | | | | When a dst_confirm() happens, mark the confirmation as pending in the dst. Then on the next packet out, when we have the neigh in-hand, do the update. This removes the dependency in dst_confirm() of dst's having an attached neigh. While we're here, remove the explicit 'dst' NULL check, all except 2 or 3 call sites ensure it's not NULL. So just fix those cases up. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Make neigh lookups directly in output packet path.David S. Miller2012-07-051-4/+8
| | | | | | Do not use the dst cached neigh, we'll be getting rid of that. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Show that ip_send_reply() is purely unicast routine.David S. Miller2012-06-281-4/+5
| | | | | | | | | Rename it to ip_send_unicast_reply() and add explicit 'saddr' argument. This removed one of the few users of rt->rt_spec_dst. Signed-off-by: David S. Miller <davem@davemloft.net>
* net-next: add dev_loopback_xmit() to avoid duplicate codeMichel Machado2012-06-121-15/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Add dev_loopback_xmit() in order to deduplicate functions ip_dev_loopback_xmit() (in net/ipv4/ip_output.c) and ip6_dev_loopback_xmit() (in net/ipv6/ip6_output.c). I was about to reinvent the wheel when I noticed that ip_dev_loopback_xmit() and ip6_dev_loopback_xmit() do exactly what I need and are not IP-only functions, but they were not available to reuse elsewhere. ip6_dev_loopback_xmit() does not have line "skb_dst_force(skb);", but I understand that this is harmless, and should be in dev_loopback_xmit(). Signed-off-by: Michel Machado <michel@digirati.com.br> CC: "David S. Miller" <davem@davemloft.net> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Eric Dumazet <edumazet@google.com> CC: Jiri Pirko <jpirko@redhat.com> CC: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> CC: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use consume_skb() in place of kfree_skb()Eric Dumazet2012-06-041-2/+2
| | | | | | | Remove some dropwatch/drop_monitor false positives. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Convert net_ratelimit uses to net_<level>_ratelimitedJoe Perches2012-05-151-2/+2
| | | | | | | | | | Standardize the net core ratelimited logging functions. Coalesce formats, align arguments. Change a printk then vprintk sequence to use printf extension %pV. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Remove all #inclusions of asm/system.hDavid Howells2012-03-281-1/+0
| | | | | | | | | Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *` Signed-off-by: David Howells <dhowells@redhat.com>
* net: Rename dst_get_neighbour{, _raw} to dst_get_neighbour_noref{, _raw}.David Miller2011-12-051-1/+1
| | | | | | | | To reflect the fact that a refrence is not obtained to the resulting neighbour entry. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Roland Dreier <roland@purestorage.com>
* ipv4: use a 64bit load/store in output pathEric Dumazet2011-12-011-4/+17
| | | | | | | | | | | | | gcc compiler is smart enough to use a single load/store if we memcpy(dptr, sptr, 8) on x86_64, regardless of CONFIG_CC_OPTIMIZE_FOR_SIZE In IP header, daddr immediately follows saddr, this wont change in the future. We only need to make sure our flowi4 (saddr,daddr) fields wont break the rule. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: tcp: fix TOS value in ACK messages sent from TIME_WAITEric Dumazet2011-10-241-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | There is a long standing bug in linux tcp stack, about ACK messages sent on behalf of TIME_WAIT sockets. In the IP header of the ACK message, we choose to reflect TOS field of incoming message, and this might break some setups. Example of things that were broken : - Routing using TOS as a selector - Firewalls - Trafic classification / shaping We now remember in timewait structure the inet tos field and use it in ACK generation, and route lookup. Notes : - We still reflect incoming TOS in RST messages. - We could extend MuraliRaja Muniraju patch to report TOS value in netlink messages for TIME_WAIT sockets. - A patch is needed for IPv6 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add skb frag size accessorsEric Dumazet2011-10-191-3/+3
| | | | | | | | | | | To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set|add|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ipv4: convert to SKB frag APIsIan Campbell2011-08-241-3/+4
| | | | | | | | | | | | Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: use dst with ref during bcast/mcast loopbackJulian Anastasov2011-08-071-0/+1
| | | | | | | | | | Make sure skb dst has reference when moving to another context. Currently, I don't see protocols that can hit it when sending broadcasts/multicasts to loopback using noref dsts, so it is just a precaution. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: fix NULL dereferences in check_peer_redir()Eric Dumazet2011-08-031-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | Gergely Kalman reported crashes in check_peer_redir(). It appears commit f39925dbde778 (ipv4: Cache learned redirect information in inetpeer.) added a race, leading to possible NULL ptr dereference. Since we can now change dst neighbour, we should make sure a reader can safely use a neighbour. Add RCU protection to dst neighbour, and make sure check_peer_redir() can be called safely by different cpus in parallel. As neighbours are already freed after one RCU grace period, this patch should not add typical RCU penalty (cache cold effects) Many thanks to Gergely for providing a pretty report pointing to the bug. Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Constrain UFO fragment sizes to multiples of 8 bytesBill Sommerfeld2011-07-211-3/+3
| | | | | | | | | | | | Because the ip fragment offset field counts 8-byte chunks, ip fragments other than the last must contain a multiple of 8 bytes of payload. ip_ufo_append_data wasn't respecting this constraint and, depending on the MTU and ip option sizes, could create malformed non-final fragments. Google-Bug-Id: 5009328 Signed-off-by: Bill Sommerfeld <wsommerfeld@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Abstract dst->neighbour accesses behind helpers.David S. Miller2011-07-171-1/+1
| | | | | | dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Create and use new helper, neigh_output().David S. Miller2011-07-161-7/+3
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud