blackbird-op-linux - Blackbird™ Linux sources for OpenPOWER

	Commit message (Collapse)	Author	Age	Files	Lines
*	[TCP]: BIC max increment too large	Stephen Hemminger	2005-11-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The max growth of BIC TCP is too large. Original code was based on BIC 1.0 and the default there was 32. Later code (2.6.13) included compensation for delayed acks, and should have reduced the default value to 16; since normally TCP gets one ack for every two packets sent. The current value of 32 makes BIC too aggressive and unfair to other flows. Submitted-by: Injong Rhee <rhee@eos.ncsu.edu> Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Acked-by: Ian McDonald <imcdnzl@gmail.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
*	[MCAST]: ip[6]_mc_add_src should be called when number of sources is zero	Yan Zheng	2005-11-02	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	And filter mode is exclude. Further explanation by David Stevens: Multicast source filters aren't widely used yet, and that's really the only feature that's affected if an application actually exercises this bug, as far as I can tell. An ordinary filter-less multicast join should still work, and only forwarded multicast traffic making use of filters and doing empty-source filters with the MSFILTER ioctl would be at risk of not getting multicast traffic forwarded to them because the reports generated would not be based on the correct counts. Signed-off-by: Yan Zheng <yanzheng@21cn.com Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
*	[NETFILTER]: Add "revision" support to arp_tables and ip6_tables	Harald Welte	2005-10-31	1	-70/+131
\| \| \| \| \| \| \| \| \|	Like ip_tables already has it for some time, this adds support for having multiple revisions for each match/target. We steal one byte from the name in order to accomodate a 8 bit version number. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
*	[PATCH] Typo fix: dot after newline in printk strings	Jean Delvare	2005-10-30	1	-1/+1
\| \| \| \| \| \| \| \|	Typo fix: dots appearing after a newline in printk strings. Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[IPV4]: Fix issue reported by Coverity in ipv4/fib_frontend.c	Jayachandran C	2005-10-29	1	-1/+1
\| \| \| \| \| \| \| \|	fib_del_ifaddr() dereferences ifa->ifa_dev, so the code already assumes that ifa->ifa_dev is non-NULL, the check is unnecessary. Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
*	[IPv4/IPv6]: UFO Scatter-gather approach	Ananda Raju	2005-10-28	1	-5/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Attached is kernel patch for UDP Fragmentation Offload (UFO) feature. 1. This patch incorporate the review comments by Jeff Garzik. 2. Renamed USO as UFO (UDP Fragmentation Offload) 3. udp sendfile support with UFO This patches uses scatter-gather feature of skb to generate large UDP datagram. Below is a "how-to" on changes required in network device driver to use the UFO interface. UDP Fragmentation Offload (UFO) Interface: ------------------------------------------- UFO is a feature wherein the Linux kernel network stack will offload the IP fragmentation functionality of large UDP datagram to hardware. This will reduce the overhead of stack in fragmenting the large UDP datagram to MTU sized packets 1) Drivers indicate their capability of UFO using dev->features \|= NETIF_F_UFO \| NETIF_F_HW_CSUM \| NETIF_F_SG NETIF_F_HW_CSUM is required for UFO over ipv6. 2) UFO packet will be submitted for transmission using driver xmit routine. UFO packet will have a non-zero value for "skb_shinfo(skb)->ufo_size" skb_shinfo(skb)->ufo_size will indicate the length of data part in each IP fragment going out of the adapter after IP fragmentation by hardware. skb->data will contain MAC/IP/UDP header and skb_shinfo(skb)->frags[] contains the data payload. The skb->ip_summed will be set to CHECKSUM_HW indicating that hardware has to do checksum calculation. Hardware should compute the UDP checksum of complete datagram and also ip header checksum of each fragmented IP packet. For IPV6 the UFO provides the fragment identification-id in skb_shinfo(skb)->ip6_frag_id. The adapter should use this ID for generating IPv6 fragments. Signed-off-by: Ananda Raju <ananda.raju@neterion.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (forwarded) Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
*	Merge master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.15	Linus Torvalds	2005-10-28	6	-50/+101
\|\
\| *	[IPV4]: Fix setting broadcast for SIOCSIFNETMASK	David Engel	2005-10-26	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix setting of the broadcast address when the netmask is set via SIOCSIFNETMASK in Linux 2.6. The code wanted the old value of ifa->ifa_mask but used it after it had already been overwritten with the new value. Signed-off-by: David Engel <gigem@comcast.net> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
\| *	[IPV4]: Remove dead code from ip_output.c	Jayachandran C	2005-10-26	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	skb_prev is assigned from skb, which cannot be NULL. This patch removes the unnecessary NULL check. Signed-off-by: Jayachandran C. <c.jayachandran at gmail.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
\| *	[IPV4]: Kill redundant rcu_dereference on fa_info	Herbert Xu	2005-10-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch kills a redundant rcu_dereference on fa->fa_info in fib_trie.c. As this dereference directly follows a list_for_each_entry_rcu line, we have already taken a read barrier with respect to getting an entry from the list. This read barrier guarantees that all values read out of fa are valid. In particular, the contents of structure pointed to by fa->fa_info is initialised before fa->fa_info is actually set (see fn_trie_insert); the setting of fa->fa_info itself is further separated with a write barrier from the insertion of fa into the list. Therefore by taking a read barrier after obtaining fa from the list (which is given by list_for_each_entry_rcu), we can be sure that fa->fa_info contains a valid pointer, as well as the fact that the data pointed to by fa->fa_info is itself valid. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
\| *	[NETFILTER] ip_conntrack: Make "hashsize" conntrack parameter writable	Harald Welte	2005-10-26	1	-37/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's fairly simple to resize the hash table, but currently you need to remove and reinsert the module. That's bad (we lose connection state). Harald has even offered to write a daemon which sets this based on load. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
\| *	[NET]: Wider use of for_each_*cpu()	John Hawkes	2005-10-25	2	-7/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In 'net' change the explicit use of for-loops and NR_CPUS into the general for_each_cpu() or for_each_online_cpu() constructs, as appropriate. This widens the scope of potential future optimizations of the general constructs, as well as takes advantage of the existing optimizations of first_cpu() and next_cpu(), which is advantageous when the true CPU count is much smaller than NR_CPUS. Signed-off-by: John Hawkes <hawkes@sgi.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
* \|	[TCP]: Clear stale pred_flags when snd_wnd changes	Herbert Xu	2005-10-27	1	-0/+1
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This bug is responsible for causing the infamous "Treason uncloaked" messages that's been popping up everywhere since the printk was added. It has usually been blamed on foreign operating systems. However, some of those reports implicate Linux as both systems are running Linux or the TCP connection is going across the loopback interface. In fact, there really is a bug in the Linux TCP header prediction code that's been there since at least 2.1.8. This bug was tracked down with help from Dale Blount. The effect of this bug ranges from harmless "Treason uncloaked" messages to hung/aborted TCP connections. The details of the bug and fix is as follows. When snd_wnd is updated, we only update pred_flags if tcp_fast_path_check succeeds. When it fails (for example, when our rcvbuf is used up), we will leave pred_flags with an out-of-date snd_wnd value. When the out-of-date pred_flags happens to match the next incoming packet we will again hit the fast path and use the current snd_wnd which will be wrong. In the case of the treason messages, it just happens that the snd_wnd cached in pred_flags is zero while tp->snd_wnd is non-zero. Therefore when a zero-window packet comes in we incorrectly conclude that the window is non-zero. In fact if the peer continues to send us zero-window pure ACKs we will continue making the same mistake. It's only when the peer transmits a zero-window packet with data attached that we get a chance to snap out of it. This is what triggers the treason message at the next retransmit timeout. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
*	[SK_BUFF]: ipvs_property field must be copied	Julian Anastasov	2005-10-22	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	IPVS used flag NFC_IPVS_PROPERTY in nfcache but as now nfcache was removed the new flag 'ipvs_property' still needs to be copied. This patch should be included in 2.6.14. Further comments from Harald Welte: Sorry, seems like the bug was introduced by me. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
*	[TCP] Allow len == skb->len in tcp_fragment	Herbert Xu	2005-10-20	1	-11/+1
\| \| \| \| \| \| \| \| \|	It is legitimate to call tcp_fragment with len == skb->len since that is done for FIN packets and the FIN flag counts as one byte. So we should only check for the len > skb->len case. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
*	[TCP]: Ratelimit debugging warning.	Herbert Xu	2005-10-13	1	-5/+7
\| \| \| \| \| \| \|	Better safe than sorry. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: Fix OOPSes on machines with discontiguous cpu numbering.	David S. Miller	2005-10-13	2	-11/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Original patch by Harald Welte, with feedback from Herbert Xu and testing by S�bastien Bernard. EBTABLES, ARP tables, and IP/IP6 tables all assume that cpus are numbered linearly. That is not necessarily true. This patch fixes that up by calculating the largest possible cpu number, and allocating enough per-cpu structure space given that. Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Add code to help track down "BUG at net/ipv4/tcp_output.c:438!"	Herbert Xu	2005-10-12	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the second report of this bug. Unfortunately the first reporter hasn't been able to reproduce it since to provide more debugging info. So let's apply this patch for 2.6.14 to 1) Make this non-fatal. 2) Provide the info we need to track it down. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TWSK]: Grab the module refcount for timewait sockets	Arnaldo Carvalho de Melo	2005-10-10	1	-0/+1
\| \| \| \| \| \| \| \|	This is required to avoid unloading a module that has active timewait sockets, such as DCCP. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] ctnetlink: add support to change protocol info	Pablo Neira Ayuso	2005-10-10	1	-0/+37
\| \| \| \| \| \| \| \| \|	This patch add support to change the state of the private protocol information via conntrack_netlink. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] ctnetlink: allow userspace to change TCP state	Pablo Neira Ayuso	2005-10-10	1	-0/+23
\| \| \| \| \| \| \| \| \| \| \|	This patch adds the ability of changing the state a TCP connection. I know that this must be used with care but it's required to provide a complete conntrack creation via conntrack_netlink. So I'll document this aspect on the upcoming docs. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: Use only 32bit counters for CONNTRACK_ACCT	Harald Welte	2005-10-10	2	-9/+12
\| \| \| \| \| \| \| \| \| \|	Initially we used 64bit counters for conntrack-based accounting, since we had no event mechanism to tell userspace that our counters are about to overflow. With nfnetlink_conntrack, we now have such a event mechanism and thus can save 16bytes per connection. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPSEC] Fix block size/MTU bugs in ESP	Herbert Xu	2005-10-10	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes the following bugs in ESP: * Fix transport mode MTU overestimate. This means that the inner MTU is smaller than it needs be. Worse yet, given an input MTU which is a multiple of 4 it will always produce an estimate which is not a multiple of 4. For example, given a standard ESP/3DES/MD5 transform and an MTU of 1500, the resulting MTU for transport mode is 1462 when it should be 1464. The reason for this is because IP header lengths are always a multiple of 4 for IPv4 and 8 for IPv6. * Ensure that the block size is at least 4. This is required by RFC2406 and corresponds to what the esp_output function does. At the moment this only affects crypto_null as its block size is 1. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPSEC]: Use ALIGN macro in ESP	Herbert Xu	2005-10-10	1	-5/+6
\| \| \| \| \| \| \|	This patch uses the macro ALIGN in all the applicable spots for ESP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] ctnetlink: add one nesting level for TCP state	Pablo Neira Ayuso	2005-10-10	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To keep consistency, the TCP private protocol information is nested attributes under CTA_PROTOINFO_TCP. This way the sequence of attributes to access the TCP state information looks like here below: CTA_PROTOINFO CTA_PROTOINFO_TCP CTA_PROTOINFO_TCP_STATE instead of: CTA_PROTOINFO CTA_PROTOINFO_TCP_STATE Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] ctnetlink: ICMP ID is not mandatory	Pablo Neira Ayuso	2005-10-10	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \|	The ID is only required by ICMP type 8 (echo), so it's not mandatory for all sort of ICMP connections. This patch makes mandatory only the type and the code for ICMP netlink messages. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] conntrack_netlink: Fix endian issue with status from userspace	Harald Welte	2005-10-10	1	-1/+2
\| \| \| \| \| \| \| \| \|	When we send "status" from userspace, we forget to convert the endianness. This patch adds the reqired conversion. Thanks to Pablo Neira for discovering this. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] ipt_ULOG: Mark ipt_ULOG as OBSOLETE	Harald Welte	2005-10-10	1	-1/+6
\| \| \| \| \| \| \| \| \|	Similar to nfnetlink_queue and ip_queue, we mark ipt_ULOG as obsolete. This should have been part of the original nfnetlink_log merge, but I somehow missed it. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] PPTP helper: Add missing Kconfig dependency	Harald Welte	2005-10-10	1	-0/+1
\| \| \| \| \| \| \|	PPTP should not be selectable without conntrack enabled Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[PATCH] gfp flags annotations - part 1	Al Viro	2005-10-08	3	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	- added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[TCP]: BIC coding bug in Linux 2.6.13	Stephen Hemminger	2005-10-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Missing parenthesis in causes BIC to be slow in increasing congestion window. Spotted by Injong Rhee. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPVS]: fix sparse gfp nocast warnings	Randy Dunlap	2005-10-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	From: Randy Dunlap <rdunlap@xenotime.net> Fix implicit nocast warnings in ip_vs code: net/ipv4/ipvs/ip_vs_app.c:631:54: warning: implicit cast to nocast type Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: Fix Kconfig typo	Horst H. von Brand	2005-10-04	1	-1/+1
\| \| \| \| \|	Signed-off-by: Horst H. von Brand <vonbrand@inf.utfsm.cl> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: fib_trie root-node expansion	Robert Olsson	2005-10-04	1	-2/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The patch below introduces special thresholds to keep root node in the trie large. This gives a flatter tree at the cost of a modest memory increase. Overall it seems to be gain and this was also proposed by one the authors of the paper in recent a seminar. Main table after loading 123 k routes. Aver depth: 3.30 Max depth: 9 Root-node size 12 bits Total size: 4044 kB With the patch: Aver depth: 2.78 Max depth: 8 Root-node size 15 bits Total size: 4150 kB An increase of 8-10% was seen in forwading performance for an rDoS attack. Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Update icmp sysctl docs and disable broadcast ECHO/TIMESTAMP by default	David S. Miller	2005-10-03	1	-1/+1
\| \| \| \| \| \| \|	It's not a good idea to be smurf'able by default. The few people who need this can turn it on. Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Replace __in_dev_get with __in_dev_get_rcu/rtnl	Herbert Xu	2005-10-03	10	-30/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following patch renames __in_dev_get() to __in_dev_get_rtnl() and introduces __in_dev_get_rcu() to cover the second case. 1) RCU with refcnt should use in_dev_get(). 2) RCU without refcnt should use __in_dev_get_rcu(). 3) All others must hold RTNL and use __in_dev_get_rtnl(). There is one exception in net/ipv4/route.c which is in fact a pre-existing race condition. I've marked it as such so that we remember to fix it. This patch is based on suggestions and prior work by Suzanne Wood and Paul McKenney. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[IPV4]: Fix "Proxy ARP seems broken"	Herbert Xu	2005-10-03	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Meelis Roos <mroos@linux.ee> wrote: > RK> My firewall setup relies on proxyarp working. However, with 2.6.14-rc3, > RK> it appears to be completely broken. The firewall is 212.18.232.186, > > Same here with some kernel between 14-rc2 and 14-rc3 - no reposnse to > ARP on a proxyarp gateway. Sorry, no exact revison and no more debugging > yet since it'a a production gateway. The breakage is caused by the change to use the CB area for flagging whether a packet has been queued due to proxy_delay. This area gets cleared every time arp_rcv gets called. Unfortunately packets delayed due to proxy_delay also go through arp_rcv when they are reprocessed. In fact, I can't think of a reason why delayed proxy packets should go through netfilter again at all. So the easiest solution is to bypass that and go straight to arp_process. This is essentially what would've happened before netfilter support was added to ARP. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[INET]: speedup inet (tcp/dccp) lookups	Eric Dumazet	2005-10-03	2	-8/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Arnaldo and I agreed it could be applied now, because I have other pending patches depending on this one (Thank you Arnaldo) (The other important patch moves skc_refcnt in a separate cache line, so that the SMP/NUMA performance doesnt suffer from cache line ping pongs) 1) First some performance data : -------------------------------- tcp_v4_rcv() wastes a lot of time in __inet_lookup_established() The most time critical code is : sk_for_each(sk, node, &head->chain) { if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif)) goto hit; /* You sunk my battleship! / } The sk_for_each() does use prefetch() hints but only the begining of "struct sock" is prefetched. As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far away from the begining of "struct sock", it has to bring into CPU cache cold cache line. Each iteration has to use at least 2 cache lines. This can be problematic if some chains are very long. 2) The goal ----------- The idea I had is to change things so that INET_MATCH() may return FALSE in 99% of cases only using the data already in the CPU cache, using one cache line per iteration. 3) Description of the patch --------------------------- Adds a new 'unsigned int skc_hash' field in 'struct sock_common', filling a 32 bits hole on 64 bits platform. struct sock_common { unsigned short skc_family; volatile unsigned char skc_state; unsigned char skc_reuse; int skc_bound_dev_if; struct hlist_node skc_node; struct hlist_node skc_bind_node; atomic_t skc_refcnt; + unsigned int skc_hash; struct proto skc_prot; }; Store in this 32 bits field the full hash, not masked by (ehash_size - 1) Using this full hash as the first comparison done in INET_MATCH permits us immediatly skip the element without touching a second cache line in case of a miss. Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to sk_hash and tw_hash) already contains the slot number if we mask with (ehash_size - 1) File include/net/inet_hashtables.h 64 bits platforms : #define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ (((__sk)->sk_hash == (__hash)) ((((__u64 )&(inet_sk(__sk)->daddr)))== (__cookie)) && \ ((((__u32 )&(inet_sk(__sk)->dport))) == (__ports)) && \ (!((__sk)->sk_bound_dev_if) \|\| ((__sk)->sk_bound_dev_if == (__dif)))) 32bits platforms: #define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ (((__sk)->sk_hash == (__hash)) && \ (inet_sk(__sk)->daddr == (__saddr)) && \ (inet_sk(__sk)->rcv_saddr == (__daddr)) && \ (!((__sk)->sk_bound_dev_if) \|\| ((__sk)->sk_bound_dev_if == (__dif)))) - Adds a prefetch(head->chain.first) in __inet_lookup_established()/__tcp_v4_check_established() and __inet6_lookup_established()/__tcp_v6_check_established() and __dccp_v4_check_established() to bring into cache the first element of the list, before the {read\|write}_lock(&head->lock); Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NET]: Fix packet timestamping.	Herbert Xu	2005-10-03	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I've found the problem in general. It affects any 64-bit architecture. The problem occurs when you change the system time. Suppose that when you boot your system clock is forward by a day. This gets recorded down in skb_tv_base. You then wind the clock back by a day. From that point onwards the offset will be negative which essentially overflows the 32-bit variables they're stored in. In fact, why don't we just store the real time stamp in those 32-bit variables? After all, we're not going to overflow for quite a while yet. When we do overflow, we'll need a better solution of course. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Don't over-clamp window in tcp_clamp_window()	Alexey Kuznetsov	2005-09-29	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Handle better the case where the sender sends full sized frames initially, then moves to a mode where it trickles out small amounts of data at a time. This known problem is even mentioned in the comments above tcp_grow_window() in tcp_input.c, specifically: ... * The scheme does not work when sender sends good segments opening * window and then starts to feed us spagetti. But it should work * in common situations. Otherwise, we have to rely on queue collapsing. ... When the sender gives full sized frames, the "struct sk_buff" overhead from each packet is small. So we'll advertize a larger window. If the sender moves to a mode where small segments are sent, this ratio becomes tilted to the other extreme and we start overrunning the socket buffer space. tcp_clamp_window() tries to address this, but it's clamping of tp->window_clamp is a wee bit too aggressive for this particular case. Fix confirmed by Ion Badulescu. Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Revert 6b251858d377196b8cea20e65cae60f584a42735	David S. Miller	2005-09-29	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	But retain the comment fix. Alexey Kuznetsov has explained the situation as follows: -------------------- I think the fix is incorrect. Look, the RFC function init_cwnd(mss) is not continuous: f.e. for mss=1095 it needs initial window 10954, but for mss=1096 it is 10963. We do not know exactly what mss sender used for calculations. If we advertised 1096 (and calculate initial window 31096), the sender could limit it to some value < 1096 and then it will need window his_mss4 > 31096 to send initial burst. See? So, the honest function for inital rcv_wnd derived from tcp_init_cwnd() is: init_rcv_wnd(mss)= min { init_cwnd(mss1)mss1 for mss1 <= mss } It is something sort of: if (mss < 1096) return mss4; if (mss < 10962) return 10964; return mss2; (I just scrablled a graph of piece of paper, it is difficult to see or to explain without this) I selected it differently giving more window than it is strictly required. Initial receive window must be large enough to allow sender following to the rfc (or just setting initial cwnd to 2) to send initial burst. But besides that it is arbitrary, so I decided to give slack space of one segment. Actually, the logic was: If mss is low/normal (<=ethernet), set window to receive more than initial burst allowed by rfc under the worst conditions i.e. mss4. This gives slack space of 1 segment for ethernet frames. For msses slighlty more than ethernet frame, take 3. Try to give slack space of 1 frame again. If mss is huge, force 2mss. No slack space. Value 14603 is really confusing. Minimal one is 10962, but besides that it is an arbitrary value. It was meant to be ~4096. 14603 is just the magic number from RFC, 14603 = 1095*4 is the magic :-), so that I guess hands typed this themselves. -------------------- Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Fix init_cwnd calculations in tcp_select_initial_window()	David S. Miller	2005-09-28	1	-5/+6
\| \| \| \| \| \| \|	Match it up to what RFC2414 really specifies. Noticed by Rick Jones. Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: Fix invalid module autoloading by splitting iptable_nat	Harald Welte	2005-09-26	4	-34/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When you've enabled conntrack and NAT as a module (standard case in all distributions), and you've also enabled the new conntrack netlink interface, loading ip_conntrack_netlink.ko will auto-load iptable_nat.ko. This causes a huge performance penalty, since for every packet you iterate the nat code, even if you don't want it. This patch splits iptable_nat.ko into the NAT core (ip_nat.ko) and the iptables frontend (iptable_nat.ko). Threfore, ip_conntrack_netlink.ko will only pull ip_nat.ko, but not the frontend. ip_nat.ko will "only" allocate some resources, but not affect runtime performance. This separation is also a nice step in anticipation of new packet filters (nf-hipac, ipset, pkttables) being able to use the NAT core. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] ip_conntrack: Update event cache when status changes	Harald Welte	2005-09-24	3	-1/+4
\| \| \| \| \| \| \| \| \|	The GRE, SCTP and TCP protocol helpers did not call ip_conntrack_event_cache() when updating ct->status. This patch adds the respective calls. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER]: Fix ip[6]t_NFQUEUE Kconfig dependency	Harald Welte	2005-09-24	2	-1/+12
\| \| \| \| \| \| \| \| \|	We have to introduce a separate Kconfig menu entry for the NFQUEUE targets. They cannot "just" depend on nfnetlink_queue, since nfnetlink_queue could be linked into the kernel, whereas iptables can be a module. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] Fix conntrack event cache deadlock/oops	Harald Welte	2005-09-22	5	-28/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes a number of bugs. It cannot be reasonably split up in multiple fixes, since all bugs interact with each other and affect the same function: Bug #1: The event cache code cannot be called while a lock is held. Therefore, the call to ip_conntrack_event_cache() within ip_ct_refresh_acct() needs to be moved outside of the locked section. This fixes a number of 2.6.14-rcX oops and deadlock reports. Bug #2: We used to call ct_add_counters() for unconfirmed connections without holding a lock. Since the add operations are not atomic, we could race with another CPU. Bug #3: ip_ct_refresh_acct() lost REFRESH events in some cases where refresh (and the corresponding event) are desired, but no accounting shall be performed. Both, evenst and accounting implicitly depended on the skb parameter bein non-null. We now re-introduce a non-accounting "ip_ct_refresh()" variant to explicitly state the desired behaviour. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] Fix sparse endian warnings in pptp helper	Alexey Dobriyan	2005-09-22	1	-6/+8
\| \| \| \| \| \|	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[NETFILTER] fix DEBUG statement in PPTP helper	Harald Welte	2005-09-22	1	-1/+1
\| \| \| \| \| \| \| \|	As noted by Alexey Dobriyan, the DEBUGP statement prints the wrong callID. Signed-off-by: Harald Welte <laforge@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Adjust Reno SACK estimate in tcp_fragment	Herbert Xu	2005-09-22	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since the introduction of TSO pcount a year ago, it has been possible for tcp_fragment() to cause packets_out to decrease. Prior to that, tcp_retrans_try_collapse() was the only way for that to happen on the retransmission path. When this happens with Reno, it is possible for sasked_out to become invalid because it is only an estimate and not tied to any particular packet on the retransmission queue. Therefore we need to adjust sacked_out as well as left_out in the Reno case. The following patch does exactly that. This bug is pretty difficult to trigger in practice though since you need a SACKless peer with a retransmission that occurs just as the cached MTU value expires. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
*	[TCP]: Set default congestion control correctly for incoming connections.	Stephen Hemminger	2005-09-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Patch from Joel Sing to fix the default congestion control algorithm for incoming connections. If a new congestion control handler is added (via module), it should become the default for new connections. Instead, the incoming connections use reno. The cause is incorrect initialisation causes the tcp_init_congestion_control() function to return after the initial if test fails. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Acked-by: Ian McDonald <imcdnzl@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>