summaryrefslogtreecommitdiffstats
path: root/net/core/sock.c
Commit message (Collapse)AuthorAgeFilesLines
* net: Fix sock_wfree() raceEric Dumazet2009-09-301-7/+12
| | | | | | | | | | | | | | Commit 2b85a34e911bf483c27cfdd124aeb1605145dc80 (net: No more expensive sock_hold()/sock_put() on each tx) opens a window in sock_wfree() where another cpu might free the socket we are working on. A fix is to call sk->sk_write_space(sk) while still holding a reference on sk. Reported-by: Jike Song <albcamus@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Make setsockopt() optlen be unsigned.David S. Miller2009-09-301-4/+4
| | | | | | | | | | | | This provides safety against negative optlen at the type level instead of depending upon (sometimes non-trivial) checks against this sprinkled all over the the place, in each and every implementation. Based upon work done by Arjan van de Ven and feedback from Linus Torvalds. Signed-off-by: David S. Miller <davem@davemloft.net>
* mm: replace various uses of num_physpages by totalram_pagesJan Beulich2009-09-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'master' of ↵David S. Miller2009-09-021-1/+1
|\ | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/yellowfin.c
| * net: sk_free() should be allowed right after sk_alloc()Jarek Poplawski2009-09-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 2b85a34e911bf483c27cfdd124aeb1605145dc80 (net: No more expensive sock_hold()/sock_put() on each tx) sk_free() frees socks conditionally and depends on sk_wmem_alloc being set e.g. in sock_init_data(). But in some cases sk_free() is called earlier, usually after other alloc errors. Fix is to move sk_wmem_alloc initialization from sock_init_data() to sk_alloc() itself. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: implement a SO_DOMAIN getsockoptionJan Engelhardt2009-08-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | This sockopt goes in line with SO_TYPE and SO_PROTOCOL. It makes it possible for userspace programs to pass around file descriptors — I am referring to arguments-to-functions, but it may even work for the fd passing over UNIX sockets — without needing to also pass the auxiliary information (PF_INET6/IPPROTO_TCP). Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: implement a SO_PROTOCOL getsockoptionJan Engelhardt2009-08-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to SO_TYPE returning the socket type, SO_PROTOCOL allows to retrieve the protocol used with a given socket. I am not quite sure why we have that-many copies of socket.h, and why the values are not the same on all arches either, but for where hex numbers dominate, I use 0x1029 for SO_PROTOCOL as that seems to be the next free unused number across a bunch of operating systems, or so Google results make me want to believe. SO_PROTOCOL for others just uses the next free Linux number, 38. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: mark read-only arrays as constJan Engelhardt2009-08-051-3/+3
|/ | | | | | | | String literals are constant, and usually, we can also tag the array of pointers const too, moving it to the .rodata section. Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* Fix error return for setsockopt(SO_TIMESTAMPING)Rémi Denis-Courmont2009-07-201-1/+1
| | | | | | | | I guess it should be -EINVAL rather than EINVAL. I have not checked when the bug came in. Perhaps a candidate for -stable? Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sock_copy() fixesEric Dumazet2009-07-161-2/+18
| | | | | | | | | | | | | | | | | | | | | | Commit e912b1142be8f1e2c71c71001dc992c6e5eb2ec1 (net: sk_prot_alloc() should not blindly overwrite memory) took care of not zeroing whole new socket at allocation time. sock_copy() is another spot where we should be very careful. We should not set refcnt to a non null value, until we are sure other fields are correctly setup, or a lockless reader could catch this socket by mistake, while not fully (re)initialized. This patch puts sk_node & sk_refcnt to the very beginning of struct sock to ease sock_copy() & sk_prot_alloc() job. We add appropriate smp_wmb() before sk_refcnt initializations to match our RCU requirements (changes to sock keys should be committed to memory before sk_refcnt setting) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sk_prot_alloc() should not blindly overwrite memoryEric Dumazet2009-07-111-2/+17
| | | | | | | | | | | | | Some sockets use SLAB_DESTROY_BY_RCU, and our RCU code correctness depends on sk->sk_nulls_node.next being always valid. A NULL value is not allowed as it might fault a lockless reader. Current sk_prot_alloc() implementation doesnt respect this hypothesis, calling kmem_cache_alloc() with __GFP_ZERO. Just call memset() around the forbidden field. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: adding memory barrier to the poll and receive callbacksJiri Olsa2009-07-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding memory barrier after the poll_wait function, paired with receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper to wrap the memory barrier. Without the memory barrier, following race can happen. The race fires, when following code paths meet, and the tp->rcv_nxt and __add_wait_queue updates stay in CPU caches. CPU1 CPU2 sys_select receive packet ... ... __add_wait_queue update tp->rcv_nxt ... ... tp->rcv_nxt check sock_def_readable ... { schedule ... if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) wake_up_interruptible(sk->sk_sleep) ... } If there was no cache the code would work ok, since the wait_queue and rcv_nxt are opposit to each other. Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already passed the tp->rcv_nxt check and sleeps, or will get the new value for tp->rcv_nxt and will return with new data mask. In both cases the process (CPU1) is being added to the wait queue, so the waitqueue_active (CPU2) call cannot miss and will wake up CPU1. The bad case is when the __add_wait_queue changes done by CPU1 stay in its cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then endup calling schedule and sleep forever if there are no more data on the socket. Calls to poll_wait in following modules were ommited: net/bluetooth/af_bluetooth.c net/irda/af_irda.c net/irda/irnet/irnet_ppp.c net/mac80211/rc80211_pid_debugfs.c net/phonet/socket.c net/rds/af_rds.c net/rfkill/core.c net/sunrpc/cache.c net/sunrpc/rpc_pipe.c net/tipc/socket.c Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-linus2' of ↵Linus Torvalds2009-06-161-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/vegard/kmemcheck: (39 commits) signal: fix __send_signal() false positive kmemcheck warning fs: fix do_mount_root() false positive kmemcheck warning fs: introduce __getname_gfp() trace: annotate bitfields in struct ring_buffer_event net: annotate struct sock bitfield c2port: annotate bitfield for kmemcheck net: annotate inet_timewait_sock bitfields ieee1394/csr1212: fix false positive kmemcheck report ieee1394: annotate bitfield net: annotate bitfields in struct inet_sock net: use kmemcheck bitfields API for skbuff kmemcheck: introduce bitfield API kmemcheck: add opcode self-testing at boot x86: unify pte_hidden x86: make _PAGE_HIDDEN conditional kmemcheck: make kconfig accessible for other architectures kmemcheck: enable in the x86 Kconfig kmemcheck: add hooks for the page allocator kmemcheck: add hooks for page- and sg-dma-mappings kmemcheck: don't track page tables ...
| * net: annotate struct sock bitfieldVegard Nossum2009-06-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2009/2/24 Ingo Molnar <mingo@elte.hu>: > ok, this is the last warning i have from today's overnight -tip > testruns - a 32-bit system warning in sock_init_data(): > > [ 2.610389] NET: Registered protocol family 16 > [ 2.616138] initcall netlink_proto_init+0x0/0x170 returned 0 after 7812 usecs > [ 2.620010] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f642c184) > [ 2.624002] 010000000200000000000000604990c000000000000000000000000000000000 > [ 2.634076] i i i i i i u u i i i i i i i i i i i i i i i i i i i i i i i i > [ 2.641038] ^ > [ 2.643376] > [ 2.644004] Pid: 1, comm: swapper Not tainted (2.6.29-rc6-tip-01751-g4d1c22c-dirty #885) > [ 2.648003] EIP: 0060:[<c07141a1>] EFLAGS: 00010282 CPU: 0 > [ 2.652008] EIP is at sock_init_data+0xa1/0x190 > [ 2.656003] EAX: 0001a800 EBX: f6836c00 ECX: 00463000 EDX: c0e46fe0 > [ 2.660003] ESI: f642c180 EDI: c0b83088 EBP: f6863ed8 ESP: c0c412ec > [ 2.664003] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 > [ 2.668003] CR0: 8005003b CR2: f682c400 CR3: 00b91000 CR4: 000006f0 > [ 2.672003] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 > [ 2.676003] DR6: ffff4ff0 DR7: 00000400 > [ 2.680002] [<c07423e5>] __netlink_create+0x35/0xa0 > [ 2.684002] [<c07443cc>] netlink_kernel_create+0x4c/0x140 > [ 2.688002] [<c072755e>] rtnetlink_net_init+0x1e/0x40 > [ 2.696002] [<c071b601>] register_pernet_operations+0x11/0x30 > [ 2.700002] [<c071b72c>] register_pernet_subsys+0x1c/0x30 > [ 2.704002] [<c0bf3c8c>] rtnetlink_init+0x4c/0x100 > [ 2.708002] [<c0bf4669>] netlink_proto_init+0x159/0x170 > [ 2.712002] [<c0101124>] do_one_initcall+0x24/0x150 > [ 2.716002] [<c0bbf3c7>] do_initcalls+0x27/0x40 > [ 2.723201] [<c0bbf3fc>] do_basic_setup+0x1c/0x20 > [ 2.728002] [<c0bbfb8a>] kernel_init+0x5a/0xa0 > [ 2.732002] [<c0103e47>] kernel_thread_helper+0x7/0x10 > [ 2.736002] [<ffffffff>] 0xffffffff We fix this false positive by annotating the bitfield in struct sock. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
* | net: No more expensive sock_hold()/sock_put() on each txEric Dumazet2009-06-111-4/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One of the problem with sock memory accounting is it uses a pair of sock_hold()/sock_put() for each transmitted packet. This slows down bidirectional flows because the receive path also needs to take a refcount on socket and might use a different cpu than transmit path or transmit completion path. So these two atomic operations also trigger cache line bounces. We can see this in tx or tx/rx workloads (media gateways for example), where sock_wfree() can be in top five functions in profiles. We use this sock_hold()/sock_put() so that sock freeing is delayed until all tx packets are completed. As we also update sk_wmem_alloc, we could offset sk_wmem_alloc by one unit at init time, until sk_free() is called. Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc) to decrement initial offset and atomicaly check if any packets are in flight. skb_set_owner_w() doesnt call sock_hold() anymore sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc reached 0 to perform the final freeing. Drawback is that a skb->truesize error could lead to unfreeable sockets, or even worse, prematurely calling __sk_free() on a live socket. Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt contention point. 5 % speedup on a UDP transmit workload (depends on number of flows), lowering TX completion cpu usage. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Add constants for the ieee 802.15.4 stackSergey Lapin2009-06-091-0/+3
| | | | | | | | | | | | | | | | IEEE 802.15.4 stack requires several constants to be defined/adjusted. Signed-off-by: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Signed-off-by: Sergey Lapin <slapin@ossfans.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: net/core/sock.c cleanupEric Dumazet2009-05-271-59/+44
|/ | | | | | | Pure style cleanup patch. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* epoll keyed wakeups: make sockets use keyed wakeupsDavide Libenzi2009-04-011-3/+5
| | | | | | | | | | | | | | Add support for event-aware wakeups to the sockets code. Events are delivered to the wakeup target, so that epoll can avoid spurious wakeups for non-interesting events. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: William Lee Irwin III <wli@movementarian.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* RDS: Add RDS to AF key stringsAndy Grover2009-02-261-3/+3
| | | | | Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller2009-02-241-2/+1
|\
| * net: amend the fix for SO_BSDCOMPAT gsopt infoleakEugene Teo2009-02-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The fix for CVE-2009-0676 (upstream commit df0bca04) is incomplete. Note that the same problem of leaking kernel memory will reappear if someone on some architecture uses struct timeval with some internal padding (for example tv_sec 64-bit and tv_usec 32-bit) --- then, you are going to leak the padded bytes to userspace. Signed-off-by: Eugene Teo <eugeneteo@kernel.sg> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Kill skb_truesize_check(), it only catches false-positives.David S. Miller2009-02-171-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | A long time ago we had bugs, primarily in TCP, where we would modify skb->truesize (for TSO queue collapsing) in ways which would corrupt the socket memory accounting. skb_truesize_check() was added in order to try and catch this error more systematically. However this debugging check has morphed into a Frankenstein of sorts and these days it does nothing other than catch false-positives. Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: socket infrastructure for SO_TIMESTAMPINGPatrick Ohly2009-02-151-12/+69
| | | | | | | | | | | | | | | | | | | | The overlap with the old SO_TIMESTAMP[NS] options is handled so that time stamping in software (net_enable_timestamp()) is enabled when SO_TIMESTAMP[NS] and/or SO_TIMESTAMPING_RX_SOFTWARE is set. It's disabled if all of these are off. Signed-off-by: Patrick Ohly <patrick.ohly@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller2009-02-141-0/+2
|\ \ | |/ | | | | | | | | Conflicts: drivers/net/wireless/iwlwifi/iwl-agn.c drivers/net/wireless/iwlwifi/iwl3945-base.c
| * net: 4 bytes kernel memory disclosure in SO_BSDCOMPAT gsopt try #2Clément Lecigne2009-02-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In function sock_getsockopt() located in net/core/sock.c, optval v.val is not correctly initialized and directly returned in userland in case we have SO_BSDCOMPAT option set. This dummy code should trigger the bug: int main(void) { unsigned char buf[4] = { 0, 0, 0, 0 }; int len; int sock; sock = socket(33, 2, 2); getsockopt(sock, 1, SO_BSDCOMPAT, &buf, &len); printf("%x%x%x%x\n", buf[0], buf[1], buf[2], buf[3]); close(sock); } Here is a patch that fix this bug by initalizing v.val just after its declaration. Signed-off-by: Clément Lecigne <clement.lecigne@netasq.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Reexport sock_alloc_send_pskbHerbert Xu2009-02-041-4/+4
|/ | | | | | | | | | | | The function sock_alloc_send_pskb is completely useless if not exported since most of the code in it won't be used as is. In fact, this code has already been duplicated in the tun driver. Now that we need accounting in the tun driver, we can in fact use this function as is. So this patch marks it for export again. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* Revert "net: release skb->dst in sock_queue_rcv_skb()"David S. Miller2008-12-171-5/+1
| | | | | | | | | | This reverts commit 70355602879229c6f8bd694ec9c0814222bc4936. As pointed out by Mark McLoughlin IP_PKTINFO cmsg data is one post-queueing user, so this optimization is not valid right now. Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2008-11-261-19/+12
|\ | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/hp-plus.c drivers/net/wireless/ath5k/base.c drivers/net/wireless/ath9k/recv.c net/wireless/reg.c
| * net: Fix memory leak in the proto_register functionCatalin Marinas2008-11-211-19/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | If the slub allocator is used, kmem_cache_create() may merge two or more kmem_cache's into one but the cache name pointer is not updated and kmem_cache_name() is no longer guaranteed to return the pointer passed to the former function. This patch stores the kmalloc'ed pointers in the corresponding request_sock_ops and timewait_sock_ops structures. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: release skb->dst in sock_queue_rcv_skb()Eric Dumazet2008-11-261-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | When queuing a skb to sk->sk_receive_queue, we can release its dst, not anymore needed. Since current cpu did the dst_hold(), refcount is probably still hot int this cpu caches. This avoids readers to access the original dst to decrement its refcount, possibly a long time after packet reception. This should speedup UDP and RAW receive path. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Use a percpu_counter for sockets_allocatedEric Dumazet2008-11-251-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Instead of using one atomic_t per protocol, use a percpu_counter for "sockets_allocated", to reduce cache line contention on heavy duty network servers. Note : We revert commit (248969ae31e1b3276fc4399d67ce29a5d81e6fd9 net: af_unix can make unix_nr_socks visbile in /proc), since it is not anymore used after sock_prot_inuse_add() addition Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: make /proc/net/protocols namespace awareEric Dumazet2008-11-191-5/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | Converting /proc/net/protocols to be namespace aware is quite easy and permits us to use sock_prot_inuse_get(). This provides seperate counters for each protocol. For example we can really count TCPv6 sockets and TCPv4 sockets, while previously, we had the same value, and this value was not namespace aware. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2008-11-181-2/+0
|\ \ | |/ | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/isdn/i4l/isdn_net.c fs/cifs/connect.c
| * lockdep: include/linux/lockdep.h - fix warning in net/bluetooth/af_bluetooth.cIngo Molnar2008-11-131-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fix this warning: net/bluetooth/af_bluetooth.c:60: warning: ‘bt_key_strings’ defined but not used net/bluetooth/af_bluetooth.c:71: warning: ‘bt_slock_key_strings’ defined but not used this is a lockdep macro problem in the !LOCKDEP case. We cannot convert it to an inline because the macro works on multiple types, but we can mark the parameter used. [ also clean up a misaligned tab in sock_lock_init_class_and_name() ] [ also remove #ifdefs from around af_family_clock_key strings - which were certainly added to get rid of the ugly build warnings. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Convert TCP & DCCP hash tables to use RCU / hlist_nullsEric Dumazet2008-11-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RCU was added to UDP lookups, using a fast infrastructure : - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the price of call_rcu() at freeing time. - hlist_nulls permits to use few memory barriers. This patch uses same infrastructure for TCP/DCCP established and timewait sockets. Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications using short lived TCP connections. A followup patch, converting rwlocks to spinlocks will even speedup this case. __inet_lookup_established() is pretty fast now we dont have to dirty a contended cache line (read_lock/read_unlock) Only established and timewait hashtable are converted to RCU (bind table and listen table are still using traditional locking) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | udp: RCU handling for Unicast packets.Eric Dumazet2008-10-291-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Goals are : 1) Optimizing handling of incoming Unicast UDP frames, so that no memory writes should happen in the fast path. Note: Multicasts and broadcasts still will need to take a lock, because doing a full lockless lookup in this case is difficult. 2) No expensive operations in the socket bind/unhash phases : - No expensive synchronize_rcu() calls. - No added rcu_head in socket structure, increasing memory needs, but more important, forcing us to use call_rcu() calls, that have the bad property of making sockets structure cold. (rcu grace period between socket freeing and its potential reuse make this socket being cold in CPU cache). David did a previous patch using call_rcu() and noticed a 20% impact on TCP connection rates. Quoting Cristopher Lameter : "Right. That results in cacheline cooldown. You'd want to recycle the object as they are cache hot on a per cpu basis. That is screwed up by the delayed regular rcu processing. We have seen multiple regressions due to cacheline cooldown. The only choice in cacheline hot sensitive areas is to deal with the complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU." - Because udp sockets are allocated from dedicated kmem_cache, use of SLAB_DESTROY_BY_RCU can help here. Theory of operation : --------------------- As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()), special attention must be taken by readers and writers. Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed, reused, inserted in a different chain or in worst case in the same chain while readers could do lookups in the same time. In order to avoid loops, a reader must check each socket found in a chain really belongs to the chain the reader was traversing. If it finds a mismatch, lookup must start again at the begining. This *restart* loop is the reason we had to use rdlock for the multicast case, because we dont want to send same message several times to the same socket. We use RCU only for fast path. Thus, /proc/net/udp still takes spinlocks. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: wrap sk->sk_backlog_rcv()Peter Zijlstra2008-10-071-2/+2
| | | | | | | | Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the generic sk_backlog_rcv behaviour. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
* Phonet: global definitionsRemi Denis-Courmont2008-09-221-3/+6
| | | | | Signed-off-by: Remi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ISDN sockets: add missing lockdep stringsRémi Denis-Courmont2008-09-181-3/+3
| | | | | Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Update entry in af_family_clock_key_stringsOliver Hartkopp2008-07-231-1/+1
| | | | | | | | | | | In the merge phase of the CAN subsystem the af_family_clock_key_strings[] have been added to sock.c in commit 443aef0eddfa44c158d1b94ebb431a70638fcab4 (lockdep: fixup sk_callback_lock annotation). This trivial patch adds the missing name for address family 29 (AF_CAN). Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock: add net to prot->enter_memory_pressure callbackPavel Emelyanov2008-07-161-1/+1
| | | | | | | | | | | | The tcp_enter_memory_pressure calls NET_INC_STATS, but doesn't have where to get the net from. I decided to add a sk argument, not the net itself, only to factor all the required sock_net(sk) calls inside the enter_memory_pressure callback itself. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add sk_set_socket() helper.David S. Miller2008-06-171-2/+2
| | | | | | | | | In order to more easily grep for all things that set sk->sk_socket, add sk_set_socket() helper inline function. Suggested (although only half-seriously) by Evgeniy Polyakov. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: remove CVS keywordsAdrian Bunk2008-06-111-2/+0
| | | | | | | | This patch removes CVS keywords that weren't updated for a long time from comments. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Fix typo in net/core/sock.c.Rami Rosen2008-05-141-1/+1
| | | | | | | | In sock_queue_rcv_skb() (net/core/sock.c) it should be: "Cast sk->rcvbuf ..." instead of: "Cast skb->rcvbuf ..." Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add missing braces to multi-statement if()sIlpo Järvinen2008-05-021-1/+2
| | | | | | | | One finds all kinds of crazy things with some shell pipelining. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6Linus Torvalds2008-04-211-9/+0
|\ | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6: [SPARC]: Remove SunOS and Solaris binary support.
| * [SPARC]: Remove SunOS and Solaris binary support.David S. Miller2008-04-211-9/+0
| | | | | | | | | | | | As per Documentation/feature-removal-schedule.txt Signed-off-by: David S. Miller <davem@davemloft.net>
* | Remove documentation of non-existent sk_alloc argRusty Russell2008-04-211-1/+0
|/ | | | | | | As you can see, there's no zero_it arg (in fact code always uses __GFP_ZERO). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
* [NETNS]: Add netns refcnt debug for kernel sockets.Denis V. Lunev2008-04-161-0/+1
| | | | | | | | | Protocol control sockets and netlink kernel sockets should not prevent the namespace stop request. They are initialized and disposed in a special way by sk_change_net/sk_release_kernel. Signed-off-by: Denis V. Lunev <den@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2008-04-141-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/ehea/ehea_main.c drivers/net/wireless/iwlwifi/Kconfig drivers/net/wireless/rt2x00/rt61pci.c net/ipv4/inet_timewait_sock.c net/ipv6/raw.c net/mac80211/ieee80211_sta.c
OpenPOWER on IntegriCloud