summaryrefslogtreecommitdiffstats
path: root/net/core
Commit message (Collapse)AuthorAgeFilesLines
* net: Enable a userns root rtnl calls that are safe for unprivilged usersEric W. Biederman2012-11-183-24/+4
| | | | | | | | | | | | | | | | | - Only allow moving network devices to network namespaces you have CAP_NET_ADMIN privileges over. - Enable creating/deleting/modifying interfaces - Enable adding/deleting addresses - Enable adding/setting/deleting neighbour entries - Enable adding/removing routes - Enable adding/removing fib rules - Enable setting the forwarding state - Enable adding/removing ipv6 address labels - Enable setting bridge parameter Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Allow userns root control of the core of the network stack.Eric W. Biederman2012-11-184-13/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow an unpriviled user who has created a user namespace, and then created a network namespace to effectively use the new network namespace, by reducing capable(CAP_NET_ADMIN) and capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns, CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls. Settings that merely control a single network device are allowed. Either the network device is a logical network device where restrictions make no difference or the network device is hardware NIC that has been explicity moved from the initial network namespace. In general policy and network stack state changes are allowed while resource control is left unchanged. Allow ethtool ioctls. Allow binding to network devices. Allow setting the socket mark. Allow setting the socket priority. Allow setting the network device alias via sysfs. Allow setting the mtu via sysfs. Allow changing the network device flags via sysfs. Allow setting the network device group via sysfs. Allow the following network device ioctls. SIOCGMIIPHY SIOCGMIIREG SIOCSIFNAME SIOCSIFFLAGS SIOCSIFMETRIC SIOCSIFMTU SIOCSIFHWADDR SIOCSIFSLAVE SIOCADDMULTI SIOCDELMULTI SIOCSIFHWBROADCAST SIOCSMIIREG SIOCBONDENSLAVE SIOCBONDRELEASE SIOCBONDSETHWADDR SIOCBONDCHANGEACTIVE SIOCBRADDIF SIOCBRDELIF SIOCSHWTSTAMP Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Allow userns root to force the scm credsEric W. Biederman2012-11-181-3/+3
| | | | | | | | | If the user calling sendmsg has the appropriate privieleges in their user namespace allow them to set the uid, gid, and pid in the SCM_CREDENTIALS control message to any valid value. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Push capable(CAP_NET_ADMIN) into the rtnl methodsEric W. Biederman2012-11-183-1/+31
| | | | | | | | | | | | | | | | | | - In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged users to make netlink calls to modify their local network namespace. - In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so that calls that are not safe for unprivileged users are still protected. Later patches will remove the extra capable calls from methods that are safe for unprivilged users. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Don't export sysctls to unprivileged usersEric W. Biederman2012-11-182-0/+9
| | | | | | | | | | | | | In preparation for supporting the creation of network namespaces by unprivileged users, modify all of the per net sysctl exports and refuse to allow them to unprivileged users. This makes it safe for unprivileged users in general to access per net sysctls, and allows sysctls to be exported to unprivileged users on an individual basis as they are deemed safe. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* userns: make each net (net_ns) belong to a user_nsEric W. Biederman2012-11-181-4/+12
| | | | | | | | | | | | | | | | | The user namespace which creates a new network namespace owns that namespace and all resources created in it. This way we can target capability checks for privileged operations against network resources to the user_ns which created the network namespace in which the resource lives. Privilege to the user namespace which owns the network namespace, or any parent user namespace thereof, provides the same privilege to the network resource. This patch is reworked from a version originally by Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NSEric W. Biederman2012-11-181-7/+0
| | | | | | | | | | | | | | | | | | The copy of copy_net_ns used when the network stack is not built is broken as it does not return -EINVAL when attempting to create a new network namespace. We don't even have a previous network namespace. Since we need a copy of copy_net_ns in net/net_namespace.h that is available when the networking stack is not built at all move the correct version of copy_net_ns from net_namespace.c into net_namespace.h Leaving us with just 2 versions of copy_net_ns. One version for when we compile in network namespace suport and another stub for all other occasions. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-11-172-2/+5
|\ | | | | | | | | | | Minor line offset auto-merges. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net-rps: Fix brokeness causing OOO packetsTom Herbert2012-11-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | In commit c445477d74ab3779 which adds aRFS to the kernel, the CPU selected for RFS is not set correctly when CPU is changing. This is causing OOO packets and probably other issues. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: correct check in dev_addr_del()Jiri Pirko2012-11-151-1/+2
| | | | | | | | | | | | | | | | Check (ha->addr == dev->dev_addr) is always true because dev_addr_init() sets this. Correct the check to behave properly on addr removal. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: use right lock in __dev_remove_offloadEric Dumazet2012-11-161-2/+2
| | | | | | | | | | | | | | | | | | offload_base is protected by offload_lock, not ptype_lock Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Remove code duplication between offload structuresVlad Yasevich2012-11-151-7/+7
| | | | | | | | | | | | | | Move the offload callbacks into its own structure. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Switch to using the new packet offload infrustructureVlad Yasevich2012-11-151-10/+9
| | | | | | | | | | | | | | | | Convert to using the new GSO/GRO registration mechanism and new packet offload structure. Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Add generic packet offload infrastructure.Vlad Yasevich2012-11-151-0/+80
| | | | | | | | | | | | | | | | | | Create a new data structure to contain the GRO/GSO callbacks and add a new registration mechanism. Singed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2012-11-103-4/+7
|\ \ | |/ | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c Minor conflict between the BCM_CNIC define removal in net-next and a bug fix added to net. Based upon a conflict resolution patch posted by Stephen Rothwell. Signed-off-by: David S. Miller <davem@davemloft.net>
| * af-packet: fix oops when socket is not presentEric Leblond2012-11-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to a NULL dereference, the following patch is causing oops in normal trafic condition: commit c0de08d04215031d68fa13af36f347a6cfa252ca Author: Eric Leblond <eric@regit.org> Date:   Thu Aug 16 22:02:58 2012 +0000     af_packet: don't emit packet on orig fanout group This buggy patch was a feature fix and has reached most stable branches. When skb->sk is NULL and when packet fanout is used, there is a crash in match_fanout_group where skb->sk is accessed. This patch fixes the issue by returning false as soon as the socket is NULL: this correspond to the wanted behavior because the kernel as to resend the skb to all the listening socket in this case. Signed-off-by: Eric Leblond <eric@regit.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rtnetlink: Use nlmsg type RTM_NEWNEIGH from dflt fdb dumpJohn Fastabend2012-11-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the dflt fdb dump handler to use RTM_NEWNEIGH to be compatible with bridge dump routines. The dump reply from the network driver handlers should match the reply from bridge handler. The fact they were not in the ixgbe case was effectively a bug. This patch resolves it. Applications that were not checking the nlmsg type will continue to work. And now applications that do check the type will work as expected. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: fix secpath kmemleakEric Dumazet2012-10-221-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mike Kazantsev found 3.5 kernels and beyond were leaking memory, and tracked the faulty commit to a1c7fff7e18f59e ("net: netdev_alloc_skb() use build_skb()") While this commit seems fine, it uncovered a bug introduced in commit bad43ca8325 ("net: introduce skb_try_coalesce()), in function kfree_skb_partial()"): If head is stolen, we free the sk_buff, without removing references on secpath (skb->sp). So IPsec + IP defrag/reassembly (using skb coalescing), or TCP coalescing could leak secpath objects. Fix this bug by calling skb_release_head_state(skb) to properly release all possible references to linked objects. Reported-by: Mike Kazantsev <mk.fraggod@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Bisected-by: Mike Kazantsev <mk.fraggod@gmail.com> Tested-by: Mike Kazantsev <mk.fraggod@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix bridge notify hook to manage flags correctlyJohn Fastabend2012-11-031-8/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The bridge notify hook rtnl_bridge_notify() was not handling the case where the master flags was set or with both flags set. First flags are not being passed correctly and second the logic to parse them is broken. This patch passes the original flags value and fixes the logic. Reported-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | pktgen: clean up ktime_t helpersDaniel Borkmann2012-11-031-28/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some years ago, the ktime_t helper functions ktime_now() and ktime_lt() have been introduced. Instead of defining them inside pktgen.c, they should either use ktime_t library functions or, if not available, they should be defined in ktime.h, so that also others can benefit from them. ktime_compare() is introduced with a similar notion as in timespec_compare(). Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Fix continued iteration in rtnl_bridge_getlink()Ben Hutchings2012-11-021-16/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e5a55a898720096f43bc24938f8875c0a1b34cd7 ('net: create generic bridge ops') broke the handling of a non-zero starting index in rtnl_bridge_getlink() (based on the old br_dump_ifinfo()). When the starting index is non-zero, we need to increment the current index for each entry that we are skipping. Also, we need to check the index before both cases, since we may previously have stopped iteration between getting information about a device from its master and from itself. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Tested-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | skb: api to report errors for zero copy skbsMichael S. Tsirkin2012-11-021-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | Orphaning frags for zero copy skbs needs to allocate data in atomic context so is has a chance to fail. If it does we currently discard the skb which is safe, but we don't report anything to the caller, so it can not recover by e.g. disabling zero copy. Add an API to free skb reporting such errors: this is used by tun in case orphaning frags fails. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | skb: report completion status for zero copy skbsMichael S. Tsirkin2012-11-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even if skb is marked for zero copy, net core might still decide to copy it later which is somewhat slower than a copy in user context: besides copying the data we need to pin/unpin the pages. Add a parameter reporting such cases through zero copy callback: if this happens a lot, device can take this into account and switch to copying in user context. This patch updates all users but ignores the passed value for now: it will be used by follow-up patches. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sk-filter: Add ability to get socket filter program (v2)Pavel Emelyanov2012-11-012-0/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SO_ATTACH_FILTER option is set only. I propose to add the get ability by using SO_ATTACH_FILTER in getsockopt. To be less irritating to eyes the SO_GET_FILTER alias to it is declared. This ability is required by checkpoint-restore project to be able to save full state of a socket. There are two issues with getting filter back. First, kernel modifies the sock_filter->code on filter load, thus in order to return the filter element back to user we have to decode it into user-visible constants. Fortunately the modification in question is interconvertible. Second, the BPF_S_ALU_DIV_K code modifies the command argument k to speed up the run-time division by doing kernel_k = reciprocal(user_k). Bad news is that different user_k may result in same kernel_k, so we can't get the original user_k back. Good news is that we don't have to do it. What we need to is calculate a user2_k so, that reciprocal(user2_k) == reciprocal(user_k) == kernel_k i.e. if it's re-loaded back the compiled again value will be exactly the same as it was. That said, the user2_k can be calculated like this user2_k = reciprocal(kernel_k) with an exception, that if kernel_k == 0, then user2_k == 1. The optlen argument is treated like this -- when zero, kernel returns the amount of sock_fprog elements in filter, otherwise it should be large enough for the sock_fprog array. changes since v1: * Declared SO_GET_FILTER in all arch headers * Added decode of vlan-tag codes Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: filter: add vlan tag accessEric Dumazet2012-10-311-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BPF filters lack ability to access skb->vlan_tci This patch adds two new ancillary accessors : SKF_AD_VLAN_TAG (44) mapped to vlan_tx_tag_get(skb) SKF_AD_VLAN_TAG_PRESENT (48) mapped to vlan_tx_tag_present(skb) This allows libpcap/tcpdump to use a kernel filter instead of having to fallback to accept all packets, then filter them in user space. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Ani Sinha <ani@aristanetworks.com> Suggested-by: Daniel Borkmann <danborkmann@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ixgbe: add setlink, getlink support to ixgbe and ixgbevfJohn Fastabend2012-10-311-0/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for the net device ops to manage the embedded hardware bridge on ixgbe devices. With this patch the bridge mode can be toggled between VEB and VEPA to support stacking macvlan devices or using the embedded switch without any SW component in 802.1Qbg/br environments. Additionally, this adds source address pruning to the ixgbevf driver to prune any frames sent back from a reflective relay on the switch. This is required because the existing hardware does not support this. Without it frames get pushed into the stack with its own src mac which is invalid per 802.1Qbg VEPA definition. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: set and query VEB/VEPA bridge mode via PF_BRIDGEJohn Fastabend2012-10-311-4/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hardware switches may support enabling and disabling the loopback switch which puts the device in a VEPA mode defined in the IEEE 802.1Qbg specification. In this mode frames are not switched in the hardware but sent directly to the switch. SR-IOV capable NICs will likely support this mode I am aware of at least two such devices. Also I am told (but don't have any of this hardware available) that there are devices that only support VEPA modes. In these cases it is important at a minimum to be able to query these attributes. This patch adds an additional IFLA_BRIDGE_MODE attribute that can be set and dumped via the PF_BRIDGE:{SET|GET}LINK operations. Also anticipating bridge attributes that may be common for both embedded bridges and software bridges this adds a flags attribute IFLA_BRIDGE_FLAGS currently used to determine if the command or event is being generated to/from an embedded bridge or software bridge. Finally, the event generation is pulled out of the bridge module and into rtnetlink proper. For example using the macvlan driver in VEPA mode on top of an embedded switch requires putting the embedded switch into a VEPA mode to get the expected results. -------- -------- | VEPA | | VEPA | <-- macvlan vepa edge relays -------- -------- | | | | ------------------ | VEPA | <-- embedded switch in NIC ------------------ | | ------------------- | external switch | <-- shiny new physical ------------------- switch with VEPA support A packet sent from the macvlan VEPA at the top could be loopbacked on the embedded switch and never seen by the external switch. So in order for this to work the embedded switch needs to be set in the VEPA state via the above described commands. By making these attributes nested in IFLA_AF_SPEC we allow future extensions to be made as needed. CC: Lennert Buytenhek <buytenh@wantstofly.org> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: create generic bridge opsJohn Fastabend2012-10-311-0/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PF_BRIDGE:RTM_{GET|SET}LINK nlmsg family and type are currently embedded in the ./net/bridge module. This prohibits them from being used by other bridging devices. One example of this being hardware that has embedded bridging components. In order to use these nlmsg types more generically this patch adds two net_device_ops hooks. One to set link bridge attributes and another to dump the current bride attributes. ndo_bridge_setlink() ndo_bridge_getlink() CC: Lennert Buytenhek <buytenh@wantstofly.org> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | cgroup: net_cls: Pass in task to sock_update_classid()Daniel Wagner2012-10-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sock_update_classid() assumes that the update operation always are applied on the current task. sock_update_classid() needs to know on which tasks to work on in order to be able to migrate task between cgroups using the struct cgroup_subsys attach() callback. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Joe Perches <joe@perches.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: <netdev@vger.kernel.org> Cc: <cgroups@vger.kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | cgroup: net_cls: Remove rcu_read_lock/unlockDaniel Wagner2012-10-261-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Eric pointed out: "Hey task_cls_classid() has its own rcu protection since commit 3fb5a991916091a908d (cls_cgroup: Fix rcu lockdep warning) So we can safely revert Paul commit (1144182a8757f2a1) (We no longer need rcu_read_lock/unlock here)" Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: netdev@vger.kernel.org Cc: cgroups@vger.kernel.org Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | cgroup: net_prio: Mark local used function staticDaniel Wagner2012-10-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | net_prio_attach() is only access via cgroup_subsys callbacks, therefore we can reduce the visibility of this function. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Tejun Heo <tj@kernel.org> Cc: <netdev@vger.kernel.org> Cc: <cgroups@vger.kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netlink: cleanup the unnecessary return value checkHans Zhang2012-10-231-3/+3
| | | | | | | | | | | | | | | | | | It's no needed to check the return value of tab since the NULL situation has been handled already, and the rtnl_msg_handlers[PF_UNSPEC] has been initialized as non-NULL during the rtnetlink_init(). Signed-off-by: Hans Zhang <zhanghonghui@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net:dev: remove double indentical assignment in dev_change_net_namespace().Rami Rosen2012-10-211-1/+0
| | | | | | | | | | | | | | | | This patch removes double assignment of err to -EINVAL in dev_change_net_namespace(). Signed-off-by: Rami Rosen <ramirose@gmail.com> Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | sockopt: Make SO_BINDTODEVICE readablePavel Emelyanov2012-10-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SO_BINDTODEVICE option is the only SOL_SOCKET one that can be set, but cannot be get via sockopt API. The only way we can find the device id a socket is bound to is via sock-diag interface. But the diag works only on hashed sockets, while the opt in question can be set for yet unhashed one. That said, in order to know what device a socket is bound to (we do want to know this in checkpoint-restore project) I propose to make this option getsockopt-able and report the respective device index. Another solution to the problem might be to teach the sock-diag reporting info on unhashed sockets. Should I go this way instead? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | pktgen: Use ipv6_addr_anyJoe Perches2012-10-211-5/+1
|/ | | | | | | Use the standard test for a non-zero ipv6 address. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add doc for in4_pton()Amerigo Wang2012-10-121-0/+12
| | | | | | | | | It is not easy to use in4_pton() correctly without reading its definition, so add some doc for it. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add doc for in6_pton()Amerigo Wang2012-10-121-0/+12
| | | | | | | | | It is not easy to use in6_pton() correctly without reading its definition, so add some doc for it. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* pktgen: replace scan_ip6() with in6_pton()Amerigo Wang2012-10-101-97/+4
| | | | | | Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* pktgen: enable automatic IPv6 address settingAmerigo Wang2012-10-101-13/+5
| | | | | | Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* pktgen: display IPv4 address in human-readable formatAmerigo Wang2012-10-101-2/+2
| | | | | | | | | | It is weird to display IPv4 address in %x format, what's more, IPv6 address is disaplayed in human-readable format too. So, make it human-readable. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* pktgen: set different default min_pkt_size for different protocolsAmerigo Wang2012-10-101-7/+20
| | | | | | | | | ETH_ZLEN is too small for IPv6, so this default value is not suitable. Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* pktgen: fix crash when generating IPv6 packetsAmerigo Wang2012-10-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | For IPv6, sizeof(struct ipv6hdr) = 40, thus the following expression will result negative: datalen = pkt_dev->cur_pkt_size - 14 - sizeof(struct ipv6hdr) - sizeof(struct udphdr) - pkt_dev->pkt_overhead; And, the check "if (datalen < sizeof(struct pktgen_hdr))" will be passed as "datalen" is promoted to unsigned, therefore will cause a crash later. This is a quick fix by checking if "datalen" is negative. The following patch will increase the default value of 'min_pkt_size' for IPv6. This bug should exist for a long time, so Cc -stable too. Cc: <stable@vger.kernel.org> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vlan: don't deliver frames for unknown vlans to protocolsFlorian Zumbiehl2012-10-081-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | 6a32e4f9dd9219261f8856f817e6655114cfec2f made the vlan code skip marking vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if there was an rx_handler, as the rx_handler could cause the frame to be received on a different (virtual) vlan-capable interface where that vlan might be configured. As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause frames for unknown vlans to be delivered to the protocol stack as if they had been received untagged. For example, if an ipv6 router advertisement that's tagged for a locally not configured vlan is received on an interface with macvlan interfaces attached, macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the macvlan interfaces, which caused it to be passed to the protocol stack, leading to ipv6 addresses for the announced prefix being configured even though those are completely unusable on the underlying interface. The fix moves marking as PACKET_OTHERHOST after the rx_handler so the rx_handler, if there is one, sees the frame unchanged, but afterwards, before the frame is delivered to the protocol stack, it gets marked whether there is an rx_handler or not. Signed-off-by: Florian Zumbiehl <florz@florz.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: gro: selective flush of packetsEric Dumazet2012-10-081-7/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Current GRO can hold packets in gro_list for almost unlimited time, in case napi->poll() handler consumes its budget over and over. In this case, napi_complete()/napi_gro_flush() are not called. Another problem is that gro_list is flushed in non friendly way : We scan the list and complete packets in the reverse order. (youngest packets first, oldest packets last) This defeats priorities that sender could have cooked. Since GRO currently only store TCP packets, we dont really notice the bug because of retransmits, but this behavior can add unexpected latencies, particularly on mice flows clamped by elephant flows. This patch makes sure no packet can stay more than 1 ms in queue, and only in stress situations. It also complete packets in the right order to minimize latencies. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jesse Gross <jesse@nicira.com> Cc: Tom Herbert <therbert@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: gro: fix a potential crash in skb_gro_reset_offsetEric Dumazet2012-10-071-6/+8
| | | | | | | | | | | | | | | | Before accessing skb first fragment, better make sure there is one. This is probably not needed for old kernels, since an ethernet frame cannot contain only an ethernet header, but the recent GRO addition to tunnels makes this patch needed. Also skb_gro_reset_offset() can be static, it actually allows compiler to inline it. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Fix skb_under_panic oops in neigh_resolve_outputramesh.nagappa@gmail.com2012-10-071-4/+2
| | | | | | | | | | | | | The retry loop in neigh_resolve_output() and neigh_connected_output() call dev_hard_header() with out reseting the skb to network_header. This causes the retry to fail with skb_under_panic. The fix is to reset the network_header within the retry loop. Signed-off-by: Ramesh Nagappa <ramesh.nagappa@ericsson.com> Reviewed-by: Shawn Lu <shawn.lu@ericsson.com> Reviewed-by: Robert Coulson <robert.coulson@ericsson.com> Reviewed-by: Billie Alsup <billie.alsup@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: remove skb recyclingEric Dumazet2012-10-071-47/+0
| | | | | | | | | | | | | | | | | | | | | | | Over time, skb recycling infrastructure got litle interest and many bugs. Generic rx path skb allocation is now using page fragments for efficient GRO / TCP coalescing, and recyling a tx skb for rx path is not worth the pain. Last identified bug is that fat skbs can be recycled and it can endup using high order pages after few iterations. With help from Maxime Bizon, who pointed out that commit 87151b8689d (net: allow pskb_expand_head() to get maximum tailroom) introduced this regression for recycled skbs. Instead of fixing this bug, lets remove skb recycling. Drivers wanting really hot skbs should use build_skb() anyway, to allocate/populate sk_buff right before netif_receive_skb() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'for-linus' of ↵Linus Torvalds2012-10-022-28/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...
| * make get_file() return its argumentAl Viro2012-09-261-2/+1
| | | | | | | | | | | | simplifies a bunch of callers... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * new helper: iterate_fd()Al Viro2012-09-261-26/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | iterates through the opened files in given descriptor table, calling a supplied function; we stop once non-zero is returned. Callback gets struct file *, descriptor number and const void * argument passed to iterator. It is called with files->file_lock held, so it is not allowed to block. tty_io, netprio_cgroup and selinux flush_unauthorized_files() converted to its use. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
OpenPOWER on IntegriCloud