summaryrefslogtreecommitdiffstats
path: root/net/sched
Commit message (Collapse)AuthorAgeFilesLines
* net: sched: flower: insert new filter to idr after setting its maskVlad Buslov2019-03-191-21/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit ecb3dea400d3beaf611ce76ac7a51d4230492cf2 ] When adding new filter to flower classifier, fl_change() inserts it to handle_idr before initializing filter extensions and assigning it a mask. Normally this ordering doesn't matter because all flower classifier ops callbacks assume rtnl lock protection. However, when filter has an action that doesn't have its kernel module loaded, rtnl lock is released before call to request_module(). During this time the filter can be accessed bu concurrent task before its initialization is completed, which can lead to a crash. Example case of NULL pointer dereference in concurrent dump: Task 1 Task 2 tc_new_tfilter() fl_change() idr_alloc_u32(fnew) fl_set_parms() tcf_exts_validate() tcf_action_init() tcf_action_init_1() rtnl_unlock() request_module() ... rtnl_lock() tc_dump_tfilter() tcf_chain_dump() fl_walk() idr_get_next_ul() tcf_node_dump() tcf_fill_node() fl_dump() mask = &f->mask->key; <- NULL ptr rtnl_lock() Extension initialization and mask assignment don't depend on fnew->handle that is allocated by idr_alloc_u32(). Move idr allocation code after action creation and mask assignment in fl_change() to prevent concurrent access to not fully initialized filter when rtnl lock is released to load action module. Fixes: 01683a146999 ("net: sched: refactor flower walk to iterate over idr") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: sched: put back q.qlen into a single locationEric Dumazet2019-03-101-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 46b1c18f9deb326a7e18348e668e4c7ab7c7458b ] In the series fc8b81a5981f ("Merge branch 'lockless-qdisc-series'") John made the assumption that the data path had no need to read the qdisc qlen (number of packets in the qdisc). It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO children. But pfifo_fast can be used as leaf in class full qdiscs, and existing logic needs to access the child qlen in an efficient way. HTB breaks badly, since it uses cl->leaf.q->q.qlen in : htb_activate() -> WARN_ON() htb_dequeue_tree() to decide if a class can be htb_deactivated when it has no more packets. HFSC, DRR, CBQ, QFQ have similar issues, and some calls to qdisc_tree_reduce_backlog() also read q.qlen directly. Using qdisc_qlen_sum() (which iterates over all possible cpus) in the data path is a non starter. It seems we have to put back qlen in a central location, at least for stable kernels. For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock, so the existing q.qlen{++|--} are correct. For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}() because the spinlock might be not held (for example from pfifo_fast_enqueue() and pfifo_fast_dequeue()) This patch adds atomic_qlen (in the same location than qlen) and renames the following helpers, since we want to express they can be used without qdisc lock, and that qlen is no longer percpu. - qdisc_qstats_cpu_qlen_dec -> qdisc_qstats_atomic_qlen_dec() - qdisc_qstats_cpu_qlen_inc -> qdisc_qstats_atomic_qlen_inc() Later (net-next) we might revert this patch by tracking all these qlen uses and replace them by a more efficient method (not having to access a precise qlen, but an empty/non_empty status that might be less expensive to maintain/track). Another possibility is to have a legacy pfifo_fast version that would be used when used a a child qdisc, since the parent qdisc needs a spinlock anyway. But then, future lockless qdiscs would also have the same problem. Fixes: 7e66016f2c65 ("net: sched: helpers to sum qlen and qlen for per cpu logic") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* net: netem: fix skb length BUG_ON in __skb_to_sgvecSheng Lan2019-02-281-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It can be reproduced by following steps: 1. virtio_net NIC is configured with gso/tso on 2. configure nginx as http server with an index file bigger than 1M bytes 3. use tc netem to produce duplicate packets and delay: tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90% 4. continually curl the nginx http server to get index file on client 5. BUG_ON is seen quickly [10258690.371129] kernel BUG at net/core/skbuff.c:4028! [10258690.371748] invalid opcode: 0000 [#1] SMP PTI [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G W 5.0.0-rc6 #2 [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202 [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002 [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028 [10258690.372094] FS: 0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000 [10258690.372094] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0 [10258690.372094] Call Trace: [10258690.372094] <IRQ> [10258690.372094] skb_to_sgvec+0x11/0x40 [10258690.372094] start_xmit+0x38c/0x520 [virtio_net] [10258690.372094] dev_hard_start_xmit+0x9b/0x200 [10258690.372094] sch_direct_xmit+0xff/0x260 [10258690.372094] __qdisc_run+0x15e/0x4e0 [10258690.372094] net_tx_action+0x137/0x210 [10258690.372094] __do_softirq+0xd6/0x2a9 [10258690.372094] irq_exit+0xde/0xf0 [10258690.372094] smp_apic_timer_interrupt+0x74/0x140 [10258690.372094] apic_timer_interrupt+0xf/0x20 [10258690.372094] </IRQ> In __skb_to_sgvec(), the skb->len is not equal to the sum of the skb's linear data size and nonlinear data size, thus BUG_ON triggered. Because the skb is cloned and a part of nonlinear data is split off. Duplicate packet is cloned in netem_enqueue() and may be delayed some time in qdisc. When qdisc len reached the limit and returns NET_XMIT_DROP, the skb will be retransmit later in write queue. the skb will be fragmented by tso_fragment(), the limit size that depends on cwnd and mss decrease, the skb's nonlinear data will be split off. The length of the skb cloned by netem will not be updated. When we use virtio_net NIC and invoke skb_to_sgvec(), the BUG_ON trigger. To fix it, netem returns NET_XMIT_SUCCESS to upper stack when it clones a duplicate packet. Fixes: 35d889d1 ("sch_netem: fix skb leak in netem_enqueue()") Signed-off-by: Sheng Lan <lansheng@huawei.com> Reported-by: Qin Ji <jiqin.ji@huawei.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: act_tunnel_key: fix NULL pointer dereference during initVlad Buslov2019-02-251-1/+2
| | | | | | | | | | | | Metadata pointer is only initialized for action TCA_TUNNEL_KEY_ACT_SET, but it is unconditionally dereferenced in tunnel_key_init() error handler. Verify that metadata pointer is not NULL before dereferencing it in tunnel_key_init error handling code. Fixes: ee28bb56ac5b ("net/sched: fix memory leak in act_tunnel_key_init()") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: act_skbedit: fix refcount leak when replace failsDavide Caratti2019-02-241-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | when act_skbedit was converted to use RCU in the data plane, we added an error path, but we forgot to drop the action refcount in case of failure during a 'replace' operation: # tc actions add action skbedit ptype otherhost pass index 100 # tc action show action skbedit total acts 1 action order 0: skbedit ptype otherhost pass index 100 ref 1 bind 0 # tc actions replace action skbedit ptype otherhost drop index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action show action skbedit total acts 1 action order 0: skbedit ptype otherhost pass index 100 ref 2 bind 0 Ensure we call tcf_idr_release(), in case 'params_new' allocation failed, also when the action is being replaced. Fixes: c749cdda9089 ("net/sched: act_skbedit: don't use spinlock in the data path") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: act_ipt: fix refcount leak when replace failsDavide Caratti2019-02-241-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 4e8ddd7f1758 ("net: sched: don't release reference on action overwrite"), the error path of all actions was converted to drop refcount also when the action was being overwritten. But we forgot act_ipt_init(), in case allocation of 'tname' was not successful: # tc action add action xt -j LOG --log-prefix hello index 100 tablename: mangle hook: NF_IP_POST_ROUTING target: LOG level warning prefix "hello" index 100 # tc action show action xt total acts 1 action order 0: tablename: mangle hook: NF_IP_POST_ROUTING target LOG level warning prefix "hello" index 100 ref 1 bind 0 # tc action replace action xt -j LOG --log-prefix world index 100 tablename: mangle hook: NF_IP_POST_ROUTING target: LOG level warning prefix "world" index 100 RTNETLINK answers: Cannot allocate memory We have an error talking to the kernel # tc action show action xt total acts 1 action order 0: tablename: mangle hook: NF_IP_POST_ROUTING target LOG level warning prefix "hello" index 100 ref 2 bind 0 Ensure we call tcf_idr_release(), in case 'tname' allocation failed, also when the action is being replaced. Fixes: 4e8ddd7f1758 ("net: sched: don't release reference on action overwrite") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: fix two more memory leaks in cls_tcindexCong Wang2019-02-121-9/+7
| | | | | | | | | | | | | | | | | struct tcindex_filter_result contains two parts: struct tcf_exts and struct tcf_result. For the local variable 'cr', its exts part is never used but initialized without being released properly on success path. So just completely remove the exts part to fix this leak. For the local variable 'new_filter_result', it is never properly released if not used by 'r' on success path. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: fix a memory leak in cls_tcindexCong Wang2019-02-121-16/+30
| | | | | | | | | | | | | | | | | | | | | | | When tcindex_destroy() destroys all the filter results in the perfect hash table, it invokes the walker to delete each of them. However, results with class==0 are skipped in either tcindex_walk() or tcindex_delete(), which causes a memory leak reported by kmemleak. This patch fixes it by skipping the walker and directly deleting these filter results so we don't miss any filter result. As a result of this change, we have to initialize exts->net properly in tcindex_alloc_perfect_hash(). For net-next, we need to consider whether we should initialize ->net in tcf_exts_init() instead, before that just directly test CONFIG_NET_CLS_ACT=y. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: fix a race condition in tcindex_destroy()Cong Wang2019-02-121-7/+11
| | | | | | | | | | | | | | | | | | | | | | | | tcindex_destroy() invokes tcindex_destroy_element() via a walker to delete each filter result in its perfect hash table, and tcindex_destroy_element() calls tcindex_delete() which schedules tcf RCU works to do the final deletion work. Unfortunately this races with the RCU callback __tcindex_destroy(), which could lead to use-after-free as reported by Adrian. Fix this by migrating this RCU callback to tcf RCU work too, as that workqueue is ordered, we will not have use-after-free. Note, we don't need to hold netns refcnt because we don't call tcf_exts_destroy() here. Fixes: 27ce4f05e2ab ("net_sched: use tcf_queue_work() in tcindex filter") Reported-by: Adrian <bugs@abtelecom.ro> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Documentation: bring operstate documentation up-to-dateJouke Witteveen2019-02-111-1/+1
| | | | | | | Netlink has moved from bitmasks to group numbers long ago. Signed-off-by: Jouke Witteveen <j.witteveen@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: cls_flower: Remove filter from mask before freeing itPetr Machata2019-02-041-1/+5
| | | | | | | | | | | | | | | | | | In fl_change(), when adding a new rule (i.e. fold == NULL), a driver may reject the new rule, for example due to resource exhaustion. By that point, the new rule was already assigned a mask, and it was added to that mask's hash table. The clean-up path that's invoked as a result of the rejection however neglects to undo the hash table addition, and proceeds to free the new rule, thus leaving a dangling pointer in the hash table. Fix by removing fnew from the mask's hash table before it is freed. Fixes: 35cc3cefc4de ("net/sched: cls_flower: Reject duplicated rules also under skip_sw") Signed-off-by: Petr Machata <petrm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: cls_flower: allocate mask dynamically in fl_change()Ivan Vecera2019-01-171-5/+14
| | | | | | | | | | | | | | | | | | | | Recent changes (especially 05cd271fd61a ("cls_flower: Support multiple masks per priority")) in the fl_flow_mask structure grow it and its current size e.g. on x86_64 with defconfig is 760 bytes and more than 1024 bytes with some debug options enabled. Prior the mentioned commit its size was 176 bytes (using defconfig on x86_64). With regard to this fact it's reasonable to allocate this structure dynamically in fl_change() to reduce its stack size. v2: - use kzalloc() instead of kcalloc() Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority") Cc: Jiri Pirko <jiri@resnulli.us> Cc: Paul Blakey <paulb@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net_sched: refetch skb protocol for each filterCong Wang2019-01-161-2/+1
| | | | | | | | | | | | | | | | | | | | | | | Martin reported a set of filters don't work after changing from reclassify to continue. Looking into the code, it looks like skb protocol is not always fetched for each iteration of the filters. But, as demonstrated by Martin, TC actions could modify skb->protocol, for example act_vlan, this means we have to refetch skb protocol in each iteration, rather than using the one we fetch in the beginning of the loop. This bug is _not_ introduced by commit 3b3ae880266d ("net: sched: consolidate tc_classify{,_compat}"), technically, if act_vlan is the only action that modifies skb protocol, then it is commit c7e2b9689ef8 ("sched: introduce vlan action") which introduced this bug. Reported-by: Martin Olsson <martin.olsson+netdev@sentorsecurity.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/sched: act_tunnel_key: fix memory leak in case of action replaceDavide Caratti2019-01-151-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | running the following TDC test cases: 7afc - Replace tunnel_key set action with all parameters 364d - Replace tunnel_key set action with all parameters and cookie it's possible to trigger kmemleak warnings like: unreferenced object 0xffff94797127ab40 (size 192): comm "tc", pid 3248, jiffies 4300565293 (age 1006.862s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 c0 93 f9 8a ff ff ff ff ................ 41 84 ee 89 ff ff ff ff 00 00 00 00 00 00 00 00 A............... backtrace: [<000000001e85b61c>] tunnel_key_init+0x31d/0x820 [act_tunnel_key] [<000000007f3f6ee7>] tcf_action_init_1+0x384/0x4c0 [<00000000e89e3ded>] tcf_action_init+0x12b/0x1a0 [<00000000c1c8c0f8>] tcf_action_add+0x73/0x170 [<0000000095a9fc28>] tc_ctl_action+0x122/0x160 [<000000004bebeac5>] rtnetlink_rcv_msg+0x263/0x2d0 [<000000009fd862dd>] netlink_rcv_skb+0x4a/0x110 [<00000000b55199e7>] netlink_unicast+0x1a0/0x250 [<000000004996cd21>] netlink_sendmsg+0x2c1/0x3c0 [<000000004d6a94b4>] sock_sendmsg+0x36/0x40 [<000000005d9f0208>] ___sys_sendmsg+0x280/0x2f0 [<00000000dec19023>] __sys_sendmsg+0x5e/0xa0 [<000000004b82ac81>] do_syscall_64+0x5b/0x180 [<00000000a0f1209a>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [<000000002926b2ab>] 0xffffffffffffffff when the tunnel_key action is replaced, the kernel forgets to release the dst metadata: ensure they are released by tunnel_key_init(), the same way it's done in tunnel_key_release(). Fixes: d0f6dd8a914f4 ("net/sched: Introduce act_tunnel_key") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sch_cake: Correctly update parent qlen when splitting GSO packetsToke Høiland-Jørgensen2019-01-151-2/+3
| | | | | | | | | | | To ensure parent qdiscs have the same notion of the number of enqueued packets even after splitting a GSO packet, update the qdisc tree with the number of packets that was added due to the split. Reported-by: Pete Heist <pete@heistp.net> Tested-by: Pete Heist <pete@heistp.net> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sched: Fix detection of empty queues in child qdiscsToke Høiland-Jørgensen2019-01-153-3/+9
| | | | | | | | | | | | | | | Several qdiscs check on enqueue whether the packet was enqueued to a class with an empty queue, in which case the class is activated. This is done by checking if the qlen is exactly 1 after enqueue. However, if GSO splitting is enabled in the child qdisc, a single packet can result in a qlen longer than 1. This means the activation check fails, leading to a stalled queue. Fix this by checking if the queue is empty *before* enqueue, and running the activation logic if this was the case. Reported-by: Pete Heist <pete@heistp.net> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sched: Avoid dereferencing skb pointer after child enqueueToke Høiland-Jørgensen2019-01-158-16/+23
| | | | | | | | | | | Parent qdiscs may dereference the pointer to the enqueued skb after enqueue. However, both CAKE and TBF call consume_skb() on the original skb when splitting GSO packets, leading to a potential use-after-free in the parent. Fix this by avoiding dereferencing the skb pointer after enqueueing to the child. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2018-12-2716-483/+1074
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) New ipset extensions for matching on destination MAC addresses, from Stefano Brivio. 2) Add ipv4 ttl and tos, plus ipv6 flow label and hop limit offloads to nfp driver. From Stefano Brivio. 3) Implement GRO for plain UDP sockets, from Paolo Abeni. 4) Lots of work from Michał Mirosław to eliminate the VLAN_TAG_PRESENT bit so that we could support the entire vlan_tci value. 5) Rework the IPSEC policy lookups to better optimize more usecases, from Florian Westphal. 6) Infrastructure changes eliminating direct manipulation of SKB lists wherever possible, and to always use the appropriate SKB list helpers. This work is still ongoing... 7) Lots of PHY driver and state machine improvements and simplifications, from Heiner Kallweit. 8) Various TSO deferral refinements, from Eric Dumazet. 9) Add ntuple filter support to aquantia driver, from Dmitry Bogdanov. 10) Batch dropping of XDP packets in tuntap, from Jason Wang. 11) Lots of cleanups and improvements to the r8169 driver from Heiner Kallweit, including support for ->xmit_more. This driver has been getting some much needed love since he started working on it. 12) Lots of new forwarding selftests from Petr Machata. 13) Enable VXLAN learning in mlxsw driver, from Ido Schimmel. 14) Packed ring support for virtio, from Tiwei Bie. 15) Add new Aquantia AQtion USB driver, from Dmitry Bezrukov. 16) Add XDP support to dpaa2-eth driver, from Ioana Ciocoi Radulescu. 17) Implement coalescing on TCP backlog queue, from Eric Dumazet. 18) Implement carrier change in tun driver, from Nicolas Dichtel. 19) Support msg_zerocopy in UDP, from Willem de Bruijn. 20) Significantly improve garbage collection of neighbor objects when the table has many PERMANENT entries, from David Ahern. 21) Remove egdev usage from nfp and mlx5, and remove the facility completely from the tree as it no longer has any users. From Oz Shlomo and others. 22) Add a NETDEV_PRE_CHANGEADDR so that drivers can veto the change and therefore abort the operation before the commit phase (which is the NETDEV_CHANGEADDR event). From Petr Machata. 23) Add indirect call wrappers to avoid retpoline overhead, and use them in the GRO code paths. From Paolo Abeni. 24) Add support for netlink FDB get operations, from Roopa Prabhu. 25) Support bloom filter in mlxsw driver, from Nir Dotan. 26) Add SKB extension infrastructure. This consolidates the handling of the auxiliary SKB data used by IPSEC and bridge netfilter, and is designed to support the needs to MPTCP which could be integrated in the future. 27) Lots of XDP TX optimizations in mlx5 from Tariq Toukan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1845 commits) net: dccp: fix kernel crash on module load drivers/net: appletalk/cops: remove redundant if statement and mask bnx2x: Fix NULL pointer dereference in bnx2x_del_all_vlans() on some hw net/net_namespace: Check the return value of register_pernet_subsys() net/netlink_compat: Fix a missing check of nla_parse_nested ieee802154: lowpan_header_create check must check daddr net/mlx4_core: drop useless LIST_HEAD mlxsw: spectrum: drop useless LIST_HEAD net/mlx5e: drop useless LIST_HEAD iptunnel: Set tun_flags in the iptunnel_metadata_reply from src net/mlx5e: fix semicolon.cocci warnings staging: octeon: fix build failure with XFRM enabled net: Revert recent Spectre-v1 patches. can: af_can: Fix Spectre v1 vulnerability packet: validate address length if non-zero nfc: af_nfc: Fix Spectre v1 vulnerability phonet: af_phonet: Fix Spectre v1 vulnerability net: core: Fix Spectre v1 vulnerability net: minor cleanup in skb_ext_add() net: drop the unused helper skb_ext_get() ...
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-12-201-4/+3
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | Lots of conflicts, by happily all cases of overlapping changes, parallel adds, things of that nature. Thanks to Stephen Rothwell, Saeed Mahameed, and others for their guidance in these resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: sched: simplify the qdisc_leaf codeTonghao Zhang2018-12-151-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Except for returning, the var leaf is not used in the qdisc_leaf(). For simplicity, remove it. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net_sched: fold tcf_block_cb_call() into tc_setup_cb_call()Cong Wang2018-12-145-45/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 69bd48404f25 ("net/sched: Remove egdev mechanism"), tc_setup_cb_call() is nearly identical to tcf_block_cb_call(), so we can just fold tcf_block_cb_call() into tc_setup_cb_call() and remove its unused parameter 'exts'. Fixes: 69bd48404f25 ("net/sched: Remove egdev mechanism") Cc: Oz Shlomo <ozsh@mellanox.com> Cc: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/sched: Remove egdev mechanismOz Shlomo2018-12-102-267/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The egdev mechanism was replaced by the TC indirect block notifications platform. Signed-off-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Cc: John Hurley <john.hurley@netronome.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-12-093-25/+25
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Several conflicts, seemingly all over the place. I used Stephen Rothwell's sample resolutions for many of these, if not just to double check my own work, so definitely the credit largely goes to him. The NFP conflict consisted of a bug fix (moving operations past the rhashtable operation) while chaning the initial argument in the function call in the moved code. The net/dsa/master.c conflict had to do with a bug fix intermixing of making dsa_master_set_mtu() static with the fixing of the tagging attribute location. cls_flower had a conflict because the dup reject fix from Or overlapped with the addition of port range classifiction. __set_phy_supported()'s conflict was relatively easy to resolve because Andrew fixed it in both trees, so it was just a matter of taking the net-next copy. Or at least I think it was :-) Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup() intermixed with changes on how the sdif and caller_net are calculated in these code paths in net-next. The remaining BPF conflicts were largely about the addition of the __bpf_md_ptr stuff in 'net' overlapping with adjustments and additions to the relevant data structure where the MD pointer macros are used. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: netem: use a list in addition to rbtreePeter Oskolkov2018-12-051-20/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When testing high-bandwidth TCP streams with large windows, high latency, and low jitter, netem consumes a lot of CPU cycles doing rbtree rebalancing. This patch uses a linear list/queue in addition to the rbtree: if an incoming packet is past the tail of the linear queue, it is added there, otherwise it is inserted into the rbtree. Without this patch, perf shows netem_enqueue, netem_dequeue, and rb_* functions among the top offenders. With this patch, only netem_enqueue is noticeable if jitter is low/absent. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Oskolkov <posk@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net/sched: act_tunnel_key: Don't dump dst port if it wasn't setAdi Nissim2018-12-041-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible to set a tunnel without a destination port. However, on dump(), a zero dst port is returned to user space even if it was not set, fix that. Note that so far it wasn't required, b/c key less tunnels were not supported and the UDP tunnels do require destination port. Signed-off-by: Adi Nissim <adin@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net/sched: act_tunnel_key: Allow key-less tunnelsAdi Nissim2018-12-041-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow setting a tunnel without a tunnel key. This is required for tunneling protocols, such as GRE, that define the key as an optional field. Signed-off-by: Adi Nissim <adin@mellanox.com> Acked-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-11-241-14/+22
| |\ \ \
| * | | | net_sched: sch_fq: avoid calling ktime_get_ns() if not neededEric Dumazet2018-11-201-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two cases were we can avoid calling ktime_get_ns() : 1) Queue is empty. 2) Internal queue is not empty. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: sched: cls_u32: add res to offload informationJakub Kicinski2018-11-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case of egress offloads the class/flowid assigned by the filter may be very important for offloaded Qdisc selection. Provide this info to drivers. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: sched: gred: support reporting stats from offloadsJakub Kicinski2018-11-191-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow drivers which offload GRED to report back statistics. Since A lot of GRED stats is fairly ad hoc in nature pass to drivers the standard struct gnet_stats_basic/gnet_stats_queue pairs, and untangle the values in the core. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: sched: gred: add basic Qdisc offloadJakub Kicinski2018-11-191-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add basic offload for the GRED Qdisc. Inform the drivers any time Qdisc or virtual queue configuration changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-11-192-13/+21
| |\ \ \ \
| * | | | | net: sched: gred: allow manipulating per-DP RED flagsJakub Kicinski2018-11-161-1/+143
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow users to set and dump RED flags (ECN enabled and harddrop) on per-virtual queue basis. Validation of attributes is split from changes to make sure we won't have to undo previous operations when we find out configuration is invalid. The objective is to allow changing per-Qdisc parameters without overwriting the per-vq configured flags. Old user space will not pass the TCA_GRED_VQ_FLAGS attribute and per-Qdisc flags will always get propagated to the virtual queues. New user space which wants to make use of per-vq flags should set per-Qdisc flags to 0 and then configure per-vq flags as it sees fit. Once per-vq flags are set per-Qdisc flags can't be changed to non-zero. Vice versa - if the per-Qdisc flags are non-zero the TCA_GRED_VQ_FLAGS attribute has to either be omitted or set to the same value as per-Qdisc flags. Update per-Qdisc parameters: per-Qdisc | per-VQ | result 0 | 0 | all vq flags updated 0 | non-0 | error (vq flags in use) non-0 | 0 | -- impossible -- non-0 | non-0 | all vq flags updated Update per-VQ state (flags parameter not specified): no change to flags Update per-VQ state (flags parameter set): per-Qdisc | per-VQ | result 0 | any | per-vq flags updated non-0 | 0 | -- impossible -- non-0 | non-0 | error (per-Qdisc flags in use) Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: gred: store red flags per virtual queueJakub Kicinski2018-11-161-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now ECN marking and HARD drop (the common RED flags) can only be configured for the entire Qdisc. In preparation for per-vq flags store the values in the virtual queue structure. Setting per-vq flags will only be allowed when no flags are set for the entire Qdisc. For the new flags we will also make sure undefined bits are 0. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: gred: provide a better structured dump and expose statsJakub Kicinski2018-11-161-1/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently all GRED's virtual queue data is dumped in a single array in a single attribute. This makes it pretty much impossible to add new fields. In order to expose more detailed stats add a new set of attributes. We can now expose the 64 bit value of bytesin and all the mark stats which were not part of the original design. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: gred: store bytesin as a 64 bit valueJakub Kicinski2018-11-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 32 bit counters for bytes are not really going to last long in modern world. Make sch_gred count bytes on a 64 bit counter. It will still get truncated during dump but follow up patch will add set of new stat dump attributes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: gred: use extack to provide more details on configuration errorsJakub Kicinski2018-11-161-11/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add extack messages to -EINVAL errors, to help users identify their mistakes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: gred: pass extack to nla_parse_nested()Jakub Kicinski2018-11-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case netlink wants to provide parsing error pass extack to nla_parse_nested(). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: gred: separate error and non-error path in gred_change()Jakub Kicinski2018-11-161-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will soon want to add more code to the non-error path, separate it from the error handling flow. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | etf: Drop all expired packetsJesus Sanchez-Palencia2018-11-161-15/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently on dequeue() ETF only drops the first expired packet, which causes a problem if the next packet is already expired. When this happens, the watchdog will be configured with a time in the past, fire straight way and the packet will finally be dropped once the dequeue() function of the qdisc is called again. We can save quite a few cycles and improve the overall behavior of the qdisc if we drop all expired packets if the next packet is expired. This should allow ETF to recover faster from bad situations. But packet drops are still a very serious warning that the requirements imposed on the system aren't reasonable. This was inspired by how the implementation of hrtimers use the rb_tree inside the kernel. Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | etf: Split timersortedlist_erase()Jesus Sanchez-Palencia2018-11-161-15/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is just a refactor that will simplify the implementation of the next patch in this series which will drop all expired packets on the dequeue flow. Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | etf: Use cached rb_rootJesus Sanchez-Palencia2018-11-161-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ETF's peek() operation is heavily used so use an rb_root_cached instead and leverage rb_first_cached() which will run in O(1) instead of O(log n). Even if on 'timesortedlist_clear()' we could be using rb_erase(), we choose to use rb_erase_cached(), because if in the future we allow runtime changes to ETF parameters, and need to do a '_clear()', this might cause some hard to debug issues. Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | etf: Cancel timer if there are no pending skbsJesus Sanchez-Palencia2018-11-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no point in firing the qdisc watchdog if there are no future skbs pending in the queue and the watchdog had been set previously. Signed-off-by: Jesus Sanchez-Palencia <jesus.s.palencia@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: cls_flower: Classify packets using port rangesAmritha Nambiar2018-11-151-6/+149
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added support in tc flower for filtering based on port ranges. Example: 1. Match on a port range: ------------------------- $ tc filter add dev enp4s0 protocol ip parent ffff:\ prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\ action drop $ tc -s filter show dev enp4s0 parent ffff: filter protocol ip pref 1 flower chain 0 filter protocol ip pref 1 flower chain 0 handle 0x1 eth_type ipv4 ip_proto tcp dst_port range 20-30 skip_hw not_in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 installed 85 sec used 3 sec Action statistics: Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0) backlog 0b 0p requeues 0 2. Match on IP address and port range: -------------------------------------- $ tc filter add dev enp4s0 protocol ip parent ffff:\ prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\ skip_hw action drop $ tc -s filter show dev enp4s0 parent ffff: filter protocol ip pref 1 flower chain 0 handle 0x2 eth_type ipv4 ip_proto tcp dst_ip 192.168.1.1 dst_port range 100-200 skip_hw not_in_hw action order 1: gact action drop random type none pass val 0 index 2 ref 1 bind 1 installed 58 sec used 2 sec Action statistics: Sent 920 bytes 20 pkt (dropped 20, overlimits 0 requeues 0) backlog 0b 0p requeues 0 v4: 1. Added condition before setting port key. 2. Organized setting and dumping port range keys into functions and added validation of input range. v3: 1. Moved new fields in UAPI enum to the end of enum. 2. Removed couple of empty lines. v2: Addressed Jiri's comments: 1. Added separate functions for dst and src comparisons. 2. Removed endpoint enum. 3. Added new bit TCA_FLOWER_FLAGS_RANGE to decide normal/range lookup. 4. Cleaned up fl_lookup function. Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: red: notify drivers about RED's limit parameterJakub Kicinski2018-11-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RED qdisc's limit parameter changes the behaviour of the qdisc, for instance if it's set to 0 qdisc will drop all the packets. When replace operation happens and parameter is set to non-0 a new fifo qdisc will be instantiated and replace the old child qdisc which will be destroyed. Drivers need to know the parameter, even if they don't impose the actual limit to be able to reliably reconstruct the Qdisc hierarchy. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: mq: offload a graft notificationJakub Kicinski2018-11-141-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drivers offloading Qdiscs should have reasonable certainty the offloaded behaviour matches the SW path. This is impossible if the driver does not know about all Qdiscs or when Qdiscs move and are reused. Send a graft notification from MQ. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: red: offload a graft notificationJakub Kicinski2018-11-141-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drivers offloading Qdiscs should have reasonable certainty the offloaded behaviour matches the SW path. This is impossible if the driver does not know about all Qdiscs or when Qdiscs move and are reused. Send a graft notification from RED. The drivers are expected to simply stop offloading the Qdisc, if a non-standard child is ever grafted onto it. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: sched: provide notification for graft on rootJakub Kicinski2018-11-141-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Drivers are currently not notified when a Qdisc is grafted as root. This requires special casing Qdiscs added with parent = TC_H_ROOT in the driver. Also there is no notification sent to the driver when an existing Qdisc is grafted as root. Add this very simple notifications, drivers should now be able to track their Qdisc tree fully. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: John Hurley <john.hurley@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-11-113-11/+15
| |\ \ \ \ \
| * | | | | | net_sched: sch_fq: add dctcp-like markingEric Dumazet2018-11-111-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to 80ba92fa1a92 ("codel: add ce_threshold attribute") After EDT adoption, it became easier to implement DCTCP-like CE marking. In many cases, queues are not building in the network fabric but on the hosts themselves. If packets leaving fq missed their Earliest Departure Time by XXX usec, we mark them with ECN CE. This gives a feedback (after one RTT) to the sender to slow down and find better operating mode. Example : tc qd replace dev eth0 root fq ce_threshold 2.5ms Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
OpenPOWER on IntegriCloud