summaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/af_xdp.rst287
-rw-r--r--Documentation/networking/caif/caif.rst (renamed from Documentation/networking/caif/README)88
-rw-r--r--Documentation/networking/device_drivers/aquantia/atlantic.txt46
-rw-r--r--Documentation/networking/device_drivers/freescale/dpaa.txt12
-rw-r--r--Documentation/networking/device_drivers/freescale/dpaa2/index.rst1
-rw-r--r--Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst191
-rw-r--r--Documentation/networking/device_drivers/index.rst6
-rw-r--r--Documentation/networking/device_drivers/intel/e100.rst14
-rw-r--r--Documentation/networking/device_drivers/intel/e1000.rst12
-rw-r--r--Documentation/networking/device_drivers/intel/e1000e.rst14
-rw-r--r--Documentation/networking/device_drivers/intel/fm10k.rst10
-rw-r--r--Documentation/networking/device_drivers/intel/i40e.rst8
-rw-r--r--Documentation/networking/device_drivers/intel/iavf.rst123
-rw-r--r--Documentation/networking/device_drivers/intel/ice.rst6
-rw-r--r--Documentation/networking/device_drivers/intel/igb.rst12
-rw-r--r--Documentation/networking/device_drivers/intel/igbvf.rst6
-rw-r--r--Documentation/networking/device_drivers/intel/ixgbe.rst10
-rw-r--r--Documentation/networking/device_drivers/intel/ixgbevf.rst6
-rw-r--r--Documentation/networking/device_drivers/marvell/octeontx2.rst159
-rw-r--r--Documentation/networking/device_drivers/mellanox/mlx5.rst133
-rw-r--r--Documentation/networking/device_drivers/microsoft/netvsc.txt21
-rw-r--r--Documentation/networking/device_drivers/netronome/nfp.rst249
-rw-r--r--Documentation/networking/device_drivers/pensando/ionic.rst45
-rw-r--r--Documentation/networking/device_drivers/stmicro/stmmac.rst697
-rw-r--r--Documentation/networking/device_drivers/stmicro/stmmac.txt401
-rw-r--r--Documentation/networking/device_drivers/ti/cpsw_switchdev.txt209
-rw-r--r--Documentation/networking/devlink-health.txt86
-rw-r--r--Documentation/networking/devlink-info-versions.rst48
-rw-r--r--Documentation/networking/devlink-params-bnxt.txt18
-rw-r--r--Documentation/networking/devlink-params-mlxsw.txt10
-rw-r--r--Documentation/networking/devlink-params.txt51
-rw-r--r--Documentation/networking/devlink/bnxt.rst74
-rw-r--r--Documentation/networking/devlink/devlink-dpipe.rst252
-rw-r--r--Documentation/networking/devlink/devlink-health.rst114
-rw-r--r--Documentation/networking/devlink/devlink-info.rst100
-rw-r--r--Documentation/networking/devlink/devlink-params.rst108
-rw-r--r--Documentation/networking/devlink/devlink-region.rst60
-rw-r--r--Documentation/networking/devlink/devlink-resource.rst62
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst289
-rw-r--r--Documentation/networking/devlink/index.rst42
-rw-r--r--Documentation/networking/devlink/ionic.rst29
-rw-r--r--Documentation/networking/devlink/mlx4.rst56
-rw-r--r--Documentation/networking/devlink/mlx5.rst59
-rw-r--r--Documentation/networking/devlink/mlxsw.rst81
-rw-r--r--Documentation/networking/devlink/mv88e6xxx.rst28
-rw-r--r--Documentation/networking/devlink/netdevsim.rst72
-rw-r--r--Documentation/networking/devlink/nfp.rst65
-rw-r--r--Documentation/networking/devlink/qed.rst26
-rw-r--r--Documentation/networking/devlink/ti-cpsw-switch.rst31
-rw-r--r--Documentation/networking/dsa/sja1105.rst84
-rw-r--r--Documentation/networking/ethtool-netlink.rst618
-rw-r--r--Documentation/networking/filter.txt8
-rw-r--r--Documentation/networking/index.rst7
-rw-r--r--Documentation/networking/ip-sysctl.txt66
-rw-r--r--Documentation/networking/j1939.rst422
-rw-r--r--Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst (renamed from Documentation/networking/mac80211_hwsim/README)28
-rw-r--r--Documentation/networking/net_dim.txt36
-rw-r--r--Documentation/networking/netdev-FAQ.rst4
-rw-r--r--Documentation/networking/nf_flowtable.txt2
-rw-r--r--Documentation/networking/nfc.rst (renamed from Documentation/networking/nfc.txt)74
-rw-r--r--Documentation/networking/phy.rst23
-rw-r--r--Documentation/networking/ppp_generic.txt2
-rw-r--r--Documentation/networking/sfp-phylink.rst6
-rw-r--r--Documentation/networking/tls-offload.rst4
-rw-r--r--Documentation/networking/tls.rst26
65 files changed, 5081 insertions, 856 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index eeedc2e826aa..5bc55a4e3bce 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -40,13 +40,13 @@ allocates memory for this UMEM using whatever means it feels is most
appropriate (malloc, mmap, huge pages, etc). This memory area is then
registered with the kernel using the new setsockopt XDP_UMEM_REG. The
UMEM also has two rings: the FILL ring and the COMPLETION ring. The
-fill ring is used by the application to send down addr for the kernel
+FILL ring is used by the application to send down addr for the kernel
to fill in with RX packet data. References to these frames will then
appear in the RX ring once each packet has been received. The
-completion ring, on the other hand, contains frame addr that the
+COMPLETION ring, on the other hand, contains frame addr that the
kernel has transmitted completely and can now be used again by user
space, for either TX or RX. Thus, the frame addrs appearing in the
-completion ring are addrs that were previously transmitted using the
+COMPLETION ring are addrs that were previously transmitted using the
TX ring. In summary, the RX and FILL rings are used for the RX path
and the TX and COMPLETION rings are used for the TX path.
@@ -91,11 +91,16 @@ Concepts
========
In order to use an AF_XDP socket, a number of associated objects need
-to be setup.
+to be setup. These objects and their options are explained in the
+following sections.
-Jonathan Corbet has also written an excellent article on LWN,
-"Accelerating networking with AF_XDP". It can be found at
-https://lwn.net/Articles/750845/.
+For an overview on how AF_XDP works, you can also take a look at the
+Linux Plumbers paper from 2018 on the subject:
+http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do
+NOT consult the paper from 2017 on "AF_PACKET v4", the first attempt
+at AF_XDP. Nearly everything changed since then. Jonathan Corbet has
+also written an excellent article on LWN, "Accelerating networking
+with AF_XDP". It can be found at https://lwn.net/Articles/750845/.
UMEM
----
@@ -113,22 +118,22 @@ the next socket B can do this by setting the XDP_SHARED_UMEM flag in
struct sockaddr_xdp member sxdp_flags, and passing the file descriptor
of A to struct sockaddr_xdp member sxdp_shared_umem_fd.
-The UMEM has two single-producer/single-consumer rings, that are used
+The UMEM has two single-producer/single-consumer rings that are used
to transfer ownership of UMEM frames between the kernel and the
user-space application.
Rings
-----
-There are a four different kind of rings: Fill, Completion, RX and
+There are a four different kind of rings: FILL, COMPLETION, RX and
TX. All rings are single-producer/single-consumer, so the user-space
application need explicit synchronization of multiple
processes/threads are reading/writing to them.
-The UMEM uses two rings: Fill and Completion. Each socket associated
+The UMEM uses two rings: FILL and COMPLETION. Each socket associated
with the UMEM must have an RX queue, TX queue or both. Say, that there
is a setup with four sockets (all doing TX and RX). Then there will be
-one Fill ring, one Completion ring, four TX rings and four RX rings.
+one FILL ring, one COMPLETION ring, four TX rings and four RX rings.
The rings are head(producer)/tail(consumer) based rings. A producer
writes the data ring at the index pointed out by struct xdp_ring
@@ -146,24 +151,26 @@ The size of the rings need to be of size power of two.
UMEM Fill Ring
~~~~~~~~~~~~~~
-The Fill ring is used to transfer ownership of UMEM frames from
+The FILL ring is used to transfer ownership of UMEM frames from
user-space to kernel-space. The UMEM addrs are passed in the ring. As
an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has
16 chunks and can pass addrs between 0 and 64k.
Frames passed to the kernel are used for the ingress path (RX rings).
-The user application produces UMEM addrs to this ring. Note that the
-kernel will mask the incoming addr. E.g. for a chunk size of 2k, the
-log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050
-and 3000 refers to the same chunk.
+The user application produces UMEM addrs to this ring. Note that, if
+running the application with aligned chunk mode, the kernel will mask
+the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of
+the addr will be masked off, meaning that 2048, 2050 and 3000 refers
+to the same chunk. If the user application is run in the unaligned
+chunks mode, then the incoming addr will be left untouched.
UMEM Completion Ring
~~~~~~~~~~~~~~~~~~~~
-The Completion Ring is used transfer ownership of UMEM frames from
-kernel-space to user-space. Just like the Fill ring, UMEM indicies are
+The COMPLETION Ring is used transfer ownership of UMEM frames from
+kernel-space to user-space. Just like the FILL ring, UMEM indices are
used.
Frames passed from the kernel to user-space are frames that has been
@@ -179,7 +186,7 @@ The RX ring is the receiving side of a socket. Each entry in the ring
is a struct xdp_desc descriptor. The descriptor contains UMEM offset
(addr) and the length of the data (len).
-If no frames have been passed to kernel via the Fill ring, no
+If no frames have been passed to kernel via the FILL ring, no
descriptors will (or can) appear on the RX ring.
The user application consumes struct xdp_desc descriptors from this
@@ -197,8 +204,24 @@ be relaxed in the future.
The user application produces struct xdp_desc descriptors to this
ring.
+Libbpf
+======
+
+Libbpf is a helper library for eBPF and XDP that makes using these
+technologies a lot simpler. It also contains specific helper functions
+in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It
+contains two types of functions: those that can be used to make the
+setup of AF_XDP socket easier and ones that can be used in the data
+plane to access the rings safely and quickly. To see an example on how
+to use this API, please take a look at the sample application in
+samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data
+plane operations.
+
+We recommend that you use this library unless you have become a power
+user. It will make your program a lot simpler.
+
XSKMAP / BPF_MAP_TYPE_XSKMAP
-----------------------------
+============================
On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
is used in conjunction with bpf_redirect_map() to pass the ingress
@@ -214,21 +237,202 @@ queue 17. Only the XDP program executing for eth0 and queue 17 will
successfully pass data to the socket. Please refer to the sample
application (samples/bpf/) in for an example.
+Configuration Flags and Socket Options
+======================================
+
+These are the various configuration flags that can be used to control
+and monitor the behavior of AF_XDP sockets.
+
+XDP_COPY and XDP_ZERO_COPY bind flags
+-------------------------------------
+
+When you bind to a socket, the kernel will first try to use zero-copy
+copy. If zero-copy is not supported, it will fall back on using copy
+mode, i.e. copying all packets out to user space. But if you would
+like to force a certain mode, you can use the following flags. If you
+pass the XDP_COPY flag to the bind call, the kernel will force the
+socket into copy mode. If it cannot use copy mode, the bind call will
+fail with an error. Conversely, the XDP_ZERO_COPY flag will force the
+socket into zero-copy mode or fail.
+
+XDP_SHARED_UMEM bind flag
+-------------------------
+
+This flag enables you to bind multiple sockets to the same UMEM, but
+only if they share the same queue id. In this mode, each socket has
+their own RX and TX rings, but the UMEM (tied to the fist socket
+created) only has a single FILL ring and a single COMPLETION
+ring. To use this mode, create the first socket and bind it in the normal
+way. Create a second socket and create an RX and a TX ring, or at
+least one of them, but no FILL or COMPLETION rings as the ones from
+the first socket will be used. In the bind call, set he
+XDP_SHARED_UMEM option and provide the initial socket's fd in the
+sxdp_shared_umem_fd field. You can attach an arbitrary number of extra
+sockets this way.
+
+What socket will then a packet arrive on? This is decided by the XDP
+program. Put all the sockets in the XSK_MAP and just indicate which
+index in the array you would like to send each packet to. A simple
+round-robin example of distributing packets is shown below:
+
+.. code-block:: c
+
+ #include <linux/bpf.h>
+ #include "bpf_helpers.h"
+
+ #define MAX_SOCKS 16
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_XSKMAP);
+ __uint(max_entries, MAX_SOCKS);
+ __uint(key_size, sizeof(int));
+ __uint(value_size, sizeof(int));
+ } xsks_map SEC(".maps");
+
+ static unsigned int rr;
+
+ SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
+ {
+ rr = (rr + 1) & (MAX_SOCKS - 1);
+
+ return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
+ }
+
+Note, that since there is only a single set of FILL and COMPLETION
+rings, and they are single producer, single consumer rings, you need
+to make sure that multiple processes or threads do not use these rings
+concurrently. There are no synchronization primitives in the
+libbpf code that protects multiple users at this point in time.
+
+Libbpf uses this mode if you create more than one socket tied to the
+same umem. However, note that you need to supply the
+XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
+xsk_socket__create calls and load your own XDP program as there is no
+built in one in libbpf that will route the traffic for you.
+
+XDP_USE_NEED_WAKEUP bind flag
+-----------------------------
+
+This option adds support for a new flag called need_wakeup that is
+present in the FILL ring and the TX ring, the rings for which user
+space is a producer. When this option is set in the bind call, the
+need_wakeup flag will be set if the kernel needs to be explicitly
+woken up by a syscall to continue processing packets. If the flag is
+zero, no syscall is needed.
+
+If the flag is set on the FILL ring, the application needs to call
+poll() to be able to continue to receive packets on the RX ring. This
+can happen, for example, when the kernel has detected that there are no
+more buffers on the FILL ring and no buffers left on the RX HW ring of
+the NIC. In this case, interrupts are turned off as the NIC cannot
+receive any packets (as there are no buffers to put them in), and the
+need_wakeup flag is set so that user space can put buffers on the
+FILL ring and then call poll() so that the kernel driver can put these
+buffers on the HW ring and start to receive packets.
+
+If the flag is set for the TX ring, it means that the application
+needs to explicitly notify the kernel to send any packets put on the
+TX ring. This can be accomplished either by a poll() call, as in the
+RX path, or by calling sendto().
+
+An example of how to use this flag can be found in
+samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers
+would look like this for the TX path:
+
+.. code-block:: c
+
+ if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
+ sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
+
+I.e., only use the syscall if the flag is set.
+
+We recommend that you always enable this mode as it usually leads to
+better performance especially if you run the application and the
+driver on the same core, but also if you use different cores for the
+application and the kernel driver, as it reduces the number of
+syscalls needed for the TX path.
+
+XDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts
+------------------------------------------------------
+
+These setsockopts sets the number of descriptors that the RX, TX,
+FILL, and COMPLETION rings respectively should have. It is mandatory
+to set the size of at least one of the RX and TX rings. If you set
+both, you will be able to both receive and send traffic from your
+application, but if you only want to do one of them, you can save
+resources by only setting up one of them. Both the FILL ring and the
+COMPLETION ring are mandatory as you need to have a UMEM tied to your
+socket. But if the XDP_SHARED_UMEM flag is used, any socket after the
+first one does not have a UMEM and should in that case not have any
+FILL or COMPLETION rings created as the ones from the shared umem will
+be used. Note, that the rings are single-producer single-consumer, so
+do not try to access them from multiple processes at the same
+time. See the XDP_SHARED_UMEM section.
+
+In libbpf, you can create Rx-only and Tx-only sockets by supplying
+NULL to the rx and tx arguments, respectively, to the
+xsk_socket__create function.
+
+If you create a Tx-only socket, we recommend that you do not put any
+packets on the fill ring. If you do this, drivers might think you are
+going to receive something when you in fact will not, and this can
+negatively impact performance.
+
+XDP_UMEM_REG setsockopt
+-----------------------
+
+This setsockopt registers a UMEM to a socket. This is the area that
+contain all the buffers that packet can recide in. The call takes a
+pointer to the beginning of this area and the size of it. Moreover, it
+also has parameter called chunk_size that is the size that the UMEM is
+divided into. It can only be 2K or 4K at the moment. If you have an
+UMEM area that is 128K and a chunk size of 2K, this means that you
+will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
+area and that your largest packet size can be 2K.
+
+There is also an option to set the headroom of each single buffer in
+the UMEM. If you set this to N bytes, it means that the packet will
+start N bytes into the buffer leaving the first N bytes for the
+application to use. The final option is the flags field, but it will
+be dealt with in separate sections for each UMEM flag.
+
+XDP_STATISTICS getsockopt
+-------------------------
+
+Gets drop statistics of a socket that can be useful for debug
+purposes. The supported statistics are shown below:
+
+.. code-block:: c
+
+ struct xdp_statistics {
+ __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+ __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+ };
+
+XDP_OPTIONS getsockopt
+----------------------
+
+Gets options from an XDP socket. The only one supported so far is
+XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
+
Usage
=====
-In order to use AF_XDP sockets there are two parts needed. The
+In order to use AF_XDP sockets two parts are needed. The
user-space application and the XDP program. For a complete setup and
usage example, please refer to the sample application. The user-space
side is xdpsock_user.c and the XDP side is part of libbpf.
-The XDP code sample included in tools/lib/bpf/xsk.c is the following::
+The XDP code sample included in tools/lib/bpf/xsk.c is the following:
+
+.. code-block:: c
SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
int index = ctx->rx_queue_index;
- // A set entry here means that the correspnding queue_id
+ // A set entry here means that the corresponding queue_id
// has an active AF_XDP socket bound to it.
if (bpf_map_lookup_elem(&xsks_map, &index))
return bpf_redirect_map(&xsks_map, index, 0);
@@ -236,7 +440,10 @@ The XDP code sample included in tools/lib/bpf/xsk.c is the following::
return XDP_PASS;
}
-Naive ring dequeue and enqueue could look like this::
+A simple but not so performance ring dequeue and enqueue could look
+like this:
+
+.. code-block:: c
// struct xdp_rxtx_ring {
// __u32 *producer;
@@ -285,17 +492,16 @@ Naive ring dequeue and enqueue could look like this::
return 0;
}
-
-For a more optimized version, please refer to the sample application.
+But please use the libbpf functions as they are optimized and ready to
+use. Will make your life easier.
Sample application
==================
There is a xdpsock benchmarking/test application included that
-demonstrates how to use AF_XDP sockets with both private and shared
-UMEMs. Say that you would like your UDP traffic from port 4242 to end
-up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
-for this::
+demonstrates how to use AF_XDP sockets with private UMEMs. Say that
+you would like your UDP traffic from port 4242 to end up in queue 16,
+that we will enable AF_XDP on. Here, we use ethtool for this::
ethtool -N p3p2 rx-flow-hash udp4 fn
ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
@@ -309,13 +515,18 @@ using::
For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.
+This sample application uses libbpf to make the setup and usage of
+AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is
+really used to make something more advanced, take a look at the libbpf
+code in tools/lib/bpf/xsk.[ch].
+
FAQ
=======
Q: I am not seeing any traffic on the socket. What am I doing wrong?
A: When a netdev of a physical NIC is initialized, Linux usually
- allocates one Rx and Tx queue pair per core. So on a 8 core system,
+ allocates one RX and TX queue pair per core. So on a 8 core system,
queue ids 0 to 7 will be allocated, one per core. In the AF_XDP
bind call or the xsk_socket__create libbpf function call, you
specify a specific queue id to bind to and it is only the traffic
@@ -341,9 +552,21 @@ A: When a netdev of a physical NIC is initialized, Linux usually
sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
4242 action 2
- A number of other ways are possible all up to the capabilitites of
+ A number of other ways are possible all up to the capabilities of
the NIC you have.
+Q: Can I use the XSKMAP to implement a switch betwen different umems
+ in copy mode?
+
+A: The short answer is no, that is not supported at the moment. The
+ XSKMAP can only be used to switch traffic coming in on queue id X
+ to sockets bound to the same queue id X. The XSKMAP can contain
+ sockets bound to different queue ids, for example X and Y, but only
+ traffic goming in from queue id Y can be directed to sockets bound
+ to the same queue id Y. In zero-copy mode, you should use the
+ switch, or other distribution mechanism, in your NIC to direct
+ traffic to the correct queue id and socket.
+
Credits
=======
diff --git a/Documentation/networking/caif/README b/Documentation/networking/caif/caif.rst
index 757ccfaa1385..07afc8063d4d 100644
--- a/Documentation/networking/caif/README
+++ b/Documentation/networking/caif/caif.rst
@@ -1,18 +1,31 @@
-Copyright (C) ST-Ericsson AB 2010
-Author: Sjur Brendeland/ sjur.brandeland@stericsson.com
-License terms: GNU General Public License (GPL) version 2
----------------------------------------------------------
+:orphan:
-=== Start ===
-If you have compiled CAIF for modules do:
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
-$modprobe crc_ccitt
-$modprobe caif
-$modprobe caif_socket
-$modprobe chnl_net
+================
+Using Linux CAIF
+================
-=== Preparing the setup with a STE modem ===
+
+:Copyright: |copy| ST-Ericsson AB 2010
+
+:Author: Sjur Brendeland/ sjur.brandeland@stericsson.com
+
+Start
+=====
+
+If you have compiled CAIF for modules do::
+
+ $modprobe crc_ccitt
+ $modprobe caif
+ $modprobe caif_socket
+ $modprobe chnl_net
+
+
+Preparing the setup with a STE modem
+====================================
If you are working on integration of CAIF you should make sure
that the kernel is built with module support.
@@ -32,24 +45,30 @@ module parameter "ser_use_stx".
Normally Frame Checksum is always used on UART, but this is also provided as a
module parameter "ser_use_fcs".
-$ modprobe caif_serial ser_ttyname=/dev/ttyS0 ser_use_stx=yes
-$ ifconfig caif_ttyS0 up
+::
+
+ $ modprobe caif_serial ser_ttyname=/dev/ttyS0 ser_use_stx=yes
+ $ ifconfig caif_ttyS0 up
-PLEASE NOTE: There is a limitation in Android shell.
+PLEASE NOTE:
+ There is a limitation in Android shell.
It only accepts one argument to insmod/modprobe!
-=== Trouble shooting ===
+Trouble shooting
+================
There are debugfs parameters provided for serial communication.
/sys/kernel/debug/caif_serial/<tty-name>/
* ser_state: Prints the bit-mask status where
+
- 0x02 means SENDING, this is a transient state.
- 0x10 means FLOW_OFF_SENT, i.e. the previous frame has not been sent
- and is blocking further send operation. Flow OFF has been propagated
- to all CAIF Channels using this TTY.
+ and is blocking further send operation. Flow OFF has been propagated
+ to all CAIF Channels using this TTY.
* tty_status: Prints the bit-mask tty status information
+
- 0x01 - tty->warned is on.
- 0x02 - tty->low_latency is on.
- 0x04 - tty->packed is on.
@@ -58,13 +77,17 @@ There are debugfs parameters provided for serial communication.
- 0x20 - tty->stopped is on.
* last_tx_msg: Binary blob Prints the last transmitted frame.
- This can be printed with
+
+ This can be printed with::
+
$od --format=x1 /sys/kernel/debug/caif_serial/<tty>/last_rx_msg.
- The first two tx messages sent look like this. Note: The initial
- byte 02 is start of frame extension (STX) used for re-syncing
- upon errors.
- - Enumeration:
+ The first two tx messages sent look like this. Note: The initial
+ byte 02 is start of frame extension (STX) used for re-syncing
+ upon errors.
+
+ - Enumeration::
+
0000000 02 05 00 00 03 01 d2 02
| | | | | |
STX(1) | | | |
@@ -73,7 +96,9 @@ There are debugfs parameters provided for serial communication.
Command:Enumeration(1)
Link-ID(1)
Checksum(2)
- - Channel Setup:
+
+ - Channel Setup::
+
0000000 02 07 00 00 00 21 a1 00 48 df
| | | | | | | |
STX(1) | | | | | |
@@ -86,13 +111,18 @@ There are debugfs parameters provided for serial communication.
Checksum(2)
* last_rx_msg: Prints the last transmitted frame.
- The RX messages for LinkSetup look almost identical but they have the
- bit 0x20 set in the command bit, and Channel Setup has added one byte
- before Checksum containing Channel ID.
- NOTE: Several CAIF Messages might be concatenated. The maximum debug
+
+ The RX messages for LinkSetup look almost identical but they have the
+ bit 0x20 set in the command bit, and Channel Setup has added one byte
+ before Checksum containing Channel ID.
+
+ NOTE:
+ Several CAIF Messages might be concatenated. The maximum debug
buffer size is 128 bytes.
-== Error Scenarios:
+Error Scenarios
+===============
+
- last_tx_msg contains channel setup message and last_rx_msg is empty ->
The host seems to be able to send over the UART, at least the CAIF ldisc get
notified that sending is completed.
@@ -103,7 +133,9 @@ There are debugfs parameters provided for serial communication.
- if /sys/kernel/debug/caif_serial/<tty>/tty_status is non-zero there
might be problems transmitting over UART.
+
E.g. host and modem wiring is not correct you will typically see
tty_status = 0x10 (hw_stopped) and ser_state = 0x10 (FLOW_OFF_SENT).
+
You will probably see the enumeration message in last_tx_message
and empty last_rx_message.
diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.txt b/Documentation/networking/device_drivers/aquantia/atlantic.txt
index d235cbaeccc6..2013fcedc2da 100644
--- a/Documentation/networking/device_drivers/aquantia/atlantic.txt
+++ b/Documentation/networking/device_drivers/aquantia/atlantic.txt
@@ -1,5 +1,5 @@
-aQuantia AQtion Driver for the aQuantia Multi-Gigabit PCI Express Family of
-Ethernet Adapters
+Marvell(Aquantia) AQtion Driver for the aQuantia Multi-Gigabit PCI Express
+Family of Ethernet Adapters
=============================================================================
Contents
@@ -325,6 +325,46 @@ Supported ethtool options
Example:
ethtool -N eth0 flow-type udp4 action 0 loc 32
+ UDP GSO hardware offload
+ ---------------------------------
+ UDP GSO allows to boost UDP tx rates by offloading UDP headers allocation
+ into hardware. A special userspace socket option is required for this,
+ could be validated with /kernel/tools/testing/selftests/net/
+
+ udpgso_bench_tx -u -4 -D 10.0.1.1 -s 6300 -S 100
+
+ Will cause sending out of 100 byte sized UDP packets formed from single
+ 6300 bytes user buffer.
+
+ UDP GSO is configured by:
+
+ ethtool -K eth0 tx-udp-segmentation on
+
+ Private flags (testing)
+ ---------------------------------
+
+ Atlantic driver supports private flags for hardware custom features:
+
+ $ ethtool --show-priv-flags ethX
+
+ Private flags for ethX:
+ DMASystemLoopback : off
+ PKTSystemLoopback : off
+ DMANetworkLoopback : off
+ PHYInternalLoopback: off
+ PHYExternalLoopback: off
+
+ Example:
+
+ $ ethtool --set-priv-flags ethX DMASystemLoopback on
+
+ DMASystemLoopback: DMA Host loopback.
+ PKTSystemLoopback: Packet buffer host loopback.
+ DMANetworkLoopback: Network side loopback on DMA block.
+ PHYInternalLoopback: Internal loopback on Phy.
+ PHYExternalLoopback: External loopback on Phy (with loopback ethernet cable).
+
+
Command Line Parameters
=======================
The following command line parameters are available on atlantic driver:
@@ -426,7 +466,7 @@ Support
If an issue is identified with the released source code on the supported
kernel with a supported adapter, email the specific information related
-to the issue to support@aquantia.com
+to the issue to aqn_support@marvell.com
License
=======
diff --git a/Documentation/networking/device_drivers/freescale/dpaa.txt b/Documentation/networking/device_drivers/freescale/dpaa.txt
index f88194f71c54..b06601ff9200 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa.txt
+++ b/Documentation/networking/device_drivers/freescale/dpaa.txt
@@ -129,9 +129,9 @@ CONFIG_AQUANTIA_PHY=y
DPAA Ethernet Frame Processing
==============================
-On Rx, buffers for the incoming frames are retrieved from one of the three
-existing buffers pools. The driver initializes and seeds these, each with
-buffers of different sizes: 1KB, 2KB and 4KB.
+On Rx, buffers for the incoming frames are retrieved from the buffers found
+in the dedicated interface buffer pool. The driver initializes and seeds these
+with one page buffers.
On Tx, all transmitted frames are returned to the driver through Tx
confirmation frame queues. The driver is then responsible for freeing the
@@ -254,7 +254,7 @@ The following statistics are exported for each interface through ethtool:
The driver also exports the following information in sysfs:
- the FQ IDs for each FQ type
- /sys/devices/platform/dpaa-ethernet.0/net/<int>/fqids
+ /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/fqids
- - the IDs of the buffer pools in use
- /sys/devices/platform/dpaa-ethernet.0/net/<int>/bpids
+ - the ID of the buffer pool in use
+ /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/bpids
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst b/Documentation/networking/device_drivers/freescale/dpaa2/index.rst
index 67bd87fe6c53..ee40fcc5ddff 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst
+++ b/Documentation/networking/device_drivers/freescale/dpaa2/index.rst
@@ -8,3 +8,4 @@ DPAA2 Documentation
overview
dpio-driver
ethernet-driver
+ mac-phy-support
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst
new file mode 100644
index 000000000000..51e6624fb774
--- /dev/null
+++ b/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst
@@ -0,0 +1,191 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=======================
+DPAA2 MAC / PHY support
+=======================
+
+:Copyright: |copy| 2019 NXP
+
+Overview
+--------
+
+The DPAA2 MAC / PHY support consists of a set of APIs that help DPAA2 network
+drivers (dpaa2-eth, dpaa2-ethsw) interract with the PHY library.
+
+DPAA2 Software Architecture
+---------------------------
+
+Among other DPAA2 objects, the fsl-mc bus exports DPNI objects (abstracting a
+network interface) and DPMAC objects (abstracting a MAC). The dpaa2-eth driver
+probes on the DPNI object and connects to and configures a DPMAC object with
+the help of phylink.
+
+Data connections may be established between a DPNI and a DPMAC, or between two
+DPNIs. Depending on the connection type, the netif_carrier_[on/off] is handled
+directly by the dpaa2-eth driver or by phylink.
+
+.. code-block:: none
+
+ Sources of abstracted link state information presented by the MC firmware
+
+ +--------------------------------------+
+ +------------+ +---------+ | xgmac_mdio |
+ | net_device | | phylink |--| +-----+ +-----+ +-----+ +-----+ |
+ +------------+ +---------+ | | PHY | | PHY | | PHY | | PHY | |
+ | | | +-----+ +-----+ +-----+ +-----+ |
+ +------------------------------------+ | External MDIO bus |
+ | dpaa2-eth | +--------------------------------------+
+ +------------------------------------+
+ | | Linux
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ | | MC firmware
+ | /| V
+ +----------+ / | +----------+
+ | | / | | |
+ | | | | | |
+ | DPNI |<------| |<------| DPMAC |
+ | | | | | |
+ | | \ |<---+ | |
+ +----------+ \ | | +----------+
+ \| |
+ |
+ +--------------------------------------+
+ | MC firmware polling MAC PCS for link |
+ | +-----+ +-----+ +-----+ +-----+ |
+ | | PCS | | PCS | | PCS | | PCS | |
+ | +-----+ +-----+ +-----+ +-----+ |
+ | Internal MDIO bus |
+ +--------------------------------------+
+
+
+Depending on an MC firmware configuration setting, each MAC may be in one of two modes:
+
+- DPMAC_LINK_TYPE_FIXED: the link state management is handled exclusively by
+ the MC firmware by polling the MAC PCS. Without the need to register a
+ phylink instance, the dpaa2-eth driver will not bind to the connected dpmac
+ object at all.
+
+- DPMAC_LINK_TYPE_PHY: The MC firmware is left waiting for link state update
+ events, but those are in fact passed strictly between the dpaa2-mac (based on
+ phylink) and its attached net_device driver (dpaa2-eth, dpaa2-ethsw),
+ effectively bypassing the firmware.
+
+Implementation
+--------------
+
+At probe time or when a DPNI's endpoint is dynamically changed, the dpaa2-eth
+is responsible to find out if the peer object is a DPMAC and if this is the
+case, to integrate it with PHYLINK using the dpaa2_mac_connect() API, which
+will do the following:
+
+ - look up the device tree for PHYLINK-compatible of binding (phy-handle)
+ - will create a PHYLINK instance associated with the received net_device
+ - connect to the PHY using phylink_of_phy_connect()
+
+The following phylink_mac_ops callback are implemented:
+
+ - .validate() will populate the supported linkmodes with the MAC capabilities
+ only when the phy_interface_t is RGMII_* (at the moment, this is the only
+ link type supported by the driver).
+
+ - .mac_config() will configure the MAC in the new configuration using the
+ dpmac_set_link_state() MC firmware API.
+
+ - .mac_link_up() / .mac_link_down() will update the MAC link using the same
+ API described above.
+
+At driver unbind() or when the DPNI object is disconnected from the DPMAC, the
+dpaa2-eth driver calls dpaa2_mac_disconnect() which will, in turn, disconnect
+from the PHY and destroy the PHYLINK instance.
+
+In case of a DPNI-DPMAC connection, an 'ip link set dev eth0 up' would start
+the following sequence of operations:
+
+(1) phylink_start() called from .dev_open().
+(2) The .mac_config() and .mac_link_up() callbacks are called by PHYLINK.
+(3) In order to configure the HW MAC, the MC Firmware API
+ dpmac_set_link_state() is called.
+(4) The firmware will eventually setup the HW MAC in the new configuration.
+(5) A netif_carrier_on() call is made directly from PHYLINK on the associated
+ net_device.
+(6) The dpaa2-eth driver handles the LINK_STATE_CHANGE irq in order to
+ enable/disable Rx taildrop based on the pause frame settings.
+
+.. code-block:: none
+
+ +---------+ +---------+
+ | PHYLINK |-------------->| eth0 |
+ +---------+ (5) +---------+
+ (1) ^ |
+ | |
+ | v (2)
+ +-----------------------------------+
+ | dpaa2-eth |
+ +-----------------------------------+
+ | ^ (6)
+ | |
+ v (3) |
+ +---------+---------------+---------+
+ | DPMAC | | DPNI |
+ +---------+ +---------+
+ | MC Firmware |
+ +-----------------------------------+
+ |
+ |
+ v (4)
+ +-----------------------------------+
+ | HW MAC |
+ +-----------------------------------+
+
+In case of a DPNI-DPNI connection, a usual sequence of operations looks like
+the following:
+
+(1) ip link set dev eth0 up
+(2) The dpni_enable() MC API called on the associated fsl_mc_device.
+(3) ip link set dev eth1 up
+(4) The dpni_enable() MC API called on the associated fsl_mc_device.
+(5) The LINK_STATE_CHANGED irq is received by both instances of the dpaa2-eth
+ driver because now the operational link state is up.
+(6) The netif_carrier_on() is called on the exported net_device from
+ link_state_update().
+
+.. code-block:: none
+
+ +---------+ +---------+
+ | eth0 | | eth1 |
+ +---------+ +---------+
+ | ^ ^ |
+ | | | |
+ (1) v | (6) (6) | v (3)
+ +---------+ +---------+
+ |dpaa2-eth| |dpaa2-eth|
+ +---------+ +---------+
+ | ^ ^ |
+ | | | |
+ (2) v | (5) (5) | v (4)
+ +---------+---------------+---------+
+ | DPNI | | DPNI |
+ +---------+ +---------+
+ | MC Firmware |
+ +-----------------------------------+
+
+
+Exported API
+------------
+
+Any DPAA2 driver that drivers endpoints of DPMAC objects should service its
+_EVENT_ENDPOINT_CHANGED irq and connect/disconnect from the associated DPMAC
+when necessary using the below listed API::
+
+ - int dpaa2_mac_connect(struct dpaa2_mac *mac);
+ - void dpaa2_mac_disconnect(struct dpaa2_mac *mac);
+
+A phylink integration is necessary only when the partner DPMAC is not of TYPE_FIXED.
+One can check for this condition using the below API::
+
+ - bool dpaa2_mac_is_type_fixed(struct fsl_mc_device *dpmac_dev,struct fsl_mc_io *mc_io);
+
+Before connection to a MAC, the caller must allocate and populate the
+dpaa2_mac structure with the associated net_device, a pointer to the MC portal
+to be used and the actual fsl_mc_device structure of the DPMAC.
diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst
index 2b7fefe72351..a191faaf97de 100644
--- a/Documentation/networking/device_drivers/index.rst
+++ b/Documentation/networking/device_drivers/index.rst
@@ -22,9 +22,13 @@ Contents:
intel/iavf
intel/ice
google/gve
+ marvell/octeontx2
mellanox/mlx5
+ netronome/nfp
+ pensando/ionic
+ stmicro/stmmac
-.. only:: subproject
+.. only:: subproject and html
Indices
=======
diff --git a/Documentation/networking/device_drivers/intel/e100.rst b/Documentation/networking/device_drivers/intel/e100.rst
index 2b9f4887beda..caf023cc88de 100644
--- a/Documentation/networking/device_drivers/intel/e100.rst
+++ b/Documentation/networking/device_drivers/intel/e100.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-==============================================================
-Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters
-==============================================================
+=============================================================
+Linux Base Driver for the Intel(R) PRO/100 Family of Adapters
+=============================================================
June 1, 2018
@@ -21,7 +21,7 @@ Contents
In This Release
===============
-This file describes the Linux* Base Driver for the Intel(R) PRO/100 Family of
+This file describes the Linux Base Driver for the Intel(R) PRO/100 Family of
Adapters. This driver includes support for Itanium(R)2-based systems.
For questions related to hardware requirements, refer to the documentation
@@ -138,9 +138,9 @@ version 1.6 or later is required for this functionality.
The latest release of ethtool can be found from
https://www.kernel.org/pub/software/network/ethtool/
-Enabling Wake on LAN* (WoL)
----------------------------
-WoL is provided through the ethtool* utility. For instructions on
+Enabling Wake on LAN (WoL)
+--------------------------
+WoL is provided through the ethtool utility. For instructions on
enabling WoL with ethtool, refer to the ethtool man page. WoL will be
enabled on the system during the next shut down or reboot. For this
driver version, in order to enable WoL, the e100 driver must be loaded
diff --git a/Documentation/networking/device_drivers/intel/e1000.rst b/Documentation/networking/device_drivers/intel/e1000.rst
index 956560b6e745..4aaae0f7d6ba 100644
--- a/Documentation/networking/device_drivers/intel/e1000.rst
+++ b/Documentation/networking/device_drivers/intel/e1000.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-===========================================================
-Linux* Base Driver for Intel(R) Ethernet Network Connection
-===========================================================
+==========================================================
+Linux Base Driver for Intel(R) Ethernet Network Connection
+==========================================================
Intel Gigabit Linux driver.
Copyright(c) 1999 - 2013 Intel Corporation.
@@ -438,10 +438,10 @@ ethtool
The latest release of ethtool can be found from
https://www.kernel.org/pub/software/network/ethtool/
-Enabling Wake on LAN* (WoL)
----------------------------
+Enabling Wake on LAN (WoL)
+--------------------------
- WoL is configured through the ethtool* utility.
+ WoL is configured through the ethtool utility.
WoL will be enabled on the system during the next shut down or reboot.
For this driver version, in order to enable WoL, the e1000 driver must be
diff --git a/Documentation/networking/device_drivers/intel/e1000e.rst b/Documentation/networking/device_drivers/intel/e1000e.rst
index 01999f05509c..f49cd370e7bf 100644
--- a/Documentation/networking/device_drivers/intel/e1000e.rst
+++ b/Documentation/networking/device_drivers/intel/e1000e.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-======================================================
-Linux* Driver for Intel(R) Ethernet Network Connection
-======================================================
+=====================================================
+Linux Driver for Intel(R) Ethernet Network Connection
+=====================================================
Intel Gigabit Linux driver.
Copyright(c) 2008-2018 Intel Corporation.
@@ -338,7 +338,7 @@ and higher cannot be forced. Use the autonegotiation advertising setting to
manually set devices for 1 Gbps and higher.
Speed, duplex, and autonegotiation advertising are configured through the
-ethtool* utility.
+ethtool utility.
Caution: Only experienced network administrators should force speed and duplex
or change autonegotiation advertising manually. The settings at the switch must
@@ -351,9 +351,9 @@ will not attempt to auto-negotiate with its link partner since those adapters
operate only in full duplex and only at their native speed.
-Enabling Wake on LAN* (WoL)
----------------------------
-WoL is configured through the ethtool* utility.
+Enabling Wake on LAN (WoL)
+--------------------------
+WoL is configured through the ethtool utility.
WoL will be enabled on the system during the next shut down or reboot. For
this driver version, in order to enable WoL, the e1000e driver must be loaded
diff --git a/Documentation/networking/device_drivers/intel/fm10k.rst b/Documentation/networking/device_drivers/intel/fm10k.rst
index ac3269e34f55..4d279e64e221 100644
--- a/Documentation/networking/device_drivers/intel/fm10k.rst
+++ b/Documentation/networking/device_drivers/intel/fm10k.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-==============================================================
-Linux* Base Driver for Intel(R) Ethernet Multi-host Controller
-==============================================================
+=============================================================
+Linux Base Driver for Intel(R) Ethernet Multi-host Controller
+=============================================================
August 20, 2018
Copyright(c) 2015-2018 Intel Corporation.
@@ -120,8 +120,8 @@ rx-flow-hash tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 m|v|t|s|d|f|n|r
Known Issues/Troubleshooting
============================
-Enabling SR-IOV in a 64-bit Microsoft* Windows Server* 2012/R2 guest OS under Linux KVM
----------------------------------------------------------------------------------------
+Enabling SR-IOV in a 64-bit Microsoft Windows Server 2012/R2 guest OS under Linux KVM
+-------------------------------------------------------------------------------------
KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This
includes traditional PCIe devices, as well as SR-IOV-capable devices based on
the Intel Ethernet Controller XL710.
diff --git a/Documentation/networking/device_drivers/intel/i40e.rst b/Documentation/networking/device_drivers/intel/i40e.rst
index 848fd388fa6e..8a9b18573688 100644
--- a/Documentation/networking/device_drivers/intel/i40e.rst
+++ b/Documentation/networking/device_drivers/intel/i40e.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-==================================================================
-Linux* Base Driver for the Intel(R) Ethernet Controller 700 Series
-==================================================================
+=================================================================
+Linux Base Driver for the Intel(R) Ethernet Controller 700 Series
+=================================================================
Intel 40 Gigabit Linux driver.
Copyright(c) 1999-2018 Intel Corporation.
@@ -384,7 +384,7 @@ NOTE: You cannot set the speed for devices based on the Intel(R) Ethernet
Network Adapter XXV710 based devices.
Speed, duplex, and autonegotiation advertising are configured through the
-ethtool* utility.
+ethtool utility.
Caution: Only experienced network administrators should force speed and duplex
or change autonegotiation advertising manually. The settings at the switch must
diff --git a/Documentation/networking/device_drivers/intel/iavf.rst b/Documentation/networking/device_drivers/intel/iavf.rst
index 2d0c3baa1752..84ac7e75f363 100644
--- a/Documentation/networking/device_drivers/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/intel/iavf.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-==================================================================
-Linux* Base Driver for Intel(R) Ethernet Adaptive Virtual Function
-==================================================================
+=================================================================
+Linux Base Driver for Intel(R) Ethernet Adaptive Virtual Function
+=================================================================
Intel Ethernet Adaptive Virtual Function Linux driver.
Copyright(c) 2013-2018 Intel Corporation.
@@ -10,12 +10,16 @@ Copyright(c) 2013-2018 Intel Corporation.
Contents
========
+- Overview
- Identifying Your Adapter
- Additional Configurations
- Known Issues/Troubleshooting
- Support
-This file describes the iavf Linux* Base Driver. This driver was formerly
+Overview
+========
+
+This file describes the iavf Linux Base Driver. This driver was formerly
called i40evf.
The iavf driver supports the below mentioned virtual function devices and
@@ -27,6 +31,7 @@ The guest OS loading the iavf driver must support MSI-X interrupts.
Identifying Your Adapter
========================
+
The driver in this kernel is compatible with devices based on the following:
* Intel(R) XL710 X710 Virtual Function
* Intel(R) X722 Virtual Function
@@ -50,9 +55,10 @@ Link messages will not be displayed to the console if the distribution is
restricting system messages. In order to see network driver link messages on
your console, set dmesg to eight by entering the following::
- dmesg -n 8
+ # dmesg -n 8
-NOTE: This setting is not saved across reboots.
+NOTE:
+ This setting is not saved across reboots.
ethtool
-------
@@ -72,11 +78,11 @@ then requests from that VF to set VLAN tag stripping will be ignored.
To enable/disable VLAN tag stripping for a VF, issue the following command
from inside the VM in which you are running the VF::
- ethtool -K <if_name> rxvlan on/off
+ # ethtool -K <if_name> rxvlan on/off
or alternatively::
- ethtool --offload <if_name> rxvlan on/off
+ # ethtool --offload <if_name> rxvlan on/off
Adaptive Virtual Function
-------------------------
@@ -91,21 +97,21 @@ additional features depending on what features are available in the PF with
which the AVF is associated. The following are base mode features:
- 4 Queue Pairs (QP) and associated Configuration Status Registers (CSRs)
- for Tx/Rx.
-- i40e descriptors and ring format.
-- Descriptor write-back completion.
-- 1 control queue, with i40e descriptors, CSRs and ring format.
-- 5 MSI-X interrupt vectors and corresponding i40e CSRs.
-- 1 Interrupt Throttle Rate (ITR) index.
-- 1 Virtual Station Interface (VSI) per VF.
+ for Tx/Rx
+- i40e descriptors and ring format
+- Descriptor write-back completion
+- 1 control queue, with i40e descriptors, CSRs and ring format
+- 5 MSI-X interrupt vectors and corresponding i40e CSRs
+- 1 Interrupt Throttle Rate (ITR) index
+- 1 Virtual Station Interface (VSI) per VF
- 1 Traffic Class (TC), TC0
- Receive Side Scaling (RSS) with 64 entry indirection table and key,
- configured through the PF.
-- 1 unicast MAC address reserved per VF.
-- 16 MAC address filters for each VF.
-- Stateless offloads - non-tunneled checksums.
-- AVF device ID.
-- HW mailbox is used for VF to PF communications (including on Windows).
+ configured through the PF
+- 1 unicast MAC address reserved per VF
+- 16 MAC address filters for each VF
+- Stateless offloads - non-tunneled checksums
+- AVF device ID
+- HW mailbox is used for VF to PF communications (including on Windows)
IEEE 802.1ad (QinQ) Support
---------------------------
@@ -117,8 +123,8 @@ VLAN ID, among other uses.
The following are examples of how to configure 802.1ad (QinQ)::
- ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
- ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+ # ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+ # ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
Where "24" and "371" are example VLAN IDs.
@@ -133,6 +139,19 @@ specific application. This can reduce latency for the specified application,
and allow Tx traffic to be rate limited per application. Follow the steps below
to set ADq.
+Requirements:
+
+- The sch_mqprio, act_mirred and cls_flower modules must be loaded
+- The latest version of iproute2
+- If another driver (for example, DPDK) has set cloud filters, you cannot
+ enable ADQ
+- Depending on the underlying PF device, ADQ cannot be enabled when the
+ following features are enabled:
+
+ + Data Center Bridging (DCB)
+ + Multiple Functions per Port (MFP)
+ + Sideband Filters
+
1. Create traffic classes (TCs). Maximum of 8 TCs can be created per interface.
The shaper bw_rlimit parameter is optional.
@@ -141,9 +160,9 @@ to 1Gbit for tc0 and 3Gbit for tc1.
::
- # tc qdisc add dev <interface> root mqprio num_tc 2 map 0 0 0 0 1 1 1 1
- queues 16@0 16@16 hw 1 mode channel shaper bw_rlimit min_rate 1Gbit 2Gbit
- max_rate 1Gbit 3Gbit
+ tc qdisc add dev <interface> root mqprio num_tc 2 map 0 0 0 0 1 1 1 1
+ queues 16@0 16@16 hw 1 mode channel shaper bw_rlimit min_rate 1Gbit 2Gbit
+ max_rate 1Gbit 3Gbit
map: priority mapping for up to 16 priorities to tcs (e.g. map 0 0 0 0 1 1 1 1
sets priorities 0-3 to use tc0 and 4-7 to use tc1)
@@ -162,6 +181,10 @@ Totals must be equal or less than port speed.
For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
+NOTE:
+ Setting up channels via ethtool (ethtool -L) is not supported when the
+ TCs are configured using mqprio.
+
2. Enable HW TC offload on interface::
# ethtool -K <interface> hw-tc-offload on
@@ -171,16 +194,16 @@ monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
# tc qdisc add dev <interface> ingress
NOTES:
- - Run all tc commands from the iproute2 <pathtoiproute2>/tc/ directory.
- - ADq is not compatible with cloud filters.
+ - Run all tc commands from the iproute2 <pathtoiproute2>/tc/ directory
+ - ADq is not compatible with cloud filters
- Setting up channels via ethtool (ethtool -L) is not supported when the TCs
- are configured using mqprio.
+ are configured using mqprio
- You must have iproute2 latest version
- - NVM version 6.01 or later is required.
+ - NVM version 6.01 or later is required
- ADq cannot be enabled when any the following features are enabled: Data
- Center Bridging (DCB), Multiple Functions per Port (MFP), or Sideband Filters.
+ Center Bridging (DCB), Multiple Functions per Port (MFP), or Sideband Filters
- If another driver (for example, DPDK) has set cloud filters, you cannot
- enable ADq.
+ enable ADq
- Tunnel filters are not supported in ADq. If encapsulated packets do arrive
in non-tunnel mode, filtering will be done on the inner headers. For example,
for VXLAN traffic in non-tunnel mode, PCTYPE is identified as a VXLAN
@@ -198,6 +221,16 @@ NOTES:
Known Issues/Troubleshooting
============================
+Bonding fails with VFs bound to an Intel(R) Ethernet Controller 700 series device
+---------------------------------------------------------------------------------
+If you bind Virtual Functions (VFs) to an Intel(R) Ethernet Controller 700
+series based device, the VF slaves may fail when they become the active slave.
+If the MAC address of the VF is set by the PF (Physical Function) of the
+device, when you add a slave, or change the active-backup slave, Linux bonding
+tries to sync the backup slave's MAC address to the same MAC address as the
+active slave. Linux bonding will fail at this point. This issue will not occur
+if the VF's MAC address is not set by the PF.
+
Traffic Is Not Being Passed Between VM and Client
-------------------------------------------------
You may not be able to pass traffic between a client system and a
@@ -215,13 +248,28 @@ Do not unload a port's driver if a Virtual Function (VF) with an active Virtual
Machine (VM) is bound to it. Doing so will cause the port to appear to hang.
Once the VM shuts down, or otherwise releases the VF, the command will complete.
+Using four traffic classes fails
+--------------------------------
+Do not try to reserve more than three traffic classes in the iavf driver. Doing
+so will fail to set any traffic classes and will cause the driver to write
+errors to stdout. Use a maximum of three queues to avoid this issue.
+
+Multiple log error messages on iavf driver removal
+--------------------------------------------------
+If you have several VFs and you remove the iavf driver, several instances of
+the following log errors are written to the log::
+
+ Unable to send opcode 2 to PF, err I40E_ERR_QUEUE_EMPTY, aq_err ok
+ Unable to send the message to VF 2 aq_err 12
+ ARQ Overflow Error detected
+
Virtual machine does not get link
---------------------------------
If the virtual machine has more than one virtual port assigned to it, and those
virtual ports are bound to different physical ports, you may not get link on
all of the virtual ports. The following command may work around the issue::
- ethtool -r <PF>
+ # ethtool -r <PF>
Where <PF> is the PF interface in the host, for example: p5p1. You may need to
run the command more than once to get link on all virtual ports.
@@ -251,12 +299,13 @@ traffic.
If you have multiple interfaces in a server, either turn on ARP filtering by
entering::
- echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
+ # echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
-NOTE: This setting is not saved across reboots. The configuration change can be
-made permanent by adding the following line to the file /etc/sysctl.conf::
+NOTE:
+ This setting is not saved across reboots. The configuration change can be
+ made permanent by adding the following line to the file /etc/sysctl.conf::
- net.ipv4.conf.all.arp_filter = 1
+ net.ipv4.conf.all.arp_filter = 1
Another alternative is to install the interfaces in separate broadcast domains
(either in different switches or in a switch partitioned to VLANs).
diff --git a/Documentation/networking/device_drivers/intel/ice.rst b/Documentation/networking/device_drivers/intel/ice.rst
index c220aa2711c6..ee43ea57d443 100644
--- a/Documentation/networking/device_drivers/intel/ice.rst
+++ b/Documentation/networking/device_drivers/intel/ice.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-===================================================================
-Linux* Base Driver for the Intel(R) Ethernet Connection E800 Series
-===================================================================
+==================================================================
+Linux Base Driver for the Intel(R) Ethernet Connection E800 Series
+==================================================================
Intel ice Linux driver.
Copyright(c) 2018 Intel Corporation.
diff --git a/Documentation/networking/device_drivers/intel/igb.rst b/Documentation/networking/device_drivers/intel/igb.rst
index fc8cfaa5dcfa..87e560fe5eaa 100644
--- a/Documentation/networking/device_drivers/intel/igb.rst
+++ b/Documentation/networking/device_drivers/intel/igb.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-===========================================================
-Linux* Base Driver for Intel(R) Ethernet Network Connection
-===========================================================
+==========================================================
+Linux Base Driver for Intel(R) Ethernet Network Connection
+==========================================================
Intel Gigabit Linux driver.
Copyright(c) 1999-2018 Intel Corporation.
@@ -129,9 +129,9 @@ version is required for this functionality. Download it at:
https://www.kernel.org/pub/software/network/ethtool/
-Enabling Wake on LAN* (WoL)
----------------------------
-WoL is configured through the ethtool* utility.
+Enabling Wake on LAN (WoL)
+--------------------------
+WoL is configured through the ethtool utility.
WoL will be enabled on the system during the next shut down or reboot. For
this driver version, in order to enable WoL, the igb driver must be loaded
diff --git a/Documentation/networking/device_drivers/intel/igbvf.rst b/Documentation/networking/device_drivers/intel/igbvf.rst
index 9cddabe8108e..557fc020ef31 100644
--- a/Documentation/networking/device_drivers/intel/igbvf.rst
+++ b/Documentation/networking/device_drivers/intel/igbvf.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-============================================================
-Linux* Base Virtual Function Driver for Intel(R) 1G Ethernet
-============================================================
+===========================================================
+Linux Base Virtual Function Driver for Intel(R) 1G Ethernet
+===========================================================
Intel Gigabit Virtual Function Linux driver.
Copyright(c) 1999-2018 Intel Corporation.
diff --git a/Documentation/networking/device_drivers/intel/ixgbe.rst b/Documentation/networking/device_drivers/intel/ixgbe.rst
index c7d25483fedb..f1d5233e5e51 100644
--- a/Documentation/networking/device_drivers/intel/ixgbe.rst
+++ b/Documentation/networking/device_drivers/intel/ixgbe.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-=============================================================================
-Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Adapters
-=============================================================================
+===========================================================================
+Linux Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Adapters
+===========================================================================
Intel 10 Gigabit Linux driver.
Copyright(c) 1999-2018 Intel Corporation.
@@ -519,8 +519,8 @@ The offload is also supported for ixgbe's VFs, but the VF must be set as
Known Issues/Troubleshooting
============================
-Enabling SR-IOV in a 64-bit Microsoft* Windows Server* 2012/R2 guest OS
------------------------------------------------------------------------
+Enabling SR-IOV in a 64-bit Microsoft Windows Server 2012/R2 guest OS
+---------------------------------------------------------------------
Linux KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM.
This includes traditional PCIe devices, as well as SR-IOV-capable devices based
on the Intel Ethernet Controller XL710.
diff --git a/Documentation/networking/device_drivers/intel/ixgbevf.rst b/Documentation/networking/device_drivers/intel/ixgbevf.rst
index 5d4977360157..76bbde736f21 100644
--- a/Documentation/networking/device_drivers/intel/ixgbevf.rst
+++ b/Documentation/networking/device_drivers/intel/ixgbevf.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0+
-=============================================================
-Linux* Base Virtual Function Driver for Intel(R) 10G Ethernet
-=============================================================
+============================================================
+Linux Base Virtual Function Driver for Intel(R) 10G Ethernet
+============================================================
Intel 10 Gigabit Virtual Function Linux driver.
Copyright(c) 1999-2018 Intel Corporation.
diff --git a/Documentation/networking/device_drivers/marvell/octeontx2.rst b/Documentation/networking/device_drivers/marvell/octeontx2.rst
new file mode 100644
index 000000000000..88f508338c5f
--- /dev/null
+++ b/Documentation/networking/device_drivers/marvell/octeontx2.rst
@@ -0,0 +1,159 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+====================================
+Marvell OcteonTx2 RVU Kernel Drivers
+====================================
+
+Copyright (c) 2020 Marvell International Ltd.
+
+Contents
+========
+
+- `Overview`_
+- `Drivers`_
+- `Basic packet flow`_
+
+Overview
+========
+
+Resource virtualization unit (RVU) on Marvell's OcteonTX2 SOC maps HW
+resources from the network, crypto and other functional blocks into
+PCI-compatible physical and virtual functions. Each functional block
+again has multiple local functions (LFs) for provisioning to PCI devices.
+RVU supports multiple PCIe SRIOV physical functions (PFs) and virtual
+functions (VFs). PF0 is called the administrative / admin function (AF)
+and has privileges to provision RVU functional block's LFs to each of the
+PF/VF.
+
+RVU managed networking functional blocks
+ - Network pool or buffer allocator (NPA)
+ - Network interface controller (NIX)
+ - Network parser CAM (NPC)
+ - Schedule/Synchronize/Order unit (SSO)
+ - Loopback interface (LBK)
+
+RVU managed non-networking functional blocks
+ - Crypto accelerator (CPT)
+ - Scheduled timers unit (TIM)
+ - Schedule/Synchronize/Order unit (SSO)
+ Used for both networking and non networking usecases
+
+Resource provisioning examples
+ - A PF/VF with NIX-LF & NPA-LF resources works as a pure network device
+ - A PF/VF with CPT-LF resource works as a pure crypto offload device.
+
+RVU functional blocks are highly configurable as per software requirements.
+
+Firmware setups following stuff before kernel boots
+ - Enables required number of RVU PFs based on number of physical links.
+ - Number of VFs per PF are either static or configurable at compile time.
+ Based on config, firmware assigns VFs to each of the PFs.
+ - Also assigns MSIX vectors to each of PF and VFs.
+ - These are not changed after kernel boot.
+
+Drivers
+=======
+
+Linux kernel will have multiple drivers registering to different PF and VFs
+of RVU. Wrt networking there will be 3 flavours of drivers.
+
+Admin Function driver
+---------------------
+
+As mentioned above RVU PF0 is called the admin function (AF), this driver
+supports resource provisioning and configuration of functional blocks.
+Doesn't handle any I/O. It sets up few basic stuff but most of the
+funcionality is achieved via configuration requests from PFs and VFs.
+
+PF/VFs communicates with AF via a shared memory region (mailbox). Upon
+receiving requests AF does resource provisioning and other HW configuration.
+AF is always attached to host kernel, but PFs and their VFs may be used by host
+kernel itself, or attached to VMs or to userspace applications like
+DPDK etc. So AF has to handle provisioning/configuration requests sent
+by any device from any domain.
+
+AF driver also interacts with underlying firmware to
+ - Manage physical ethernet links ie CGX LMACs.
+ - Retrieve information like speed, duplex, autoneg etc
+ - Retrieve PHY EEPROM and stats.
+ - Configure FEC, PAM modes
+ - etc
+
+From pure networking side AF driver supports following functionality.
+ - Map a physical link to a RVU PF to which a netdev is registered.
+ - Attach NIX and NPA block LFs to RVU PF/VF which provide buffer pools, RQs, SQs
+ for regular networking functionality.
+ - Flow control (pause frames) enable/disable/config.
+ - HW PTP timestamping related config.
+ - NPC parser profile config, basically how to parse pkt and what info to extract.
+ - NPC extract profile config, what to extract from the pkt to match data in MCAM entries.
+ - Manage NPC MCAM entries, upon request can frame and install requested packet forwarding rules.
+ - Defines receive side scaling (RSS) algorithms.
+ - Defines segmentation offload algorithms (eg TSO)
+ - VLAN stripping, capture and insertion config.
+ - SSO and TIM blocks config which provide packet scheduling support.
+ - Debugfs support, to check current resource provising, current status of
+ NPA pools, NIX RQ, SQ and CQs, various stats etc which helps in debugging issues.
+ - And many more.
+
+Physical Function driver
+------------------------
+
+This RVU PF handles IO, is mapped to a physical ethernet link and this
+driver registers a netdev. This supports SR-IOV. As said above this driver
+communicates with AF with a mailbox. To retrieve information from physical
+links this driver talks to AF and AF gets that info from firmware and responds
+back ie cannot talk to firmware directly.
+
+Supports ethtool for configuring links, RSS, queue count, queue size,
+flow control, ntuple filters, dump PHY EEPROM, config FEC etc.
+
+Virtual Function driver
+-----------------------
+
+There are two types VFs, VFs that share the physical link with their parent
+SR-IOV PF and the VFs which work in pairs using internal HW loopback channels (LBK).
+
+Type1:
+ - These VFs and their parent PF share a physical link and used for outside communication.
+ - VFs cannot communicate with AF directly, they send mbox message to PF and PF
+ forwards that to AF. AF after processing, responds back to PF and PF forwards
+ the reply to VF.
+ - From functionality point of view there is no difference between PF and VF as same type
+ HW resources are attached to both. But user would be able to configure few stuff only
+ from PF as PF is treated as owner/admin of the link.
+
+Type2:
+ - RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels.
+ - A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of
+ VF0 will be received by VF1 and viceversa.
+ - These VFs can be used by applications or virtual machines to communicate between them
+ without sending traffic outside. There is no switch present in HW, hence the support
+ for loopback VFs.
+ - These communicate directly with AF (PF0) via mbox.
+
+Except for the IO channels or links used for packet reception and transmission there is
+no other difference between these VF types. AF driver takes care of IO channel mapping,
+hence same VF driver works for both types of devices.
+
+Basic packet flow
+=================
+
+Ingress
+-------
+
+1. CGX LMAC receives packet.
+2. Forwards the packet to the NIX block.
+3. Then submitted to NPC block for parsing and then MCAM lookup to get the destination RVU device.
+4. NIX LF attached to the destination RVU device allocates a buffer from RQ mapped buffer pool of NPA block LF.
+5. RQ may be selected by RSS or by configuring MCAM rule with a RQ number.
+6. Packet is DMA'ed and driver is notified.
+
+Egress
+------
+
+1. Driver prepares a send descriptor and submits to SQ for transmission.
+2. The SQ is already configured (by AF) to transmit on a specific link/channel.
+3. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF.
+4. NIX block transmits the pkt on the designated channel.
+5. NPC MCAM entries can be installed to divert pkt onto a different channel.
diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst
index 214325897732..f575a49790e8 100644
--- a/Documentation/networking/device_drivers/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/mellanox/mlx5.rst
@@ -11,7 +11,9 @@ Contents
- `Enabling the driver and kconfig options`_
- `Devlink info`_
+- `Devlink parameters`_
- `Devlink health reporters`_
+- `mlx5 tracepoints`_
Enabling the driver and kconfig options
================================================
@@ -121,12 +123,65 @@ User command example::
stored:
fw.version 16.26.0100
+Devlink parameters
+==================
+
+flow_steering_mode: Device flow steering mode
+---------------------------------------------
+The flow steering mode parameter controls the flow steering mode of the driver.
+Two modes are supported:
+1. 'dmfs' - Device managed flow steering.
+2. 'smfs - Software/Driver managed flow steering.
+
+In DMFS mode, the HW steering entities are created and managed through the
+Firmware.
+In SMFS mode, the HW steering entities are created and managed though by
+the driver directly into Hardware without firmware intervention.
+
+SMFS mode is faster and provides better rule inserstion rate compared to default DMFS mode.
+
+User command examples:
+
+- Set SMFS flow steering mode::
+
+ $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
+
+- Read device flow steering mode::
+
+ $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
+ pci/0000:06:00.0:
+ name flow_steering_mode type driver-specific
+ values:
+ cmode runtime value smfs
+
+enable_roce: RoCE enablement state
+----------------------------------
+RoCE enablement state controls driver support for RoCE traffic.
+When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic.
+
+To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload.
+
+User command examples:
+
+- Disable RoCE::
+
+ $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
+ $ devlink dev reload pci/0000:06:00.0
+
+- Read RoCE enablement state::
+
+ $ devlink dev param show pci/0000:06:00.0 name enable_roce
+ pci/0000:06:00.0:
+ name enable_roce type generic
+ values:
+ cmode driverinit value true
+
Devlink health reporters
========================
tx reporter
-----------
-The tx reporter is responsible of two error scenarios:
+The tx reporter is responsible for reporting and recovering of the following two error scenarios:
- TX timeout
Report on kernel tx timeout detection.
@@ -135,7 +190,7 @@ The tx reporter is responsible of two error scenarios:
Report on error tx completion.
Recover by flushing the TX queue and reset it.
-TX reporter also support Diagnose callback, on which it provides
+TX reporter also support on demand diagnose callback, on which it provides
real time information of its send queues status.
User commands examples:
@@ -144,11 +199,40 @@ User commands examples:
$ devlink health diagnose pci/0000:82:00.0 reporter tx
+NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
+
- Show number of tx errors indicated, number of recover flows ended successfully,
is autorecover enabled and graceful period from last recover::
$ devlink health show pci/0000:82:00.0 reporter tx
+rx reporter
+-----------
+The rx reporter is responsible for reporting and recovering of the following two error scenarios:
+
+- RX queues initialization (population) timeout
+ RX queues descriptors population on ring initialization is done in
+ napi context via triggering an irq, in case of a failure to get
+ the minimum amount of descriptors, a timeout would occur and it
+ could be recoverable by polling the EQ (Event Queue).
+- RX completions with errors (reported by HW on interrupt context)
+ Report on rx completion error.
+ Recover (if needed) by flushing the related queue and reset it.
+
+RX reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues status.
+
+- Diagnose rx queues status, and corresponding completion queue::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter rx
+
+NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
+
+- Show number of rx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter rx
+
fw reporter
-----------
The fw reporter implements diagnose and dump callbacks.
@@ -190,3 +274,48 @@ User commands examples:
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
NOTE: This command can run only on PF.
+
+mlx5 tracepoints
+================
+
+mlx5 driver provides internal trace points for tracking and debugging using
+kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
+
+For the list of support mlx5 events check /sys/kernel/debug/tracing/events/mlx5/
+
+tc and eswitch offloads tracepoints:
+
+- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
+
+- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
+
+ $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
+
+- mlx5e_stats_flower: trace flower stats request::
+
+ $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
+
+- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
+
+- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
+
+ $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
diff --git a/Documentation/networking/device_drivers/microsoft/netvsc.txt b/Documentation/networking/device_drivers/microsoft/netvsc.txt
index 3bfa635bbbd5..cd63556b27a0 100644
--- a/Documentation/networking/device_drivers/microsoft/netvsc.txt
+++ b/Documentation/networking/device_drivers/microsoft/netvsc.txt
@@ -82,3 +82,24 @@ Features
contain one or more packets. The send buffer is an optimization, the driver
will use slower method to handle very large packets or if the send buffer
area is exhausted.
+
+ XDP support
+ -----------
+ XDP (eXpress Data Path) is a feature that runs eBPF bytecode at the early
+ stage when packets arrive at a NIC card. The goal is to increase performance
+ for packet processing, reducing the overhead of SKB allocation and other
+ upper network layers.
+
+ hv_netvsc supports XDP in native mode, and transparently sets the XDP
+ program on the associated VF NIC as well.
+
+ Setting / unsetting XDP program on synthetic NIC (netvsc) propagates to
+ VF NIC automatically. Setting / unsetting XDP program on VF NIC directly
+ is not recommended, also not propagated to synthetic NIC, and may be
+ overwritten by setting of synthetic NIC.
+
+ XDP program cannot run with LRO (RSC) enabled, so you need to disable LRO
+ before running XDP:
+ ethtool -K eth0 lro off
+
+ XDP_REDIRECT action is not yet supported.
diff --git a/Documentation/networking/device_drivers/netronome/nfp.rst b/Documentation/networking/device_drivers/netronome/nfp.rst
new file mode 100644
index 000000000000..ada611fb427c
--- /dev/null
+++ b/Documentation/networking/device_drivers/netronome/nfp.rst
@@ -0,0 +1,249 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+=============================================
+Netronome Flow Processor (NFP) Kernel Drivers
+=============================================
+
+Copyright (c) 2019, Netronome Systems, Inc.
+
+Contents
+========
+
+- `Overview`_
+- `Acquiring Firmware`_
+
+Overview
+========
+
+This driver supports Netronome's line of Flow Processor devices,
+including the NFP4000, NFP5000, and NFP6000 models, which are also
+incorporated in the company's family of Agilio SmartNICs. The SR-IOV
+physical and virtual functions for these devices are supported by
+the driver.
+
+Acquiring Firmware
+==================
+
+The NFP4000 and NFP6000 devices require application specific firmware
+to function. Application firmware can be located either on the host file system
+or in the device flash (if supported by management firmware).
+
+Firmware files on the host filesystem contain card type (`AMDA-*` string), media
+config etc. They should be placed in `/lib/firmware/netronome` directory to
+load firmware from the host file system.
+
+Firmware for basic NIC operation is available in the upstream
+`linux-firmware.git` repository.
+
+Firmware in NVRAM
+-----------------
+
+Recent versions of management firmware supports loading application
+firmware from flash when the host driver gets probed. The firmware loading
+policy configuration may be used to configure this feature appropriately.
+
+Devlink or ethtool can be used to update the application firmware on the device
+flash by providing the appropriate `nic_AMDA*.nffw` file to the respective
+command. Users need to take care to write the correct firmware image for the
+card and media configuration to flash.
+
+Available storage space in flash depends on the card being used.
+
+Dealing with multiple projects
+------------------------------
+
+NFP hardware is fully programmable therefore there can be different
+firmware images targeting different applications.
+
+When using application firmware from host, we recommend placing
+actual firmware files in application-named subdirectories in
+`/lib/firmware/netronome` and linking the desired files, e.g.::
+
+ $ tree /lib/firmware/netronome/
+ /lib/firmware/netronome/
+ ├── bpf
+ │   ├── nic_AMDA0081-0001_1x40.nffw
+ │   └── nic_AMDA0081-0001_4x10.nffw
+ ├── flower
+ │   ├── nic_AMDA0081-0001_1x40.nffw
+ │   └── nic_AMDA0081-0001_4x10.nffw
+ ├── nic
+ │   ├── nic_AMDA0081-0001_1x40.nffw
+ │   └── nic_AMDA0081-0001_4x10.nffw
+ ├── nic_AMDA0081-0001_1x40.nffw -> bpf/nic_AMDA0081-0001_1x40.nffw
+ └── nic_AMDA0081-0001_4x10.nffw -> bpf/nic_AMDA0081-0001_4x10.nffw
+
+ 3 directories, 8 files
+
+You may need to use hard instead of symbolic links on distributions
+which use old `mkinitrd` command instead of `dracut` (e.g. Ubuntu).
+
+After changing firmware files you may need to regenerate the initramfs
+image. Initramfs contains drivers and firmware files your system may
+need to boot. Refer to the documentation of your distribution to find
+out how to update initramfs. Good indication of stale initramfs
+is system loading wrong driver or firmware on boot, but when driver is
+later reloaded manually everything works correctly.
+
+Selecting firmware per device
+-----------------------------
+
+Most commonly all cards on the system use the same type of firmware.
+If you want to load specific firmware image for a specific card, you
+can use either the PCI bus address or serial number. Driver will print
+which files it's looking for when it recognizes a NFP device::
+
+ nfp: Looking for firmware file in order of priority:
+ nfp: netronome/serial-00-12-34-aa-bb-cc-10-ff.nffw: not found
+ nfp: netronome/pci-0000:02:00.0.nffw: not found
+ nfp: netronome/nic_AMDA0081-0001_1x40.nffw: found, loading...
+
+In this case if file (or link) called *serial-00-12-34-aa-bb-5d-10-ff.nffw*
+or *pci-0000:02:00.0.nffw* is present in `/lib/firmware/netronome` this
+firmware file will take precedence over `nic_AMDA*` files.
+
+Note that `serial-*` and `pci-*` files are **not** automatically included
+in initramfs, you will have to refer to documentation of appropriate tools
+to find out how to include them.
+
+Firmware loading policy
+-----------------------
+
+Firmware loading policy is controlled via three HWinfo parameters
+stored as key value pairs in the device flash:
+
+app_fw_from_flash
+ Defines which firmware should take precedence, 'Disk' (0), 'Flash' (1) or
+ the 'Preferred' (2) firmware. When 'Preferred' is selected, the management
+ firmware makes the decision over which firmware will be loaded by comparing
+ versions of the flash firmware and the host supplied firmware.
+ This variable is configurable using the 'fw_load_policy'
+ devlink parameter.
+
+abi_drv_reset
+ Defines if the driver should reset the firmware when
+ the driver is probed, either 'Disk' (0) if firmware was found on disk,
+ 'Always' (1) reset or 'Never' (2) reset. Note that the device is always
+ reset on driver unload if firmware was loaded when the driver was probed.
+ This variable is configurable using the 'reset_dev_on_drv_probe'
+ devlink parameter.
+
+abi_drv_load_ifc
+ Defines a list of PF devices allowed to load FW on the device.
+ This variable is not currently user configurable.
+
+Statistics
+==========
+
+Following device statistics are available through the ``ethtool -S`` interface:
+
+.. flat-table:: NFP device statistics
+ :header-rows: 1
+ :widths: 3 1 11
+
+ * - Name
+ - ID
+ - Meaning
+
+ * - dev_rx_discards
+ - 1
+ - Packet can be discarded on the RX path for one of the following reasons:
+
+ * The NIC is not in promisc mode, and the destination MAC address
+ doesn't match the interfaces' MAC address.
+ * The received packet is larger than the max buffer size on the host.
+ I.e. it exceeds the Layer 3 MRU.
+ * There is no freelist descriptor available on the host for the packet.
+ It is likely that the NIC couldn't cache one in time.
+ * A BPF program discarded the packet.
+ * The datapath drop action was executed.
+ * The MAC discarded the packet due to lack of ingress buffer space
+ on the NIC.
+
+ * - dev_rx_errors
+ - 2
+ - A packet can be counted (and dropped) as RX error for the following
+ reasons:
+
+ * A problem with the VEB lookup (only when SR-IOV is used).
+ * A physical layer problem that causes Ethernet errors, like FCS or
+ alignment errors. The cause is usually faulty cables or SFPs.
+
+ * - dev_rx_bytes
+ - 3
+ - Total number of bytes received.
+
+ * - dev_rx_uc_bytes
+ - 4
+ - Unicast bytes received.
+
+ * - dev_rx_mc_bytes
+ - 5
+ - Multicast bytes received.
+
+ * - dev_rx_bc_bytes
+ - 6
+ - Broadcast bytes received.
+
+ * - dev_rx_pkts
+ - 7
+ - Total number of packets received.
+
+ * - dev_rx_mc_pkts
+ - 8
+ - Multicast packets received.
+
+ * - dev_rx_bc_pkts
+ - 9
+ - Broadcast packets received.
+
+ * - dev_tx_discards
+ - 10
+ - A packet can be discarded in the TX direction if the MAC is
+ being flow controlled and the NIC runs out of TX queue space.
+
+ * - dev_tx_errors
+ - 11
+ - A packet can be counted as TX error (and dropped) for one for the
+ following reasons:
+
+ * The packet is an LSO segment, but the Layer 3 or Layer 4 offset
+ could not be determined. Therefore LSO could not continue.
+ * An invalid packet descriptor was received over PCIe.
+ * The packet Layer 3 length exceeds the device MTU.
+ * An error on the MAC/physical layer. Usually due to faulty cables or
+ SFPs.
+ * A CTM buffer could not be allocated.
+ * The packet offset was incorrect and could not be fixed by the NIC.
+
+ * - dev_tx_bytes
+ - 12
+ - Total number of bytes transmitted.
+
+ * - dev_tx_uc_bytes
+ - 13
+ - Unicast bytes transmitted.
+
+ * - dev_tx_mc_bytes
+ - 14
+ - Multicast bytes transmitted.
+
+ * - dev_tx_bc_bytes
+ - 15
+ - Broadcast bytes transmitted.
+
+ * - dev_tx_pkts
+ - 16
+ - Total number of packets transmitted.
+
+ * - dev_tx_mc_pkts
+ - 17
+ - Multicast packets transmitted.
+
+ * - dev_tx_bc_pkts
+ - 18
+ - Broadcast packets transmitted.
+
+Note that statistics unknown to the driver will be displayed as
+``dev_unknown_stat$ID``, where ``$ID`` refers to the second column
+above.
diff --git a/Documentation/networking/device_drivers/pensando/ionic.rst b/Documentation/networking/device_drivers/pensando/ionic.rst
new file mode 100644
index 000000000000..c17d680cf334
--- /dev/null
+++ b/Documentation/networking/device_drivers/pensando/ionic.rst
@@ -0,0 +1,45 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+========================================================
+Linux Driver for the Pensando(R) Ethernet adapter family
+========================================================
+
+Pensando Linux Ethernet driver.
+Copyright(c) 2019 Pensando Systems, Inc
+
+Contents
+========
+
+- Identifying the Adapter
+- Support
+
+Identifying the Adapter
+=======================
+
+To find if one or more Pensando PCI Ethernet devices are installed on the
+host, check for the PCI devices::
+
+ $ lspci -d 1dd8:
+ b5:00.0 Ethernet controller: Device 1dd8:1002
+ b6:00.0 Ethernet controller: Device 1dd8:1002
+
+If such devices are listed as above, then the ionic.ko driver should find
+and configure them for use. There should be log entries in the kernel
+messages such as these::
+
+ $ dmesg | grep ionic
+ ionic Pensando Ethernet NIC Driver, ver 0.15.0-k
+ ionic 0000:b5:00.0 enp181s0: renamed from eth0
+ ionic 0000:b6:00.0 enp182s0: renamed from eth0
+
+Support
+=======
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+ netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+ drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.rst b/Documentation/networking/device_drivers/stmicro/stmmac.rst
new file mode 100644
index 000000000000..c34bab3d2df0
--- /dev/null
+++ b/Documentation/networking/device_drivers/stmicro/stmmac.rst
@@ -0,0 +1,697 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==============================================================
+Linux Driver for the Synopsys(R) Ethernet Controllers "stmmac"
+==============================================================
+
+Authors: Giuseppe Cavallaro <peppe.cavallaro@st.com>,
+Alexandre Torgue <alexandre.torgue@st.com>, Jose Abreu <joabreu@synopsys.com>
+
+Contents
+========
+
+- In This Release
+- Feature List
+- Kernel Configuration
+- Command Line Parameters
+- Driver Information and Notes
+- Debug Information
+- Support
+
+In This Release
+===============
+
+This file describes the stmmac Linux Driver for all the Synopsys(R) Ethernet
+Controllers.
+
+Currently, this network device driver is for all STi embedded MAC/GMAC
+(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XILINX XC2V3000
+FF1152AMT0221 D1215994A VIRTEX FPGA board. The Synopsys Ethernet QoS 5.0 IPK
+is also supported.
+
+DesignWare(R) Cores Ethernet MAC 10/100/1000 Universal version 3.70a
+(and older) and DesignWare(R) Cores Ethernet Quality-of-Service version 4.0
+(and upper) have been used for developing this driver as well as
+DesignWare(R) Cores XGMAC - 10G Ethernet MAC.
+
+This driver supports both the platform bus and PCI.
+
+This driver includes support for the following Synopsys(R) DesignWare(R)
+Cores Ethernet Controllers and corresponding minimum and maximum versions:
+
++-------------------------------+--------------+--------------+--------------+
+| Controller Name | Min. Version | Max. Version | Abbrev. Name |
++===============================+==============+==============+==============+
+| Ethernet MAC Universal | N/A | 3.73a | GMAC |
++-------------------------------+--------------+--------------+--------------+
+| Ethernet Quality-of-Service | 4.00a | N/A | GMAC4+ |
++-------------------------------+--------------+--------------+--------------+
+| XGMAC - 10G Ethernet MAC | 2.10a | N/A | XGMAC2+ |
++-------------------------------+--------------+--------------+--------------+
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Ethernet adapter. All hardware requirements listed apply
+to use with Linux.
+
+Feature List
+============
+
+The following features are available in this driver:
+ - GMII/MII/RGMII/SGMII/RMII/XGMII Interface
+ - Half-Duplex / Full-Duplex Operation
+ - Energy Efficient Ethernet (EEE)
+ - IEEE 802.3x PAUSE Packets (Flow Control)
+ - RMON/MIB Counters
+ - IEEE 1588 Timestamping (PTP)
+ - Pulse-Per-Second Output (PPS)
+ - MDIO Clause 22 / Clause 45 Interface
+ - MAC Loopback
+ - ARP Offloading
+ - Automatic CRC / PAD Insertion and Checking
+ - Checksum Offload for Received and Transmitted Packets
+ - Standard or Jumbo Ethernet Packets
+ - Source Address Insertion / Replacement
+ - VLAN TAG Insertion / Replacement / Deletion / Filtering (HASH and PERFECT)
+ - Programmable TX and RX Watchdog and Coalesce Settings
+ - Destination Address Filtering (PERFECT)
+ - HASH Filtering (Multicast)
+ - Layer 3 / Layer 4 Filtering
+ - Remote Wake-Up Detection
+ - Receive Side Scaling (RSS)
+ - Frame Preemption for TX and RX
+ - Programmable Burst Length, Threshold, Queue Size
+ - Multiple Queues (up to 8)
+ - Multiple Scheduling Algorithms (TX: WRR, DWRR, WFQ, SP, CBS, EST, TBS;
+ RX: WRR, SP)
+ - Flexible RX Parser
+ - TCP / UDP Segmentation Offload (TSO, USO)
+ - Split Header (SPH)
+ - Safety Features (ECC Protection, Data Parity Protection)
+ - Selftests using Ethtool
+
+Kernel Configuration
+====================
+
+The kernel configuration option is ``CONFIG_STMMAC_ETH``:
+ - ``CONFIG_STMMAC_PLATFORM``: is to enable the platform driver.
+ - ``CONFIG_STMMAC_PCI``: is to enable the pci driver.
+
+Command Line Parameters
+=======================
+
+If the driver is built as a module the following optional parameters are used
+by entering them on the command line with the modprobe command using this
+syntax (e.g. for PCI module)::
+
+ modprobe stmmac_pci [<option>=<VAL1>,<VAL2>,...]
+
+Driver parameters can be also passed in command line by using::
+
+ stmmaceth=watchdog:100,chain_mode=1
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+watchdog
+--------
+:Valid Range: 5000-None
+:Default Value: 5000
+
+This parameter overrides the transmit timeout in milliseconds.
+
+debug
+-----
+:Valid Range: 0-16 (0=none,...,16=all)
+:Default Value: 0
+
+This parameter adjusts the level of debug messages displayed in the system
+logs.
+
+phyaddr
+-------
+:Valid Range: 0-31
+:Default Value: -1
+
+This parameter overrides the physical address of the PHY device.
+
+flow_ctrl
+---------
+:Valid Range: 0-3 (0=off,1=rx,2=tx,3=rx/tx)
+:Default Value: 3
+
+This parameter changes the default Flow Control ability.
+
+pause
+-----
+:Valid Range: 0-65535
+:Default Value: 65535
+
+This parameter changes the default Flow Control Pause time.
+
+tc
+--
+:Valid Range: 64-256
+:Default Value: 64
+
+This parameter changes the default HW FIFO Threshold control value.
+
+buf_sz
+------
+:Valid Range: 1536-16384
+:Default Value: 1536
+
+This parameter changes the default RX DMA packet buffer size.
+
+eee_timer
+---------
+:Valid Range: 0-None
+:Default Value: 1000
+
+This parameter changes the default LPI TX Expiration time in milliseconds.
+
+chain_mode
+----------
+:Valid Range: 0-1 (0=off,1=on)
+:Default Value: 0
+
+This parameter changes the default mode of operation from Ring Mode to
+Chain Mode.
+
+Driver Information and Notes
+============================
+
+Transmit Process
+----------------
+
+The xmit method is invoked when the kernel needs to transmit a packet; it sets
+the descriptors in the ring and informs the DMA engine that there is a packet
+ready to be transmitted.
+
+By default, the driver sets the ``NETIF_F_SG`` bit in the features field of
+the ``net_device`` structure, enabling the scatter-gather feature. This is
+true on chips and configurations where the checksum can be done in hardware.
+
+Once the controller has finished transmitting the packet, timer will be
+scheduled to release the transmit resources.
+
+Receive Process
+---------------
+
+When one or more packets are received, an interrupt happens. The interrupts
+are not queued, so the driver has to scan all the descriptors in the ring
+during the receive process.
+
+This is based on NAPI, so the interrupt handler signals only if there is work
+to be done, and it exits. Then the poll method will be scheduled at some
+future point.
+
+The incoming packets are stored, by the DMA, in a list of pre-allocated socket
+buffers in order to avoid the memcpy (zero-copy).
+
+Interrupt Mitigation
+--------------------
+
+The driver is able to mitigate the number of its DMA interrupts using NAPI for
+the reception on chips older than the 3.50. New chips have an HW RX Watchdog
+used for this mitigation.
+
+Mitigation parameters can be tuned by ethtool.
+
+WoL
+---
+
+Wake up on Lan feature through Magic and Unicast frames are supported for the
+GMAC, GMAC4/5 and XGMAC core.
+
+DMA Descriptors
+---------------
+
+Driver handles both normal and alternate descriptors. The latter has been only
+tested on DesignWare(R) Cores Ethernet MAC Universal version 3.41a and later.
+
+stmmac supports DMA descriptor to operate both in dual buffer (RING) and
+linked-list(CHAINED) mode. In RING each descriptor points to two data buffer
+pointers whereas in CHAINED mode they point to only one data buffer pointer.
+RING mode is the default.
+
+In CHAINED mode each descriptor will have pointer to next descriptor in the
+list, hence creating the explicit chaining in the descriptor itself, whereas
+such explicit chaining is not possible in RING mode.
+
+Extended Descriptors
+--------------------
+
+The extended descriptors give us information about the Ethernet payload when
+it is carrying PTP packets or TCP/UDP/ICMP over IP. These are not available on
+GMAC Synopsys(R) chips older than the 3.50. At probe time the driver will
+decide if these can be actually used. This support also is mandatory for PTPv2
+because the extra descriptors are used for saving the hardware timestamps and
+Extended Status.
+
+Ethtool Support
+---------------
+
+Ethtool is supported. For example, driver statistics (including RMON),
+internal errors can be taken using::
+
+ ethtool -S ethX
+
+Ethtool selftests are also supported. This allows to do some early sanity
+checks to the HW using MAC and PHY loopback mechanisms::
+
+ ethtool -t ethX
+
+Jumbo and Segmentation Offloading
+---------------------------------
+
+Jumbo frames are supported and tested for the GMAC. The GSO has been also
+added but it's performed in software. LRO is not supported.
+
+TSO Support
+-----------
+
+TSO (TCP Segmentation Offload) feature is supported by GMAC > 4.x and XGMAC
+chip family. When a packet is sent through TCP protocol, the TCP stack ensures
+that the SKB provided to the low level driver (stmmac in our case) matches
+with the maximum frame len (IP header + TCP header + payload <= 1500 bytes
+(for MTU set to 1500)). It means that if an application using TCP want to send
+a packet which will have a length (after adding headers) > 1514 the packet
+will be split in several TCP packets: The data payload is split and headers
+(TCP/IP ..) are added. It is done by software.
+
+When TSO is enabled, the TCP stack doesn't care about the maximum frame length
+and provide SKB packet to stmmac as it is. The GMAC IP will have to perform
+the segmentation by it self to match with maximum frame length.
+
+This feature can be enabled in device tree through ``snps,tso`` entry.
+
+Energy Efficient Ethernet
+-------------------------
+
+Energy Efficient Ethernet (EEE) enables IEEE 802.3 MAC sublayer along with a
+family of Physical layer to operate in the Low Power Idle (LPI) mode. The EEE
+mode supports the IEEE 802.3 MAC operation at 100Mbps, 1000Mbps and 1Gbps.
+
+The LPI mode allows power saving by switching off parts of the communication
+device functionality when there is no data to be transmitted & received.
+The system on both the side of the link can disable some functionalities and
+save power during the period of low-link utilization. The MAC controls whether
+the system should enter or exit the LPI mode and communicate this to PHY.
+
+As soon as the interface is opened, the driver verifies if the EEE can be
+supported. This is done by looking at both the DMA HW capability register and
+the PHY devices MCD registers.
+
+To enter in TX LPI mode the driver needs to have a software timer that enable
+and disable the LPI mode when there is nothing to be transmitted.
+
+Precision Time Protocol (PTP)
+-----------------------------
+
+The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP), which
+enables precise synchronization of clocks in measurement and control systems
+implemented with technologies such as network communication.
+
+In addition to the basic timestamp features mentioned in IEEE 1588-2002
+Timestamps, new GMAC cores support the advanced timestamp features.
+IEEE 1588-2008 can be enabled when configuring the Kernel.
+
+SGMII/RGMII Support
+-------------------
+
+New GMAC devices provide own way to manage RGMII/SGMII. This information is
+available at run-time by looking at the HW capability register. This means
+that the stmmac can manage auto-negotiation and link status w/o using the
+PHYLIB stuff. In fact, the HW provides a subset of extended registers to
+restart the ANE, verify Full/Half duplex mode and Speed. Thanks to these
+registers, it is possible to look at the Auto-negotiated Link Parter Ability.
+
+Physical
+--------
+
+The driver is compatible with Physical Abstraction Layer to be connected with
+PHY and GPHY devices.
+
+Platform Information
+--------------------
+
+Several information can be passed through the platform and device-tree.
+
+::
+
+ struct plat_stmmacenet_data {
+
+1) Bus identifier::
+
+ int bus_id;
+
+2) PHY Physical Address. If set to -1 the driver will pick the first PHY it
+finds::
+
+ int phy_addr;
+
+3) PHY Device Interface::
+
+ int interface;
+
+4) Specific platform fields for the MDIO bus::
+
+ struct stmmac_mdio_bus_data *mdio_bus_data;
+
+5) Internal DMA parameters::
+
+ struct stmmac_dma_cfg *dma_cfg;
+
+6) Fixed CSR Clock Range selection::
+
+ int clk_csr;
+
+7) HW uses the GMAC core::
+
+ int has_gmac;
+
+8) If set the MAC will use Enhanced Descriptors::
+
+ int enh_desc;
+
+9) Core is able to perform TX Checksum and/or RX Checksum in HW::
+
+ int tx_coe;
+ int rx_coe;
+
+11) Some HWs are not able to perform the csum in HW for over-sized frames due
+to limited buffer sizes. Setting this flag the csum will be done in SW on
+JUMBO frames::
+
+ int bugged_jumbo;
+
+12) Core has the embedded power module::
+
+ int pmt;
+
+13) Force DMA to use the Store and Forward mode or Threshold mode::
+
+ int force_sf_dma_mode;
+ int force_thresh_dma_mode;
+
+15) Force to disable the RX Watchdog feature and switch to NAPI mode::
+
+ int riwt_off;
+
+16) Limit the maximum operating speed and MTU::
+
+ int max_speed;
+ int maxmtu;
+
+18) Number of Multicast/Unicast filters::
+
+ int multicast_filter_bins;
+ int unicast_filter_entries;
+
+20) Limit the maximum TX and RX FIFO size::
+
+ int tx_fifo_size;
+ int rx_fifo_size;
+
+21) Use the specified number of TX and RX Queues::
+
+ u32 rx_queues_to_use;
+ u32 tx_queues_to_use;
+
+22) Use the specified TX and RX scheduling algorithm::
+
+ u8 rx_sched_algorithm;
+ u8 tx_sched_algorithm;
+
+23) Internal TX and RX Queue parameters::
+
+ struct stmmac_rxq_cfg rx_queues_cfg[MTL_MAX_RX_QUEUES];
+ struct stmmac_txq_cfg tx_queues_cfg[MTL_MAX_TX_QUEUES];
+
+24) This callback is used for modifying some syscfg registers (on ST SoCs)
+according to the link speed negotiated by the physical layer::
+
+ void (*fix_mac_speed)(void *priv, unsigned int speed);
+
+25) Callbacks used for calling a custom initialization; This is sometimes
+necessary on some platforms (e.g. ST boxes) where the HW needs to have set
+some PIO lines or system cfg registers. init/exit callbacks should not use
+or modify platform data::
+
+ int (*init)(struct platform_device *pdev, void *priv);
+ void (*exit)(struct platform_device *pdev, void *priv);
+
+26) Perform HW setup of the bus. For example, on some ST platforms this field
+is used to configure the AMBA bridge to generate more efficient STBus traffic::
+
+ struct mac_device_info *(*setup)(void *priv);
+ void *bsp_priv;
+
+27) Internal clocks and rates::
+
+ struct clk *stmmac_clk;
+ struct clk *pclk;
+ struct clk *clk_ptp_ref;
+ unsigned int clk_ptp_rate;
+ unsigned int clk_ref_rate;
+ s32 ptp_max_adj;
+
+28) Main reset::
+
+ struct reset_control *stmmac_rst;
+
+29) AXI Internal Parameters::
+
+ struct stmmac_axi *axi;
+
+30) HW uses GMAC>4 cores::
+
+ int has_gmac4;
+
+31) HW is sun8i based::
+
+ bool has_sun8i;
+
+32) Enables TSO feature::
+
+ bool tso_en;
+
+33) Enables Receive Side Scaling (RSS) feature::
+
+ int rss_en;
+
+34) MAC Port selection::
+
+ int mac_port_sel_speed;
+
+35) Enables TX LPI Clock Gating::
+
+ bool en_tx_lpi_clockgating;
+
+36) HW uses XGMAC>2.10 cores::
+
+ int has_xgmac;
+
+::
+
+ }
+
+For MDIO bus data, we have:
+
+::
+
+ struct stmmac_mdio_bus_data {
+
+1) PHY mask passed when MDIO bus is registered::
+
+ unsigned int phy_mask;
+
+2) List of IRQs, one per PHY::
+
+ int *irqs;
+
+3) If IRQs is NULL, use this for probed PHY::
+
+ int probed_phy_irq;
+
+4) Set to true if PHY needs reset::
+
+ bool needs_reset;
+
+::
+
+ }
+
+For DMA engine configuration, we have:
+
+::
+
+ struct stmmac_dma_cfg {
+
+1) Programmable Burst Length (TX and RX)::
+
+ int pbl;
+
+2) If set, DMA TX / RX will use this value rather than pbl::
+
+ int txpbl;
+ int rxpbl;
+
+3) Enable 8xPBL::
+
+ bool pblx8;
+
+4) Enable Fixed or Mixed burst::
+
+ int fixed_burst;
+ int mixed_burst;
+
+5) Enable Address Aligned Beats::
+
+ bool aal;
+
+6) Enable Enhanced Addressing (> 32 bits)::
+
+ bool eame;
+
+::
+
+ }
+
+For DMA AXI parameters, we have:
+
+::
+
+ struct stmmac_axi {
+
+1) Enable AXI LPI::
+
+ bool axi_lpi_en;
+ bool axi_xit_frm;
+
+2) Set AXI Write / Read maximum outstanding requests::
+
+ u32 axi_wr_osr_lmt;
+ u32 axi_rd_osr_lmt;
+
+3) Set AXI 4KB bursts::
+
+ bool axi_kbbe;
+
+4) Set AXI maximum burst length map::
+
+ u32 axi_blen[AXI_BLEN];
+
+5) Set AXI Fixed burst / mixed burst::
+
+ bool axi_fb;
+ bool axi_mb;
+
+6) Set AXI rebuild incrx mode::
+
+ bool axi_rb;
+
+::
+
+ }
+
+For the RX Queues configuration, we have:
+
+::
+
+ struct stmmac_rxq_cfg {
+
+1) Mode to use (DCB or AVB)::
+
+ u8 mode_to_use;
+
+2) DMA channel to use::
+
+ u32 chan;
+
+3) Packet routing, if applicable::
+
+ u8 pkt_route;
+
+4) Use priority routing, and priority to route::
+
+ bool use_prio;
+ u32 prio;
+
+::
+
+ }
+
+For the TX Queues configuration, we have:
+
+::
+
+ struct stmmac_txq_cfg {
+
+1) Queue weight in scheduler::
+
+ u32 weight;
+
+2) Mode to use (DCB or AVB)::
+
+ u8 mode_to_use;
+
+3) Credit Base Shaper Parameters::
+
+ u32 send_slope;
+ u32 idle_slope;
+ u32 high_credit;
+ u32 low_credit;
+
+4) Use priority scheduling, and priority::
+
+ bool use_prio;
+ u32 prio;
+
+::
+
+ }
+
+Device Tree Information
+-----------------------
+
+Please refer to the following document:
+Documentation/devicetree/bindings/net/snps,dwmac.yaml
+
+HW Capabilities
+---------------
+
+Note that, starting from new chips, where it is available the HW capability
+register, many configurations are discovered at run-time for example to
+understand if EEE, HW csum, PTP, enhanced descriptor etc are actually
+available. As strategy adopted in this driver, the information from the HW
+capability register can replace what has been passed from the platform.
+
+Debug Information
+=================
+
+The driver exports many information i.e. internal statistics, debug
+information, MAC and DMA registers etc.
+
+These can be read in several ways depending on the type of the information
+actually needed.
+
+For example a user can be use the ethtool support to get statistics: e.g.
+using: ``ethtool -S ethX`` (that shows the Management counters (MMC) if
+supported) or sees the MAC/DMA registers: e.g. using: ``ethtool -d ethX``
+
+Compiling the Kernel with ``CONFIG_DEBUG_FS`` the driver will export the
+following debugfs entries:
+
+ - ``descriptors_status``: To show the DMA TX/RX descriptor rings
+ - ``dma_cap``: To show the HW Capabilities
+
+Developer can also use the ``debug`` module parameter to get further debug
+information (please see: NETIF Msg Level).
+
+Support
+=======
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the
+issue to netdev@vger.kernel.org
diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.txt b/Documentation/networking/device_drivers/stmicro/stmmac.txt
deleted file mode 100644
index 1ae979fd90d2..000000000000
--- a/Documentation/networking/device_drivers/stmicro/stmmac.txt
+++ /dev/null
@@ -1,401 +0,0 @@
- STMicroelectronics 10/100/1000 Synopsys Ethernet driver
-
-Copyright (C) 2007-2015 STMicroelectronics Ltd
-Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
-
-This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
-(Synopsys IP blocks).
-
-Currently this network device driver is for all STi embedded MAC/GMAC
-(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000
-FF1152AMT0221 D1215994A VIRTEX FPGA board.
-
-DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether
-MAC 10/100 Universal version 4.0 have been used for developing this driver.
-
-This driver supports both the platform bus and PCI.
-
-Please, for more information also visit: www.stlinux.com
-
-1) Kernel Configuration
-The kernel configuration option is STMMAC_ETH:
- Device Drivers ---> Network device support ---> Ethernet (1000 Mbit) --->
- STMicroelectronics 10/100/1000 Ethernet driver (STMMAC_ETH)
-
-CONFIG_STMMAC_PLATFORM: is to enable the platform driver.
-CONFIG_STMMAC_PCI: is to enable the pci driver.
-
-2) Driver parameters list:
- debug: message level (0: no output, 16: all);
- phyaddr: to manually provide the physical address to the PHY device;
- buf_sz: DMA buffer size;
- tc: control the HW FIFO threshold;
- watchdog: transmit timeout (in milliseconds);
- flow_ctrl: Flow control ability [on/off];
- pause: Flow Control Pause Time;
- eee_timer: tx EEE timer;
- chain_mode: select chain mode instead of ring.
-
-3) Command line options
-Driver parameters can be also passed in command line by using:
- stmmaceth=watchdog:100,chain_mode=1
-
-4) Driver information and notes
-
-4.1) Transmit process
-The xmit method is invoked when the kernel needs to transmit a packet; it sets
-the descriptors in the ring and informs the DMA engine, that there is a packet
-ready to be transmitted.
-By default, the driver sets the NETIF_F_SG bit in the features field of the
-net_device structure, enabling the scatter-gather feature. This is true on
-chips and configurations where the checksum can be done in hardware.
-Once the controller has finished transmitting the packet, timer will be
-scheduled to release the transmit resources.
-
-4.2) Receive process
-When one or more packets are received, an interrupt happens. The interrupts
-are not queued, so the driver has to scan all the descriptors in the ring during
-the receive process.
-This is based on NAPI, so the interrupt handler signals only if there is work
-to be done, and it exits.
-Then the poll method will be scheduled at some future point.
-The incoming packets are stored, by the DMA, in a list of pre-allocated socket
-buffers in order to avoid the memcpy (zero-copy).
-
-4.3) Interrupt mitigation
-The driver is able to mitigate the number of its DMA interrupts
-using NAPI for the reception on chips older than the 3.50.
-New chips have an HW RX-Watchdog used for this mitigation.
-Mitigation parameters can be tuned by ethtool.
-
-4.4) WOL
-Wake up on Lan feature through Magic and Unicast frames are supported for the
-GMAC core.
-
-4.5) DMA descriptors
-Driver handles both normal and alternate descriptors. The latter has been only
-tested on DWC Ether MAC 10/100/1000 Universal version 3.41a and later.
-
-STMMAC supports DMA descriptor to operate both in dual buffer (RING)
-and linked-list(CHAINED) mode. In RING each descriptor points to two
-data buffer pointers whereas in CHAINED mode they point to only one data
-buffer pointer. RING mode is the default.
-
-In CHAINED mode each descriptor will have pointer to next descriptor in
-the list, hence creating the explicit chaining in the descriptor itself,
-whereas such explicit chaining is not possible in RING mode.
-
-4.5.1) Extended descriptors
-The extended descriptors give us information about the Ethernet payload
-when it is carrying PTP packets or TCP/UDP/ICMP over IP.
-These are not available on GMAC Synopsys chips older than the 3.50.
-At probe time the driver will decide if these can be actually used.
-This support also is mandatory for PTPv2 because the extra descriptors
-are used for saving the hardware timestamps and Extended Status.
-
-4.6) Ethtool support
-Ethtool is supported.
-
-For example, driver statistics (including RMON), internal errors can be taken
-using:
- # ethtool -S ethX
-command
-
-4.7) Jumbo and Segmentation Offloading
-Jumbo frames are supported and tested for the GMAC.
-The GSO has been also added but it's performed in software.
-LRO is not supported.
-
-4.8) Physical
-The driver is compatible with Physical Abstraction Layer to be connected with
-PHY and GPHY devices.
-
-4.9) Platform information
-Several information can be passed through the platform and device-tree.
-
-struct plat_stmmacenet_data {
- char *phy_bus_name;
- int bus_id;
- int phy_addr;
- int interface;
- struct stmmac_mdio_bus_data *mdio_bus_data;
- struct stmmac_dma_cfg *dma_cfg;
- int clk_csr;
- int has_gmac;
- int enh_desc;
- int tx_coe;
- int rx_coe;
- int bugged_jumbo;
- int pmt;
- int force_sf_dma_mode;
- int force_thresh_dma_mode;
- int riwt_off;
- int max_speed;
- int maxmtu;
- void (*fix_mac_speed)(void *priv, unsigned int speed);
- void (*bus_setup)(void __iomem *ioaddr);
- int (*init)(struct platform_device *pdev, void *priv);
- void (*exit)(struct platform_device *pdev, void *priv);
- void *bsp_priv;
- int has_gmac4;
- bool tso_en;
-};
-
-Where:
- o phy_bus_name: phy bus name to attach to the stmmac.
- o bus_id: bus identifier.
- o phy_addr: the physical address can be passed from the platform.
- If it is set to -1 the driver will automatically
- detect it at run-time by probing all the 32 addresses.
- o interface: PHY device's interface.
- o mdio_bus_data: specific platform fields for the MDIO bus.
- o dma_cfg: internal DMA parameters
- o pbl: the Programmable Burst Length is maximum number of beats to
- be transferred in one DMA transaction.
- GMAC also enables the 4xPBL by default. (8xPBL for GMAC 3.50 and newer)
- o txpbl/rxpbl: GMAC and newer supports independent DMA pbl for tx/rx.
- o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default.
- o fixed_burst/mixed_burst/aal
- o clk_csr: fixed CSR Clock range selection.
- o has_gmac: uses the GMAC core.
- o enh_desc: if sets the MAC will use the enhanced descriptor structure.
- o tx_coe: core is able to perform the tx csum in HW.
- o rx_coe: the supports three check sum offloading engine types:
- type_1, type_2 (full csum) and no RX coe.
- o bugged_jumbo: some HWs are not able to perform the csum in HW for
- over-sized frames due to limited buffer sizes.
- Setting this flag the csum will be done in SW on
- JUMBO frames.
- o pmt: core has the embedded power module (optional).
- o force_sf_dma_mode: force DMA to use the Store and Forward mode
- instead of the Threshold.
- o force_thresh_dma_mode: force DMA to use the Threshold mode other than
- the Store and Forward mode.
- o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode.
- o fix_mac_speed: this callback is used for modifying some syscfg registers
- (on ST SoCs) according to the link speed negotiated by the
- physical layer .
- o bus_setup: perform HW setup of the bus. For example, on some ST platforms
- this field is used to configure the AMBA bridge to generate more
- efficient STBus traffic.
- o init/exit: callbacks used for calling a custom initialization;
- this is sometime necessary on some platforms (e.g. ST boxes)
- where the HW needs to have set some PIO lines or system cfg
- registers. init/exit callbacks should not use or modify
- platform data.
- o bsp_priv: another private pointer.
- o has_gmac4: uses GMAC4 core.
- o tso_en: Enables TSO (TCP Segmentation Offload) feature.
-
-For MDIO bus The we have:
-
- struct stmmac_mdio_bus_data {
- int (*phy_reset)(void *priv);
- unsigned int phy_mask;
- int *irqs;
- int probed_phy_irq;
- };
-
-Where:
- o phy_reset: hook to reset the phy device attached to the bus.
- o phy_mask: phy mask passed when register the MDIO bus within the driver.
- o irqs: list of IRQs, one per PHY.
- o probed_phy_irq: if irqs is NULL, use this for probed PHY.
-
-For DMA engine we have the following internal fields that should be
-tuned according to the HW capabilities.
-
-struct stmmac_dma_cfg {
- int pbl;
- int txpbl;
- int rxpbl;
- bool pblx8;
- int fixed_burst;
- int mixed_burst;
- bool aal;
-};
-
-Where:
- o pbl: Programmable Burst Length (tx and rx)
- o txpbl: Transmit Programmable Burst Length. Only for GMAC and newer.
- If set, DMA tx will use this value rather than pbl.
- o rxpbl: Receive Programmable Burst Length. Only for GMAC and newer.
- If set, DMA rx will use this value rather than pbl.
- o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default.
- o fixed_burst: program the DMA to use the fixed burst mode
- o mixed_burst: program the DMA to use the mixed burst mode
- o aal: Address-Aligned Beats
-
----
-
-Below an example how the structures above are using on ST platforms.
-
- static struct plat_stmmacenet_data stxYYY_ethernet_platform_data = {
- .has_gmac = 0,
- .enh_desc = 0,
- .fix_mac_speed = stxYYY_ethernet_fix_mac_speed,
- |
- |-> to write an internal syscfg
- | on this platform when the
- | link speed changes from 10 to
- | 100 and viceversa
- .init = &stmmac_claim_resource,
- |
- |-> On ST SoC this calls own "PAD"
- | manager framework to claim
- | all the resources necessary
- | (GPIO ...). The .custom_cfg field
- | is used to pass a custom config.
-};
-
-Below the usage of the stmmac_mdio_bus_data: on this SoC, in fact,
-there are two MAC cores: one MAC is for MDIO Bus/PHY emulation
-with fixed_link support.
-
-static struct stmmac_mdio_bus_data stmmac1_mdio_bus = {
- .phy_reset = phy_reset;
- |
- |-> function to provide the phy_reset on this board
- .phy_mask = 0,
-};
-
-static struct fixed_phy_status stmmac0_fixed_phy_status = {
- .link = 1,
- .speed = 100,
- .duplex = 1,
-};
-
-During the board's device_init we can configure the first
-MAC for fixed_link by calling:
- fixed_phy_add(PHY_POLL, 1, &stmmac0_fixed_phy_status);
-and the second one, with a real PHY device attached to the bus,
-by using the stmmac_mdio_bus_data structure (to provide the id, the
-reset procedure etc).
-
-Note that, starting from new chips, where it is available the HW capability
-register, many configurations are discovered at run-time for example to
-understand if EEE, HW csum, PTP, enhanced descriptor etc are actually
-available. As strategy adopted in this driver, the information from the HW
-capability register can replace what has been passed from the platform.
-
-4.10) Device-tree support.
-
-Please see the following document:
- Documentation/devicetree/bindings/net/stmmac.txt
-
-4.11) This is a summary of the content of some relevant files:
- o stmmac_main.c: implements the main network device driver;
- o stmmac_mdio.c: provides MDIO functions;
- o stmmac_pci: this is the PCI driver;
- o stmmac_platform.c: this the platform driver (OF supported);
- o stmmac_ethtool.c: implements the ethtool support;
- o stmmac.h: private driver structure;
- o common.h: common definitions and VFTs;
- o mmc_core.c/mmc.h: Management MAC Counters;
- o stmmac_hwtstamp.c: HW timestamp support for PTP;
- o stmmac_ptp.c: PTP 1588 clock;
- o stmmac_pcs.h: Physical Coding Sublayer common implementation;
- o dwmac-<XXX>.c: these are for the platform glue-logic file; e.g. dwmac-sti.c
- for STMicroelectronics SoCs.
-
-- GMAC 3.x
- o descs.h: descriptor structure definitions;
- o dwmac1000_core.c: dwmac GiGa core functions;
- o dwmac1000_dma.c: dma functions for the GMAC chip;
- o dwmac1000.h: specific header file for the dwmac GiGa;
- o dwmac100_core: dwmac 100 core code;
- o dwmac100_dma.c: dma functions for the dwmac 100 chip;
- o dwmac1000.h: specific header file for the MAC;
- o dwmac_lib.c: generic DMA functions;
- o enh_desc.c: functions for handling enhanced descriptors;
- o norm_desc.c: functions for handling normal descriptors;
- o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes;
-
-- GMAC4.x generation
- o dwmac4_core.c: dwmac GMAC4.x core functions;
- o dwmac4_desc.c: functions for handling GMAC4.x descriptors;
- o dwmac4_descs.h: descriptor definitions;
- o dwmac4_dma.c: dma functions for the GMAC4.x chip;
- o dwmac4_dma.h: dma definitions for the GMAC4.x chip;
- o dwmac4.h: core definitions for the GMAC4.x chip;
- o dwmac4_lib.c: generic GMAC4.x functions;
-
-4.12) TSO support (GMAC4.x)
-
-TSO (Tcp Segmentation Offload) feature is supported by GMAC 4.x chip family.
-When a packet is sent through TCP protocol, the TCP stack ensures that
-the SKB provided to the low level driver (stmmac in our case) matches with
-the maximum frame len (IP header + TCP header + payload <= 1500 bytes (for
-MTU set to 1500)). It means that if an application using TCP want to send a
-packet which will have a length (after adding headers) > 1514 the packet
-will be split in several TCP packets: The data payload is split and headers
-(TCP/IP ..) are added. It is done by software.
-
-When TSO is enabled, the TCP stack doesn't care about the maximum frame
-length and provide SKB packet to stmmac as it is. The GMAC IP will have to
-perform the segmentation by it self to match with maximum frame length.
-
-This feature can be enabled in device tree through "snps,tso" entry.
-
-5) Debug Information
-
-The driver exports many information i.e. internal statistics,
-debug information, MAC and DMA registers etc.
-
-These can be read in several ways depending on the
-type of the information actually needed.
-
-For example a user can be use the ethtool support
-to get statistics: e.g. using: ethtool -S ethX
-(that shows the Management counters (MMC) if supported)
-or sees the MAC/DMA registers: e.g. using: ethtool -d ethX
-
-Compiling the Kernel with CONFIG_DEBUG_FS the driver will export the following
-debugfs entries:
-
-/sys/kernel/debug/stmmaceth/descriptors_status
- To show the DMA TX/RX descriptor rings
-
-Developer can also use the "debug" module parameter to get further debug
-information (please see: NETIF Msg Level).
-
-6) Energy Efficient Ethernet
-
-Energy Efficient Ethernet(EEE) enables IEEE 802.3 MAC sublayer along
-with a family of Physical layer to operate in the Low power Idle(LPI)
-mode. The EEE mode supports the IEEE 802.3 MAC operation at 100Mbps,
-1000Mbps & 10Gbps.
-
-The LPI mode allows power saving by switching off parts of the
-communication device functionality when there is no data to be
-transmitted & received. The system on both the side of the link can
-disable some functionalities & save power during the period of low-link
-utilization. The MAC controls whether the system should enter or exit
-the LPI mode & communicate this to PHY.
-
-As soon as the interface is opened, the driver verifies if the EEE can
-be supported. This is done by looking at both the DMA HW capability
-register and the PHY devices MCD registers.
-To enter in Tx LPI mode the driver needs to have a software timer
-that enable and disable the LPI mode when there is nothing to be
-transmitted.
-
-7) Precision Time Protocol (PTP)
-The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP),
-which enables precise synchronization of clocks in measurement and
-control systems implemented with technologies such as network
-communication.
-
-In addition to the basic timestamp features mentioned in IEEE 1588-2002
-Timestamps, new GMAC cores support the advanced timestamp features.
-IEEE 1588-2008 that can be enabled when configure the Kernel.
-
-8) SGMII/RGMII support
-New GMAC devices provide own way to manage RGMII/SGMII.
-This information is available at run-time by looking at the
-HW capability register. This means that the stmmac can manage
-auto-negotiation and link status w/o using the PHYLIB stuff.
-In fact, the HW provides a subset of extended registers to
-restart the ANE, verify Full/Half duplex mode and Speed.
-Thanks to these registers, it is possible to look at the
-Auto-negotiated Link Parter Ability.
diff --git a/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt b/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt
new file mode 100644
index 000000000000..12855ab268b8
--- /dev/null
+++ b/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt
@@ -0,0 +1,209 @@
+* Texas Instruments CPSW switchdev based ethernet driver 2.0
+
+- Port renaming
+On older udev versions renaming of ethX to swXpY will not be automatically
+supported
+In order to rename via udev:
+ip -d link show dev sw0p1 | grep switchid
+
+SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \
+ ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
+
+
+====================
+# Dual mac mode
+====================
+- The new (cpsw_new.c) driver is operating in dual-emac mode by default, thus
+working as 2 individual network interfaces. Main differences from legacy CPSW
+driver are:
+ - optimized promiscuous mode: The P0_UNI_FLOOD (both ports) is enabled in
+addition to ALLMULTI (current port) instead of ALE_BYPASS.
+So, Ports in promiscuous mode will keep possibility of mcast and vlan filtering,
+which is provides significant benefits when ports are joined to the same bridge,
+but without enabling "switch" mode, or to different bridges.
+ - learning disabled on ports as it make not too much sense for
+ segregated ports - no forwarding in HW.
+ - enabled basic support for devlink.
+
+ devlink dev show
+ platform/48484000.switch
+
+ devlink dev param show
+ platform/48484000.switch:
+ name switch_mode type driver-specific
+ values:
+ cmode runtime value false
+ name ale_bypass type driver-specific
+ values:
+ cmode runtime value false
+
+Devlink configuration parameters
+====================
+See Documentation/networking/devlink/ti-cpsw-switch.rst
+
+====================
+# Bridging in dual mac mode
+====================
+The dual_mac mode requires two vids to be reserved for internal purposes,
+which, by default, equal CPSW Port numbers. As result, bridge has to be
+configured in vlan unaware mode or default_pvid has to be adjusted.
+
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge vlan_filtering 0
+ echo 0 > /sys/class/net/br0/bridge/default_pvid
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+ - or -
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge vlan_filtering 0
+ echo 100 > /sys/class/net/br0/bridge/default_pvid
+ ip link set dev br0 type bridge vlan_filtering 1
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+
+====================
+# Enabling "switch"
+====================
+The Switch mode can be enabled by configuring devlink driver parameter
+"switch_mode" to 1/true:
+ devlink dev param set platform/48484000.switch \
+ name switch_mode value 1 cmode runtime
+
+This can be done regardless of the state of Port's netdev devices - UP/DOWN, but
+Port's netdev devices have to be in UP before joining to the bridge to avoid
+overwriting of bridge configuration as CPSW switch driver copletly reloads its
+configuration when first Port changes its state to UP.
+
+When the both interfaces joined the bridge - CPSW switch driver will enable
+marking packets with offload_fwd_mark flag unless "ale_bypass=0"
+
+All configuration is implemented via switchdev API.
+
+====================
+# Bridge setup
+====================
+ devlink dev param set platform/48484000.switch \
+ name switch_mode value 1 cmode runtime
+
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge ageing_time 1000
+ ip link set dev sw0p1 up
+ ip link set dev sw0p2 up
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+ [*] bridge vlan add dev br0 vid 1 pvid untagged self
+
+[*] if vlan_filtering=1. where default_pvid=1
+
+=================
+# On/off STP
+=================
+ip link set dev BRDEV type bridge stp_state 1/0
+
+Note. Steps [*] are mandatory.
+
+====================
+# VLAN configuration
+====================
+bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1
+
+Note. This step is mandatory for bridge/default_pvid.
+
+=================
+# Add extra VLANs
+=================
+ 1. untagged:
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 pvid untagged master
+ bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100
+
+ 2. tagged:
+ bridge vlan add dev sw0p1 vid 100 master
+ bridge vlan add dev sw0p2 vid 100 master
+ bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100
+
+====
+FDBs
+====
+FDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding FDBs:
+bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100
+bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs
+
+====
+MDBs
+====
+MDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding MDBs:
+bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100
+bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs
+
+==================
+Multicast flooding
+==================
+CPU port mcast_flooding is always on
+
+Turning flooding on/off on swithch ports:
+bridge link set dev sw0p1 mcast_flood on/off
+
+==================
+Access and Trunk port
+==================
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 master
+
+
+ bridge vlan add dev br0 vid 100 self
+ ip link add link br0 name br0.100 type vlan id 100
+
+ Note. Setting PVID on Bridge device itself working only for
+ default VLAN (default_pvid).
+
+=====================
+ NFS
+=====================
+The only way for NFS to work is by chrooting to a minimal environment when
+switch configuration that will affect connectivity is needed.
+Assuming you are booting NFS with eth1 interface(the script is hacky and
+it's just there to prove NFS is doable).
+
+setup.sh:
+#!/bin/sh
+mkdir proc
+mount -t proc none /proc
+ifconfig br0 > /dev/null
+if [ $? -ne 0 ]; then
+ echo "Setting up bridge"
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge ageing_time 1000
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ ip link set eth1 down
+ ip link set eth1 name sw0p1
+ ip link set dev sw0p1 up
+ ip link set dev sw0p2 up
+ ip link set dev sw0p2 master br0
+ ip link set dev sw0p1 master br0
+ bridge vlan add dev br0 vid 1 pvid untagged self
+ ifconfig sw0p1 0.0.0.0
+ udhchc -i br0
+fi
+umount /proc
+
+run_nfs.sh:
+#!/bin/sh
+mkdir /tmp/root/bin -p
+mkdir /tmp/root/lib -p
+
+cp -r /lib/ /tmp/root/
+cp -r /bin/ /tmp/root/
+cp /sbin/ip /tmp/root/bin
+cp /sbin/bridge /tmp/root/bin
+cp /sbin/ifconfig /tmp/root/bin
+cp /sbin/udhcpc /tmp/root/bin
+cp /path/to/setup.sh /tmp/root/bin
+chroot /tmp/root/ busybox sh /bin/setup.sh
+
+run ./run_nfs.sh
diff --git a/Documentation/networking/devlink-health.txt b/Documentation/networking/devlink-health.txt
deleted file mode 100644
index 1db3fbea0831..000000000000
--- a/Documentation/networking/devlink-health.txt
+++ /dev/null
@@ -1,86 +0,0 @@
-The health mechanism is targeted for Real Time Alerting, in order to know when
-something bad had happened to a PCI device
-- Provide alert debug information
-- Self healing
-- If problem needs vendor support, provide a way to gather all needed debugging
- information.
-
-The main idea is to unify and centralize driver health reports in the
-generic devlink instance and allow the user to set different
-attributes of the health reporting and recovery procedures.
-
-The devlink health reporter:
-Device driver creates a "health reporter" per each error/health type.
-Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
-or unknown (driver specific).
-For each registered health reporter a driver can issue error/health reports
-asynchronously. All health reports handling is done by devlink.
-Device driver can provide specific callbacks for each "health reporter", e.g.
- - Recovery procedures
- - Diagnostics and object dump procedures
- - OOB initial parameters
-Different parts of the driver can register different types of health reporters
-with different handlers.
-
-Once an error is reported, devlink health will do the following actions:
- * A log is being send to the kernel trace events buffer
- * Health status and statistics are being updated for the reporter instance
- * Object dump is being taken and saved at the reporter instance (as long as
- there is no other dump which is already stored)
- * Auto recovery attempt is being done. Depends on:
- - Auto-recovery configuration
- - Grace period vs. time passed since last recover
-
-The user interface:
-User can access/change each reporter's parameters and driver specific callbacks
-via devlink, e.g per error type (per health reporter)
- - Configure reporter's generic parameters (like: disable/enable auto recovery)
- - Invoke recovery procedure
- - Run diagnostics
- - Object dump
-
-The devlink health interface (via netlink):
-DEVLINK_CMD_HEALTH_REPORTER_GET
- Retrieves status and configuration info per DEV and reporter.
-DEVLINK_CMD_HEALTH_REPORTER_SET
- Allows reporter-related configuration setting.
-DEVLINK_CMD_HEALTH_REPORTER_RECOVER
- Triggers a reporter's recovery procedure.
-DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
- Retrieves diagnostics data from a reporter on a device.
-DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET
- Retrieves the last stored dump. Devlink health
- saves a single dump. If an dump is not already stored by the devlink
- for this reporter, devlink generates a new dump.
- dump output is defined by the reporter.
-DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR
- Clears the last saved dump file for the specified reporter.
-
-
- netlink
- +--------------------------+
- | |
- | + |
- | | |
- +--------------------------+
- |request for ops
- |(diagnose,
- mlx5_core devlink |recover,
- |dump)
-+--------+ +--------------------------+
-| | | reporter| |
-| | | +---------v----------+ |
-| | ops execution | | | |
-| <----------------------------------+ | |
-| | | | | |
-| | | + ^------------------+ |
-| | | | request for ops |
-| | | | (recover, dump) |
-| | | | |
-| | | +-+------------------+ |
-| | health report | | health handler | |
-| +-------------------------------> | |
-| | | +--------------------+ |
-| | health reporter create | |
-| +----------------------------> |
-+--------+ +--------------------------+
diff --git a/Documentation/networking/devlink-info-versions.rst b/Documentation/networking/devlink-info-versions.rst
deleted file mode 100644
index 4316342b7746..000000000000
--- a/Documentation/networking/devlink-info-versions.rst
+++ /dev/null
@@ -1,48 +0,0 @@
-.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
-
-=====================
-Devlink info versions
-=====================
-
-board.id
-========
-
-Unique identifier of the board design.
-
-board.rev
-=========
-
-Board design revision.
-
-board.manufacture
-=================
-
-An identifier of the company or the facility which produced the part.
-
-fw.mgmt
-=======
-
-Control unit firmware version. This firmware is responsible for house
-keeping tasks, PHY control etc. but not the packet-by-packet data path
-operation.
-
-fw.app
-======
-
-Data path microcode controlling high-speed packet processing.
-
-fw.undi
-=======
-
-UNDI software, may include the UEFI driver, firmware or both.
-
-fw.ncsi
-=======
-
-Version of the software responsible for supporting/handling the
-Network Controller Sideband Interface.
-
-fw.psid
-=======
-
-Unique identifier of the firmware parameter set.
diff --git a/Documentation/networking/devlink-params-bnxt.txt b/Documentation/networking/devlink-params-bnxt.txt
deleted file mode 100644
index 481aa303d5b4..000000000000
--- a/Documentation/networking/devlink-params-bnxt.txt
+++ /dev/null
@@ -1,18 +0,0 @@
-enable_sriov [DEVICE, GENERIC]
- Configuration mode: Permanent
-
-ignore_ari [DEVICE, GENERIC]
- Configuration mode: Permanent
-
-msix_vec_per_pf_max [DEVICE, GENERIC]
- Configuration mode: Permanent
-
-msix_vec_per_pf_min [DEVICE, GENERIC]
- Configuration mode: Permanent
-
-gre_ver_check [DEVICE, DRIVER-SPECIFIC]
- Generic Routing Encapsulation (GRE) version check will
- be enabled in the device. If disabled, device skips
- version checking for incoming packets.
- Type: Boolean
- Configuration mode: Permanent
diff --git a/Documentation/networking/devlink-params-mlxsw.txt b/Documentation/networking/devlink-params-mlxsw.txt
deleted file mode 100644
index c63ea9fc7009..000000000000
--- a/Documentation/networking/devlink-params-mlxsw.txt
+++ /dev/null
@@ -1,10 +0,0 @@
-fw_load_policy [DEVICE, GENERIC]
- Configuration mode: driverinit
-
-acl_region_rehash_interval [DEVICE, DRIVER-SPECIFIC]
- Sets an interval for periodic ACL region rehashes.
- The value is in milliseconds, minimal value is "3000".
- Value "0" disables the periodic work.
- The first rehash will be run right after value is set.
- Type: u32
- Configuration mode: runtime
diff --git a/Documentation/networking/devlink-params.txt b/Documentation/networking/devlink-params.txt
deleted file mode 100644
index 2d26434ddcf8..000000000000
--- a/Documentation/networking/devlink-params.txt
+++ /dev/null
@@ -1,51 +0,0 @@
-Devlink configuration parameters
-================================
-Following is the list of configuration parameters via devlink interface.
-Each parameter can be generic or driver specific and are device level
-parameters.
-
-Note that the driver-specific files should contain the generic params
-they support to, with supported config modes.
-
-Each parameter can be set in different configuration modes:
- runtime - set while driver is running, no reset required.
- driverinit - applied while driver initializes, requires restart
- driver by devlink reload command.
- permanent - written to device's non-volatile memory, hard reset
- required.
-
-Following is the list of parameters:
-====================================
-enable_sriov [DEVICE, GENERIC]
- Enable Single Root I/O Virtualisation (SRIOV) in
- the device.
- Type: Boolean
-
-ignore_ari [DEVICE, GENERIC]
- Ignore Alternative Routing-ID Interpretation (ARI)
- capability. If enabled, adapter will ignore ARI
- capability even when platforms has the support
- enabled and creates same number of partitions when
- platform does not support ARI.
- Type: Boolean
-
-msix_vec_per_pf_max [DEVICE, GENERIC]
- Provides the maximum number of MSIX interrupts that
- a device can create. Value is same across all
- physical functions (PFs) in the device.
- Type: u32
-
-msix_vec_per_pf_min [DEVICE, GENERIC]
- Provides the minimum number of MSIX interrupts required
- for the device initialization. Value is same across all
- physical functions (PFs) in the device.
- Type: u32
-
-fw_load_policy [DEVICE, GENERIC]
- Controls the device's firmware loading policy.
- Valid values:
- * DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER (0)
- Load firmware version preferred by the driver.
- * DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH (1)
- Load firmware currently stored in flash.
- Type: u8
diff --git a/Documentation/networking/devlink/bnxt.rst b/Documentation/networking/devlink/bnxt.rst
new file mode 100644
index 000000000000..82ef9ec46707
--- /dev/null
+++ b/Documentation/networking/devlink/bnxt.rst
@@ -0,0 +1,74 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+bnxt devlink support
+====================
+
+This document describes the devlink features implemented by the ``bnxt``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``enable_sriov``
+ - Permanent
+ * - ``ignore_ari``
+ - Permanent
+ * - ``msix_vec_per_pf_max``
+ - Permanent
+ * - ``msix_vec_per_pf_min``
+ - Permanent
+
+The ``bnxt`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``gre_ver_check``
+ - Boolean
+ - Permanent
+ - Generic Routing Encapsulation (GRE) version check will be enabled in
+ the device. If disabled, the device will skip the version check for
+ incoming packets.
+
+Info versions
+=============
+
+The ``bnxt_en`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``asic.id``
+ - fixed
+ - ASIC design identifier
+ * - ``asic.rev``
+ - fixed
+ - ASIC design revision
+ * - ``fw.psid``
+ - stored, running
+ - Firmware parameter set version of the board
+ * - ``fw``
+ - stored, running
+ - Overall board firmware version
+ * - ``fw.app``
+ - stored, running
+ - Data path firmware version
+ * - ``fw.mgmt``
+ - stored, running
+ - Management firmware version
+ * - ``fw.roce``
+ - stored, running
+ - RoCE management firmware version
diff --git a/Documentation/networking/devlink/devlink-dpipe.rst b/Documentation/networking/devlink/devlink-dpipe.rst
new file mode 100644
index 000000000000..468fe1001b74
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-dpipe.rst
@@ -0,0 +1,252 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Devlink DPIPE
+=============
+
+Background
+==========
+
+While performing the hardware offloading process, much of the hardware
+specifics cannot be presented. These details are useful for debugging, and
+``devlink-dpipe`` provides a standardized way to provide visibility into the
+offloading process.
+
+For example, the routing longest prefix match (LPM) algorithm used by the
+Linux kernel may differ from the hardware implementation. The pipeline debug
+API (DPIPE) is aimed at providing the user visibility into the ASIC's
+pipeline in a generic way.
+
+The hardware offload process is expected to be done in a way that the user
+should not be able to distinguish between the hardware vs. software
+implementation. In this process, hardware specifics are neglected. In
+reality those details can have lots of meaning and should be exposed in some
+standard way.
+
+This problem is made even more complex when one wishes to offload the
+control path of the whole networking stack to a switch ASIC. Due to
+differences in the hardware and software models some processes cannot be
+represented correctly.
+
+One example is the kernel's LPM algorithm which in many cases differs
+greatly to the hardware implementation. The configuration API is the same,
+but one cannot rely on the Forward Information Base (FIB) to look like the
+Level Path Compression trie (LPC-trie) in hardware.
+
+In many situations trying to analyze systems failure solely based on the
+kernel's dump may not be enough. By combining this data with complementary
+information about the underlying hardware, this debugging can be made
+easier; additionally, the information can be useful when debugging
+performance issues.
+
+Overview
+========
+
+The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
+modeled as a graph of match/action tables. Each table represents a specific
+hardware block. This model is not new, first being used by the P4 language.
+
+Traditionally it has been used as an alternative model for hardware
+configuration, but the ``devlink-dpipe`` interface uses it for visibility
+purposes as a standard complementary tool. The system's view from
+``devlink-dpipe`` should change according to the changes done by the
+standard configuration tools.
+
+For example, it’s quiet common to implement Access Control Lists (ACL)
+using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
+divided into TCAM regions. Complex TC filters can have multiple rules with
+different priorities and different lookup keys. On the other hand hardware
+TCAM regions have a predefined lookup key. Offloading the TC filter rules
+using TCAM engine can result in multiple TCAM regions being interconnected
+in a chain (which may affect the data path latency). In response to a new TC
+filter new tables should be created describing those regions.
+
+Model
+=====
+
+The ``DPIPE`` model introduces several objects:
+
+ * headers
+ * tables
+ * entries
+
+A ``header`` describes packet formats and provides names for fields within
+the packet. A ``table`` describes hardware blocks. An ``entry`` describes
+the actual content of a specific table.
+
+The hardware pipeline is not port specific, but rather describes the whole
+ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
+
+Drivers can register and unregister tables at run time, in order to support
+dynamic behavior. This dynamic behavior is mandatory for describing hardware
+blocks like TCAM regions which can be allocated and freed dynamically.
+
+``devlink-dpipe`` generally is not intended for configuration. The exception
+is hardware counting for a specific table.
+
+The following commands are used to obtain the ``dpipe`` objects from
+userspace:
+
+ * ``table_get``: Receive a table's description.
+ * ``headers_get``: Receive a device's supported headers.
+ * ``entries_get``: Receive a table's current entries.
+ * ``counters_set``: Enable or disable counters on a table.
+
+Table
+-----
+
+The driver should implement the following operations for each table:
+
+ * ``matches_dump``: Dump the supported matches.
+ * ``actions_dump``: Dump the supported actions.
+ * ``entries_dump``: Dump the actual content of the table.
+ * ``counters_set_update``: Synchronize hardware with counters enabled or
+ disabled.
+
+Header/Field
+------------
+
+In a similar way to P4 headers and fields are used to describe a table's
+behavior. There is a slight difference between the standard protocol headers
+and specific ASIC metadata. The protocol headers should be declared in the
+``devlink`` core API. On the other hand ASIC meta data is driver specific
+and should be defined in the driver. Additionally, each driver-specific
+devlink documentation file should document the driver-specific ``dpipe``
+headers it implements. The headers and fields are identified by enumeration.
+
+In order to provide further visibility some ASIC metadata fields could be
+mapped to kernel objects. For example, internal router interface indexes can
+be directly mapped to the net device ifindex. FIB table indexes used by
+different Virtual Routing and Forwarding (VRF) tables can be mapped to
+internal routing table indexes.
+
+Match
+-----
+
+Matches are kept primitive and close to hardware operation. Match types like
+LPM are not supported due to the fact that this is exactly a process we wish
+to describe in full detail. Example of matches:
+
+ * ``field_exact``: Exact match on a specific field.
+ * ``field_exact_mask``: Exact match on a specific field after masking.
+ * ``field_range``: Match on a specific range.
+
+The id's of the header and the field should be specified in order to
+identify the specific field. Furthermore, the header index should be
+specified in order to distinguish multiple headers of the same type in a
+packet (tunneling).
+
+Action
+------
+
+Similar to match, the actions are kept primitive and close to hardware
+operation. For example:
+
+ * ``field_modify``: Modify the field value.
+ * ``field_inc``: Increment the field value.
+ * ``push_header``: Add a header.
+ * ``pop_header``: Remove a header.
+
+Entry
+-----
+
+Entries of a specific table can be dumped on demand. Each eentry is
+identified with an index and its properties are described by a list of
+match/action values and specific counter. By dumping the tables content the
+interactions between tables can be resolved.
+
+Abstraction Example
+===================
+
+The following is an example of the abstraction model of the L3 part of
+Mellanox Spectrum ASIC. The blocks are described in the order they appear in
+the pipeline. The table sizes in the following examples are not real
+hardware sizes and are provided for demonstration purposes.
+
+LPM
+---
+
+The LPM algorithm can be implemented as a list of hash tables. Each hash
+table contains routes with the same prefix length. The root of the list is
+/32, and in case of a miss the hardware will continue to the next hash
+table. The depth of the search will affect the data path latency.
+
+In case of a hit the entry contains information about the next stage of the
+pipeline which resolves the MAC address. The next stage can be either local
+host table for directly connected routes, or adjacency table for next-hops.
+The ``meta.lpm_prefix`` field is used to connect two LPM tables.
+
+.. code::
+
+ table lpm_prefix_16 {
+ size: 4096,
+ counters_enabled: true,
+ match: { meta.vr_id: exact,
+ ipv4.dst_addr: exact_mask,
+ ipv6.dst_addr: exact_mask,
+ meta.lpm_prefix: exact },
+ action: { meta.adj_index: set,
+ meta.adj_group_size: set,
+ meta.rif_port: set,
+ meta.lpm_prefix: set },
+ }
+
+Local Host
+----------
+
+In the case of local routes the LPM lookup already resolves the egress
+router interface (RIF), yet the exact MAC address is not known. The local
+host table is a hash table combining the output interface id with
+destination IP address as a key. The result is the MAC address.
+
+.. code::
+
+ table local_host {
+ size: 4096,
+ counters_enabled: true,
+ match: { meta.rif_port: exact,
+ ipv4.dst_addr: exact},
+ action: { ethernet.daddr: set }
+ }
+
+Adjacency
+---------
+
+In case of remote routes this table does the ECMP. The LPM lookup results in
+ECMP group size and index that serves as a global offset into this table.
+Concurrently a hash of the packet is generated. Based on the ECMP group size
+and the packet's hash a local offset is generated. Multiple LPM entries can
+point to the same adjacency group.
+
+.. code::
+
+ table adjacency {
+ size: 4096,
+ counters_enabled: true,
+ match: { meta.adj_index: exact,
+ meta.adj_group_size: exact,
+ meta.packet_hash_index: exact },
+ action: { ethernet.daddr: set,
+ meta.erif: set }
+ }
+
+ERIF
+----
+
+In case the egress RIF and destination MAC have been resolved by previous
+tables this table does multiple operations like TTL decrease and MTU check.
+Then the decision of forward/drop is taken and the port L3 statistics are
+updated based on the packet's type (broadcast, unicast, multicast).
+
+.. code::
+
+ table erif {
+ size: 800,
+ counters_enabled: true,
+ match: { meta.rif_port: exact,
+ meta.is_l3_unicast: exact,
+ meta.is_l3_broadcast: exact,
+ meta.is_l3_multicast, exact },
+ action: { meta.l3_drop: set,
+ meta.l3_forward: set }
+ }
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
new file mode 100644
index 000000000000..0c99b11f05f9
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -0,0 +1,114 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Health
+==============
+
+Background
+==========
+
+The ``devlink`` health mechanism is targeted for Real Time Alerting, in
+order to know when something bad happened to a PCI device.
+
+ * Provide alert debug information.
+ * Self healing.
+ * If problem needs vendor support, provide a way to gather all needed
+ debugging information.
+
+Overview
+========
+
+The main idea is to unify and centralize driver health reports in the
+generic ``devlink`` instance and allow the user to set different
+attributes of the health reporting and recovery procedures.
+
+The ``devlink`` health reporter:
+Device driver creates a "health reporter" per each error/health type.
+Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
+or unknown (driver specific).
+For each registered health reporter a driver can issue error/health reports
+asynchronously. All health reports handling is done by ``devlink``.
+Device driver can provide specific callbacks for each "health reporter", e.g.:
+
+ * Recovery procedures
+ * Diagnostics procedures
+ * Object dump procedures
+ * OOB initial parameters
+
+Different parts of the driver can register different types of health reporters
+with different handlers.
+
+Actions
+=======
+
+Once an error is reported, devlink health will perform the following actions:
+
+ * A log is being send to the kernel trace events buffer
+ * Health status and statistics are being updated for the reporter instance
+ * Object dump is being taken and saved at the reporter instance (as long as
+ there is no other dump which is already stored)
+ * Auto recovery attempt is being done. Depends on:
+ - Auto-recovery configuration
+ - Grace period vs. time passed since last recover
+
+User Interface
+==============
+
+User can access/change each reporter's parameters and driver specific callbacks
+via ``devlink``, e.g per error type (per health reporter):
+
+ * Configure reporter's generic parameters (like: disable/enable auto recovery)
+ * Invoke recovery procedure
+ * Run diagnostics
+ * Object dump
+
+.. list-table:: List of devlink health interfaces
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
+ - Retrieves status and configuration info per DEV and reporter.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
+ - Allows reporter-related configuration setting.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
+ - Triggers a reporter's recovery procedure.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
+ - Retrieves diagnostics data from a reporter on a device.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
+ - Retrieves the last stored dump. Devlink health
+ saves a single dump. If an dump is not already stored by the devlink
+ for this reporter, devlink generates a new dump.
+ dump output is defined by the reporter.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
+ - Clears the last saved dump file for the specified reporter.
+
+The following diagram provides a general overview of ``devlink-health``::
+
+ netlink
+ +--------------------------+
+ | |
+ | + |
+ | | |
+ +--------------------------+
+ |request for ops
+ |(diagnose,
+ mlx5_core devlink |recover,
+ |dump)
+ +--------+ +--------------------------+
+ | | | reporter| |
+ | | | +---------v----------+ |
+ | | ops execution | | | |
+ | <----------------------------------+ | |
+ | | | | | |
+ | | | + ^------------------+ |
+ | | | | request for ops |
+ | | | | (recover, dump) |
+ | | | | |
+ | | | +-+------------------+ |
+ | | health report | | health handler | |
+ | +-------------------------------> | |
+ | | | +--------------------+ |
+ | | health reporter create | |
+ | +----------------------------> |
+ +--------+ +--------------------------+
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
new file mode 100644
index 000000000000..70981dd1b981
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -0,0 +1,100 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+============
+Devlink Info
+============
+
+The ``devlink-info`` mechanism enables device drivers to report device
+information in a generic fashion. It is extensible, and enables exporting
+even device or driver specific information.
+
+devlink supports representing the following types of versions
+
+.. list-table:: List of version types
+ :widths: 5 95
+
+ * - Type
+ - Description
+ * - ``fixed``
+ - Represents fixed versions, which cannot change. For example,
+ component identifiers or the board version reported in the PCI VPD.
+ * - ``running``
+ - Represents the version of the currently running component. For
+ example the running version of firmware. These versions generally
+ only update after a reboot.
+ * - ``stored``
+ - Represents the version of a component as stored, such as after a
+ flash update. Stored values should update to reflect changes in the
+ flash even if a reboot has not yet occurred.
+
+Generic Versions
+================
+
+It is expected that drivers use the following generic names for exporting
+version information. Other information may be exposed using driver-specific
+names, but these should be documented in the driver-specific file.
+
+board.id
+--------
+
+Unique identifier of the board design.
+
+board.rev
+---------
+
+Board design revision.
+
+asic.id
+-------
+
+ASIC design identifier.
+
+asic.rev
+--------
+
+ASIC design revision.
+
+board.manufacture
+-----------------
+
+An identifier of the company or the facility which produced the part.
+
+fw
+--
+
+Overall firmware version, often representing the collection of
+fw.mgmt, fw.app, etc.
+
+fw.mgmt
+-------
+
+Control unit firmware version. This firmware is responsible for house
+keeping tasks, PHY control etc. but not the packet-by-packet data path
+operation.
+
+fw.app
+------
+
+Data path microcode controlling high-speed packet processing.
+
+fw.undi
+-------
+
+UNDI software, may include the UEFI driver, firmware or both.
+
+fw.ncsi
+-------
+
+Version of the software responsible for supporting/handling the
+Network Controller Sideband Interface.
+
+fw.psid
+-------
+
+Unique identifier of the firmware parameter set.
+
+fw.roce
+-------
+
+RoCE firmware version which is responsible for handling roce
+management.
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
new file mode 100644
index 000000000000..da2f85c0fa21
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -0,0 +1,108 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Params
+==============
+
+``devlink`` provides capability for a driver to expose device parameters for low
+level device functionality. Since devlink can operate at the device-wide
+level, it can be used to provide configuration that may affect multiple
+ports on a single device.
+
+This document describes a number of generic parameters that are supported
+across multiple drivers. Each driver is also free to add their own
+parameters. Each driver must document the specific parameters they support,
+whether generic or not.
+
+Configuration modes
+===================
+
+Parameters may be set in different configuration modes.
+
+.. list-table:: Possible configuration modes
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``runtime``
+ - set while the driver is running, and takes effect immediately. No
+ reset is required.
+ * - ``driverinit``
+ - applied while the driver initializes. Requires the user to restart
+ the driver using the ``devlink`` reload command.
+ * - ``permanent``
+ - written to the device's non-volatile memory. A hard reset is required
+ for it to take effect.
+
+Reloading
+---------
+
+In order for ``driverinit`` parameters to take effect, the driver must
+support reloading via the ``devlink-reload`` command. This command will
+request a reload of the device driver.
+
+Generic configuration parameters
+================================
+The following is a list of generic configuration parameters that drivers may
+add. Use of generic parameters is preferred over each driver creating their
+own name.
+
+.. list-table:: List of generic parameters
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``enable_sriov``
+ - Boolean
+ - Enable Single Root I/O Virtualization (SRIOV) in the device.
+ * - ``ignore_ari``
+ - Boolean
+ - Ignore Alternative Routing-ID Interpretation (ARI) capability. If
+ enabled, the adapter will ignore ARI capability even when the
+ platform has support enabled. The device will create the same number
+ of partitions as when the platform does not support ARI.
+ * - ``msix_vec_per_pf_max``
+ - u32
+ - Provides the maximum number of MSI-X interrupts that a device can
+ create. Value is the same across all physical functions (PFs) in the
+ device.
+ * - ``msix_vec_per_pf_min``
+ - u32
+ - Provides the minimum number of MSI-X interrupts required for the
+ device to initialize. Value is the same across all physical functions
+ (PFs) in the device.
+ * - ``fw_load_policy``
+ - u8
+ - Control the device's firmware loading policy.
+ - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER`` (0)
+ Load firmware version preferred by the driver.
+ - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH`` (1)
+ Load firmware currently stored in flash.
+ - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK`` (2)
+ Load firmware currently available on host's disk.
+ * - ``reset_dev_on_drv_probe``
+ - u8
+ - Controls the device's reset policy on driver probe.
+ - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_UNKNOWN`` (0)
+ Unknown or invalid value.
+ - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS`` (1)
+ Always reset device on driver probe.
+ - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_NEVER`` (2)
+ Never reset device on driver probe.
+ - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK`` (3)
+ Reset the device only if firmware can be found in the filesystem.
+ * - ``enable_roce``
+ - Boolean
+ - Enable handling of RoCE traffic in the device.
+ * - ``internal_err_reset``
+ - Boolean
+ - When enabled, the device driver will reset the device on internal
+ errors.
+ * - ``max_macs``
+ - u32
+ - Specifies the maximum number of MAC addresses per ethernet port of
+ this device.
+ * - ``region_snapshot_enable``
+ - Boolean
+ - Enable capture of ``devlink-region`` snapshots.
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
new file mode 100644
index 000000000000..1a7683e7acb2
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -0,0 +1,60 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Region
+==============
+
+``devlink`` regions enable access to driver defined address regions using
+devlink.
+
+Each device can create and register its own supported address regions. The
+region can then be accessed via the devlink region interface.
+
+Region snapshots are collected by the driver, and can be accessed via read
+or dump commands. This allows future analysis on the created snapshots.
+Regions may optionally support triggering snapshots on demand.
+
+The major benefit to creating a region is to provide access to internal
+address regions that are otherwise inaccessible to the user.
+
+Regions may also be used to provide an additional way to debug complex error
+states, but see also :doc:`devlink-health`
+
+example usage
+-------------
+
+.. code:: shell
+
+ $ devlink region help
+ $ devlink region show [ DEV/REGION ]
+ $ devlink region del DEV/REGION snapshot SNAPSHOT_ID
+ $ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
+ $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ]
+ address ADDRESS length length
+
+ # Show all of the exposed regions with region sizes:
+ $ devlink region show
+ pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
+ pci/0000:00:05.0/fw-health: size 64 snapshot [1 2]
+
+ # Delete a snapshot using:
+ $ devlink region del pci/0000:00:05.0/cr-space snapshot 1
+
+ # Trigger (request) a snapshot be taken:
+ $ devlink region trigger pci/0000:00:05.0/cr-space
+
+ # Dump a snapshot:
+ $ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+ 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
+ 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
+
+ # Read a specific part of a snapshot:
+ $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0
+ length 16
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+
+As regions are likely very device or driver specific, no generic regions are
+defined. See the driver-specific documentation files for information on the
+specific regions a driver supports.
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
new file mode 100644
index 000000000000..93e92d2f0752
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -0,0 +1,62 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+Devlink Resource
+================
+
+``devlink`` provides the ability for drivers to register resources, which
+can allow administrators to see the device restrictions for a given
+resource, as well as how much of the given resource is currently
+in use. Additionally, these resources can optionally have configurable size.
+This could enable the administrator to limit the number of resources that
+are used.
+
+For example, the ``netdevsim`` driver enables ``/IPv4/fib`` and
+``/IPv4/fib-rules`` as resources to limit the number of IPv4 FIB entries and
+rules for a given device.
+
+Resource Ids
+============
+
+Each resource is represented by an id, and contains information about its
+current size and related sub resources. To access a sub resource, you
+specify the path of the resource. For example ``/IPv4/fib`` is the id for
+the ``fib`` sub-resource under the ``IPv4`` resource.
+
+example usage
+-------------
+
+The resources exposed by the driver can be observed, for example:
+
+.. code:: shell
+
+ $devlink resource show pci/0000:03:00.0
+ pci/0000:03:00.0:
+ name kvd size 245760 unit entry
+ resources:
+ name linear size 98304 occ 0 unit entry size_min 0 size_max 147456 size_gran 128
+ name hash_double size 60416 unit entry size_min 32768 size_max 180224 size_gran 128
+ name hash_single size 87040 unit entry size_min 65536 size_max 212992 size_gran 128
+
+Some resource's size can be changed. Examples:
+
+.. code:: shell
+
+ $devlink resource set pci/0000:03:00.0 path /kvd/hash_single size 73088
+ $devlink resource set pci/0000:03:00.0 path /kvd/hash_double size 74368
+
+The changes do not apply immediately, this can be validated by the 'size_new'
+attribute, which represents the pending change in size. For example:
+
+.. code:: shell
+
+ $devlink resource show pci/0000:03:00.0
+ pci/0000:03:00.0:
+ name kvd size 245760 unit entry size_valid false
+ resources:
+ name linear size 98304 size_new 147456 occ 0 unit entry size_min 0 size_max 147456 size_gran 128
+ name hash_double size 60416 unit entry size_min 32768 size_max 180224 size_gran 128
+ name hash_single size 87040 unit entry size_min 65536 size_max 212992 size_gran 128
+
+Note that changes in resource size may require a device reload to properly
+take effect.
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
new file mode 100644
index 000000000000..47a429bb8658
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -0,0 +1,289 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Devlink Trap
+============
+
+Background
+==========
+
+Devices capable of offloading the kernel's datapath and perform functions such
+as bridging and routing must also be able to send specific packets to the
+kernel (i.e., the CPU) for processing.
+
+For example, a device acting as a multicast-aware bridge must be able to send
+IGMP membership reports to the kernel for processing by the bridge module.
+Without processing such packets, the bridge module could never populate its
+MDB.
+
+As another example, consider a device acting as router which has received an IP
+packet with a TTL of 1. Upon routing the packet the device must send it to the
+kernel so that it will route it as well and generate an ICMP Time Exceeded
+error datagram. Without letting the kernel route such packets itself, utilities
+such as ``traceroute`` could never work.
+
+The fundamental ability of sending certain packets to the kernel for processing
+is called "packet trapping".
+
+Overview
+========
+
+The ``devlink-trap`` mechanism allows capable device drivers to register their
+supported packet traps with ``devlink`` and report trapped packets to
+``devlink`` for further analysis.
+
+Upon receiving trapped packets, ``devlink`` will perform a per-trap packets and
+bytes accounting and potentially report the packet to user space via a netlink
+event along with all the provided metadata (e.g., trap reason, timestamp, input
+port). This is especially useful for drop traps (see :ref:`Trap-Types`)
+as it allows users to obtain further visibility into packet drops that would
+otherwise be invisible.
+
+The following diagram provides a general overview of ``devlink-trap``::
+
+ Netlink event: Packet w/ metadata
+ Or a summary of recent drops
+ ^
+ |
+ Userspace |
+ +---------------------------------------------------+
+ Kernel |
+ |
+ +-------+--------+
+ | |
+ | drop_monitor |
+ | |
+ +-------^--------+
+ |
+ |
+ |
+ +----+----+
+ | | Kernel's Rx path
+ | devlink | (non-drop traps)
+ | |
+ +----^----+ ^
+ | |
+ +-----------+
+ |
+ +-------+-------+
+ | |
+ | Device driver |
+ | |
+ +-------^-------+
+ Kernel |
+ +---------------------------------------------------+
+ Hardware |
+ | Trapped packet
+ |
+ +--+---+
+ | |
+ | ASIC |
+ | |
+ +------+
+
+.. _Trap-Types:
+
+Trap Types
+==========
+
+The ``devlink-trap`` mechanism supports the following packet trap types:
+
+ * ``drop``: Trapped packets were dropped by the underlying device. Packets
+ are only processed by ``devlink`` and not injected to the kernel's Rx path.
+ The trap action (see :ref:`Trap-Actions`) can be changed.
+ * ``exception``: Trapped packets were not forwarded as intended by the
+ underlying device due to an exception (e.g., TTL error, missing neighbour
+ entry) and trapped to the control plane for resolution. Packets are
+ processed by ``devlink`` and injected to the kernel's Rx path. Changing the
+ action of such traps is not allowed, as it can easily break the control
+ plane.
+
+.. _Trap-Actions:
+
+Trap Actions
+============
+
+The ``devlink-trap`` mechanism supports the following packet trap actions:
+
+ * ``trap``: The sole copy of the packet is sent to the CPU.
+ * ``drop``: The packet is dropped by the underlying device and a copy is not
+ sent to the CPU.
+
+Generic Packet Traps
+====================
+
+Generic packet traps are used to describe traps that trap well-defined packets
+or packets that are trapped due to well-defined conditions (e.g., TTL error).
+Such traps can be shared by multiple device drivers and their description must
+be added to the following table:
+
+.. list-table:: List of Generic Packet Traps
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``source_mac_is_multicast``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop because of a
+ multicast source MAC
+ * - ``vlan_tag_mismatch``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop in case of VLAN
+ tag mismatch: The ingress bridge port is not configured with a PVID and
+ the packet is untagged or prio-tagged
+ * - ``ingress_vlan_filter``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop in case they are
+ tagged with a VLAN that is not configured on the ingress bridge port
+ * - ``ingress_spanning_tree_filter``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop in case the STP
+ state of the ingress bridge port is not "forwarding"
+ * - ``port_list_is_empty``
+ - ``drop``
+ - Traps packets that the device decided to drop in case they need to be
+ flooded (e.g., unknown unicast, unregistered multicast) and there are
+ no ports the packets should be flooded to
+ * - ``port_loopback_filter``
+ - ``drop``
+ - Traps packets that the device decided to drop in case after layer 2
+ forwarding the only port from which they should be transmitted through
+ is the port from which they were received
+ * - ``blackhole_route``
+ - ``drop``
+ - Traps packets that the device decided to drop in case they hit a
+ blackhole route
+ * - ``ttl_value_is_too_small``
+ - ``exception``
+ - Traps unicast packets that should be forwarded by the device whose TTL
+ was decremented to 0 or less
+ * - ``tail_drop``
+ - ``drop``
+ - Traps packets that the device decided to drop because they could not be
+ enqueued to a transmission queue which is full
+ * - ``non_ip``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to
+ undergo a layer 3 lookup, but are not IP or MPLS packets
+ * - ``uc_dip_over_mc_dmac``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and they have a unicast destination IP and a multicast destination
+ MAC
+ * - ``dip_is_loopback_address``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their destination IP is the loopback address (i.e., 127.0.0.0/8
+ and ::1/128)
+ * - ``sip_is_mc``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their source IP is multicast (i.e., 224.0.0.0/8 and ff::/8)
+ * - ``sip_is_loopback_address``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their source IP is the loopback address (i.e., 127.0.0.0/8 and ::1/128)
+ * - ``ip_header_corrupted``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their IP header is corrupted: wrong checksum, wrong IP version
+ or too short Internet Header Length (IHL)
+ * - ``ipv4_sip_is_limited_bc``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed and their source IP is limited broadcast (i.e., 255.255.255.255/32)
+ * - ``ipv6_mc_dip_reserved_scope``
+ - ``drop``
+ - Traps IPv6 packets that the device decided to drop because they need to
+ be routed and their IPv6 multicast destination IP has a reserved scope
+ (i.e., ffx0::/16)
+ * - ``ipv6_mc_dip_interface_local_scope``
+ - ``drop``
+ - Traps IPv6 packets that the device decided to drop because they need to
+ be routed and their IPv6 multicast destination IP has an interface-local scope
+ (i.e., ffx1::/16)
+ * - ``mtu_value_is_too_small``
+ - ``exception``
+ - Traps packets that should have been routed by the device, but were bigger
+ than the MTU of the egress interface
+ * - ``unresolved_neigh``
+ - ``exception``
+ - Traps packets that did not have a matching IP neighbour after routing
+ * - ``mc_reverse_path_forwarding``
+ - ``exception``
+ - Traps multicast IP packets that failed reverse-path forwarding (RPF)
+ check during multicast routing
+ * - ``reject_route``
+ - ``exception``
+ - Traps packets that hit reject routes (i.e., "unreachable", "prohibit")
+ * - ``ipv4_lpm_miss``
+ - ``exception``
+ - Traps unicast IPv4 packets that did not match any route
+ * - ``ipv6_lpm_miss``
+ - ``exception``
+ - Traps unicast IPv6 packets that did not match any route
+ * - ``non_routable_packet``
+ - ``drop``
+ - Traps packets that the device decided to drop because they are not
+ supposed to be routed. For example, IGMP queries can be flooded by the
+ device in layer 2 and reach the router. Such packets should not be
+ routed and instead dropped
+ * - ``decap_error``
+ - ``exception``
+ - Traps NVE and IPinIP packets that the device decided to drop because of
+ failure during decapsulation (e.g., packet being too short, reserved
+ bits set in VXLAN header)
+ * - ``overlay_smac_is_mc``
+ - ``drop``
+ - Traps NVE packets that the device decided to drop because their overlay
+ source MAC is multicast
+
+Driver-specific Packet Traps
+============================
+
+Device drivers can register driver-specific packet traps, but these must be
+clearly documented. Such traps can correspond to device-specific exceptions and
+help debug packet drops caused by these exceptions. The following list includes
+links to the description of driver-specific traps registered by various device
+drivers:
+
+ * :doc:`netdevsim`
+ * :doc:`mlxsw`
+
+Generic Packet Trap Groups
+==========================
+
+Generic packet trap groups are used to aggregate logically related packet
+traps. These groups allow the user to batch operations such as setting the trap
+action of all member traps. In addition, ``devlink-trap`` can report aggregated
+per-group packets and bytes statistics, in case per-trap statistics are too
+narrow. The description of these groups must be added to the following table:
+
+.. list-table:: List of Generic Packet Trap Groups
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``l2_drops``
+ - Contains packet traps for packets that were dropped by the device during
+ layer 2 forwarding (i.e., bridge)
+ * - ``l3_drops``
+ - Contains packet traps for packets that were dropped by the device or hit
+ an exception (e.g., TTL error) during layer 3 forwarding
+ * - ``buffer_drops``
+ - Contains packet traps for packets that were dropped by the device due to
+ an enqueue decision
+ * - ``tunnel_drops``
+ - Contains packet traps for packets that were dropped by the device during
+ tunnel encapsulation / decapsulation
+
+Testing
+=======
+
+See ``tools/testing/selftests/drivers/net/netdevsim/devlink_trap.sh`` for a
+test covering the core infrastructure. Test cases should be added for any new
+functionality.
+
+Device drivers should focus their tests on device-specific functionality, such
+as the triggering of supported packet traps.
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
new file mode 100644
index 000000000000..087ff54d53fc
--- /dev/null
+++ b/Documentation/networking/devlink/index.rst
@@ -0,0 +1,42 @@
+Linux Devlink Documentation
+===========================
+
+devlink is an API to expose device information and resources not directly
+related to any device class, such as chip-wide/switch-ASIC-wide configuration.
+
+Interface documentation
+-----------------------
+
+The following pages describe various interfaces available through devlink in
+general.
+
+.. toctree::
+ :maxdepth: 1
+
+ devlink-dpipe
+ devlink-health
+ devlink-info
+ devlink-params
+ devlink-region
+ devlink-resource
+ devlink-trap
+
+Driver-specific documentation
+-----------------------------
+
+Each driver that implements ``devlink`` is expected to document what
+parameters, info versions, and other features it supports.
+
+.. toctree::
+ :maxdepth: 1
+
+ bnxt
+ ionic
+ mlx4
+ mlx5
+ mlxsw
+ mv88e6xxx
+ netdevsim
+ nfp
+ qed
+ ti-cpsw-switch
diff --git a/Documentation/networking/devlink/ionic.rst b/Documentation/networking/devlink/ionic.rst
new file mode 100644
index 000000000000..48da9c92d584
--- /dev/null
+++ b/Documentation/networking/devlink/ionic.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+ionic devlink support
+=====================
+
+This document describes the devlink features implemented by the ``ionic``
+device driver.
+
+Info versions
+=============
+
+The ``ionic`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of firmware running on the device
+ * - ``asic.id``
+ - fixed
+ - The ASIC type for this device
+ * - ``asic.rev``
+ - fixed
+ - The revision of the ASIC for this device
diff --git a/Documentation/networking/devlink/mlx4.rst b/Documentation/networking/devlink/mlx4.rst
new file mode 100644
index 000000000000..7b2d17ea5471
--- /dev/null
+++ b/Documentation/networking/devlink/mlx4.rst
@@ -0,0 +1,56 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+mlx4 devlink support
+====================
+
+This document describes the devlink features implemented by the ``mlx4``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``internal_err_reset``
+ - driverinit, runtime
+ * - ``max_macs``
+ - driverinit
+ * - ``region_snapshot_enable``
+ - driverinit, runtime
+
+The ``mlx4`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``enable_64b_cqe_eqe``
+ - Boolean
+ - driverinit
+ - Enable 64 byte CQEs/EQEs, if the FW supports it.
+ * - ``enable_4k_uar``
+ - Boolean
+ - driverinit
+ - Enable using the 4k UAR.
+
+The ``mlx4`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
+
+Regions
+=======
+
+The ``mlx4`` driver supports dumping the firmware PCI crspace and health
+buffer during a critical firmware issue.
+
+In case a firmware command times out, firmware getting stuck, or a non zero
+value on the catastrophic buffer, a snapshot will be taken by the driver.
+
+The ``cr-space`` region will contain the firmware PCI crspace contents. The
+``fw-health`` region will contain the device firmware's health buffer.
+Snapshots for both of these regions are taken on the same event triggers.
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
new file mode 100644
index 000000000000..629a6e69c036
--- /dev/null
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+mlx5 devlink support
+====================
+
+This document describes the devlink features implemented by the ``mlx5``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``enable_roce``
+ - driverinit
+
+The ``mlx5`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``flow_steering_mode``
+ - string
+ - runtime
+ - Controls the flow steering mode of the driver
+
+ * ``dmfs`` Device managed flow steering. In DMFS mode, the HW
+ steering entities are created and managed through firmware.
+ * ``smfs`` Software managed flow steering. In SMFS mode, the HW
+ steering entities are created and manage through the driver without
+ firmware intervention.
+
+The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
+
+Info versions
+=============
+
+The ``mlx5`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw.psid``
+ - fixed
+ - Used to represent the board id of the device.
+ * - ``fw.version``
+ - stored, running
+ - Three digit major.minor.subminor firmware version number.
diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst
new file mode 100644
index 000000000000..cf857cb4ba8f
--- /dev/null
+++ b/Documentation/networking/devlink/mlxsw.rst
@@ -0,0 +1,81 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+mlxsw devlink support
+=====================
+
+This document describes the devlink features implemented by the ``mlxsw``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``fw_load_policy``
+ - driverinit
+
+The ``mlxsw`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``acl_region_rehash_interval``
+ - u32
+ - runtime
+ - Sets an interval for periodic ACL region rehashes. The value is
+ specified in milliseconds, with a minimum of ``3000``. The value of
+ ``0`` disables periodic work entirely. The first rehash will be run
+ immediately after the value is set.
+
+The ``mlxsw`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
+
+Info versions
+=============
+
+The ``mlxsw`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``hw.revision``
+ - fixed
+ - The hardware revision for this board
+ * - ``fw.psid``
+ - fixed
+ - Firmware PSID
+ * - ``fw.version``
+ - running
+ - Three digit firmware version
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``mlxsw``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``irif_disabled``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed from a disabled router interface (RIF). This can happen during
+ RIF dismantle, when the RIF is first disabled before being removed
+ completely
+ * - ``erif_disabled``
+ - ``drop``
+ - Traps packets that the device decided to drop because they need to be
+ routed through a disabled router interface (RIF). This can happen during
+ RIF dismantle, when the RIF is first disabled before being removed
+ completely
diff --git a/Documentation/networking/devlink/mv88e6xxx.rst b/Documentation/networking/devlink/mv88e6xxx.rst
new file mode 100644
index 000000000000..c621212a47a1
--- /dev/null
+++ b/Documentation/networking/devlink/mv88e6xxx.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+mv88e6xxx devlink support
+=========================
+
+This document describes the devlink features implemented by the ``mv88e6xxx``
+device driver.
+
+Parameters
+==========
+
+The ``mv88e6xxx`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``ATU_hash``
+ - u8
+ - runtime
+ - Select one of four possible hashing algorithms for MAC addresses in
+ the Address Translation Unit. A value of 3 may work better than the
+ default of 1 when many MAC addresses have the same OUI. Only the
+ values 0 to 3 are valid for this parameter.
diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst
new file mode 100644
index 000000000000..2a266b7e7b38
--- /dev/null
+++ b/Documentation/networking/devlink/netdevsim.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+netdevsim devlink support
+=========================
+
+This document describes the ``devlink`` features supported by the
+``netdevsim`` device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``max_macs``
+ - driverinit
+
+The ``netdevsim`` driver also implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``test1``
+ - Boolean
+ - driverinit
+ - Test parameter used to show how a driver-specific devlink parameter
+ can be implemented.
+
+The ``netdevsim`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
+
+Regions
+=======
+
+The ``netdevsim`` driver exposes a ``dummy`` region as an example of how the
+devlink-region interfaces work. A snapshot is taken whenever the
+``take_snapshot`` debugfs file is written to.
+
+Resources
+=========
+
+The ``netdevsim`` driver exposes resources to control the number of FIB
+entries and FIB rule entries that the driver will allow.
+
+.. code:: shell
+
+ $ devlink resource set netdevsim/netdevsim0 path /IPv4/fib size 96
+ $ devlink resource set netdevsim/netdevsim0 path /IPv4/fib-rules size 16
+ $ devlink resource set netdevsim/netdevsim0 path /IPv6/fib size 64
+ $ devlink resource set netdevsim/netdevsim0 path /IPv6/fib-rules size 16
+ $ devlink dev reload netdevsim/netdevsim0
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``netdevsim``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fid_miss``
+ - ``exception``
+ - When a packet enters the device it is classified to a filtering
+ indentifier (FID) based on the ingress port and VLAN. This trap is used
+ to trap packets for which a FID could not be found
diff --git a/Documentation/networking/devlink/nfp.rst b/Documentation/networking/devlink/nfp.rst
new file mode 100644
index 000000000000..a1717db0dfcc
--- /dev/null
+++ b/Documentation/networking/devlink/nfp.rst
@@ -0,0 +1,65 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+nfp devlink support
+===================
+
+This document describes the devlink features implemented by the ``nfp``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ * - ``fw_load_policy``
+ - permanent
+ * - ``reset_dev_on_drv_probe``
+ - permanent
+
+Info versions
+=============
+
+The ``nfp`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``board.id``
+ - fixed
+ - Part number identifying the board design
+ * - ``board.rev``
+ - fixed
+ - Revision of the board design
+ * - ``board.manufacture``
+ - fixed
+ - Vendor of the board design
+ * - ``board.model``
+ - fixed
+ - Model name of the board design
+ * - ``fw.bundle_id``
+ - stored, running
+ - Firmware bundle id
+ * - ``fw.mgmt``
+ - stored, running
+ - Version of the management firmware
+ * - ``fw.cpld``
+ - stored, running
+ - The CPLD firmware component version
+ * - ``fw.app``
+ - stored, running
+ - The APP firmware component version
+ * - ``fw.undi``
+ - stored, running
+ - The UNDI firmware component version
+ * - ``fw.ncsi``
+ - stored, running
+ - The NSCI firmware component version
+ * - ``chip.init``
+ - stored, running
+ - The CFGR firmware component version
diff --git a/Documentation/networking/devlink/qed.rst b/Documentation/networking/devlink/qed.rst
new file mode 100644
index 000000000000..805c6f63621a
--- /dev/null
+++ b/Documentation/networking/devlink/qed.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+qed devlink support
+===================
+
+This document describes the devlink features implemented by the ``qed`` core
+device driver.
+
+Parameters
+==========
+
+The ``qed`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``iwarp_cmt``
+ - Boolean
+ - runtime
+ - Enable iWARP functionality for 100g devices. Note that this impacts
+ L2 performance, and is therefore not enabled by default.
diff --git a/Documentation/networking/devlink/ti-cpsw-switch.rst b/Documentation/networking/devlink/ti-cpsw-switch.rst
new file mode 100644
index 000000000000..dc399e32abaa
--- /dev/null
+++ b/Documentation/networking/devlink/ti-cpsw-switch.rst
@@ -0,0 +1,31 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+ti-cpsw-switch devlink support
+==============================
+
+This document describes the devlink features implemented by the ``ti-cpsw-switch``
+device driver.
+
+Parameters
+==========
+
+The ``ti-cpsw-switch`` driver implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``ale_bypass``
+ - Boolean
+ - runtime
+ - Enables ALE_CONTROL(4).BYPASS mode for debugging purposes. In this
+ mode, all packets will be sent to the host port only.
+ * - ``switch_mode``
+ - Boolean
+ - runtime
+ - Enable switch mode
diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst
index cb2858dece93..64553d8d91cb 100644
--- a/Documentation/networking/dsa/sja1105.rst
+++ b/Documentation/networking/dsa/sja1105.rst
@@ -146,6 +146,90 @@ enslaves eth0 and eth1 (the DSA master of the switch ports). This is because in
this mode, the switch ports beneath br0 are not capable of regular traffic, and
are only used as a conduit for switchdev operations.
+Offloads
+========
+
+Time-aware scheduling
+---------------------
+
+The switch supports a variation of the enhancements for scheduled traffic
+specified in IEEE 802.1Q-2018 (formerly 802.1Qbv). This means it can be used to
+ensure deterministic latency for priority traffic that is sent in-band with its
+gate-open event in the network schedule.
+
+This capability can be managed through the tc-taprio offload ('flags 2'). The
+difference compared to the software implementation of taprio is that the latter
+would only be able to shape traffic originated from the CPU, but not
+autonomously forwarded flows.
+
+The device has 8 traffic classes, and maps incoming frames to one of them based
+on the VLAN PCP bits (if no VLAN is present, the port-based default is used).
+As described in the previous sections, depending on the value of
+``vlan_filtering``, the EtherType recognized by the switch as being VLAN can
+either be the typical 0x8100 or a custom value used internally by the driver
+for tagging. Therefore, the switch ignores the VLAN PCP if used in standalone
+or bridge mode with ``vlan_filtering=0``, as it will not recognize the 0x8100
+EtherType. In these modes, injecting into a particular TX queue can only be
+done by the DSA net devices, which populate the PCP field of the tagging header
+on egress. Using ``vlan_filtering=1``, the behavior is the other way around:
+offloaded flows can be steered to TX queues based on the VLAN PCP, but the DSA
+net devices are no longer able to do that. To inject frames into a hardware TX
+queue with VLAN awareness active, it is necessary to create a VLAN
+sub-interface on the DSA master port, and send normal (0x8100) VLAN-tagged
+towards the switch, with the VLAN PCP bits set appropriately.
+
+Management traffic (having DMAC 01-80-C2-xx-xx-xx or 01-19-1B-xx-xx-xx) is the
+notable exception: the switch always treats it with a fixed priority and
+disregards any VLAN PCP bits even if present. The traffic class for management
+traffic has a value of 7 (highest priority) at the moment, which is not
+configurable in the driver.
+
+Below is an example of configuring a 500 us cyclic schedule on egress port
+``swp5``. The traffic class gate for management traffic (7) is open for 100 us,
+and the gates for all other traffic classes are open for 400 us::
+
+ #!/bin/bash
+
+ set -e -u -o pipefail
+
+ NSEC_PER_SEC="1000000000"
+
+ gatemask() {
+ local tc_list="$1"
+ local mask=0
+
+ for tc in ${tc_list}; do
+ mask=$((${mask} | (1 << ${tc})))
+ done
+
+ printf "%02x" ${mask}
+ }
+
+ if ! systemctl is-active --quiet ptp4l; then
+ echo "Please start the ptp4l service"
+ exit
+ fi
+
+ now=$(phc_ctl /dev/ptp1 get | gawk '/clock time is/ { print $5; }')
+ # Phase-align the base time to the start of the next second.
+ sec=$(echo "${now}" | gawk -F. '{ print $1; }')
+ base_time="$(((${sec} + 1) * ${NSEC_PER_SEC}))"
+
+ tc qdisc add dev swp5 parent root handle 100 taprio \
+ num_tc 8 \
+ map 0 1 2 3 5 6 7 \
+ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
+ base-time ${base_time} \
+ sched-entry S $(gatemask 7) 100000 \
+ sched-entry S $(gatemask "0 1 2 3 4 5 6") 400000 \
+ flags 2
+
+It is possible to apply the tc-taprio offload on multiple egress ports. There
+are hardware restrictions related to the fact that no gate event may trigger
+simultaneously on two ports. The driver checks the consistency of the schedules
+against this restriction and errors out when appropriate. Schedule analysis is
+needed to avoid this, which is outside the scope of the document.
+
Device Tree bindings and board design
=====================================
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
new file mode 100644
index 000000000000..f1f868479ceb
--- /dev/null
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -0,0 +1,618 @@
+=============================
+Netlink interface for ethtool
+=============================
+
+
+Basic information
+=================
+
+Netlink interface for ethtool uses generic netlink family ``ethtool``
+(userspace application should use macros ``ETHTOOL_GENL_NAME`` and
+``ETHTOOL_GENL_VERSION`` defined in ``<linux/ethtool_netlink.h>`` uapi
+header). This family does not use a specific header, all information in
+requests and replies is passed using netlink attributes.
+
+The ethtool netlink interface uses extended ACK for error and warning
+reporting, userspace application developers are encouraged to make these
+messages available to user in a suitable way.
+
+Requests can be divided into three categories: "get" (retrieving information),
+"set" (setting parameters) and "action" (invoking an action).
+
+All "set" and "action" type requests require admin privileges
+(``CAP_NET_ADMIN`` in the namespace). Most "get" type requests are allowed for
+anyone but there are exceptions (where the response contains sensitive
+information). In some cases, the request as such is allowed for anyone but
+unprivileged users have attributes with sensitive information (e.g.
+wake-on-lan password) omitted.
+
+
+Conventions
+===========
+
+Attributes which represent a boolean value usually use NLA_U8 type so that we
+can distinguish three states: "on", "off" and "not present" (meaning the
+information is not available in "get" requests or value is not to be changed
+in "set" requests). For these attributes, the "true" value should be passed as
+number 1 but any non-zero value should be understood as "true" by recipient.
+In the tables below, "bool" denotes NLA_U8 attributes interpreted in this way.
+
+In the message structure descriptions below, if an attribute name is suffixed
+with "+", parent nest can contain multiple attributes of the same type. This
+implements an array of entries.
+
+
+Request header
+==============
+
+Each request or reply message contains a nested attribute with common header.
+Structure of this header is
+
+ ============================== ====== =============================
+ ``ETHTOOL_A_HEADER_DEV_INDEX`` u32 device ifindex
+ ``ETHTOOL_A_HEADER_DEV_NAME`` string device name
+ ``ETHTOOL_A_HEADER_FLAGS`` u32 flags common for all requests
+ ============================== ====== =============================
+
+``ETHTOOL_A_HEADER_DEV_INDEX`` and ``ETHTOOL_A_HEADER_DEV_NAME`` identify the
+device message relates to. One of them is sufficient in requests, if both are
+used, they must identify the same device. Some requests, e.g. global string
+sets, do not require device identification. Most ``GET`` requests also allow
+dump requests without device identification to query the same information for
+all devices providing it (each device in a separate message).
+
+``ETHTOOL_A_HEADER_FLAGS`` is a bitmap of request flags common for all request
+types. The interpretation of these flags is the same for all request types but
+the flags may not apply to requests. Recognized flags are:
+
+ ================================= ===================================
+ ``ETHTOOL_FLAG_COMPACT_BITSETS`` use compact format bitsets in reply
+ ``ETHTOOL_FLAG_OMIT_REPLY`` omit optional reply (_SET and _ACT)
+ ================================= ===================================
+
+New request flags should follow the general idea that if the flag is not set,
+the behaviour is backward compatible, i.e. requests from old clients not aware
+of the flag should be interpreted the way the client expects. A client must
+not set flags it does not understand.
+
+
+Bit sets
+========
+
+For short bitmaps of (reasonably) fixed length, standard ``NLA_BITFIELD32``
+type is used. For arbitrary length bitmaps, ethtool netlink uses a nested
+attribute with contents of one of two forms: compact (two binary bitmaps
+representing bit values and mask of affected bits) and bit-by-bit (list of
+bits identified by either index or name).
+
+Verbose (bit-by-bit) bitsets allow sending symbolic names for bits together
+with their values which saves a round trip (when the bitset is passed in a
+request) or at least a second request (when the bitset is in a reply). This is
+useful for one shot applications like traditional ethtool command. On the
+other hand, long running applications like ethtool monitor (displaying
+notifications) or network management daemons may prefer fetching the names
+only once and using compact form to save message size. Notifications from
+ethtool netlink interface always use compact form for bitsets.
+
+A bitset can represent either a value/mask pair (``ETHTOOL_A_BITSET_NOMASK``
+not set) or a single bitmap (``ETHTOOL_A_BITSET_NOMASK`` set). In requests
+modifying a bitmap, the former changes the bit set in mask to values set in
+value and preserves the rest; the latter sets the bits set in the bitmap and
+clears the rest.
+
+Compact form: nested (bitset) atrribute contents:
+
+ ============================ ====== ============================
+ ``ETHTOOL_A_BITSET_NOMASK`` flag no mask, only a list
+ ``ETHTOOL_A_BITSET_SIZE`` u32 number of significant bits
+ ``ETHTOOL_A_BITSET_VALUE`` binary bitmap of bit values
+ ``ETHTOOL_A_BITSET_MASK`` binary bitmap of valid bits
+ ============================ ====== ============================
+
+Value and mask must have length at least ``ETHTOOL_A_BITSET_SIZE`` bits
+rounded up to a multiple of 32 bits. They consist of 32-bit words in host byte
+order, words ordered from least significant to most significant (i.e. the same
+way as bitmaps are passed with ioctl interface).
+
+For compact form, ``ETHTOOL_A_BITSET_SIZE`` and ``ETHTOOL_A_BITSET_VALUE`` are
+mandatory. ``ETHTOOL_A_BITSET_MASK`` attribute is mandatory if
+``ETHTOOL_A_BITSET_NOMASK`` is not set (bitset represents a value/mask pair);
+if ``ETHTOOL_A_BITSET_NOMASK`` is not set, ``ETHTOOL_A_BITSET_MASK`` is not
+allowed (bitset represents a single bitmap.
+
+Kernel bit set length may differ from userspace length if older application is
+used on newer kernel or vice versa. If userspace bitmap is longer, an error is
+issued only if the request actually tries to set values of some bits not
+recognized by kernel.
+
+Bit-by-bit form: nested (bitset) attribute contents:
+
+ +------------------------------------+--------+-----------------------------+
+ | ``ETHTOOL_A_BITSET_NOMASK`` | flag | no mask, only a list |
+ +------------------------------------+--------+-----------------------------+
+ | ``ETHTOOL_A_BITSET_SIZE`` | u32 | number of significant bits |
+ +------------------------------------+--------+-----------------------------+
+ | ``ETHTOOL_A_BITSET_BITS`` | nested | array of bits |
+ +-+----------------------------------+--------+-----------------------------+
+ | | ``ETHTOOL_A_BITSET_BITS_BIT+`` | nested | one bit |
+ +-+-+--------------------------------+--------+-----------------------------+
+ | | | ``ETHTOOL_A_BITSET_BIT_INDEX`` | u32 | bit index (0 for LSB) |
+ +-+-+--------------------------------+--------+-----------------------------+
+ | | | ``ETHTOOL_A_BITSET_BIT_NAME`` | string | bit name |
+ +-+-+--------------------------------+--------+-----------------------------+
+ | | | ``ETHTOOL_A_BITSET_BIT_VALUE`` | flag | present if bit is set |
+ +-+-+--------------------------------+--------+-----------------------------+
+
+Bit size is optional for bit-by-bit form. ``ETHTOOL_A_BITSET_BITS`` nest can
+only contain ``ETHTOOL_A_BITSET_BITS_BIT`` attributes but there can be an
+arbitrary number of them. A bit may be identified by its index or by its
+name. When used in requests, listed bits are set to 0 or 1 according to
+``ETHTOOL_A_BITSET_BIT_VALUE``, the rest is preserved. A request fails if
+index exceeds kernel bit length or if name is not recognized.
+
+When ``ETHTOOL_A_BITSET_NOMASK`` flag is present, bitset is interpreted as
+a simple bitmap. ``ETHTOOL_A_BITSET_BIT_VALUE`` attributes are not used in
+such case. Such bitset represents a bitmap with listed bits set and the rest
+zero.
+
+In requests, application can use either form. Form used by kernel in reply is
+determined by ``ETHTOOL_FLAG_COMPACT_BITSETS`` flag in flags field of request
+header. Semantics of value and mask depends on the attribute.
+
+
+List of message types
+=====================
+
+All constants identifying message types use ``ETHTOOL_CMD_`` prefix and suffix
+according to message purpose:
+
+ ============== ======================================
+ ``_GET`` userspace request to retrieve data
+ ``_SET`` userspace request to set data
+ ``_ACT`` userspace request to perform an action
+ ``_GET_REPLY`` kernel reply to a ``GET`` request
+ ``_SET_REPLY`` kernel reply to a ``SET`` request
+ ``_ACT_REPLY`` kernel reply to an ``ACT`` request
+ ``_NTF`` kernel notification
+ ============== ======================================
+
+Userspace to kernel:
+
+ ===================================== ================================
+ ``ETHTOOL_MSG_STRSET_GET`` get string set
+ ``ETHTOOL_MSG_LINKINFO_GET`` get link settings
+ ``ETHTOOL_MSG_LINKINFO_SET`` set link settings
+ ``ETHTOOL_MSG_LINKMODES_GET`` get link modes info
+ ``ETHTOOL_MSG_LINKMODES_SET`` set link modes info
+ ``ETHTOOL_MSG_LINKSTATE_GET`` get link state
+ ``ETHTOOL_MSG_DEBUG_GET`` get debugging settings
+ ``ETHTOOL_MSG_DEBUG_SET`` set debugging settings
+ ``ETHTOOL_MSG_WOL_GET`` get wake-on-lan settings
+ ``ETHTOOL_MSG_WOL_SET`` set wake-on-lan settings
+ ===================================== ================================
+
+Kernel to userspace:
+
+ ===================================== =================================
+ ``ETHTOOL_MSG_STRSET_GET_REPLY`` string set contents
+ ``ETHTOOL_MSG_LINKINFO_GET_REPLY`` link settings
+ ``ETHTOOL_MSG_LINKINFO_NTF`` link settings notification
+ ``ETHTOOL_MSG_LINKMODES_GET_REPLY`` link modes info
+ ``ETHTOOL_MSG_LINKMODES_NTF`` link modes notification
+ ``ETHTOOL_MSG_LINKSTATE_GET_REPLY`` link state info
+ ``ETHTOOL_MSG_DEBUG_GET_REPLY`` debugging settings
+ ``ETHTOOL_MSG_DEBUG_NTF`` debugging settings notification
+ ``ETHTOOL_MSG_WOL_GET_REPLY`` wake-on-lan settings
+ ``ETHTOOL_MSG_WOL_NTF`` wake-on-lan settings notification
+ ===================================== =================================
+
+``GET`` requests are sent by userspace applications to retrieve device
+information. They usually do not contain any message specific attributes.
+Kernel replies with corresponding "GET_REPLY" message. For most types, ``GET``
+request with ``NLM_F_DUMP`` and no device identification can be used to query
+the information for all devices supporting the request.
+
+If the data can be also modified, corresponding ``SET`` message with the same
+layout as corresponding ``GET_REPLY`` is used to request changes. Only
+attributes where a change is requested are included in such request (also, not
+all attributes may be changed). Replies to most ``SET`` request consist only
+of error code and extack; if kernel provides additional data, it is sent in
+the form of corresponding ``SET_REPLY`` message which can be suppressed by
+setting ``ETHTOOL_FLAG_OMIT_REPLY`` flag in request header.
+
+Data modification also triggers sending a ``NTF`` message with a notification.
+These usually bear only a subset of attributes which was affected by the
+change. The same notification is issued if the data is modified using other
+means (mostly ioctl ethtool interface). Unlike notifications from ethtool
+netlink code which are only sent if something actually changed, notifications
+triggered by ioctl interface may be sent even if the request did not actually
+change any data.
+
+``ACT`` messages request kernel (driver) to perform a specific action. If some
+information is reported by kernel (which can be suppressed by setting
+``ETHTOOL_FLAG_OMIT_REPLY`` flag in request header), the reply takes form of
+an ``ACT_REPLY`` message. Performing an action also triggers a notification
+(``NTF`` message).
+
+Later sections describe the format and semantics of these messages.
+
+
+STRSET_GET
+==========
+
+Requests contents of a string set as provided by ioctl commands
+``ETHTOOL_GSSET_INFO`` and ``ETHTOOL_GSTRINGS.`` String sets are not user
+writeable so that the corresponding ``STRSET_SET`` message is only used in
+kernel replies. There are two types of string sets: global (independent of
+a device, e.g. device feature names) and device specific (e.g. device private
+flags).
+
+Request contents:
+
+ +---------------------------------------+--------+------------------------+
+ | ``ETHTOOL_A_STRSET_HEADER`` | nested | request header |
+ +---------------------------------------+--------+------------------------+
+ | ``ETHTOOL_A_STRSET_STRINGSETS`` | nested | string set to request |
+ +-+-------------------------------------+--------+------------------------+
+ | | ``ETHTOOL_A_STRINGSETS_STRINGSET+`` | nested | one string set |
+ +-+-+-----------------------------------+--------+------------------------+
+ | | | ``ETHTOOL_A_STRINGSET_ID`` | u32 | set id |
+ +-+-+-----------------------------------+--------+------------------------+
+
+Kernel response contents:
+
+ +---------------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_STRSET_HEADER`` | nested | reply header |
+ +---------------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_STRSET_STRINGSETS`` | nested | array of string sets |
+ +-+-------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_STRINGSETS_STRINGSET+`` | nested | one string set |
+ +-+-+-----------------------------------+--------+-----------------------+
+ | | | ``ETHTOOL_A_STRINGSET_ID`` | u32 | set id |
+ +-+-+-----------------------------------+--------+-----------------------+
+ | | | ``ETHTOOL_A_STRINGSET_COUNT`` | u32 | number of strings |
+ +-+-+-----------------------------------+--------+-----------------------+
+ | | | ``ETHTOOL_A_STRINGSET_STRINGS`` | nested | array of strings |
+ +-+-+-+---------------------------------+--------+-----------------------+
+ | | | | ``ETHTOOL_A_STRINGS_STRING+`` | nested | one string |
+ +-+-+-+-+-------------------------------+--------+-----------------------+
+ | | | | | ``ETHTOOL_A_STRING_INDEX`` | u32 | string index |
+ +-+-+-+-+-------------------------------+--------+-----------------------+
+ | | | | | ``ETHTOOL_A_STRING_VALUE`` | string | string value |
+ +-+-+-+-+-------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_STRSET_COUNTS_ONLY`` | flag | return only counts |
+ +---------------------------------------+--------+-----------------------+
+
+Device identification in request header is optional. Depending on its presence
+a and ``NLM_F_DUMP`` flag, there are three type of ``STRSET_GET`` requests:
+
+ - no ``NLM_F_DUMP,`` no device: get "global" stringsets
+ - no ``NLM_F_DUMP``, with device: get string sets related to the device
+ - ``NLM_F_DUMP``, no device: get device related string sets for all devices
+
+If there is no ``ETHTOOL_A_STRSET_STRINGSETS`` array, all string sets of
+requested type are returned, otherwise only those specified in the request.
+Flag ``ETHTOOL_A_STRSET_COUNTS_ONLY`` tells kernel to only return string
+counts of the sets, not the actual strings.
+
+
+LINKINFO_GET
+============
+
+Requests link settings as provided by ``ETHTOOL_GLINKSETTINGS`` except for
+link modes and autonegotiation related information. The request does not use
+any attributes.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKINFO_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKINFO_HEADER`` nested reply header
+ ``ETHTOOL_A_LINKINFO_PORT`` u8 physical port
+ ``ETHTOOL_A_LINKINFO_PHYADDR`` u8 phy MDIO address
+ ``ETHTOOL_A_LINKINFO_TP_MDIX`` u8 MDI(-X) status
+ ``ETHTOOL_A_LINKINFO_TP_MDIX_CTRL`` u8 MDI(-X) control
+ ``ETHTOOL_A_LINKINFO_TRANSCEIVER`` u8 transceiver
+ ==================================== ====== ==========================
+
+Attributes and their values have the same meaning as matching members of the
+corresponding ioctl structures.
+
+``LINKINFO_GET`` allows dump requests (kernel returns reply message for all
+devices supporting the request).
+
+
+LINKINFO_SET
+============
+
+``LINKINFO_SET`` request allows setting some of the attributes reported by
+``LINKINFO_GET``.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKINFO_HEADER`` nested request header
+ ``ETHTOOL_A_LINKINFO_PORT`` u8 physical port
+ ``ETHTOOL_A_LINKINFO_PHYADDR`` u8 phy MDIO address
+ ``ETHTOOL_A_LINKINFO_TP_MDIX_CTRL`` u8 MDI(-X) control
+ ==================================== ====== ==========================
+
+MDI(-X) status and transceiver cannot be set, request with the corresponding
+attributes is rejected.
+
+
+LINKMODES_GET
+=============
+
+Requests link modes (supported, advertised and peer advertised) and related
+information (autonegotiation status, link speed and duplex) as provided by
+``ETHTOOL_GLINKSETTINGS``. The request does not use any attributes.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKMODES_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKMODES_HEADER`` nested reply header
+ ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status
+ ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes
+ ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes
+ ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
+ ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
+ ==================================== ====== ==========================
+
+For ``ETHTOOL_A_LINKMODES_OURS``, value represents advertised modes and mask
+represents supported modes. ``ETHTOOL_A_LINKMODES_PEER`` in the reply is a bit
+list.
+
+``LINKMODES_GET`` allows dump requests (kernel returns reply messages for all
+devices supporting the request).
+
+
+LINKMODES_SET
+=============
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKMODES_HEADER`` nested request header
+ ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status
+ ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes
+ ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes
+ ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
+ ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_LINKMODES_OURS`` bit set allows setting advertised link modes. If
+autonegotiation is on (either set now or kept from before), advertised modes
+are not changed (no ``ETHTOOL_A_LINKMODES_OURS`` attribute) and at least one
+of speed and duplex is specified, kernel adjusts advertised modes to all
+supported modes matching speed, duplex or both (whatever is specified). This
+autoselection is done on ethtool side with ioctl interface, netlink interface
+is supposed to allow requesting changes without knowing what exactly kernel
+supports.
+
+
+LINKSTATE_GET
+=============
+
+Requests link state information. At the moment, only link up/down flag (as
+provided by ``ETHTOOL_GLINK`` ioctl command) is provided but some future
+extensions are planned (e.g. link down reason). This request does not have any
+attributes.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKSTATE_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKSTATE_HEADER`` nested reply header
+ ``ETHTOOL_A_LINKSTATE_LINK`` bool link state (up/down)
+ ==================================== ====== ==========================
+
+For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns
+carrier flag provided by ``netif_carrier_ok()`` but there are drivers which
+define their own handler.
+
+``LINKSTATE_GET`` allows dump requests (kernel returns reply messages for all
+devices supporting the request).
+
+
+DEBUG_GET
+=========
+
+Requests debugging settings of a device. At the moment, only message mask is
+provided.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_DEBUG_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_DEBUG_HEADER`` nested reply header
+ ``ETHTOOL_A_DEBUG_MSGMASK`` bitset message mask
+ ==================================== ====== ==========================
+
+The message mask (``ETHTOOL_A_DEBUG_MSGMASK``) is equal to message level as
+provided by ``ETHTOOL_GMSGLVL`` and set by ``ETHTOOL_SMSGLVL`` in ioctl
+interface. While it is called message level there for historical reasons, most
+drivers and almost all newer drivers use it as a mask of enabled message
+classes (represented by ``NETIF_MSG_*`` constants); therefore netlink
+interface follows its actual use in practice.
+
+``DEBUG_GET`` allows dump requests (kernel returns reply messages for all
+devices supporting the request).
+
+
+DEBUG_SET
+=========
+
+Set or update debugging settings of a device. At the moment, only message mask
+is supported.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_DEBUG_HEADER`` nested request header
+ ``ETHTOOL_A_DEBUG_MSGMASK`` bitset message mask
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_DEBUG_MSGMASK`` bit set allows setting or modifying mask of
+enabled debugging message types for the device.
+
+
+WOL_GET
+=======
+
+Query device wake-on-lan settings. Unlike most "GET" type requests,
+``ETHTOOL_MSG_WOL_GET`` requires (netns) ``CAP_NET_ADMIN`` privileges as it
+(potentially) provides SecureOn(tm) password which is confidential.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_WOL_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_WOL_HEADER`` nested reply header
+ ``ETHTOOL_A_WOL_MODES`` bitset mask of enabled WoL modes
+ ``ETHTOOL_A_WOL_SOPASS`` binary SecureOn(tm) password
+ ==================================== ====== ==========================
+
+In reply, ``ETHTOOL_A_WOL_MODES`` mask consists of modes supported by the
+device, value of modes which are enabled. ``ETHTOOL_A_WOL_SOPASS`` is only
+included in reply if ``WAKE_MAGICSECURE`` mode is supported.
+
+
+WOL_SET
+=======
+
+Set or update wake-on-lan settings.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_WOL_HEADER`` nested request header
+ ``ETHTOOL_A_WOL_MODES`` bitset enabled WoL modes
+ ``ETHTOOL_A_WOL_SOPASS`` binary SecureOn(tm) password
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_WOL_SOPASS`` is only allowed for devices supporting
+``WAKE_MAGICSECURE`` mode.
+
+
+Request translation
+===================
+
+The following table maps ioctl commands to netlink commands providing their
+functionality. Entries with "n/a" in right column are commands which do not
+have their netlink replacement yet.
+
+ =================================== =====================================
+ ioctl command netlink command
+ =================================== =====================================
+ ``ETHTOOL_GSET`` ``ETHTOOL_MSG_LINKINFO_GET``
+ ``ETHTOOL_MSG_LINKMODES_GET``
+ ``ETHTOOL_SSET`` ``ETHTOOL_MSG_LINKINFO_SET``
+ ``ETHTOOL_MSG_LINKMODES_SET``
+ ``ETHTOOL_GDRVINFO`` n/a
+ ``ETHTOOL_GREGS`` n/a
+ ``ETHTOOL_GWOL`` ``ETHTOOL_MSG_WOL_GET``
+ ``ETHTOOL_SWOL`` ``ETHTOOL_MSG_WOL_SET``
+ ``ETHTOOL_GMSGLVL`` ``ETHTOOL_MSG_DEBUG_GET``
+ ``ETHTOOL_SMSGLVL`` ``ETHTOOL_MSG_DEBUG_SET``
+ ``ETHTOOL_NWAY_RST`` n/a
+ ``ETHTOOL_GLINK`` ``ETHTOOL_MSG_LINKSTATE_GET``
+ ``ETHTOOL_GEEPROM`` n/a
+ ``ETHTOOL_SEEPROM`` n/a
+ ``ETHTOOL_GCOALESCE`` n/a
+ ``ETHTOOL_SCOALESCE`` n/a
+ ``ETHTOOL_GRINGPARAM`` n/a
+ ``ETHTOOL_SRINGPARAM`` n/a
+ ``ETHTOOL_GPAUSEPARAM`` n/a
+ ``ETHTOOL_SPAUSEPARAM`` n/a
+ ``ETHTOOL_GRXCSUM`` n/a
+ ``ETHTOOL_SRXCSUM`` n/a
+ ``ETHTOOL_GTXCSUM`` n/a
+ ``ETHTOOL_STXCSUM`` n/a
+ ``ETHTOOL_GSG`` n/a
+ ``ETHTOOL_SSG`` n/a
+ ``ETHTOOL_TEST`` n/a
+ ``ETHTOOL_GSTRINGS`` ``ETHTOOL_MSG_STRSET_GET``
+ ``ETHTOOL_PHYS_ID`` n/a
+ ``ETHTOOL_GSTATS`` n/a
+ ``ETHTOOL_GTSO`` n/a
+ ``ETHTOOL_STSO`` n/a
+ ``ETHTOOL_GPERMADDR`` rtnetlink ``RTM_GETLINK``
+ ``ETHTOOL_GUFO`` n/a
+ ``ETHTOOL_SUFO`` n/a
+ ``ETHTOOL_GGSO`` n/a
+ ``ETHTOOL_SGSO`` n/a
+ ``ETHTOOL_GFLAGS`` n/a
+ ``ETHTOOL_SFLAGS`` n/a
+ ``ETHTOOL_GPFLAGS`` n/a
+ ``ETHTOOL_SPFLAGS`` n/a
+ ``ETHTOOL_GRXFH`` n/a
+ ``ETHTOOL_SRXFH`` n/a
+ ``ETHTOOL_GGRO`` n/a
+ ``ETHTOOL_SGRO`` n/a
+ ``ETHTOOL_GRXRINGS`` n/a
+ ``ETHTOOL_GRXCLSRLCNT`` n/a
+ ``ETHTOOL_GRXCLSRULE`` n/a
+ ``ETHTOOL_GRXCLSRLALL`` n/a
+ ``ETHTOOL_SRXCLSRLDEL`` n/a
+ ``ETHTOOL_SRXCLSRLINS`` n/a
+ ``ETHTOOL_FLASHDEV`` n/a
+ ``ETHTOOL_RESET`` n/a
+ ``ETHTOOL_SRXNTUPLE`` n/a
+ ``ETHTOOL_GRXNTUPLE`` n/a
+ ``ETHTOOL_GSSET_INFO`` ``ETHTOOL_MSG_STRSET_GET``
+ ``ETHTOOL_GRXFHINDIR`` n/a
+ ``ETHTOOL_SRXFHINDIR`` n/a
+ ``ETHTOOL_GFEATURES`` n/a
+ ``ETHTOOL_SFEATURES`` n/a
+ ``ETHTOOL_GCHANNELS`` n/a
+ ``ETHTOOL_SCHANNELS`` n/a
+ ``ETHTOOL_SET_DUMP`` n/a
+ ``ETHTOOL_GET_DUMP_FLAG`` n/a
+ ``ETHTOOL_GET_DUMP_DATA`` n/a
+ ``ETHTOOL_GET_TS_INFO`` n/a
+ ``ETHTOOL_GMODULEINFO`` n/a
+ ``ETHTOOL_GMODULEEEPROM`` n/a
+ ``ETHTOOL_GEEE`` n/a
+ ``ETHTOOL_SEEE`` n/a
+ ``ETHTOOL_GRSSH`` n/a
+ ``ETHTOOL_SRSSH`` n/a
+ ``ETHTOOL_GTUNABLE`` n/a
+ ``ETHTOOL_STUNABLE`` n/a
+ ``ETHTOOL_GPHYSTATS`` n/a
+ ``ETHTOOL_PERQUEUE`` n/a
+ ``ETHTOOL_GLINKSETTINGS`` ``ETHTOOL_MSG_LINKINFO_GET``
+ ``ETHTOOL_MSG_LINKMODES_GET``
+ ``ETHTOOL_SLINKSETTINGS`` ``ETHTOOL_MSG_LINKINFO_SET``
+ ``ETHTOOL_MSG_LINKMODES_SET``
+ ``ETHTOOL_PHY_GTUNABLE`` n/a
+ ``ETHTOOL_PHY_STUNABLE`` n/a
+ ``ETHTOOL_GFECPARAM`` n/a
+ ``ETHTOOL_SFECPARAM`` n/a
+ =================================== =====================================
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index 319e5e041f38..c4a328f2d57a 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -770,10 +770,10 @@ Some core changes of the new internal format:
callq foo
mov %rax,%r13
mov %rbx,%rdi
- mov $0x2,%esi
- mov $0x3,%edx
- mov $0x4,%ecx
- mov $0x5,%r8d
+ mov $0x6,%esi
+ mov $0x7,%edx
+ mov $0x8,%ecx
+ mov $0x9,%r8d
callq bar
add %r13,%rax
mov -0x228(%rbp),%rbx
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index a46fca264bee..d07d9855dcd3 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -13,8 +13,10 @@ Contents:
can_ucan_protocol
device_drivers/index
dsa/index
- devlink-info-versions
+ devlink/index
+ ethtool-netlink
ieee802154
+ j1939
kapi
z8530book
msg_zerocopy
@@ -30,8 +32,9 @@ Contents:
scaling
tls
tls-offload
+ nfc
-.. only:: subproject
+.. only:: subproject and html
Indices
=======
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index df33674799b5..5f53faff4e25 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -207,8 +207,8 @@ TCP variables:
somaxconn - INTEGER
Limit of socket listen() backlog, known in userspace as SOMAXCONN.
- Defaults to 128. See also tcp_max_syn_backlog for additional tuning
- for TCP sockets.
+ Defaults to 4096. (Was 128 before linux-5.4)
+ See also tcp_max_syn_backlog for additional tuning for TCP sockets.
tcp_abort_on_overflow - BOOLEAN
If listening service is too slow to accept new connections,
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
Path MTU discovery (MTU probing). If MTU probing is enabled,
this is the initial MSS used by the connection.
+tcp_mtu_probe_floor - INTEGER
+ If MTU probing is enabled this caps the minimum MSS used for search_low
+ for the connection.
+
+ Default : 48
+
tcp_min_snd_mss - INTEGER
TCP SYN and SYNACK messages usually advertise an ADVMSS option,
as described in RFC 1122 and RFC 6691.
@@ -402,11 +408,14 @@ tcp_max_orphans - INTEGER
up to ~64K of unswappable memory.
tcp_max_syn_backlog - INTEGER
- Maximal number of remembered connection requests, which have not
- received an acknowledgment from connecting client.
+ Maximal number of remembered connection requests (SYN_RECV),
+ which have not received an acknowledgment from connecting client.
+ This is a per-listener limit.
The minimal value is 128 for low memory machines, and it will
increase in proportion to the memory of machine.
If server suffers from overload, try increasing this number.
+ Remember to also check /proc/sys/net/core/somaxconn
+ A SYN_RECV request socket consumes about 304 bytes of memory.
tcp_max_tw_buckets - INTEGER
Maximal number of timewait sockets held by system simultaneously.
@@ -470,6 +479,10 @@ tcp_no_metrics_save - BOOLEAN
degradation. If set, TCP will not cache metrics on closing
connections.
+tcp_no_ssthresh_metrics_save - BOOLEAN
+ Controls whether TCP saves ssthresh metrics in the route cache.
+ Default is 1, which disables ssthresh metrics.
+
tcp_orphan_retries - INTEGER
This value influences the timeout of a locally closed TCP connection,
when RTO retransmissions remain unacknowledged.
@@ -594,7 +607,7 @@ tcp_synack_retries - INTEGER
with the current initial RTO of 1second. With this the final timeout
for a passive TCP connection will happen after 63seconds.
-tcp_syncookies - BOOLEAN
+tcp_syncookies - INTEGER
Only valid when the kernel was compiled with CONFIG_SYN_COOKIES
Send out syncookies when the syn backlog queue of a socket
overflows. This is to prevent against the common 'SYN flood attack'
@@ -895,8 +908,9 @@ ip_local_port_range - 2 INTEGERS
Defines the local port range that is used by TCP and UDP to
choose the local port. The first number is the first, the
second the last local port number.
- If possible, it is better these numbers have different parity.
- (one even and one odd values)
+ If possible, it is better these numbers have different parity
+ (one even and one odd value).
+ Must be greater than or equal to ip_unprivileged_port_start.
The default values are 32768 and 60999 respectively.
ip_local_reserved_ports - list of comma separated ranges
@@ -934,8 +948,8 @@ ip_unprivileged_port_start - INTEGER
This is a per-namespace sysctl. It defines the first
unprivileged port in the network namespace. Privileged ports
require root or CAP_NET_BIND_SERVICE in order to bind to them.
- To disable all privileged ports, set this to 0. It may not
- overlap with the ip_local_reserved_ports range.
+ To disable all privileged ports, set this to 0. They must not
+ overlap with the ip_local_port_range.
Default: 1024
@@ -2082,6 +2096,28 @@ pf_enable - INTEGER
Default: 1
+pf_expose - INTEGER
+ Unset or enable/disable pf (pf is short for potentially failed) state
+ exposure. Applications can control the exposure of the PF path state
+ in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO
+ sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with
+ SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info
+ can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled,
+ a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming
+ SCTP_PF state and a SCTP_PF-state transport info can be got via
+ SCTP_GET_PEER_ADDR_INFO sockopt; When it's diabled, no
+ SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when
+ trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO
+ sockopt.
+
+ 0: Unset pf state exposure, Compatible with old applications.
+
+ 1: Disable pf state exposure.
+
+ 2: Enable pf state exposure.
+
+ Default: 0
+
addip_noauth_enable - BOOLEAN
Dynamic Address Reconfiguration (ADD-IP) requires the use of
authentication to protect the operations of adding or removing new
@@ -2164,6 +2200,18 @@ pf_retrans - INTEGER
Default: 0
+ps_retrans - INTEGER
+ Primary.Switchover.Max.Retrans (PSMR), it's a tunable parameter coming
+ from section-5 "Primary Path Switchover" in rfc7829. The primary path
+ will be changed to another active path when the path error counter on
+ the old primary path exceeds PSMR, so that "the SCTP sender is allowed
+ to continue data transmission on a new working path even when the old
+ primary destination address becomes active again". Note this feature
+ is disabled by initializing 'ps_retrans' per netns as 0xffff by default,
+ and its value can't be less than 'pf_retrans' when changing by sysctl.
+
+ Default: 0xffff
+
rto_initial - INTEGER
The initial round trip timeout value in milliseconds that will be used
in calculating round trip times. This is the initial time interval
diff --git a/Documentation/networking/j1939.rst b/Documentation/networking/j1939.rst
new file mode 100644
index 000000000000..f5be243d250a
--- /dev/null
+++ b/Documentation/networking/j1939.rst
@@ -0,0 +1,422 @@
+.. SPDX-License-Identifier: (GPL-2.0 OR MIT)
+
+===================
+J1939 Documentation
+===================
+
+Overview / What Is J1939
+========================
+
+SAE J1939 defines a higher layer protocol on CAN. It implements a more
+sophisticated addressing scheme and extends the maximum packet size above 8
+bytes. Several derived specifications exist, which differ from the original
+J1939 on the application level, like MilCAN A, NMEA2000 and especially
+ISO-11783 (ISOBUS). This last one specifies the so-called ETP (Extended
+Transport Protocol) which is has been included in this implementation. This
+results in a maximum packet size of ((2 ^ 24) - 1) * 7 bytes == 111 MiB.
+
+Specifications used
+-------------------
+
+* SAE J1939-21 : data link layer
+* SAE J1939-81 : network management
+* ISO 11783-6 : Virtual Terminal (Extended Transport Protocol)
+
+.. _j1939-motivation:
+
+Motivation
+==========
+
+Given the fact there's something like SocketCAN with an API similar to BSD
+sockets, we found some reasons to justify a kernel implementation for the
+addressing and transport methods used by J1939.
+
+* **Addressing:** when a process on an ECU communicates via J1939, it should
+ not necessarily know its source address. Although at least one process per
+ ECU should know the source address. Other processes should be able to reuse
+ that address. This way, address parameters for different processes
+ cooperating for the same ECU, are not duplicated. This way of working is
+ closely related to the UNIX concept where programs do just one thing, and do
+ it well.
+
+* **Dynamic addressing:** Address Claiming in J1939 is time critical.
+ Furthermore data transport should be handled properly during the address
+ negotiation. Putting this functionality in the kernel eliminates it as a
+ requirement for _every_ user space process that communicates via J1939. This
+ results in a consistent J1939 bus with proper addressing.
+
+* **Transport:** both TP & ETP reuse some PGNs to relay big packets over them.
+ Different processes may thus use the same TP & ETP PGNs without actually
+ knowing it. The individual TP & ETP sessions _must_ be serialized
+ (synchronized) between different processes. The kernel solves this problem
+ properly and eliminates the serialization (synchronization) as a requirement
+ for _every_ user space process that communicates via J1939.
+
+J1939 defines some other features (relaying, gateway, fast packet transport,
+...). In-kernel code for these would not contribute to protocol stability.
+Therefore, these parts are left to user space.
+
+The J1939 sockets operate on CAN network devices (see SocketCAN). Any J1939
+user space library operating on CAN raw sockets will still operate properly.
+Since such library does not communicate with the in-kernel implementation, care
+must be taken that these two do not interfere. In practice, this means they
+cannot share ECU addresses. A single ECU (or virtual ECU) address is used by
+the library exclusively, or by the in-kernel system exclusively.
+
+J1939 concepts
+==============
+
+PGN
+---
+
+The PGN (Parameter Group Number) is a number to identify a packet. The PGN
+is composed as follows:
+1 bit : Reserved Bit
+1 bit : Data Page
+8 bits : PF (PDU Format)
+8 bits : PS (PDU Specific)
+
+In J1939-21 distinction is made between PDU1 format (where PF < 240) and PDU2
+format (where PF >= 240). Furthermore, when using PDU2 format, the PS-field
+contains a so-called Group Extension, which is part of the PGN. When using PDU2
+format, the Group Extension is set in the PS-field.
+
+On the other hand, when using PDU1 format, the PS-field contains a so-called
+Destination Address, which is _not_ part of the PGN. When communicating a PGN
+from user space to kernel (or visa versa) and PDU2 format is used, the PS-field
+of the PGN shall be set to zero. The Destination Address shall be set
+elsewhere.
+
+Regarding PGN mapping to 29-bit CAN identifier, the Destination Address shall
+be get/set from/to the appropriate bits of the identifier by the kernel.
+
+
+Addressing
+----------
+
+Both static and dynamic addressing methods can be used.
+
+For static addresses, no extra checks are made by the kernel, and provided
+addresses are considered right. This responsibility is for the OEM or system
+integrator.
+
+For dynamic addressing, so-called Address Claiming, extra support is foreseen
+in the kernel. In J1939 any ECU is known by it's 64-bit NAME. At the moment of
+a successful address claim, the kernel keeps track of both NAME and source
+address being claimed. This serves as a base for filter schemes. By default,
+packets with a destination that is not locally, will be rejected.
+
+Mixed mode packets (from a static to a dynamic address or vice versa) are
+allowed. The BSD sockets define separate API calls for getting/setting the
+local & remote address and are applicable for J1939 sockets.
+
+Filtering
+---------
+
+J1939 defines white list filters per socket that a user can set in order to
+receive a subset of the J1939 traffic. Filtering can be based on:
+
+* SA
+* SOURCE_NAME
+* PGN
+
+When multiple filters are in place for a single socket, and a packet comes in
+that matches several of those filters, the packet is only received once for
+that socket.
+
+How to Use J1939
+================
+
+API Calls
+---------
+
+On CAN, you first need to open a socket for communicating over a CAN network.
+To use J1939, #include <linux/can/j1939.h>. From there, <linux/can.h> will be
+included too. To open a socket, use:
+
+.. code-block:: C
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_J1939);
+
+J1939 does use SOCK_DGRAM sockets. In the J1939 specification, connections are
+mentioned in the context of transport protocol sessions. These still deliver
+packets to the other end (using several CAN packets). SOCK_STREAM is not
+supported.
+
+After the successful creation of the socket, you would normally use the bind(2)
+and/or connect(2) system call to bind the socket to a CAN interface. After
+binding and/or connecting the socket, you can read(2) and write(2) from/to the
+socket or use send(2), sendto(2), sendmsg(2) and the recv*() counterpart
+operations on the socket as usual. There are also J1939 specific socket options
+described below.
+
+In order to send data, a bind(2) must have been successful. bind(2) assigns a
+local address to a socket.
+
+Different from CAN is that the payload data is just the data that get send,
+without it's header info. The header info is derived from the sockaddr supplied
+to bind(2), connect(2), sendto(2) and recvfrom(2). A write(2) with size 4 will
+result in a packet with 4 bytes.
+
+The sockaddr structure has extensions for use with J1939 as specified below:
+
+.. code-block:: C
+
+ struct sockaddr_can {
+ sa_family_t can_family;
+ int can_ifindex;
+ union {
+ struct {
+ __u64 name;
+ /* pgn:
+ * 8 bit: PS in PDU2 case, else 0
+ * 8 bit: PF
+ * 1 bit: DP
+ * 1 bit: reserved
+ */
+ __u32 pgn;
+ __u8 addr;
+ } j1939;
+ } can_addr;
+ }
+
+can_family & can_ifindex serve the same purpose as for other SocketCAN sockets.
+
+can_addr.j1939.pgn specifies the PGN (max 0x3ffff). Individual bits are
+specified above.
+
+can_addr.j1939.name contains the 64-bit J1939 NAME.
+
+can_addr.j1939.addr contains the address.
+
+The bind(2) system call assigns the local address, i.e. the source address when
+sending packages. If a PGN during bind(2) is set, it's used as a RX filter.
+I.e. only packets with a matching PGN are received. If an ADDR or NAME is set
+it is used as a receive filter, too. It will match the destination NAME or ADDR
+of the incoming packet. The NAME filter will work only if appropriate Address
+Claiming for this name was done on the CAN bus and registered/cached by the
+kernel.
+
+On the other hand connect(2) assigns the remote address, i.e. the destination
+address. The PGN from connect(2) is used as the default PGN when sending
+packets. If ADDR or NAME is set it will be used as the default destination ADDR
+or NAME. Further a set ADDR or NAME during connect(2) is used as a receive
+filter. It will match the source NAME or ADDR of the incoming packet.
+
+Both write(2) and send(2) will send a packet with local address from bind(2) and
+the remote address from connect(2). Use sendto(2) to overwrite the destination
+address.
+
+If can_addr.j1939.name is set (!= 0) the NAME is looked up by the kernel and
+the corresponding ADDR is used. If can_addr.j1939.name is not set (== 0),
+can_addr.j1939.addr is used.
+
+When creating a socket, reasonable defaults are set. Some options can be
+modified with setsockopt(2) & getsockopt(2).
+
+RX path related options:
+
+- SO_J1939_FILTER - configure array of filters
+- SO_J1939_PROMISC - disable filters set by bind(2) and connect(2)
+
+By default no broadcast packets can be send or received. To enable sending or
+receiving broadcast packets use the socket option SO_BROADCAST:
+
+.. code-block:: C
+
+ int value = 1;
+ setsockopt(sock, SOL_SOCKET, SO_BROADCAST, &value, sizeof(value));
+
+The following diagram illustrates the RX path:
+
+.. code::
+
+ +--------------------+
+ | incoming packet |
+ +--------------------+
+ |
+ V
+ +--------------------+
+ | SO_J1939_PROMISC? |
+ +--------------------+
+ | |
+ no | | yes
+ | |
+ .---------' `---------.
+ | |
+ +---------------------------+ |
+ | bind() + connect() + | |
+ | SOCK_BROADCAST filter | |
+ +---------------------------+ |
+ | |
+ |<---------------------'
+ V
+ +---------------------------+
+ | SO_J1939_FILTER |
+ +---------------------------+
+ |
+ V
+ +---------------------------+
+ | socket recv() |
+ +---------------------------+
+
+TX path related options:
+SO_J1939_SEND_PRIO - change default send priority for the socket
+
+Message Flags during send() and Related System Calls
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+send(2), sendto(2) and sendmsg(2) take a 'flags' argument. Currently
+supported flags are:
+
+* MSG_DONTWAIT, i.e. non-blocking operation.
+
+recvmsg(2)
+^^^^^^^^^^
+
+In most cases recvmsg(2) is needed if you want to extract more information than
+recvfrom(2) can provide. For example package priority and timestamp. The
+Destination Address, name and packet priority (if applicable) are attached to
+the msghdr in the recvmsg(2) call. They can be extracted using cmsg(3) macros,
+with cmsg_level == SOL_J1939 && cmsg_type == SCM_J1939_DEST_ADDR,
+SCM_J1939_DEST_NAME or SCM_J1939_PRIO. The returned data is a uint8_t for
+priority and dst_addr, and uint64_t for dst_name.
+
+.. code-block:: C
+
+ uint8_t priority, dst_addr;
+ uint64_t dst_name;
+
+ for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+ switch (cmsg->cmsg_level) {
+ case SOL_CAN_J1939:
+ if (cmsg->cmsg_type == SCM_J1939_DEST_ADDR)
+ dst_addr = *CMSG_DATA(cmsg);
+ else if (cmsg->cmsg_type == SCM_J1939_DEST_NAME)
+ memcpy(&dst_name, CMSG_DATA(cmsg), cmsg->cmsg_len - CMSG_LEN(0));
+ else if (cmsg->cmsg_type == SCM_J1939_PRIO)
+ priority = *CMSG_DATA(cmsg);
+ break;
+ }
+ }
+
+Dynamic Addressing
+------------------
+
+Distinction has to be made between using the claimed address and doing an
+address claim. To use an already claimed address, one has to fill in the
+j1939.name member and provide it to bind(2). If the name had claimed an address
+earlier, all further messages being sent will use that address. And the
+j1939.addr member will be ignored.
+
+An exception on this is PGN 0x0ee00. This is the "Address Claim/Cannot Claim
+Address" message and the kernel will use the j1939.addr member for that PGN if
+necessary.
+
+To claim an address following code example can be used:
+
+.. code-block:: C
+
+ struct sockaddr_can baddr = {
+ .can_family = AF_CAN,
+ .can_addr.j1939 = {
+ .name = name,
+ .addr = J1939_IDLE_ADDR,
+ .pgn = J1939_NO_PGN, /* to disable bind() rx filter for PGN */
+ },
+ .can_ifindex = if_nametoindex("can0"),
+ };
+
+ bind(sock, (struct sockaddr *)&baddr, sizeof(baddr));
+
+ /* for Address Claiming broadcast must be allowed */
+ int value = 1;
+ setsockopt(sock, SOL_SOCKET, SO_BROADCAST, &value, sizeof(value));
+
+ /* configured advanced RX filter with PGN needed for Address Claiming */
+ const struct j1939_filter filt[] = {
+ {
+ .pgn = J1939_PGN_ADDRESS_CLAIMED,
+ .pgn_mask = J1939_PGN_PDU1_MAX,
+ }, {
+ .pgn = J1939_PGN_REQUEST,
+ .pgn_mask = J1939_PGN_PDU1_MAX,
+ }, {
+ .pgn = J1939_PGN_ADDRESS_COMMANDED,
+ .pgn_mask = J1939_PGN_MAX,
+ },
+ };
+
+ setsockopt(sock, SOL_CAN_J1939, SO_J1939_FILTER, &filt, sizeof(filt));
+
+ uint64_t dat = htole64(name);
+ const struct sockaddr_can saddr = {
+ .can_family = AF_CAN,
+ .can_addr.j1939 = {
+ .pgn = J1939_PGN_ADDRESS_CLAIMED,
+ .addr = J1939_NO_ADDR,
+ },
+ };
+
+ /* Afterwards do a sendto(2) with data set to the NAME (Little Endian). If the
+ * NAME provided, does not match the j1939.name provided to bind(2), EPROTO
+ * will be returned.
+ */
+ sendto(sock, dat, sizeof(dat), 0, (const struct sockaddr *)&saddr, sizeof(saddr));
+
+If no-one else contests the address claim within 250ms after transmission, the
+kernel marks the NAME-SA assignment as valid. The valid assignment will be kept
+among other valid NAME-SA assignments. From that point, any socket bound to the
+NAME can send packets.
+
+If another ECU claims the address, the kernel will mark the NAME-SA expired.
+No socket bound to the NAME can send packets (other than address claims). To
+claim another address, some socket bound to NAME, must bind(2) again, but with
+only j1939.addr changed to the new SA, and must then send a valid address claim
+packet. This restarts the state machine in the kernel (and any other
+participant on the bus) for this NAME.
+
+can-utils also include the jacd tool, so it can be used as code example or as
+default Address Claiming daemon.
+
+Send Examples
+-------------
+
+Static Addressing
+^^^^^^^^^^^^^^^^^
+
+This example will send a PGN (0x12300) from SA 0x20 to DA 0x30.
+
+Bind:
+
+.. code-block:: C
+
+ struct sockaddr_can baddr = {
+ .can_family = AF_CAN,
+ .can_addr.j1939 = {
+ .name = J1939_NO_NAME,
+ .addr = 0x20,
+ .pgn = J1939_NO_PGN,
+ },
+ .can_ifindex = if_nametoindex("can0"),
+ };
+
+ bind(sock, (struct sockaddr *)&baddr, sizeof(baddr));
+
+Now, the socket 'sock' is bound to the SA 0x20. Since no connect(2) was called,
+at this point we can use only sendto(2) or sendmsg(2).
+
+Send:
+
+.. code-block:: C
+
+ const struct sockaddr_can saddr = {
+ .can_family = AF_CAN,
+ .can_addr.j1939 = {
+ .name = J1939_NO_NAME;
+ .pgn = 0x30,
+ .addr = 0x12300,
+ },
+ };
+
+ sendto(sock, dat, sizeof(dat), 0, (const struct sockaddr *)&saddr, sizeof(saddr));
diff --git a/Documentation/networking/mac80211_hwsim/README b/Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst
index 3566a725d19c..d2266ce5534e 100644
--- a/Documentation/networking/mac80211_hwsim/README
+++ b/Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst
@@ -1,5 +1,13 @@
+:orphan:
+
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===================================================================
mac80211_hwsim - software simulator of 802.11 radio(s) for mac80211
-Copyright (c) 2008, Jouni Malinen <j@w1.fi>
+===================================================================
+
+:Copyright: |copy| 2008, Jouni Malinen <j@w1.fi>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2 as
@@ -7,6 +15,7 @@ published by the Free Software Foundation.
Introduction
+============
mac80211_hwsim is a Linux kernel module that can be used to simulate
arbitrary number of IEEE 802.11 radios for mac80211. It can be used to
@@ -43,6 +52,7 @@ regardless of channel.
Simple example
+==============
This example shows how to use mac80211_hwsim to simulate two radios:
one to act as an access point and the other as a station that
@@ -50,17 +60,19 @@ associates with the AP. hostapd and wpa_supplicant are used to take
care of WPA2-PSK authentication. In addition, hostapd is also
processing access point side of association.
+::
+
-# Build mac80211_hwsim as part of kernel configuration
+ # Build mac80211_hwsim as part of kernel configuration
-# Load the module
-modprobe mac80211_hwsim
+ # Load the module
+ modprobe mac80211_hwsim
-# Run hostapd (AP) for wlan0
-hostapd hostapd.conf
+ # Run hostapd (AP) for wlan0
+ hostapd hostapd.conf
-# Run wpa_supplicant (station) for wlan1
-wpa_supplicant -Dnl80211 -iwlan1 -c wpa_supplicant.conf
+ # Run wpa_supplicant (station) for wlan1
+ wpa_supplicant -Dnl80211 -iwlan1 -c wpa_supplicant.conf
More test cases are available in hostap.git:
diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.txt
index 9cb31c5e2dcd..9bdb7d5a3ba3 100644
--- a/Documentation/networking/net_dim.txt
+++ b/Documentation/networking/net_dim.txt
@@ -92,16 +92,16 @@ under some conditions.
Part III: Registering a Network Device to DIM
==============================================
-Net DIM API exposes the main function net_dim(struct net_dim *dim,
-struct net_dim_sample end_sample). This function is the entry point to the Net
+Net DIM API exposes the main function net_dim(struct dim *dim,
+struct dim_sample end_sample). This function is the entry point to the Net
DIM algorithm and has to be called every time the driver would like to check if
it should change interrupt moderation parameters. The driver should provide two
-data structures: struct net_dim and struct net_dim_sample. Struct net_dim
+data structures: struct dim and struct dim_sample. Struct dim
describes the state of DIM for a specific object (RX queue, TX queue,
other queues, etc.). This includes the current selected profile, previous data
samples, the callback function provided by the driver and more.
-Struct net_dim_sample describes a data sample, which will be compared to the
-data sample stored in struct net_dim in order to decide on the algorithm's next
+Struct dim_sample describes a data sample, which will be compared to the
+data sample stored in struct dim in order to decide on the algorithm's next
step. The sample should include bytes, packets and interrupts, measured by
the driver.
@@ -110,9 +110,9 @@ main net_dim() function. The recommended method is to call net_dim() on each
interrupt. Since Net DIM has a built-in moderation and it might decide to skip
iterations under certain conditions, there is no need to moderate the net_dim()
calls as well. As mentioned above, the driver needs to provide an object of type
-struct net_dim to the net_dim() function call. It is advised for each entity
-using Net DIM to hold a struct net_dim as part of its data structure and use it
-as the main Net DIM API object. The struct net_dim_sample should hold the latest
+struct dim to the net_dim() function call. It is advised for each entity
+using Net DIM to hold a struct dim as part of its data structure and use it
+as the main Net DIM API object. The struct dim_sample should hold the latest
bytes, packets and interrupts count. No need to perform any calculations, just
include the raw data.
@@ -132,19 +132,19 @@ usage is not complete but it should make the outline of the usage clear.
my_driver.c:
-#include <linux/net_dim.h>
+#include <linux/dim.h>
/* Callback for net DIM to schedule on a decision to change moderation */
void my_driver_do_dim_work(struct work_struct *work)
{
- /* Get struct net_dim from struct work_struct */
- struct net_dim *dim = container_of(work, struct net_dim,
- work);
+ /* Get struct dim from struct work_struct */
+ struct dim *dim = container_of(work, struct dim,
+ work);
/* Do interrupt moderation related stuff */
...
/* Signal net DIM work is done and it should move to next iteration */
- dim->state = NET_DIM_START_MEASURE;
+ dim->state = DIM_START_MEASURE;
}
/* My driver's interrupt handler */
@@ -152,13 +152,13 @@ int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
{
...
/* A struct to hold current measured data */
- struct net_dim_sample dim_sample;
+ struct dim_sample dim_sample;
...
/* Initiate data sample struct with current data */
- net_dim_sample(my_entity->events,
- my_entity->packets,
- my_entity->bytes,
- &dim_sample);
+ dim_update_sample(my_entity->events,
+ my_entity->packets,
+ my_entity->bytes,
+ &dim_sample);
/* Call net DIM */
net_dim(&my_entity->dim, dim_sample);
...
diff --git a/Documentation/networking/netdev-FAQ.rst b/Documentation/networking/netdev-FAQ.rst
index 642fa963be3c..d5c9320901c3 100644
--- a/Documentation/networking/netdev-FAQ.rst
+++ b/Documentation/networking/netdev-FAQ.rst
@@ -34,8 +34,8 @@ the names, the ``net`` tree is for fixes to existing code already in the
mainline tree from Linus, and ``net-next`` is where the new code goes
for the future release. You can find the trees here:
-- https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
-- https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
+- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
+- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
Q: How often do changes from these trees make it to the mainline Linus tree?
----------------------------------------------------------------------------
diff --git a/Documentation/networking/nf_flowtable.txt b/Documentation/networking/nf_flowtable.txt
index ca2136c76042..0bf32d1121be 100644
--- a/Documentation/networking/nf_flowtable.txt
+++ b/Documentation/networking/nf_flowtable.txt
@@ -76,7 +76,7 @@ flowtable and add one rule to your forward chain.
table inet x {
flowtable f {
- hook ingress priority 0 devices = { eth0, eth1 };
+ hook ingress priority 0; devices = { eth0, eth1 };
}
chain y {
type filter hook forward priority 0; policy accept;
diff --git a/Documentation/networking/nfc.txt b/Documentation/networking/nfc.rst
index b24c29bdae27..9aab3a88c9b2 100644
--- a/Documentation/networking/nfc.txt
+++ b/Documentation/networking/nfc.rst
@@ -1,3 +1,4 @@
+===================
Linux NFC subsystem
===================
@@ -8,7 +9,7 @@ This document covers the architecture overview, the device driver interface
description and the userspace interface description.
Architecture overview
----------------------
+=====================
The NFC subsystem is responsible for:
- NFC adapters management;
@@ -25,33 +26,34 @@ The control operations are available to userspace via generic netlink.
The low-level data exchange interface is provided by the new socket family
PF_NFC. The NFC_SOCKPROTO_RAW performs raw communication with NFC targets.
-
- +--------------------------------------+
- | USER SPACE |
- +--------------------------------------+
- ^ ^
- | low-level | control
- | data exchange | operations
- | |
- | v
- | +-----------+
- | AF_NFC | netlink |
- | socket +-----------+
- | raw ^
- | |
- v v
- +---------+ +-----------+
- | rawsock | <--------> | core |
- +---------+ +-----------+
- ^
- |
- v
- +-----------+
- | driver |
- +-----------+
+.. code-block:: none
+
+ +--------------------------------------+
+ | USER SPACE |
+ +--------------------------------------+
+ ^ ^
+ | low-level | control
+ | data exchange | operations
+ | |
+ | v
+ | +-----------+
+ | AF_NFC | netlink |
+ | socket +-----------+
+ | raw ^
+ | |
+ v v
+ +---------+ +-----------+
+ | rawsock | <--------> | core |
+ +---------+ +-----------+
+ ^
+ |
+ v
+ +-----------+
+ | driver |
+ +-----------+
Device Driver Interface
------------------------
+=======================
When registering on the NFC subsystem, the device driver must inform the core
of the set of supported NFC protocols and the set of ops callbacks. The ops
@@ -64,7 +66,7 @@ callbacks that must be implemented are the following:
* data_exchange - send data and receive the response (transceive operation)
Userspace interface
---------------------
+===================
The userspace interface is divided in control operations and low-level data
exchange operation.
@@ -82,7 +84,7 @@ The operations are composed by commands and events, all listed below:
* NFC_EVENT_DEVICE_ADDED - reports an NFC device addition
* NFC_EVENT_DEVICE_REMOVED - reports an NFC device removal
* NFC_EVENT_TARGETS_FOUND - reports START_POLL results when 1 or more targets
-are found
+ are found
The user must call START_POLL to poll for NFC targets, passing the desired NFC
protocols through NFC_ATTR_PROTOCOLS attribute. The device remains in polling
@@ -101,14 +103,14 @@ it's closed.
LOW-LEVEL DATA EXCHANGE:
The userspace must use PF_NFC sockets to perform any data communication with
-targets. All NFC sockets use AF_NFC:
-
-struct sockaddr_nfc {
- sa_family_t sa_family;
- __u32 dev_idx;
- __u32 target_idx;
- __u32 nfc_protocol;
-};
+targets. All NFC sockets use AF_NFC::
+
+ struct sockaddr_nfc {
+ sa_family_t sa_family;
+ __u32 dev_idx;
+ __u32 target_idx;
+ __u32 nfc_protocol;
+ };
To establish a connection with one target, the user must create an
NFC_SOCKPROTO_RAW socket and call the 'connect' syscall with the sockaddr_nfc
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index a689966bc4be..1e4735cc0553 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -73,7 +73,7 @@ The Reduced Gigabit Medium Independent Interface (RGMII) is a 12-pin
electrical signal interface using a synchronous 125Mhz clock signal and several
data lines. Due to this design decision, a 1.5ns to 2ns delay must be added
between the clock line (RXC or TXC) and the data lines to let the PHY (clock
-sink) have enough setup and hold times to sample the data lines correctly. The
+sink) have a large enough setup and hold time to sample the data lines correctly. The
PHY library offers different types of PHY_INTERFACE_MODE_RGMII* values to let
the PHY driver and optionally the MAC driver, implement the required delay. The
values of phy_interface_t must be understood from the perspective of the PHY
@@ -267,6 +267,24 @@ Some of the interface modes are described below:
duplex, pause or other settings. This is dependent on the MAC and/or
PHY behaviour.
+``PHY_INTERFACE_MODE_10GBASER``
+ This is the IEEE 802.3 Clause 49 defined 10GBASE-R protocol used with
+ various different mediums. Please refer to the IEEE standard for a
+ definition of this.
+
+ Note: 10GBASE-R is just one protocol that can be used with XFI and SFI.
+ XFI and SFI permit multiple protocols over a single SERDES lane, and
+ also defines the electrical characteristics of the signals with a host
+ compliance board plugged into the host XFP/SFP connector. Therefore,
+ XFI and SFI are not PHY interface types in their own right.
+
+``PHY_INTERFACE_MODE_10GKR``
+ This is the IEEE 802.3 Clause 49 defined 10GBASE-R with Clause 73
+ autonegotiation. Please refer to the IEEE standard for further
+ information.
+
+ Note: due to legacy usage, some 10GBASE-R usage incorrectly makes
+ use of this definition.
Pause frames / flow control
===========================
@@ -352,7 +370,8 @@ Fills the phydev structure with up-to-date information about the current
settings in the PHY.
::
- int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd);
+ int phy_ethtool_ksettings_set(struct phy_device *phydev,
+ const struct ethtool_link_ksettings *cmd);
Ethtool convenience functions.
::
diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt
index 61daf4b39600..fd563aff5fc9 100644
--- a/Documentation/networking/ppp_generic.txt
+++ b/Documentation/networking/ppp_generic.txt
@@ -378,6 +378,8 @@ an interface unit are:
CONFIG_PPP_FILTER option is enabled, the set of packets which reset
the transmit and receive idle timers is restricted to those which
pass the `active' packet filter.
+ Two versions of this command exist, to deal with user space
+ expecting times as either 32-bit or 64-bit time_t seconds.
* PPPIOCSMAXCID sets the maximum connection-ID parameter (and thus the
number of connection slots) for the TCP header compressor and
diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst
index 91446b431b70..d753a309f9d1 100644
--- a/Documentation/networking/sfp-phylink.rst
+++ b/Documentation/networking/sfp-phylink.rst
@@ -8,7 +8,8 @@ Overview
========
phylink is a mechanism to support hot-pluggable networking modules
-without needing to re-initialise the adapter on hot-plug events.
+directly connected to a MAC without needing to re-initialise the
+adapter on hot-plug events.
phylink supports conventional phylib-based setups, fixed link setups
and SFP (Small Formfactor Pluggable) modules at present.
@@ -250,7 +251,8 @@ this documentation.
phylink_mac_change(priv->phylink, link_is_up);
where ``link_is_up`` is true if the link is currently up or false
- otherwise.
+ otherwise. If a MAC is unable to provide these interrupts, then
+ it should set ``priv->phylink_config.pcs_poll = true;`` in step 9.
11. Verify that the driver does not call::
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index 0dd3f748239f..f914e81fd3a6 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -436,6 +436,10 @@ by the driver:
encryption.
* ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
but did not arrive in the expected order.
+ * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of
+ a TLS stream and arrived out-of-order, but skipped the HW offload routine
+ and went to the regular transmit flow as they were retransmissions of the
+ connection handshake.
* ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
a TLS stream dropped, because they arrived out of order and associated
record could not be found.
diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst
index 5bcbf75e2025..8cb2cd4e2a80 100644
--- a/Documentation/networking/tls.rst
+++ b/Documentation/networking/tls.rst
@@ -213,3 +213,29 @@ A patchset to OpenSSL to use ktls as the record layer is
of calling send directly after a handshake using gnutls.
Since it doesn't implement a full record layer, control
messages are not supported.
+
+Statistics
+==========
+
+TLS implementation exposes the following per-namespace statistics
+(``/proc/net/tls_stat``):
+
+- ``TlsCurrTxSw``, ``TlsCurrRxSw`` -
+ number of TX and RX sessions currently installed where host handles
+ cryptography
+
+- ``TlsCurrTxDevice``, ``TlsCurrRxDevice`` -
+ number of TX and RX sessions currently installed where NIC handles
+ cryptography
+
+- ``TlsTxSw``, ``TlsRxSw`` -
+ number of TX and RX sessions opened with host cryptography
+
+- ``TlsTxDevice``, ``TlsRxDevice`` -
+ number of TX and RX sessions opened with NIC cryptography
+
+- ``TlsDecryptError`` -
+ record decryption failed (e.g. due to incorrect authentication tag)
+
+- ``TlsDeviceRxResync`` -
+ number of RX resyncs sent to NICs handling cryptography
OpenPOWER on IntegriCloud