diff options
author | Jeff Garzik <jgarzik@pobox.com> | 2005-09-01 18:02:01 -0400 |
---|---|---|
committer | Jeff Garzik <jgarzik@pobox.com> | 2005-09-01 18:02:01 -0400 |
commit | e3ee3b78f83688a0ae4315e8be71b2eac559904a (patch) | |
tree | deb03bcdd020262af450ed23382d7c921263f5cf /Documentation/networking | |
parent | 91cb70c1769d9b72dd1efe40c31f01005820b09e (diff) | |
parent | 6b39374a27eb4be7e9d82145ae270ba02ea90dc8 (diff) | |
download | talos-obmc-linux-e3ee3b78f83688a0ae4315e8be71b2eac559904a.tar.gz talos-obmc-linux-e3ee3b78f83688a0ae4315e8be71b2eac559904a.zip |
/spare/repo/netdev-2.6 branch 'master'
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/cxgb.txt | 352 | ||||
-rw-r--r-- | Documentation/networking/phy.txt | 288 |
2 files changed, 640 insertions, 0 deletions
diff --git a/Documentation/networking/cxgb.txt b/Documentation/networking/cxgb.txt new file mode 100644 index 000000000000..76324638626b --- /dev/null +++ b/Documentation/networking/cxgb.txt @@ -0,0 +1,352 @@ + Chelsio N210 10Gb Ethernet Network Controller + + Driver Release Notes for Linux + + Version 2.1.1 + + June 20, 2005 + +CONTENTS +======== + INTRODUCTION + FEATURES + PERFORMANCE + DRIVER MESSAGES + KNOWN ISSUES + SUPPORT + + +INTRODUCTION +============ + + This document describes the Linux driver for Chelsio 10Gb Ethernet Network + Controller. This driver supports the Chelsio N210 NIC and is backward + compatible with the Chelsio N110 model 10Gb NICs. + + +FEATURES +======== + + Adaptive Interrupts (adaptive-rx) + --------------------------------- + + This feature provides an adaptive algorithm that adjusts the interrupt + coalescing parameters, allowing the driver to dynamically adapt the latency + settings to achieve the highest performance during various types of network + load. + + The interface used to control this feature is ethtool. Please see the + ethtool manpage for additional usage information. + + By default, adaptive-rx is disabled. + To enable adaptive-rx: + + ethtool -C <interface> adaptive-rx on + + To disable adaptive-rx, use ethtool: + + ethtool -C <interface> adaptive-rx off + + After disabling adaptive-rx, the timer latency value will be set to 50us. + You may set the timer latency after disabling adaptive-rx: + + ethtool -C <interface> rx-usecs <microseconds> + + An example to set the timer latency value to 100us on eth0: + + ethtool -C eth0 rx-usecs 100 + + You may also provide a timer latency value while disabling adpative-rx: + + ethtool -C <interface> adaptive-rx off rx-usecs <microseconds> + + If adaptive-rx is disabled and a timer latency value is specified, the timer + will be set to the specified value until changed by the user or until + adaptive-rx is enabled. + + To view the status of the adaptive-rx and timer latency values: + + ethtool -c <interface> + + + TCP Segmentation Offloading (TSO) Support + ----------------------------------------- + + This feature, also known as "large send", enables a system's protocol stack + to offload portions of outbound TCP processing to a network interface card + thereby reducing system CPU utilization and enhancing performance. + + The interface used to control this feature is ethtool version 1.8 or higher. + Please see the ethtool manpage for additional usage information. + + By default, TSO is enabled. + To disable TSO: + + ethtool -K <interface> tso off + + To enable TSO: + + ethtool -K <interface> tso on + + To view the status of TSO: + + ethtool -k <interface> + + +PERFORMANCE +=========== + + The following information is provided as an example of how to change system + parameters for "performance tuning" an what value to use. You may or may not + want to change these system parameters, depending on your server/workstation + application. Doing so is not warranted in any way by Chelsio Communications, + and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss + of data or damage to equipment. + + Your distribution may have a different way of doing things, or you may prefer + a different method. These commands are shown only to provide an example of + what to do and are by no means definitive. + + Making any of the following system changes will only last until you reboot + your system. You may want to write a script that runs at boot-up which + includes the optimal settings for your system. + + Setting PCI Latency Timer: + setpci -d 1425:* 0x0c.l=0x0000F800 + + Disabling TCP timestamp: + sysctl -w net.ipv4.tcp_timestamps=0 + + Disabling SACK: + sysctl -w net.ipv4.tcp_sack=0 + + Setting large number of incoming connection requests: + sysctl -w net.ipv4.tcp_max_syn_backlog=3000 + + Setting maximum receive socket buffer size: + sysctl -w net.core.rmem_max=1024000 + + Setting maximum send socket buffer size: + sysctl -w net.core.wmem_max=1024000 + + Set smp_affinity (on a multiprocessor system) to a single CPU: + echo 1 > /proc/irq/<interrupt_number>/smp_affinity + + Setting default receive socket buffer size: + sysctl -w net.core.rmem_default=524287 + + Setting default send socket buffer size: + sysctl -w net.core.wmem_default=524287 + + Setting maximum option memory buffers: + sysctl -w net.core.optmem_max=524287 + + Setting maximum backlog (# of unprocessed packets before kernel drops): + sysctl -w net.core.netdev_max_backlog=300000 + + Setting TCP read buffers (min/default/max): + sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000" + + Setting TCP write buffers (min/pressure/max): + sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000" + + Setting TCP buffer space (min/pressure/max): + sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000" + + TCP window size for single connections: + The receive buffer (RX_WINDOW) size must be at least as large as the + Bandwidth-Delay Product of the communication link between the sender and + receiver. Due to the variations of RTT, you may want to increase the buffer + size up to 2 times the Bandwidth-Delay Product. Reference page 289 of + "TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens. + At 10Gb speeds, use the following formula: + RX_WINDOW >= 1.25MBytes * RTT(in milliseconds) + Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000 + RX_WINDOW sizes of 256KB - 512KB should be sufficient. + Setting the min, max, and default receive buffer (RX_WINDOW) size: + sysctl -w net.ipv4.tcp_rmem="<min> <default> <max>" + + TCP window size for multiple connections: + The receive buffer (RX_WINDOW) size may be calculated the same as single + connections, but should be divided by the number of connections. The + smaller window prevents congestion and facilitates better pacing, + especially if/when MAC level flow control does not work well or when it is + not supported on the machine. Experimentation may be necessary to attain + the correct value. This method is provided as a starting point fot the + correct receive buffer size. + Setting the min, max, and default receive buffer (RX_WINDOW) size is + performed in the same manner as single connection. + + +DRIVER MESSAGES +=============== + + The following messages are the most common messages logged by syslog. These + may be found in /var/log/messages. + + Driver up: + Chelsio Network Driver - version 2.1.1 + + NIC detected: + eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit + + Link up: + eth#: link is up at 10 Gbps, full duplex + + Link down: + eth#: link is down + + +KNOWN ISSUES +============ + + These issues have been identified during testing. The following information + is provided as a workaround to the problem. In some cases, this problem is + inherent to Linux or to a particular Linux Distribution and/or hardware + platform. + + 1. Large number of TCP retransmits on a multiprocessor (SMP) system. + + On a system with multiple CPUs, the interrupt (IRQ) for the network + controller may be bound to more than one CPU. This will cause TCP + retransmits if the packet data were to be split across different CPUs + and re-assembled in a different order than expected. + + To eliminate the TCP retransmits, set smp_affinity on the particular + interrupt to a single CPU. You can locate the interrupt (IRQ) used on + the N110/N210 by using ifconfig: + ifconfig <dev_name> | grep Interrupt + Set the smp_affinity to a single CPU: + echo 1 > /proc/irq/<interrupt_number>/smp_affinity + + It is highly suggested that you do not run the irqbalance daemon on your + system, as this will change any smp_affinity setting you have applied. + The irqbalance daemon runs on a 10 second interval and binds interrupts + to the least loaded CPU determined by the daemon. To disable this daemon: + chkconfig --level 2345 irqbalance off + + By default, some Linux distributions enable the kernel feature, + irqbalance, which performs the same function as the daemon. To disable + this feature, add the following line to your bootloader: + noirqbalance + + Example using the Grub bootloader: + title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp) + root (hd0,0) + kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance + initrd /initrd-2.4.21-27.ELsmp.img + + 2. After running insmod, the driver is loaded and the incorrect network + interface is brought up without running ifup. + + When using 2.4.x kernels, including RHEL kernels, the Linux kernel + invokes a script named "hotplug". This script is primarily used to + automatically bring up USB devices when they are plugged in, however, + the script also attempts to automatically bring up a network interface + after loading the kernel module. The hotplug script does this by scanning + the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking + for HWADDR=<mac_address>. + + If the hotplug script does not find the HWADDRR within any of the + ifcfg-eth# files, it will bring up the device with the next available + interface name. If this interface is already configured for a different + network card, your new interface will have incorrect IP address and + network settings. + + To solve this issue, you can add the HWADDR=<mac_address> key to the + interface config file of your network controller. + + To disable this "hotplug" feature, you may add the driver (module name) + to the "blacklist" file located in /etc/hotplug. It has been noted that + this does not work for network devices because the net.agent script + does not use the blacklist file. Simply remove, or rename, the net.agent + script located in /etc/hotplug to disable this feature. + + 3. Transport Protocol (TP) hangs when running heavy multi-connection traffic + on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset. + + If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel + chipset, you may experience the "133-Mhz Mode Split Completion Data + Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the + bus PCI-X bus. + + AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel + can provide stale data via split completion cycles to a PCI-X card that + is operating at 133 Mhz", causing data corruption. + + AMD's provides three workarounds for this problem, however, Chelsio + recommends the first option for best performance with this bug: + + For 133Mhz secondary bus operation, limit the transaction length and + the number of outstanding transactions, via BIOS configuration + programming of the PCI-X card, to the following: + + Data Length (bytes): 1k + Total allowed outstanding transactions: 2 + + Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004, + section 56, "133-MHz Mode Split Completion Data Corruption" for more + details with this bug and workarounds suggested by AMD. + + It may be possible to work outside AMD's recommended PCI-X settings, try + increasing the Data Length to 2k bytes for increased performance. If you + have issues with these settings, please revert to the "safe" settings + and duplicate the problem before submitting a bug or asking for support. + + NOTE: The default setting on most systems is 8 outstanding transactions + and 2k bytes data length. + + 4. On multiprocessor systems, it has been noted that an application which + is handling 10Gb networking can switch between CPUs causing degraded + and/or unstable performance. + + If running on an SMP system and taking performance measurements, it + is suggested you either run the latest netperf-2.4.0+ or use a binding + tool such as Tim Hockin's procstate utilities (runon) + <http://www.hockin.org/~thockin/procstate/>. + + Binding netserver and netperf (or other applications) to particular + CPUs will have a significant difference in performance measurements. + You may need to experiment which CPU to bind the application to in + order to achieve the best performance for your system. + + If you are developing an application designed for 10Gb networking, + please keep in mind you may want to look at kernel functions + sched_setaffinity & sched_getaffinity to bind your application. + + If you are just running user-space applications such as ftp, telnet, + etc., you may want to try the runon tool provided by Tim Hockin's + procstate utility. You could also try binding the interface to a + particular CPU: runon 0 ifup eth0 + + +SUPPORT +======= + + If you have problems with the software or hardware, please contact our + customer support team via email at support@chelsio.com or check our website + at http://www.chelsio.com + +=============================================================================== + + Chelsio Communications + 370 San Aleso Ave. + Suite 100 + Sunnyvale, CA 94085 + http://www.chelsio.com + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License, version 2, as +published by the Free Software Foundation. + +You should have received a copy of the GNU General Public License along +with this program; if not, write to the Free Software Foundation, Inc., +59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED +WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. + + Copyright (c) 2003-2005 Chelsio Communications. All rights reserved. + +=============================================================================== diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt new file mode 100644 index 000000000000..29ccae409031 --- /dev/null +++ b/Documentation/networking/phy.txt @@ -0,0 +1,288 @@ + +------- +PHY Abstraction Layer +(Updated 2005-07-21) + +Purpose + + Most network devices consist of set of registers which provide an interface + to a MAC layer, which communicates with the physical connection through a + PHY. The PHY concerns itself with negotiating link parameters with the link + partner on the other side of the network connection (typically, an ethernet + cable), and provides a register interface to allow drivers to determine what + settings were chosen, and to configure what settings are allowed. + + While these devices are distinct from the network devices, and conform to a + standard layout for the registers, it has been common practice to integrate + the PHY management code with the network driver. This has resulted in large + amounts of redundant code. Also, on embedded systems with multiple (and + sometimes quite different) ethernet controllers connected to the same + management bus, it is difficult to ensure safe use of the bus. + + Since the PHYs are devices, and the management busses through which they are + accessed are, in fact, busses, the PHY Abstraction Layer treats them as such. + In doing so, it has these goals: + + 1) Increase code-reuse + 2) Increase overall code-maintainability + 3) Speed development time for new network drivers, and for new systems + + Basically, this layer is meant to provide an interface to PHY devices which + allows network driver writers to write as little code as possible, while + still providing a full feature set. + +The MDIO bus + + Most network devices are connected to a PHY by means of a management bus. + Different devices use different busses (though some share common interfaces). + In order to take advantage of the PAL, each bus interface needs to be + registered as a distinct device. + + 1) read and write functions must be implemented. Their prototypes are: + + int write(struct mii_bus *bus, int mii_id, int regnum, u16 value); + int read(struct mii_bus *bus, int mii_id, int regnum); + + mii_id is the address on the bus for the PHY, and regnum is the register + number. These functions are guaranteed not to be called from interrupt + time, so it is safe for them to block, waiting for an interrupt to signal + the operation is complete + + 2) A reset function is necessary. This is used to return the bus to an + initialized state. + + 3) A probe function is needed. This function should set up anything the bus + driver needs, setup the mii_bus structure, and register with the PAL using + mdiobus_register. Similarly, there's a remove function to undo all of + that (use mdiobus_unregister). + + 4) Like any driver, the device_driver structure must be configured, and init + exit functions are used to register the driver. + + 5) The bus must also be declared somewhere as a device, and registered. + + As an example for how one driver implemented an mdio bus driver, see + drivers/net/gianfar_mii.c and arch/ppc/syslib/mpc85xx_devices.c + +Connecting to a PHY + + Sometime during startup, the network driver needs to establish a connection + between the PHY device, and the network device. At this time, the PHY's bus + and drivers need to all have been loaded, so it is ready for the connection. + At this point, there are several ways to connect to the PHY: + + 1) The PAL handles everything, and only calls the network driver when + the link state changes, so it can react. + + 2) The PAL handles everything except interrupts (usually because the + controller has the interrupt registers). + + 3) The PAL handles everything, but checks in with the driver every second, + allowing the network driver to react first to any changes before the PAL + does. + + 4) The PAL serves only as a library of functions, with the network device + manually calling functions to update status, and configure the PHY + + +Letting the PHY Abstraction Layer do Everything + + If you choose option 1 (The hope is that every driver can, but to still be + useful to drivers that can't), connecting to the PHY is simple: + + First, you need a function to react to changes in the link state. This + function follows this protocol: + + static void adjust_link(struct net_device *dev); + + Next, you need to know the device name of the PHY connected to this device. + The name will look something like, "phy0:0", where the first number is the + bus id, and the second is the PHY's address on that bus. + + Now, to connect, just call this function: + + phydev = phy_connect(dev, phy_name, &adjust_link, flags); + + phydev is a pointer to the phy_device structure which represents the PHY. If + phy_connect is successful, it will return the pointer. dev, here, is the + pointer to your net_device. Once done, this function will have started the + PHY's software state machine, and registered for the PHY's interrupt, if it + has one. The phydev structure will be populated with information about the + current state, though the PHY will not yet be truly operational at this + point. + + flags is a u32 which can optionally contain phy-specific flags. + This is useful if the system has put hardware restrictions on + the PHY/controller, of which the PHY needs to be aware. + + Now just make sure that phydev->supported and phydev->advertising have any + values pruned from them which don't make sense for your controller (a 10/100 + controller may be connected to a gigabit capable PHY, so you would need to + mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions + for these bitfields. Note that you should not SET any bits, or the PHY may + get put into an unsupported state. + + Lastly, once the controller is ready to handle network traffic, you call + phy_start(phydev). This tells the PAL that you are ready, and configures the + PHY to connect to the network. If you want to handle your own interrupts, + just set phydev->irq to PHY_IGNORE_INTERRUPT before you call phy_start. + Similarly, if you don't want to use interrupts, set phydev->irq to PHY_POLL. + + When you want to disconnect from the network (even if just briefly), you call + phy_stop(phydev). + +Keeping Close Tabs on the PAL + + It is possible that the PAL's built-in state machine needs a little help to + keep your network device and the PHY properly in sync. If so, you can + register a helper function when connecting to the PHY, which will be called + every second before the state machine reacts to any changes. To do this, you + need to manually call phy_attach() and phy_prepare_link(), and then call + phy_start_machine() with the second argument set to point to your special + handler. + + Currently there are no examples of how to use this functionality, and testing + on it has been limited because the author does not have any drivers which use + it (they all use option 1). So Caveat Emptor. + +Doing it all yourself + + There's a remote chance that the PAL's built-in state machine cannot track + the complex interactions between the PHY and your network device. If this is + so, you can simply call phy_attach(), and not call phy_start_machine or + phy_prepare_link(). This will mean that phydev->state is entirely yours to + handle (phy_start and phy_stop toggle between some of the states, so you + might need to avoid them). + + An effort has been made to make sure that useful functionality can be + accessed without the state-machine running, and most of these functions are + descended from functions which did not interact with a complex state-machine. + However, again, no effort has been made so far to test running without the + state machine, so tryer beware. + + Here is a brief rundown of the functions: + + int phy_read(struct phy_device *phydev, u16 regnum); + int phy_write(struct phy_device *phydev, u16 regnum, u16 val); + + Simple read/write primitives. They invoke the bus's read/write function + pointers. + + void phy_print_status(struct phy_device *phydev); + + A convenience function to print out the PHY status neatly. + + int phy_clear_interrupt(struct phy_device *phydev); + int phy_config_interrupt(struct phy_device *phydev, u32 interrupts); + + Clear the PHY's interrupt, and configure which ones are allowed, + respectively. Currently only supports all on, or all off. + + int phy_enable_interrupts(struct phy_device *phydev); + int phy_disable_interrupts(struct phy_device *phydev); + + Functions which enable/disable PHY interrupts, clearing them + before and after, respectively. + + int phy_start_interrupts(struct phy_device *phydev); + int phy_stop_interrupts(struct phy_device *phydev); + + Requests the IRQ for the PHY interrupts, then enables them for + start, or disables then frees them for stop. + + struct phy_device * phy_attach(struct net_device *dev, const char *phy_id, + u32 flags); + + Attaches a network device to a particular PHY, binding the PHY to a generic + driver if none was found during bus initialization. Passes in + any phy-specific flags as needed. + + int phy_start_aneg(struct phy_device *phydev); + + Using variables inside the phydev structure, either configures advertising + and resets autonegotiation, or disables autonegotiation, and configures + forced settings. + + static inline int phy_read_status(struct phy_device *phydev); + + Fills the phydev structure with up-to-date information about the current + settings in the PHY. + + void phy_sanitize_settings(struct phy_device *phydev) + + Resolves differences between currently desired settings, and + supported settings for the given PHY device. Does not make + the changes in the hardware, though. + + int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd); + int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd); + + Ethtool convenience functions. + + int phy_mii_ioctl(struct phy_device *phydev, + struct mii_ioctl_data *mii_data, int cmd); + + The MII ioctl. Note that this function will completely screw up the state + machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to + use this only to write registers which are not standard, and don't set off + a renegotiation. + + +PHY Device Drivers + + With the PHY Abstraction Layer, adding support for new PHYs is + quite easy. In some cases, no work is required at all! However, + many PHYs require a little hand-holding to get up-and-running. + +Generic PHY driver + + If the desired PHY doesn't have any errata, quirks, or special + features you want to support, then it may be best to not add + support, and let the PHY Abstraction Layer's Generic PHY Driver + do all of the work. + +Writing a PHY driver + + If you do need to write a PHY driver, the first thing to do is + make sure it can be matched with an appropriate PHY device. + This is done during bus initialization by reading the device's + UID (stored in registers 2 and 3), then comparing it to each + driver's phy_id field by ANDing it with each driver's + phy_id_mask field. Also, it needs a name. Here's an example: + + static struct phy_driver dm9161_driver = { + .phy_id = 0x0181b880, + .name = "Davicom DM9161E", + .phy_id_mask = 0x0ffffff0, + ... + } + + Next, you need to specify what features (speed, duplex, autoneg, + etc) your PHY device and driver support. Most PHYs support + PHY_BASIC_FEATURES, but you can look in include/mii.h for other + features. + + Each driver consists of a number of function pointers: + + config_init: configures PHY into a sane state after a reset. + For instance, a Davicom PHY requires descrambling disabled. + probe: Does any setup needed by the driver + suspend/resume: power management + config_aneg: Changes the speed/duplex/negotiation settings + read_status: Reads the current speed/duplex/negotiation settings + ack_interrupt: Clear a pending interrupt + config_intr: Enable or disable interrupts + remove: Does any driver take-down + + Of these, only config_aneg and read_status are required to be + assigned by the driver code. The rest are optional. Also, it is + preferred to use the generic phy driver's versions of these two + functions if at all possible: genphy_read_status and + genphy_config_aneg. If this is not possible, it is likely that + you only need to perform some actions before and after invoking + these functions, and so your functions will wrap the generic + ones. + + Feel free to look at the Marvell, Cicada, and Davicom drivers in + drivers/net/phy/ for examples (the lxt and qsemi drivers have + not been tested as of this writing) |