summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/bcache.txt12
-rw-r--r--Documentation/devices.txt8
-rw-r--r--Documentation/devicetree/bindings/net/allwinner,sun4i-emac.txt22
-rw-r--r--Documentation/devicetree/bindings/net/allwinner,sun4i-mdio.txt26
-rw-r--r--Documentation/devicetree/bindings/net/cpsw.txt6
-rw-r--r--Documentation/devicetree/bindings/net/davicom-dm9000.txt26
-rw-r--r--Documentation/devicetree/bindings/net/macb.txt2
-rw-r--r--Documentation/devicetree/bindings/net/marvell-orion-net.txt85
-rw-r--r--Documentation/devicetree/bindings/net/micrel-ks8851.txt9
-rw-r--r--Documentation/devicetree/bindings/net/via-velocity.txt20
-rw-r--r--Documentation/devicetree/bindings/rtc/atmel,at91rm9200-rtc.txt2
-rw-r--r--Documentation/devicetree/bindings/vendor-prefixes.txt1
-rw-r--r--Documentation/devicetree/bindings/video/exynos_hdmi.txt (renamed from Documentation/devicetree/bindings/drm/exynos/hdmi.txt)0
-rw-r--r--Documentation/devicetree/bindings/video/exynos_hdmiddc.txt (renamed from Documentation/devicetree/bindings/drm/exynos/hdmiddc.txt)0
-rw-r--r--Documentation/devicetree/bindings/video/exynos_hdmiphy.txt (renamed from Documentation/devicetree/bindings/drm/exynos/hdmiphy.txt)0
-rw-r--r--Documentation/devicetree/bindings/video/exynos_mixer.txt (renamed from Documentation/devicetree/bindings/drm/exynos/mixer.txt)0
-rw-r--r--Documentation/devicetree/bindings/video/simple-framebuffer.txt25
-rw-r--r--Documentation/devicetree/usage-model.txt8
-rw-r--r--Documentation/dmatest.txt6
-rw-r--r--Documentation/filesystems/xfs.txt3
-rw-r--r--Documentation/kernel-parameters.txt24
-rw-r--r--Documentation/kernel-per-CPU-kthreads.txt202
-rw-r--r--Documentation/m68k/kernel-options.txt2
-rw-r--r--Documentation/networking/.gitignore1
-rw-r--r--Documentation/networking/00-INDEX2
-rw-r--r--Documentation/networking/Makefile5
-rw-r--r--Documentation/networking/bonding.txt54
-rw-r--r--Documentation/networking/ifenslave.c1105
-rw-r--r--Documentation/networking/ip-sysctl.txt13
-rw-r--r--Documentation/networking/netlink_mmap.txt14
-rw-r--r--Documentation/networking/packet_mmap.txt133
-rw-r--r--Documentation/networking/scaling.txt58
-rw-r--r--Documentation/power/devices.txt15
-rw-r--r--Documentation/power/interface.txt4
-rw-r--r--Documentation/power/notifiers.txt6
-rw-r--r--Documentation/power/states.txt30
-rw-r--r--Documentation/powerpc/transactional_memory.txt27
-rw-r--r--Documentation/rapidio/rapidio.txt128
-rw-r--r--Documentation/rapidio/sysfs.txt17
-rw-r--r--Documentation/sysctl/net.txt27
40 files changed, 795 insertions, 1333 deletions
diff --git a/Documentation/bcache.txt b/Documentation/bcache.txt
index 77db8809bd96..b3a7e7d384f6 100644
--- a/Documentation/bcache.txt
+++ b/Documentation/bcache.txt
@@ -319,7 +319,10 @@ cache<0..n>
Symlink to each of the cache devices comprising this cache set.
cache_available_percent
- Percentage of cache device free.
+ Percentage of cache device which doesn't contain dirty data, and could
+ potentially be used for writeback. This doesn't mean this space isn't used
+ for clean cached data; the unused statistic (in priority_stats) is typically
+ much lower.
clear_stats
Clears the statistics associated with this cache
@@ -423,8 +426,11 @@ nbuckets
Total buckets in this cache
priority_stats
- Statistics about how recently data in the cache has been accessed. This can
- reveal your working set size.
+ Statistics about how recently data in the cache has been accessed.
+ This can reveal your working set size. Unused is the percentage of
+ the cache that doesn't contain any data. Metadata is bcache's
+ metadata overhead. Average is the average priority of cache buckets.
+ Next is a list of quantiles with the priority threshold of each.
written
Sum of all data that has been written to the cache; comparison with
diff --git a/Documentation/devices.txt b/Documentation/devices.txt
index 08f01e79c41a..b9015912bca6 100644
--- a/Documentation/devices.txt
+++ b/Documentation/devices.txt
@@ -498,12 +498,8 @@ Your cooperation is appreciated.
Each device type has 5 bits (32 minors).
- 13 block 8-bit MFM/RLL/IDE controller
- 0 = /dev/xda First XT disk whole disk
- 64 = /dev/xdb Second XT disk whole disk
-
- Partitions are handled in the same way as IDE disks
- (see major number 3).
+ 13 block Previously used for the XT disk (/dev/xdN)
+ Deleted in kernel v3.9.
14 char Open Sound System (OSS)
0 = /dev/mixer Mixer control
diff --git a/Documentation/devicetree/bindings/net/allwinner,sun4i-emac.txt b/Documentation/devicetree/bindings/net/allwinner,sun4i-emac.txt
new file mode 100644
index 000000000000..b90bfcd138ff
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/allwinner,sun4i-emac.txt
@@ -0,0 +1,22 @@
+* Allwinner EMAC ethernet controller
+
+Required properties:
+- compatible: should be "allwinner,sun4i-emac".
+- reg: address and length of the register set for the device.
+- interrupts: interrupt for the device
+- phy: A phandle to a phy node defining the PHY address (as the reg
+ property, a single integer).
+- clocks: A phandle to the reference clock for this device
+
+Optional properties:
+- (local-)mac-address: mac address to be used by this driver
+
+Example:
+
+emac: ethernet@01c0b000 {
+ compatible = "allwinner,sun4i-emac";
+ reg = <0x01c0b000 0x1000>;
+ interrupts = <55>;
+ clocks = <&ahb_gates 17>;
+ phy = <&phy0>;
+};
diff --git a/Documentation/devicetree/bindings/net/allwinner,sun4i-mdio.txt b/Documentation/devicetree/bindings/net/allwinner,sun4i-mdio.txt
new file mode 100644
index 000000000000..00b9f9a3ec1d
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/allwinner,sun4i-mdio.txt
@@ -0,0 +1,26 @@
+* Allwinner A10 MDIO Ethernet Controller interface
+
+Required properties:
+- compatible: should be "allwinner,sun4i-mdio".
+- reg: address and length of the register set for the device.
+
+Optional properties:
+- phy-supply: phandle to a regulator if the PHY needs one
+
+Example at the SoC level:
+mdio@01c0b080 {
+ compatible = "allwinner,sun4i-mdio";
+ reg = <0x01c0b080 0x14>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+};
+
+And at the board level:
+
+mdio@01c0b080 {
+ phy-supply = <&reg_emac_3v3>;
+
+ phy0: ethernet-phy@0 {
+ reg = <0>;
+ };
+};
diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt
index 4f2ca6b4a182..05d660e4ac64 100644
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -28,6 +28,8 @@ Optional properties:
Slave Properties:
Required properties:
- phy_id : Specifies slave phy id
+- phy-mode : The interface between the SoC and the PHY (a string
+ that of_get_phy_mode() can understand)
- mac-address : Specifies slave MAC address
Optional properties:
@@ -58,11 +60,13 @@ Examples:
cpts_clock_shift = <29>;
cpsw_emac0: slave@0 {
phy_id = <&davinci_mdio>, <0>;
+ phy-mode = "rgmii-txid";
/* Filled in by U-Boot */
mac-address = [ 00 00 00 00 00 00 ];
};
cpsw_emac1: slave@1 {
phy_id = <&davinci_mdio>, <1>;
+ phy-mode = "rgmii-txid";
/* Filled in by U-Boot */
mac-address = [ 00 00 00 00 00 00 ];
};
@@ -84,11 +88,13 @@ Examples:
cpts_clock_shift = <29>;
cpsw_emac0: slave@0 {
phy_id = <&davinci_mdio>, <0>;
+ phy-mode = "rgmii-txid";
/* Filled in by U-Boot */
mac-address = [ 00 00 00 00 00 00 ];
};
cpsw_emac1: slave@1 {
phy_id = <&davinci_mdio>, <1>;
+ phy-mode = "rgmii-txid";
/* Filled in by U-Boot */
mac-address = [ 00 00 00 00 00 00 ];
};
diff --git a/Documentation/devicetree/bindings/net/davicom-dm9000.txt b/Documentation/devicetree/bindings/net/davicom-dm9000.txt
new file mode 100644
index 000000000000..2d39c990e641
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/davicom-dm9000.txt
@@ -0,0 +1,26 @@
+Davicom DM9000 Fast Ethernet controller
+
+Required properties:
+- compatible = "davicom,dm9000";
+- reg : physical addresses and sizes of registers, must contain 2 entries:
+ first entry : address register,
+ second entry : data register.
+- interrupt-parent : interrupt controller to which the device is connected
+- interrupts : interrupt specifier specific to interrupt controller
+
+Optional properties:
+- local-mac-address : A bytestring of 6 bytes specifying Ethernet MAC address
+ to use (from firmware or bootloader)
+- davicom,no-eeprom : Configuration EEPROM is not available
+- davicom,ext-phy : Use external PHY
+
+Example:
+
+ ethernet@18000000 {
+ compatible = "davicom,dm9000";
+ reg = <0x18000000 0x2 0x18000004 0x2>;
+ interrupt-parent = <&gpn>;
+ interrupts = <7 4>;
+ local-mac-address = [00 00 de ad be ef];
+ davicom,no-eeprom;
+ };
diff --git a/Documentation/devicetree/bindings/net/macb.txt b/Documentation/devicetree/bindings/net/macb.txt
index 44afa0e5057d..4ff65047bb9a 100644
--- a/Documentation/devicetree/bindings/net/macb.txt
+++ b/Documentation/devicetree/bindings/net/macb.txt
@@ -4,7 +4,7 @@ Required properties:
- compatible: Should be "cdns,[<chip>-]{macb|gem}"
Use "cdns,at91sam9260-macb" Atmel at91sam9260 and at91sam9263 SoCs.
Use "cdns,at32ap7000-macb" for other 10/100 usage or use the generic form: "cdns,macb".
- Use "cnds,pc302-gem" for Picochip picoXcell pc302 and later devices based on
+ Use "cdns,pc302-gem" for Picochip picoXcell pc302 and later devices based on
the Cadence GEM, or the generic form: "cdns,gem".
- reg: Address and length of the register set for the device
- interrupts: Should contain macb interrupt
diff --git a/Documentation/devicetree/bindings/net/marvell-orion-net.txt b/Documentation/devicetree/bindings/net/marvell-orion-net.txt
new file mode 100644
index 000000000000..a73b79f227e1
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/marvell-orion-net.txt
@@ -0,0 +1,85 @@
+Marvell Orion/Discovery ethernet controller
+=============================================
+
+The Marvell Discovery ethernet controller can be found on Marvell Orion SoCs
+(Kirkwood, Dove, Orion5x, and Discovery Innovation) and as part of Marvell
+Discovery system controller chips (mv64[345]60).
+
+The Discovery ethernet controller is described with two levels of nodes. The
+first level describes the ethernet controller itself and the second level
+describes up to 3 ethernet port nodes within that controller. The reason for
+the multiple levels is that the port registers are interleaved within a single
+set of controller registers. Each port node describes port-specific properties.
+
+Note: The above separation is only true for Discovery system controllers.
+For Orion SoCs we stick to the separation, although there each controller has
+only one port associated. Multiple ports are implemented as multiple single-port
+controllers. As Kirkwood has some issues with proper initialization after reset,
+an extra compatible string is added for it.
+
+* Ethernet controller node
+
+Required controller properties:
+ - #address-cells: shall be 1.
+ - #size-cells: shall be 0.
+ - compatible: shall be one of "marvell,orion-eth", "marvell,kirkwood-eth".
+ - reg: address and length of the controller registers.
+
+Optional controller properties:
+ - clocks: phandle reference to the controller clock.
+ - marvell,tx-checksum-limit: max tx packet size for hardware checksum.
+
+* Ethernet port node
+
+Required port properties:
+ - device_type: shall be "network".
+ - compatible: shall be one of "marvell,orion-eth-port",
+ "marvell,kirkwood-eth-port".
+ - reg: port number relative to ethernet controller, shall be 0, 1, or 2.
+ - interrupts: port interrupt.
+ - local-mac-address: 6 bytes MAC address.
+
+Optional port properties:
+ - marvell,tx-queue-size: size of the transmit ring buffer.
+ - marvell,tx-sram-addr: address of transmit descriptor buffer located in SRAM.
+ - marvell,tx-sram-size: size of transmit descriptor buffer located in SRAM.
+ - marvell,rx-queue-size: size of the receive ring buffer.
+ - marvell,rx-sram-addr: address of receive descriptor buffer located in SRAM.
+ - marvell,rx-sram-size: size of receive descriptor buffer located in SRAM.
+
+and
+
+ - phy-handle: phandle reference to ethernet PHY.
+
+or
+
+ - speed: port speed if no PHY connected.
+ - duplex: port mode if no PHY connected.
+
+* Node example:
+
+mdio-bus {
+ ...
+ ethphy: ethernet-phy@8 {
+ device_type = "ethernet-phy";
+ ...
+ };
+};
+
+eth: ethernet-controller@72000 {
+ compatible = "marvell,orion-eth";
+ #address-cells = <1>;
+ #size-cells = <0>;
+ reg = <0x72000 0x2000>;
+ clocks = <&gate_clk 2>;
+ marvell,tx-checksum-limit = <1600>;
+
+ ethernet@0 {
+ device_type = "network";
+ compatible = "marvell,orion-eth-port";
+ reg = <0>;
+ interrupts = <29>;
+ phy-handle = <&ethphy>;
+ local-mac-address = [00 00 00 00 00 00];
+ };
+};
diff --git a/Documentation/devicetree/bindings/net/micrel-ks8851.txt b/Documentation/devicetree/bindings/net/micrel-ks8851.txt
new file mode 100644
index 000000000000..11ace3c3d805
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/micrel-ks8851.txt
@@ -0,0 +1,9 @@
+Micrel KS8851 Ethernet mac
+
+Required properties:
+- compatible = "micrel,ks8851-ml" of parallel interface
+- reg : 2 physical address and size of registers for data and command
+- interrupts : interrupt connection
+
+Optional properties:
+- local-mac-address : Ethernet mac address to use
diff --git a/Documentation/devicetree/bindings/net/via-velocity.txt b/Documentation/devicetree/bindings/net/via-velocity.txt
new file mode 100644
index 000000000000..b3db469b1ad7
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/via-velocity.txt
@@ -0,0 +1,20 @@
+* VIA Velocity 10/100/1000 Network Controller
+
+Required properties:
+- compatible : Should be "via,velocity-vt6110"
+- reg : Address and length of the io space
+- interrupts : Should contain the controller interrupt line
+
+Optional properties:
+- no-eeprom : PCI network cards use an external EEPROM to store data. Embedded
+ devices quite often set this data in uboot and do not provide an eeprom.
+ Specify this option if you have no external eeprom.
+
+Examples:
+
+eth0@d8004000 {
+ compatible = "via,velocity-vt6110";
+ reg = <0xd8004000 0x400>;
+ interrupts = <10>;
+ no-eeprom;
+};
diff --git a/Documentation/devicetree/bindings/rtc/atmel,at91rm9200-rtc.txt b/Documentation/devicetree/bindings/rtc/atmel,at91rm9200-rtc.txt
index 2a3feabd3b22..34c1505774bf 100644
--- a/Documentation/devicetree/bindings/rtc/atmel,at91rm9200-rtc.txt
+++ b/Documentation/devicetree/bindings/rtc/atmel,at91rm9200-rtc.txt
@@ -1,7 +1,7 @@
Atmel AT91RM9200 Real Time Clock
Required properties:
-- compatible: should be: "atmel,at91rm9200-rtc"
+- compatible: should be: "atmel,at91rm9200-rtc" or "atmel,at91sam9x5-rtc"
- reg: physical base address of the controller and length of memory mapped
region.
- interrupts: rtc alarm/event interrupt
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt
index 6931c4348d24..2fe74e6ec209 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -18,6 +18,7 @@ chrp Common Hardware Reference Platform
cirrus Cirrus Logic, Inc.
cortina Cortina Systems, Inc.
dallas Maxim Integrated Products (formerly Dallas Semiconductor)
+davicom DAVICOM Semiconductor, Inc.
denx Denx Software Engineering
emmicro EM Microelectronic
epson Seiko Epson Corp.
diff --git a/Documentation/devicetree/bindings/drm/exynos/hdmi.txt b/Documentation/devicetree/bindings/video/exynos_hdmi.txt
index 589edee37394..589edee37394 100644
--- a/Documentation/devicetree/bindings/drm/exynos/hdmi.txt
+++ b/Documentation/devicetree/bindings/video/exynos_hdmi.txt
diff --git a/Documentation/devicetree/bindings/drm/exynos/hdmiddc.txt b/Documentation/devicetree/bindings/video/exynos_hdmiddc.txt
index fa166d945809..fa166d945809 100644
--- a/Documentation/devicetree/bindings/drm/exynos/hdmiddc.txt
+++ b/Documentation/devicetree/bindings/video/exynos_hdmiddc.txt
diff --git a/Documentation/devicetree/bindings/drm/exynos/hdmiphy.txt b/Documentation/devicetree/bindings/video/exynos_hdmiphy.txt
index 858f4f9b902f..858f4f9b902f 100644
--- a/Documentation/devicetree/bindings/drm/exynos/hdmiphy.txt
+++ b/Documentation/devicetree/bindings/video/exynos_hdmiphy.txt
diff --git a/Documentation/devicetree/bindings/drm/exynos/mixer.txt b/Documentation/devicetree/bindings/video/exynos_mixer.txt
index 9b2ea0343566..9b2ea0343566 100644
--- a/Documentation/devicetree/bindings/drm/exynos/mixer.txt
+++ b/Documentation/devicetree/bindings/video/exynos_mixer.txt
diff --git a/Documentation/devicetree/bindings/video/simple-framebuffer.txt b/Documentation/devicetree/bindings/video/simple-framebuffer.txt
new file mode 100644
index 000000000000..3ea460583111
--- /dev/null
+++ b/Documentation/devicetree/bindings/video/simple-framebuffer.txt
@@ -0,0 +1,25 @@
+Simple Framebuffer
+
+A simple frame-buffer describes a raw memory region that may be rendered to,
+with the assumption that the display hardware has already been set up to scan
+out from that buffer.
+
+Required properties:
+- compatible: "simple-framebuffer"
+- reg: Should contain the location and size of the framebuffer memory.
+- width: The width of the framebuffer in pixels.
+- height: The height of the framebuffer in pixels.
+- stride: The number of bytes in each line of the framebuffer.
+- format: The format of the framebuffer surface. Valid values are:
+ - r5g6b5 (16-bit pixels, d[15:11]=r, d[10:5]=g, d[4:0]=b).
+
+Example:
+
+ framebuffer {
+ compatible = "simple-framebuffer";
+ reg = <0x1d385000 (1600 * 1200 * 2)>;
+ width = <1600>;
+ height = <1200>;
+ stride = <(1600 * 2)>;
+ format = "r5g6b5";
+ };
diff --git a/Documentation/devicetree/usage-model.txt b/Documentation/devicetree/usage-model.txt
index ef9d06c9f8fd..0efedaad5165 100644
--- a/Documentation/devicetree/usage-model.txt
+++ b/Documentation/devicetree/usage-model.txt
@@ -191,9 +191,11 @@ Linux it will look something like this:
};
The bootargs property contains the kernel arguments, and the initrd-*
-properties define the address and size of an initrd blob. The
-chosen node may also optionally contain an arbitrary number of
-additional properties for platform-specific configuration data.
+properties define the address and size of an initrd blob. Note that
+initrd-end is the first address after the initrd image, so this doesn't
+match the usual semantic of struct resource. The chosen node may also
+optionally contain an arbitrary number of additional properties for
+platform-specific configuration data.
During early boot, the architecture setup code calls of_scan_flat_dt()
several times with different helper callbacks to parse device tree
diff --git a/Documentation/dmatest.txt b/Documentation/dmatest.txt
index 279ac0a8c5b1..132a094c7bc3 100644
--- a/Documentation/dmatest.txt
+++ b/Documentation/dmatest.txt
@@ -34,7 +34,7 @@ command:
After a while you will start to get messages about current status or error like
in the original code.
-Note that running a new test will stop any in progress test.
+Note that running a new test will not stop any in progress test.
The following command should return actual state of the test.
% cat /sys/kernel/debug/dmatest/run
@@ -52,8 +52,8 @@ To wait for test done the user may perform a busy loop that checks the state.
The module parameters that is supplied to the kernel command line will be used
for the first performed test. After user gets a control, the test could be
-interrupted or re-run with same or different parameters. For the details see
-the above section "Part 2 - When dmatest is built as a module..."
+re-run with the same or different parameters. For the details see the above
+section "Part 2 - When dmatest is built as a module..."
In both cases the module parameters are used as initial values for the test case.
You always could check them at run-time by running
diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt
index 3e4b3dd1e046..83577f0232a0 100644
--- a/Documentation/filesystems/xfs.txt
+++ b/Documentation/filesystems/xfs.txt
@@ -33,6 +33,9 @@ When mounting an XFS filesystem, the following options are accepted.
removing extended attributes) the on-disk superblock feature
bit field will be updated to reflect this format being in use.
+ CRC enabled filesystems always use the attr2 format, and so
+ will reject the noattr2 mount option if it is set.
+
barrier
Enables the use of block layer write barriers for writes into
the journal and unwritten extent conversion. This allows for
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index c3bfacb92910..2fe6e767b3d6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3005,6 +3005,27 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Force threading of all interrupt handlers except those
marked explicitly IRQF_NO_THREAD.
+ tmem [KNL,XEN]
+ Enable the Transcendent memory driver if built-in.
+
+ tmem.cleancache=0|1 [KNL, XEN]
+ Default is on (1). Disable the usage of the cleancache
+ API to send anonymous pages to the hypervisor.
+
+ tmem.frontswap=0|1 [KNL, XEN]
+ Default is on (1). Disable the usage of the frontswap
+ API to send swap pages to the hypervisor. If disabled
+ the selfballooning and selfshrinking are force disabled.
+
+ tmem.selfballooning=0|1 [KNL, XEN]
+ Default is on (1). Disable the driving of swap pages
+ to the hypervisor.
+
+ tmem.selfshrinking=0|1 [KNL, XEN]
+ Default is on (1). Partial swapoff that immediately
+ transfers pages from Xen hypervisor back to the
+ kernel based on different criteria.
+
topology= [S390]
Format: {off | on}
Specify if the kernel should make use of the cpu
@@ -3330,9 +3351,6 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
plus one apbt timer for broadcast timer.
x86_mrst_timer=apbt_only | lapic_and_apbt
- xd= [HW,XT] Original XT pre-IDE (RLL encoded) disks.
- xd_geo= See header of drivers/block/xd.c.
-
xen_emul_unplug= [HW,X86,XEN]
Unplug Xen emulated devices
Format: [unplug0,][unplug1]
diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
new file mode 100644
index 000000000000..cbf7ae412da4
--- /dev/null
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -0,0 +1,202 @@
+REDUCING OS JITTER DUE TO PER-CPU KTHREADS
+
+This document lists per-CPU kthreads in the Linux kernel and presents
+options to control their OS jitter. Note that non-per-CPU kthreads are
+not listed here. To reduce OS jitter from non-per-CPU kthreads, bind
+them to a "housekeeping" CPU dedicated to such work.
+
+
+REFERENCES
+
+o Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs.
+
+o Documentation/cgroups: Using cgroups to bind tasks to sets of CPUs.
+
+o man taskset: Using the taskset command to bind tasks to sets
+ of CPUs.
+
+o man sched_setaffinity: Using the sched_setaffinity() system
+ call to bind tasks to sets of CPUs.
+
+o /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state,
+ writing "0" to offline and "1" to online.
+
+o In order to locate kernel-generated OS jitter on CPU N:
+
+ cd /sys/kernel/debug/tracing
+ echo 1 > max_graph_depth # Increase the "1" for more detail
+ echo function_graph > current_tracer
+ # run workload
+ cat per_cpu/cpuN/trace
+
+
+KTHREADS
+
+Name: ehca_comp/%u
+Purpose: Periodically process Infiniband-related work.
+To reduce its OS jitter, do any of the following:
+1. Don't use eHCA Infiniband hardware, instead choosing hardware
+ that does not require per-CPU kthreads. This will prevent these
+ kthreads from being created in the first place. (This will
+ work for most people, as this hardware, though important, is
+ relatively old and is produced in relatively low unit volumes.)
+2. Do all eHCA-Infiniband-related work on other CPUs, including
+ interrupts.
+3. Rework the eHCA driver so that its per-CPU kthreads are
+ provisioned only on selected CPUs.
+
+
+Name: irq/%d-%s
+Purpose: Handle threaded interrupts.
+To reduce its OS jitter, do the following:
+1. Use irq affinity to force the irq threads to execute on
+ some other CPU.
+
+Name: kcmtpd_ctr_%d
+Purpose: Handle Bluetooth work.
+To reduce its OS jitter, do one of the following:
+1. Don't use Bluetooth, in which case these kthreads won't be
+ created in the first place.
+2. Use irq affinity to force Bluetooth-related interrupts to
+ occur on some other CPU and furthermore initiate all
+ Bluetooth activity on some other CPU.
+
+Name: ksoftirqd/%u
+Purpose: Execute softirq handlers when threaded or when under heavy load.
+To reduce its OS jitter, each softirq vector must be handled
+separately as follows:
+TIMER_SOFTIRQ: Do all of the following:
+1. To the extent possible, keep the CPU out of the kernel when it
+ is non-idle, for example, by avoiding system calls and by forcing
+ both kernel threads and interrupts to execute elsewhere.
+2. Build with CONFIG_HOTPLUG_CPU=y. After boot completes, force
+ the CPU offline, then bring it back online. This forces
+ recurring timers to migrate elsewhere. If you are concerned
+ with multiple CPUs, force them all offline before bringing the
+ first one back online. Once you have onlined the CPUs in question,
+ do not offline any other CPUs, because doing so could force the
+ timer back onto one of the CPUs in question.
+NET_TX_SOFTIRQ and NET_RX_SOFTIRQ: Do all of the following:
+1. Force networking interrupts onto other CPUs.
+2. Initiate any network I/O on other CPUs.
+3. Once your application has started, prevent CPU-hotplug operations
+ from being initiated from tasks that might run on the CPU to
+ be de-jittered. (It is OK to force this CPU offline and then
+ bring it back online before you start your application.)
+BLOCK_SOFTIRQ: Do all of the following:
+1. Force block-device interrupts onto some other CPU.
+2. Initiate any block I/O on other CPUs.
+3. Once your application has started, prevent CPU-hotplug operations
+ from being initiated from tasks that might run on the CPU to
+ be de-jittered. (It is OK to force this CPU offline and then
+ bring it back online before you start your application.)
+BLOCK_IOPOLL_SOFTIRQ: Do all of the following:
+1. Force block-device interrupts onto some other CPU.
+2. Initiate any block I/O and block-I/O polling on other CPUs.
+3. Once your application has started, prevent CPU-hotplug operations
+ from being initiated from tasks that might run on the CPU to
+ be de-jittered. (It is OK to force this CPU offline and then
+ bring it back online before you start your application.)
+TASKLET_SOFTIRQ: Do one or more of the following:
+1. Avoid use of drivers that use tasklets. (Such drivers will contain
+ calls to things like tasklet_schedule().)
+2. Convert all drivers that you must use from tasklets to workqueues.
+3. Force interrupts for drivers using tasklets onto other CPUs,
+ and also do I/O involving these drivers on other CPUs.
+SCHED_SOFTIRQ: Do all of the following:
+1. Avoid sending scheduler IPIs to the CPU to be de-jittered,
+ for example, ensure that at most one runnable kthread is present
+ on that CPU. If a thread that expects to run on the de-jittered
+ CPU awakens, the scheduler will send an IPI that can result in
+ a subsequent SCHED_SOFTIRQ.
+2. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y,
+ CONFIG_NO_HZ_FULL=y, and, in addition, ensure that the CPU
+ to be de-jittered is marked as an adaptive-ticks CPU using the
+ "nohz_full=" boot parameter. This reduces the number of
+ scheduler-clock interrupts that the de-jittered CPU receives,
+ minimizing its chances of being selected to do the load balancing
+ work that runs in SCHED_SOFTIRQ context.
+3. To the extent possible, keep the CPU out of the kernel when it
+ is non-idle, for example, by avoiding system calls and by
+ forcing both kernel threads and interrupts to execute elsewhere.
+ This further reduces the number of scheduler-clock interrupts
+ received by the de-jittered CPU.
+HRTIMER_SOFTIRQ: Do all of the following:
+1. To the extent possible, keep the CPU out of the kernel when it
+ is non-idle. For example, avoid system calls and force both
+ kernel threads and interrupts to execute elsewhere.
+2. Build with CONFIG_HOTPLUG_CPU=y. Once boot completes, force the
+ CPU offline, then bring it back online. This forces recurring
+ timers to migrate elsewhere. If you are concerned with multiple
+ CPUs, force them all offline before bringing the first one
+ back online. Once you have onlined the CPUs in question, do not
+ offline any other CPUs, because doing so could force the timer
+ back onto one of the CPUs in question.
+RCU_SOFTIRQ: Do at least one of the following:
+1. Offload callbacks and keep the CPU in either dyntick-idle or
+ adaptive-ticks state by doing all of the following:
+ a. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y,
+ CONFIG_NO_HZ_FULL=y, and, in addition ensure that the CPU
+ to be de-jittered is marked as an adaptive-ticks CPU using
+ the "nohz_full=" boot parameter. Bind the rcuo kthreads
+ to housekeeping CPUs, which can tolerate OS jitter.
+ b. To the extent possible, keep the CPU out of the kernel
+ when it is non-idle, for example, by avoiding system
+ calls and by forcing both kernel threads and interrupts
+ to execute elsewhere.
+2. Enable RCU to do its processing remotely via dyntick-idle by
+ doing all of the following:
+ a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y.
+ b. Ensure that the CPU goes idle frequently, allowing other
+ CPUs to detect that it has passed through an RCU quiescent
+ state. If the kernel is built with CONFIG_NO_HZ_FULL=y,
+ userspace execution also allows other CPUs to detect that
+ the CPU in question has passed through a quiescent state.
+ c. To the extent possible, keep the CPU out of the kernel
+ when it is non-idle, for example, by avoiding system
+ calls and by forcing both kernel threads and interrupts
+ to execute elsewhere.
+
+Name: rcuc/%u
+Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels.
+To reduce its OS jitter, do at least one of the following:
+1. Build the kernel with CONFIG_PREEMPT=n. This prevents these
+ kthreads from being created in the first place, and also obviates
+ the need for RCU priority boosting. This approach is feasible
+ for workloads that do not require high degrees of responsiveness.
+2. Build the kernel with CONFIG_RCU_BOOST=n. This prevents these
+ kthreads from being created in the first place. This approach
+ is feasible only if your workload never requires RCU priority
+ boosting, for example, if you ensure frequent idle time on all
+ CPUs that might execute within the kernel.
+3. Build with CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y,
+ which offloads all RCU callbacks to kthreads that can be moved
+ off of CPUs susceptible to OS jitter. This approach prevents the
+ rcuc/%u kthreads from having any work to do, so that they are
+ never awakened.
+4. Ensure that the CPU never enters the kernel, and, in particular,
+ avoid initiating any CPU hotplug operations on this CPU. This is
+ another way of preventing any callbacks from being queued on the
+ CPU, again preventing the rcuc/%u kthreads from having any work
+ to do.
+
+Name: rcuob/%d, rcuop/%d, and rcuos/%d
+Purpose: Offload RCU callbacks from the corresponding CPU.
+To reduce its OS jitter, do at least one of the following:
+1. Use affinity, cgroups, or other mechanism to force these kthreads
+ to execute on some other CPU.
+2. Build with CONFIG_RCU_NOCB_CPUS=n, which will prevent these
+ kthreads from being created in the first place. However, please
+ note that this will not eliminate OS jitter, but will instead
+ shift it to RCU_SOFTIRQ.
+
+Name: watchdog/%u
+Purpose: Detect software lockups on each CPU.
+To reduce its OS jitter, do at least one of the following:
+1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these
+ kthreads from being created in the first place.
+2. Echo a zero to /proc/sys/kernel/watchdog to disable the
+ watchdog timer.
+3. Echo a large number of /proc/sys/kernel/watchdog_thresh in
+ order to reduce the frequency of OS jitter due to the watchdog
+ timer down to a level that is acceptable for your workload.
diff --git a/Documentation/m68k/kernel-options.txt b/Documentation/m68k/kernel-options.txt
index 97d45f276fe6..eaf32a1fd0b1 100644
--- a/Documentation/m68k/kernel-options.txt
+++ b/Documentation/m68k/kernel-options.txt
@@ -80,8 +80,6 @@ Valid names are:
/dev/sdd: -> 0x0830 (forth SCSI disk)
/dev/sde: -> 0x0840 (fifth SCSI disk)
/dev/fd : -> 0x0200 (floppy disk)
- /dev/xda: -> 0x0c00 (first XT disk, unused in Linux/m68k)
- /dev/xdb: -> 0x0c40 (second XT disk, unused in Linux/m68k)
The name must be followed by a decimal number, that stands for the
partition number. Internally, the value of the number is just
diff --git a/Documentation/networking/.gitignore b/Documentation/networking/.gitignore
index 286a5680f490..e69de29bb2d1 100644
--- a/Documentation/networking/.gitignore
+++ b/Documentation/networking/.gitignore
@@ -1 +0,0 @@
-ifenslave
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
index 258d9b92c36f..32dfbd924121 100644
--- a/Documentation/networking/00-INDEX
+++ b/Documentation/networking/00-INDEX
@@ -88,8 +88,6 @@ gianfar.txt
- Gianfar Ethernet Driver.
ieee802154.txt
- Linux IEEE 802.15.4 implementation, API and drivers
-ifenslave.c
- - Configure network interfaces for parallel routing (bonding).
igb.txt
- README for the Intel Gigabit Ethernet Driver (igb).
igbvf.txt
diff --git a/Documentation/networking/Makefile b/Documentation/networking/Makefile
index 24c308dd3fd1..0aa1ac98fc2b 100644
--- a/Documentation/networking/Makefile
+++ b/Documentation/networking/Makefile
@@ -1,11 +1,6 @@
# kbuild trick to avoid linker error. Can be omitted if a module is built.
obj- := dummy.o
-# List of programs to build
-hostprogs-y := ifenslave
-
-HOSTCFLAGS_ifenslave.o += -I$(objtree)/usr/include
-
# Tell kbuild to always build the programs
always := $(hostprogs-y)
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 10a015c384b8..e7454fcc9176 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -104,8 +104,7 @@ Table of Contents
==============================
Most popular distro kernels ship with the bonding driver
-already available as a module and the ifenslave user level control
-program installed and ready for use. If your distro does not, or you
+already available as a module. If your distro does not, or you
have need to compile bonding from source (e.g., configuring and
installing a mainline kernel from kernel.org), you'll need to perform
the following steps:
@@ -124,46 +123,13 @@ device support" section. It is recommended that you configure the
driver as module since it is currently the only way to pass parameters
to the driver or configure more than one bonding device.
- Build and install the new kernel and modules, then continue
-below to install ifenslave.
+ Build and install the new kernel and modules.
-1.2 Install ifenslave Control Utility
+1.2 Bonding Control Utility
-------------------------------------
- The ifenslave user level control program is included in the
-kernel source tree, in the file Documentation/networking/ifenslave.c.
-It is generally recommended that you use the ifenslave that
-corresponds to the kernel that you are using (either from the same
-source tree or supplied with the distro), however, ifenslave
-executables from older kernels should function (but features newer
-than the ifenslave release are not supported). Running an ifenslave
-that is newer than the kernel is not supported, and may or may not
-work.
-
- To install ifenslave, do the following:
-
-# gcc -Wall -O -I/usr/src/linux/include ifenslave.c -o ifenslave
-# cp ifenslave /sbin/ifenslave
-
- If your kernel source is not in "/usr/src/linux," then replace
-"/usr/src/linux/include" in the above with the location of your kernel
-source include directory.
-
- You may wish to back up any existing /sbin/ifenslave, or, for
-testing or informal use, tag the ifenslave to the kernel version
-(e.g., name the ifenslave executable /sbin/ifenslave-2.6.10).
-
-IMPORTANT NOTE:
-
- If you omit the "-I" or specify an incorrect directory, you
-may end up with an ifenslave that is incompatible with the kernel
-you're trying to build it for. Some distros (e.g., Red Hat from 7.1
-onwards) do not have /usr/include/linux symbolically linked to the
-default kernel source include directory.
-
-SECOND IMPORTANT NOTE:
- If you plan to configure bonding using sysfs or using the
-/etc/network/interfaces file, you do not need to use ifenslave.
+ It is recommended to configure bonding via iproute2 (netlink)
+or sysfs, the old ifenslave control utility is obsolete.
2. Bonding Driver Options
=========================
@@ -851,7 +817,7 @@ resend_igmp
==============================
You can configure bonding using either your distro's network
-initialization scripts, or manually using either ifenslave or the
+initialization scripts, or manually using either iproute2 or the
sysfs interface. Distros generally use one of three packages for the
network initialization scripts: initscripts, sysconfig or interfaces.
Recent versions of these packages have support for bonding, while older
@@ -1160,7 +1126,7 @@ not support this method for specifying multiple bonding interfaces; for
those instances, see the "Configuring Multiple Bonds Manually" section,
below.
-3.3 Configuring Bonding Manually with Ifenslave
+3.3 Configuring Bonding Manually with iproute2
-----------------------------------------------
This section applies to distros whose network initialization
@@ -1171,7 +1137,7 @@ version 8.
The general method for these systems is to place the bonding
module parameters into a config file in /etc/modprobe.d/ (as
appropriate for the installed distro), then add modprobe and/or
-ifenslave commands to the system's global init script. The name of
+`ip link` commands to the system's global init script. The name of
the global init script differs; for sysconfig, it is
/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local.
@@ -1183,8 +1149,8 @@ reboots, edit the appropriate file (/etc/init.d/boot.local or
modprobe bonding mode=balance-alb miimon=100
modprobe e100
ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
-ifenslave bond0 eth0
-ifenslave bond0 eth1
+ip link set eth0 master bond0
+ip link set eth1 master bond0
Replace the example bonding module parameters and bond0
network configuration (IP address, netmask, etc) with the appropriate
diff --git a/Documentation/networking/ifenslave.c b/Documentation/networking/ifenslave.c
deleted file mode 100644
index ac5debb2f16c..000000000000
--- a/Documentation/networking/ifenslave.c
+++ /dev/null
@@ -1,1105 +0,0 @@
-/* Mode: C;
- * ifenslave.c: Configure network interfaces for parallel routing.
- *
- * This program controls the Linux implementation of running multiple
- * network interfaces in parallel.
- *
- * Author: Donald Becker <becker@cesdis.gsfc.nasa.gov>
- * Copyright 1994-1996 Donald Becker
- *
- * This program is free software; you can redistribute it
- * and/or modify it under the terms of the GNU General Public
- * License as published by the Free Software Foundation.
- *
- * The author may be reached as becker@CESDIS.gsfc.nasa.gov, or C/O
- * Center of Excellence in Space Data and Information Sciences
- * Code 930.5, Goddard Space Flight Center, Greenbelt MD 20771
- *
- * Changes :
- * - 2000/10/02 Willy Tarreau <willy at meta-x.org> :
- * - few fixes. Master's MAC address is now correctly taken from
- * the first device when not previously set ;
- * - detach support : call BOND_RELEASE to detach an enslaved interface.
- * - give a mini-howto from command-line help : # ifenslave -h
- *
- * - 2001/02/16 Chad N. Tindel <ctindel at ieee dot org> :
- * - Master is now brought down before setting the MAC address. In
- * the 2.4 kernel you can't change the MAC address while the device is
- * up because you get EBUSY.
- *
- * - 2001/09/13 Takao Indoh <indou dot takao at jp dot fujitsu dot com>
- * - Added the ability to change the active interface on a mode 1 bond
- * at runtime.
- *
- * - 2001/10/23 Chad N. Tindel <ctindel at ieee dot org> :
- * - No longer set the MAC address of the master. The bond device will
- * take care of this itself
- * - Try the SIOC*** versions of the bonding ioctls before using the
- * old versions
- * - 2002/02/18 Erik Habbinga <erik_habbinga @ hp dot com> :
- * - ifr2.ifr_flags was not initialized in the hwaddr_notset case,
- * SIOCGIFFLAGS now called before hwaddr_notset test
- *
- * - 2002/10/31 Tony Cureington <tony.cureington * hp_com> :
- * - If the master does not have a hardware address when the first slave
- * is enslaved, the master is assigned the hardware address of that
- * slave - there is a comment in bonding.c stating "ifenslave takes
- * care of this now." This corrects the problem of slaves having
- * different hardware addresses in active-backup mode when
- * multiple interfaces are specified on a single ifenslave command
- * (ifenslave bond0 eth0 eth1).
- *
- * - 2003/03/18 - Tsippy Mendelson <tsippy.mendelson at intel dot com> and
- * Shmulik Hen <shmulik.hen at intel dot com>
- * - Moved setting the slave's mac address and openning it, from
- * the application to the driver. This enables support of modes
- * that need to use the unique mac address of each slave.
- * The driver also takes care of closing the slave and restoring its
- * original mac address upon release.
- * In addition, block possibility of enslaving before the master is up.
- * This prevents putting the system in an undefined state.
- *
- * - 2003/05/01 - Amir Noam <amir.noam at intel dot com>
- * - Added ABI version control to restore compatibility between
- * new/old ifenslave and new/old bonding.
- * - Prevent adding an adapter that is already a slave.
- * Fixes the problem of stalling the transmission and leaving
- * the slave in a down state.
- *
- * - 2003/05/01 - Shmulik Hen <shmulik.hen at intel dot com>
- * - Prevent enslaving if the bond device is down.
- * Fixes the problem of leaving the system in unstable state and
- * halting when trying to remove the module.
- * - Close socket on all abnormal exists.
- * - Add versioning scheme that follows that of the bonding driver.
- * current version is 1.0.0 as a base line.
- *
- * - 2003/05/22 - Jay Vosburgh <fubar at us dot ibm dot com>
- * - ifenslave -c was broken; it's now fixed
- * - Fixed problem with routes vanishing from master during enslave
- * processing.
- *
- * - 2003/05/27 - Amir Noam <amir.noam at intel dot com>
- * - Fix backward compatibility issues:
- * For drivers not using ABI versions, slave was set down while
- * it should be left up before enslaving.
- * Also, master was not set down and the default set_mac_address()
- * would fail and generate an error message in the system log.
- * - For opt_c: slave should not be set to the master's setting
- * while it is running. It was already set during enslave. To
- * simplify things, it is now handled separately.
- *
- * - 2003/12/01 - Shmulik Hen <shmulik.hen at intel dot com>
- * - Code cleanup and style changes
- * set version to 1.1.0
- */
-
-#define APP_VERSION "1.1.0"
-#define APP_RELDATE "December 1, 2003"
-#define APP_NAME "ifenslave"
-
-static char *version =
-APP_NAME ".c:v" APP_VERSION " (" APP_RELDATE ")\n"
-"o Donald Becker (becker@cesdis.gsfc.nasa.gov).\n"
-"o Detach support added on 2000/10/02 by Willy Tarreau (willy at meta-x.org).\n"
-"o 2.4 kernel support added on 2001/02/16 by Chad N. Tindel\n"
-" (ctindel at ieee dot org).\n";
-
-static const char *usage_msg =
-"Usage: ifenslave [-f] <master-if> <slave-if> [<slave-if>...]\n"
-" ifenslave -d <master-if> <slave-if> [<slave-if>...]\n"
-" ifenslave -c <master-if> <slave-if>\n"
-" ifenslave --help\n";
-
-static const char *help_msg =
-"\n"
-" To create a bond device, simply follow these three steps :\n"
-" - ensure that the required drivers are properly loaded :\n"
-" # modprobe bonding ; modprobe <3c59x|eepro100|pcnet32|tulip|...>\n"
-" - assign an IP address to the bond device :\n"
-" # ifconfig bond0 <addr> netmask <mask> broadcast <bcast>\n"
-" - attach all the interfaces you need to the bond device :\n"
-" # ifenslave [{-f|--force}] bond0 eth0 [eth1 [eth2]...]\n"
-" If bond0 didn't have a MAC address, it will take eth0's. Then, all\n"
-" interfaces attached AFTER this assignment will get the same MAC addr.\n"
-" (except for ALB/TLB modes)\n"
-"\n"
-" To set the bond device down and automatically release all the slaves :\n"
-" # ifconfig bond0 down\n"
-"\n"
-" To detach a dead interface without setting the bond device down :\n"
-" # ifenslave {-d|--detach} bond0 eth0 [eth1 [eth2]...]\n"
-"\n"
-" To change active slave :\n"
-" # ifenslave {-c|--change-active} bond0 eth0\n"
-"\n"
-" To show master interface info\n"
-" # ifenslave bond0\n"
-"\n"
-" To show all interfaces info\n"
-" # ifenslave {-a|--all-interfaces}\n"
-"\n"
-" To be more verbose\n"
-" # ifenslave {-v|--verbose} ...\n"
-"\n"
-" # ifenslave {-u|--usage} Show usage\n"
-" # ifenslave {-V|--version} Show version\n"
-" # ifenslave {-h|--help} This message\n"
-"\n";
-
-#include <unistd.h>
-#include <stdlib.h>
-#include <stdio.h>
-#include <ctype.h>
-#include <string.h>
-#include <errno.h>
-#include <fcntl.h>
-#include <getopt.h>
-#include <sys/types.h>
-#include <sys/socket.h>
-#include <sys/ioctl.h>
-#include <linux/if.h>
-#include <net/if_arp.h>
-#include <linux/if_ether.h>
-#include <linux/if_bonding.h>
-#include <linux/sockios.h>
-
-typedef unsigned long long u64; /* hack, so we may include kernel's ethtool.h */
-typedef __uint32_t u32; /* ditto */
-typedef __uint16_t u16; /* ditto */
-typedef __uint8_t u8; /* ditto */
-#include <linux/ethtool.h>
-
-struct option longopts[] = {
- /* { name has_arg *flag val } */
- {"all-interfaces", 0, 0, 'a'}, /* Show all interfaces. */
- {"change-active", 0, 0, 'c'}, /* Change the active slave. */
- {"detach", 0, 0, 'd'}, /* Detach a slave interface. */
- {"force", 0, 0, 'f'}, /* Force the operation. */
- {"help", 0, 0, 'h'}, /* Give help */
- {"usage", 0, 0, 'u'}, /* Give usage */
- {"verbose", 0, 0, 'v'}, /* Report each action taken. */
- {"version", 0, 0, 'V'}, /* Emit version information. */
- { 0, 0, 0, 0}
-};
-
-/* Command-line flags. */
-unsigned int
-opt_a = 0, /* Show-all-interfaces flag. */
-opt_c = 0, /* Change-active-slave flag. */
-opt_d = 0, /* Detach a slave interface. */
-opt_f = 0, /* Force the operation. */
-opt_h = 0, /* Help */
-opt_u = 0, /* Usage */
-opt_v = 0, /* Verbose flag. */
-opt_V = 0; /* Version */
-
-int skfd = -1; /* AF_INET socket for ioctl() calls.*/
-int abi_ver = 0; /* userland - kernel ABI version */
-int hwaddr_set = 0; /* Master's hwaddr is set */
-int saved_errno;
-
-struct ifreq master_mtu, master_flags, master_hwaddr;
-struct ifreq slave_mtu, slave_flags, slave_hwaddr;
-
-struct dev_ifr {
- struct ifreq *req_ifr;
- char *req_name;
- int req_type;
-};
-
-struct dev_ifr master_ifra[] = {
- {&master_mtu, "SIOCGIFMTU", SIOCGIFMTU},
- {&master_flags, "SIOCGIFFLAGS", SIOCGIFFLAGS},
- {&master_hwaddr, "SIOCGIFHWADDR", SIOCGIFHWADDR},
- {NULL, "", 0}
-};
-
-struct dev_ifr slave_ifra[] = {
- {&slave_mtu, "SIOCGIFMTU", SIOCGIFMTU},
- {&slave_flags, "SIOCGIFFLAGS", SIOCGIFFLAGS},
- {&slave_hwaddr, "SIOCGIFHWADDR", SIOCGIFHWADDR},
- {NULL, "", 0}
-};
-
-static void if_print(char *ifname);
-static int get_drv_info(char *master_ifname);
-static int get_if_settings(char *ifname, struct dev_ifr ifra[]);
-static int get_slave_flags(char *slave_ifname);
-static int set_master_hwaddr(char *master_ifname, struct sockaddr *hwaddr);
-static int set_slave_hwaddr(char *slave_ifname, struct sockaddr *hwaddr);
-static int set_slave_mtu(char *slave_ifname, int mtu);
-static int set_if_flags(char *ifname, short flags);
-static int set_if_up(char *ifname, short flags);
-static int set_if_down(char *ifname, short flags);
-static int clear_if_addr(char *ifname);
-static int set_if_addr(char *master_ifname, char *slave_ifname);
-static int change_active(char *master_ifname, char *slave_ifname);
-static int enslave(char *master_ifname, char *slave_ifname);
-static int release(char *master_ifname, char *slave_ifname);
-#define v_print(fmt, args...) \
- if (opt_v) \
- fprintf(stderr, fmt, ## args )
-
-int main(int argc, char *argv[])
-{
- char **spp, *master_ifname, *slave_ifname;
- int c, i, rv;
- int res = 0;
- int exclusive = 0;
-
- while ((c = getopt_long(argc, argv, "acdfhuvV", longopts, 0)) != EOF) {
- switch (c) {
- case 'a': opt_a++; exclusive++; break;
- case 'c': opt_c++; exclusive++; break;
- case 'd': opt_d++; exclusive++; break;
- case 'f': opt_f++; exclusive++; break;
- case 'h': opt_h++; exclusive++; break;
- case 'u': opt_u++; exclusive++; break;
- case 'v': opt_v++; break;
- case 'V': opt_V++; exclusive++; break;
-
- case '?':
- fprintf(stderr, "%s", usage_msg);
- res = 2;
- goto out;
- }
- }
-
- /* options check */
- if (exclusive > 1) {
- fprintf(stderr, "%s", usage_msg);
- res = 2;
- goto out;
- }
-
- if (opt_v || opt_V) {
- printf("%s", version);
- if (opt_V) {
- res = 0;
- goto out;
- }
- }
-
- if (opt_u) {
- printf("%s", usage_msg);
- res = 0;
- goto out;
- }
-
- if (opt_h) {
- printf("%s", usage_msg);
- printf("%s", help_msg);
- res = 0;
- goto out;
- }
-
- /* Open a basic socket */
- if ((skfd = socket(AF_INET, SOCK_DGRAM, 0)) < 0) {
- perror("socket");
- res = 1;
- goto out;
- }
-
- if (opt_a) {
- if (optind == argc) {
- /* No remaining args */
- /* show all interfaces */
- if_print((char *)NULL);
- goto out;
- } else {
- /* Just show usage */
- fprintf(stderr, "%s", usage_msg);
- res = 2;
- goto out;
- }
- }
-
- /* Copy the interface name */
- spp = argv + optind;
- master_ifname = *spp++;
-
- if (master_ifname == NULL) {
- fprintf(stderr, "%s", usage_msg);
- res = 2;
- goto out;
- }
-
- /* exchange abi version with bonding module */
- res = get_drv_info(master_ifname);
- if (res) {
- fprintf(stderr,
- "Master '%s': Error: handshake with driver failed. "
- "Aborting\n",
- master_ifname);
- goto out;
- }
-
- slave_ifname = *spp++;
-
- if (slave_ifname == NULL) {
- if (opt_d || opt_c) {
- fprintf(stderr, "%s", usage_msg);
- res = 2;
- goto out;
- }
-
- /* A single arg means show the
- * configuration for this interface
- */
- if_print(master_ifname);
- goto out;
- }
-
- res = get_if_settings(master_ifname, master_ifra);
- if (res) {
- /* Probably a good reason not to go on */
- fprintf(stderr,
- "Master '%s': Error: get settings failed: %s. "
- "Aborting\n",
- master_ifname, strerror(res));
- goto out;
- }
-
- /* check if master is indeed a master;
- * if not then fail any operation
- */
- if (!(master_flags.ifr_flags & IFF_MASTER)) {
- fprintf(stderr,
- "Illegal operation; the specified interface '%s' "
- "is not a master. Aborting\n",
- master_ifname);
- res = 1;
- goto out;
- }
-
- /* check if master is up; if not then fail any operation */
- if (!(master_flags.ifr_flags & IFF_UP)) {
- fprintf(stderr,
- "Illegal operation; the specified master interface "
- "'%s' is not up.\n",
- master_ifname);
- res = 1;
- goto out;
- }
-
- /* Only for enslaving */
- if (!opt_c && !opt_d) {
- sa_family_t master_family = master_hwaddr.ifr_hwaddr.sa_family;
- unsigned char *hwaddr =
- (unsigned char *)master_hwaddr.ifr_hwaddr.sa_data;
-
- /* The family '1' is ARPHRD_ETHER for ethernet. */
- if (master_family != 1 && !opt_f) {
- fprintf(stderr,
- "Illegal operation: The specified master "
- "interface '%s' is not ethernet-like.\n "
- "This program is designed to work with "
- "ethernet-like network interfaces.\n "
- "Use the '-f' option to force the "
- "operation.\n",
- master_ifname);
- res = 1;
- goto out;
- }
-
- /* Check master's hw addr */
- for (i = 0; i < 6; i++) {
- if (hwaddr[i] != 0) {
- hwaddr_set = 1;
- break;
- }
- }
-
- if (hwaddr_set) {
- v_print("current hardware address of master '%s' "
- "is %2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x, "
- "type %d\n",
- master_ifname,
- hwaddr[0], hwaddr[1],
- hwaddr[2], hwaddr[3],
- hwaddr[4], hwaddr[5],
- master_family);
- }
- }
-
- /* Accepts only one slave */
- if (opt_c) {
- /* change active slave */
- res = get_slave_flags(slave_ifname);
- if (res) {
- fprintf(stderr,
- "Slave '%s': Error: get flags failed. "
- "Aborting\n",
- slave_ifname);
- goto out;
- }
- res = change_active(master_ifname, slave_ifname);
- if (res) {
- fprintf(stderr,
- "Master '%s', Slave '%s': Error: "
- "Change active failed\n",
- master_ifname, slave_ifname);
- }
- } else {
- /* Accept multiple slaves */
- do {
- if (opt_d) {
- /* detach a slave interface from the master */
- rv = get_slave_flags(slave_ifname);
- if (rv) {
- /* Can't work with this slave. */
- /* remember the error and skip it*/
- fprintf(stderr,
- "Slave '%s': Error: get flags "
- "failed. Skipping\n",
- slave_ifname);
- res = rv;
- continue;
- }
- rv = release(master_ifname, slave_ifname);
- if (rv) {
- fprintf(stderr,
- "Master '%s', Slave '%s': Error: "
- "Release failed\n",
- master_ifname, slave_ifname);
- res = rv;
- }
- } else {
- /* attach a slave interface to the master */
- rv = get_if_settings(slave_ifname, slave_ifra);
- if (rv) {
- /* Can't work with this slave. */
- /* remember the error and skip it*/
- fprintf(stderr,
- "Slave '%s': Error: get "
- "settings failed: %s. "
- "Skipping\n",
- slave_ifname, strerror(rv));
- res = rv;
- continue;
- }
- rv = enslave(master_ifname, slave_ifname);
- if (rv) {
- fprintf(stderr,
- "Master '%s', Slave '%s': Error: "
- "Enslave failed\n",
- master_ifname, slave_ifname);
- res = rv;
- }
- }
- } while ((slave_ifname = *spp++) != NULL);
- }
-
-out:
- if (skfd >= 0) {
- close(skfd);
- }
-
- return res;
-}
-
-static short mif_flags;
-
-/* Get the inteface configuration from the kernel. */
-static int if_getconfig(char *ifname)
-{
- struct ifreq ifr;
- int metric, mtu; /* Parameters of the master interface. */
- struct sockaddr dstaddr, broadaddr, netmask;
- unsigned char *hwaddr;
-
- strcpy(ifr.ifr_name, ifname);
- if (ioctl(skfd, SIOCGIFFLAGS, &ifr) < 0)
- return -1;
- mif_flags = ifr.ifr_flags;
- printf("The result of SIOCGIFFLAGS on %s is %x.\n",
- ifname, ifr.ifr_flags);
-
- strcpy(ifr.ifr_name, ifname);
- if (ioctl(skfd, SIOCGIFADDR, &ifr) < 0)
- return -1;
- printf("The result of SIOCGIFADDR is %2.2x.%2.2x.%2.2x.%2.2x.\n",
- ifr.ifr_addr.sa_data[0], ifr.ifr_addr.sa_data[1],
- ifr.ifr_addr.sa_data[2], ifr.ifr_addr.sa_data[3]);
-
- strcpy(ifr.ifr_name, ifname);
- if (ioctl(skfd, SIOCGIFHWADDR, &ifr) < 0)
- return -1;
-
- /* Gotta convert from 'char' to unsigned for printf(). */
- hwaddr = (unsigned char *)ifr.ifr_hwaddr.sa_data;
- printf("The result of SIOCGIFHWADDR is type %d "
- "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n",
- ifr.ifr_hwaddr.sa_family, hwaddr[0], hwaddr[1],
- hwaddr[2], hwaddr[3], hwaddr[4], hwaddr[5]);
-
- strcpy(ifr.ifr_name, ifname);
- if (ioctl(skfd, SIOCGIFMETRIC, &ifr) < 0) {
- metric = 0;
- } else
- metric = ifr.ifr_metric;
- printf("The result of SIOCGIFMETRIC is %d\n", metric);
-
- strcpy(ifr.ifr_name, ifname);
- if (ioctl(skfd, SIOCGIFMTU, &ifr) < 0)
- mtu = 0;
- else
- mtu = ifr.ifr_mtu;
- printf("The result of SIOCGIFMTU is %d\n", mtu);
-
- strcpy(ifr.ifr_name, ifname);
- if (ioctl(skfd, SIOCGIFDSTADDR, &ifr) < 0) {
- memset(&dstaddr, 0, sizeof(struct sockaddr));
- } else
- dstaddr = ifr.ifr_dstaddr;
-
- strcpy(ifr.ifr_name, ifname);
- if (ioctl(skfd, SIOCGIFBRDADDR, &ifr) < 0) {
- memset(&broadaddr, 0, sizeof(struct sockaddr));
- } else
- broadaddr = ifr.ifr_broadaddr;
-
- strcpy(ifr.ifr_name, ifname);
- if (ioctl(skfd, SIOCGIFNETMASK, &ifr) < 0) {
- memset(&netmask, 0, sizeof(struct sockaddr));
- } else
- netmask = ifr.ifr_netmask;
-
- return 0;
-}
-
-static void if_print(char *ifname)
-{
- char buff[1024];
- struct ifconf ifc;
- struct ifreq *ifr;
- int i;
-
- if (ifname == (char *)NULL) {
- ifc.ifc_len = sizeof(buff);
- ifc.ifc_buf = buff;
- if (ioctl(skfd, SIOCGIFCONF, &ifc) < 0) {
- perror("SIOCGIFCONF failed");
- return;
- }
-
- ifr = ifc.ifc_req;
- for (i = ifc.ifc_len / sizeof(struct ifreq); --i >= 0; ifr++) {
- if (if_getconfig(ifr->ifr_name) < 0) {
- fprintf(stderr,
- "%s: unknown interface.\n",
- ifr->ifr_name);
- continue;
- }
-
- if (((mif_flags & IFF_UP) == 0) && !opt_a) continue;
- /*ife_print(&ife);*/
- }
- } else {
- if (if_getconfig(ifname) < 0) {
- fprintf(stderr,
- "%s: unknown interface.\n", ifname);
- }
- }
-}
-
-static int get_drv_info(char *master_ifname)
-{
- struct ifreq ifr;
- struct ethtool_drvinfo info;
- char *endptr;
-
- memset(&ifr, 0, sizeof(ifr));
- strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
- ifr.ifr_data = (caddr_t)&info;
-
- info.cmd = ETHTOOL_GDRVINFO;
- strncpy(info.driver, "ifenslave", 32);
- snprintf(info.fw_version, 32, "%d", BOND_ABI_VERSION);
-
- if (ioctl(skfd, SIOCETHTOOL, &ifr) < 0) {
- if (errno == EOPNOTSUPP) {
- goto out;
- }
-
- saved_errno = errno;
- v_print("Master '%s': Error: get bonding info failed %s\n",
- master_ifname, strerror(saved_errno));
- return 1;
- }
-
- abi_ver = strtoul(info.fw_version, &endptr, 0);
- if (*endptr) {
- v_print("Master '%s': Error: got invalid string as an ABI "
- "version from the bonding module\n",
- master_ifname);
- return 1;
- }
-
-out:
- v_print("ABI ver is %d\n", abi_ver);
-
- return 0;
-}
-
-static int change_active(char *master_ifname, char *slave_ifname)
-{
- struct ifreq ifr;
- int res = 0;
-
- if (!(slave_flags.ifr_flags & IFF_SLAVE)) {
- fprintf(stderr,
- "Illegal operation: The specified slave interface "
- "'%s' is not a slave\n",
- slave_ifname);
- return 1;
- }
-
- strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
- strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ);
- if ((ioctl(skfd, SIOCBONDCHANGEACTIVE, &ifr) < 0) &&
- (ioctl(skfd, BOND_CHANGE_ACTIVE_OLD, &ifr) < 0)) {
- saved_errno = errno;
- v_print("Master '%s': Error: SIOCBONDCHANGEACTIVE failed: "
- "%s\n",
- master_ifname, strerror(saved_errno));
- res = 1;
- }
-
- return res;
-}
-
-static int enslave(char *master_ifname, char *slave_ifname)
-{
- struct ifreq ifr;
- int res = 0;
-
- if (slave_flags.ifr_flags & IFF_SLAVE) {
- fprintf(stderr,
- "Illegal operation: The specified slave interface "
- "'%s' is already a slave\n",
- slave_ifname);
- return 1;
- }
-
- res = set_if_down(slave_ifname, slave_flags.ifr_flags);
- if (res) {
- fprintf(stderr,
- "Slave '%s': Error: bring interface down failed\n",
- slave_ifname);
- return res;
- }
-
- if (abi_ver < 2) {
- /* Older bonding versions would panic if the slave has no IP
- * address, so get the IP setting from the master.
- */
- set_if_addr(master_ifname, slave_ifname);
- } else {
- res = clear_if_addr(slave_ifname);
- if (res) {
- fprintf(stderr,
- "Slave '%s': Error: clear address failed\n",
- slave_ifname);
- return res;
- }
- }
-
- if (master_mtu.ifr_mtu != slave_mtu.ifr_mtu) {
- res = set_slave_mtu(slave_ifname, master_mtu.ifr_mtu);
- if (res) {
- fprintf(stderr,
- "Slave '%s': Error: set MTU failed\n",
- slave_ifname);
- return res;
- }
- }
-
- if (hwaddr_set) {
- /* Master already has an hwaddr
- * so set it's hwaddr to the slave
- */
- if (abi_ver < 1) {
- /* The driver is using an old ABI, so
- * the application sets the slave's
- * hwaddr
- */
- res = set_slave_hwaddr(slave_ifname,
- &(master_hwaddr.ifr_hwaddr));
- if (res) {
- fprintf(stderr,
- "Slave '%s': Error: set hw address "
- "failed\n",
- slave_ifname);
- goto undo_mtu;
- }
-
- /* For old ABI the application needs to bring the
- * slave back up
- */
- res = set_if_up(slave_ifname, slave_flags.ifr_flags);
- if (res) {
- fprintf(stderr,
- "Slave '%s': Error: bring interface "
- "down failed\n",
- slave_ifname);
- goto undo_slave_mac;
- }
- }
- /* The driver is using a new ABI,
- * so the driver takes care of setting
- * the slave's hwaddr and bringing
- * it up again
- */
- } else {
- /* No hwaddr for master yet, so
- * set the slave's hwaddr to it
- */
- if (abi_ver < 1) {
- /* For old ABI, the master needs to be
- * down before setting its hwaddr
- */
- res = set_if_down(master_ifname, master_flags.ifr_flags);
- if (res) {
- fprintf(stderr,
- "Master '%s': Error: bring interface "
- "down failed\n",
- master_ifname);
- goto undo_mtu;
- }
- }
-
- res = set_master_hwaddr(master_ifname,
- &(slave_hwaddr.ifr_hwaddr));
- if (res) {
- fprintf(stderr,
- "Master '%s': Error: set hw address "
- "failed\n",
- master_ifname);
- goto undo_mtu;
- }
-
- if (abi_ver < 1) {
- /* For old ABI, bring the master
- * back up
- */
- res = set_if_up(master_ifname, master_flags.ifr_flags);
- if (res) {
- fprintf(stderr,
- "Master '%s': Error: bring interface "
- "up failed\n",
- master_ifname);
- goto undo_master_mac;
- }
- }
-
- hwaddr_set = 1;
- }
-
- /* Do the real thing */
- strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
- strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ);
- if ((ioctl(skfd, SIOCBONDENSLAVE, &ifr) < 0) &&
- (ioctl(skfd, BOND_ENSLAVE_OLD, &ifr) < 0)) {
- saved_errno = errno;
- v_print("Master '%s': Error: SIOCBONDENSLAVE failed: %s\n",
- master_ifname, strerror(saved_errno));
- res = 1;
- }
-
- if (res) {
- goto undo_master_mac;
- }
-
- return 0;
-
-/* rollback (best effort) */
-undo_master_mac:
- set_master_hwaddr(master_ifname, &(master_hwaddr.ifr_hwaddr));
- hwaddr_set = 0;
- goto undo_mtu;
-undo_slave_mac:
- set_slave_hwaddr(slave_ifname, &(slave_hwaddr.ifr_hwaddr));
-undo_mtu:
- set_slave_mtu(slave_ifname, slave_mtu.ifr_mtu);
- return res;
-}
-
-static int release(char *master_ifname, char *slave_ifname)
-{
- struct ifreq ifr;
- int res = 0;
-
- if (!(slave_flags.ifr_flags & IFF_SLAVE)) {
- fprintf(stderr,
- "Illegal operation: The specified slave interface "
- "'%s' is not a slave\n",
- slave_ifname);
- return 1;
- }
-
- strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
- strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ);
- if ((ioctl(skfd, SIOCBONDRELEASE, &ifr) < 0) &&
- (ioctl(skfd, BOND_RELEASE_OLD, &ifr) < 0)) {
- saved_errno = errno;
- v_print("Master '%s': Error: SIOCBONDRELEASE failed: %s\n",
- master_ifname, strerror(saved_errno));
- return 1;
- } else if (abi_ver < 1) {
- /* The driver is using an old ABI, so we'll set the interface
- * down to avoid any conflicts due to same MAC/IP
- */
- res = set_if_down(slave_ifname, slave_flags.ifr_flags);
- if (res) {
- fprintf(stderr,
- "Slave '%s': Error: bring interface "
- "down failed\n",
- slave_ifname);
- }
- }
-
- /* set to default mtu */
- set_slave_mtu(slave_ifname, 1500);
-
- return res;
-}
-
-static int get_if_settings(char *ifname, struct dev_ifr ifra[])
-{
- int i;
- int res = 0;
-
- for (i = 0; ifra[i].req_ifr; i++) {
- strncpy(ifra[i].req_ifr->ifr_name, ifname, IFNAMSIZ);
- res = ioctl(skfd, ifra[i].req_type, ifra[i].req_ifr);
- if (res < 0) {
- saved_errno = errno;
- v_print("Interface '%s': Error: %s failed: %s\n",
- ifname, ifra[i].req_name,
- strerror(saved_errno));
-
- return saved_errno;
- }
- }
-
- return 0;
-}
-
-static int get_slave_flags(char *slave_ifname)
-{
- int res = 0;
-
- strncpy(slave_flags.ifr_name, slave_ifname, IFNAMSIZ);
- res = ioctl(skfd, SIOCGIFFLAGS, &slave_flags);
- if (res < 0) {
- saved_errno = errno;
- v_print("Slave '%s': Error: SIOCGIFFLAGS failed: %s\n",
- slave_ifname, strerror(saved_errno));
- } else {
- v_print("Slave %s: flags %04X.\n",
- slave_ifname, slave_flags.ifr_flags);
- }
-
- return res;
-}
-
-static int set_master_hwaddr(char *master_ifname, struct sockaddr *hwaddr)
-{
- unsigned char *addr = (unsigned char *)hwaddr->sa_data;
- struct ifreq ifr;
- int res = 0;
-
- strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
- memcpy(&(ifr.ifr_hwaddr), hwaddr, sizeof(struct sockaddr));
- res = ioctl(skfd, SIOCSIFHWADDR, &ifr);
- if (res < 0) {
- saved_errno = errno;
- v_print("Master '%s': Error: SIOCSIFHWADDR failed: %s\n",
- master_ifname, strerror(saved_errno));
- return res;
- } else {
- v_print("Master '%s': hardware address set to "
- "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n",
- master_ifname, addr[0], addr[1], addr[2],
- addr[3], addr[4], addr[5]);
- }
-
- return res;
-}
-
-static int set_slave_hwaddr(char *slave_ifname, struct sockaddr *hwaddr)
-{
- unsigned char *addr = (unsigned char *)hwaddr->sa_data;
- struct ifreq ifr;
- int res = 0;
-
- strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ);
- memcpy(&(ifr.ifr_hwaddr), hwaddr, sizeof(struct sockaddr));
- res = ioctl(skfd, SIOCSIFHWADDR, &ifr);
- if (res < 0) {
- saved_errno = errno;
-
- v_print("Slave '%s': Error: SIOCSIFHWADDR failed: %s\n",
- slave_ifname, strerror(saved_errno));
-
- if (saved_errno == EBUSY) {
- v_print(" The device is busy: it must be idle "
- "before running this command.\n");
- } else if (saved_errno == EOPNOTSUPP) {
- v_print(" The device does not support setting "
- "the MAC address.\n"
- " Your kernel likely does not support slave "
- "devices.\n");
- } else if (saved_errno == EINVAL) {
- v_print(" The device's address type does not match "
- "the master's address type.\n");
- }
- return res;
- } else {
- v_print("Slave '%s': hardware address set to "
- "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n",
- slave_ifname, addr[0], addr[1], addr[2],
- addr[3], addr[4], addr[5]);
- }
-
- return res;
-}
-
-static int set_slave_mtu(char *slave_ifname, int mtu)
-{
- struct ifreq ifr;
- int res = 0;
-
- ifr.ifr_mtu = mtu;
- strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ);
-
- res = ioctl(skfd, SIOCSIFMTU, &ifr);
- if (res < 0) {
- saved_errno = errno;
- v_print("Slave '%s': Error: SIOCSIFMTU failed: %s\n",
- slave_ifname, strerror(saved_errno));
- } else {
- v_print("Slave '%s': MTU set to %d.\n", slave_ifname, mtu);
- }
-
- return res;
-}
-
-static int set_if_flags(char *ifname, short flags)
-{
- struct ifreq ifr;
- int res = 0;
-
- ifr.ifr_flags = flags;
- strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
-
- res = ioctl(skfd, SIOCSIFFLAGS, &ifr);
- if (res < 0) {
- saved_errno = errno;
- v_print("Interface '%s': Error: SIOCSIFFLAGS failed: %s\n",
- ifname, strerror(saved_errno));
- } else {
- v_print("Interface '%s': flags set to %04X.\n", ifname, flags);
- }
-
- return res;
-}
-
-static int set_if_up(char *ifname, short flags)
-{
- return set_if_flags(ifname, flags | IFF_UP);
-}
-
-static int set_if_down(char *ifname, short flags)
-{
- return set_if_flags(ifname, flags & ~IFF_UP);
-}
-
-static int clear_if_addr(char *ifname)
-{
- struct ifreq ifr;
- int res = 0;
-
- strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
- ifr.ifr_addr.sa_family = AF_INET;
- memset(ifr.ifr_addr.sa_data, 0, sizeof(ifr.ifr_addr.sa_data));
-
- res = ioctl(skfd, SIOCSIFADDR, &ifr);
- if (res < 0) {
- saved_errno = errno;
- v_print("Interface '%s': Error: SIOCSIFADDR failed: %s\n",
- ifname, strerror(saved_errno));
- } else {
- v_print("Interface '%s': address cleared\n", ifname);
- }
-
- return res;
-}
-
-static int set_if_addr(char *master_ifname, char *slave_ifname)
-{
- struct ifreq ifr;
- int res;
- unsigned char *ipaddr;
- int i;
- struct {
- char *req_name;
- char *desc;
- int g_ioctl;
- int s_ioctl;
- } ifra[] = {
- {"IFADDR", "addr", SIOCGIFADDR, SIOCSIFADDR},
- {"DSTADDR", "destination addr", SIOCGIFDSTADDR, SIOCSIFDSTADDR},
- {"BRDADDR", "broadcast addr", SIOCGIFBRDADDR, SIOCSIFBRDADDR},
- {"NETMASK", "netmask", SIOCGIFNETMASK, SIOCSIFNETMASK},
- {NULL, NULL, 0, 0},
- };
-
- for (i = 0; ifra[i].req_name; i++) {
- strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
- res = ioctl(skfd, ifra[i].g_ioctl, &ifr);
- if (res < 0) {
- int saved_errno = errno;
-
- v_print("Interface '%s': Error: SIOCG%s failed: %s\n",
- master_ifname, ifra[i].req_name,
- strerror(saved_errno));
-
- ifr.ifr_addr.sa_family = AF_INET;
- memset(ifr.ifr_addr.sa_data, 0,
- sizeof(ifr.ifr_addr.sa_data));
- }
-
- strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ);
- res = ioctl(skfd, ifra[i].s_ioctl, &ifr);
- if (res < 0) {
- int saved_errno = errno;
-
- v_print("Interface '%s': Error: SIOCS%s failed: %s\n",
- slave_ifname, ifra[i].req_name,
- strerror(saved_errno));
-
- }
-
- ipaddr = (unsigned char *)ifr.ifr_addr.sa_data;
- v_print("Interface '%s': set IP %s to %d.%d.%d.%d\n",
- slave_ifname, ifra[i].desc,
- ipaddr[0], ipaddr[1], ipaddr[2], ipaddr[3]);
- }
-
- return 0;
-}
-
-/*
- * Local variables:
- * version-control: t
- * kept-new-versions: 5
- * c-indent-level: 4
- * c-basic-offset: 4
- * tab-width: 4
- * compile-command: "gcc -Wall -Wstrict-prototypes -O -I/usr/src/linux/include ifenslave.c -o ifenslave"
- * End:
- */
-
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index f98ca633b528..36e5a402ed0e 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -183,7 +183,7 @@ tcp_early_retrans - INTEGER
for triggering fast retransmit when the amount of outstanding data is
small and when no previously unsent data can be transmitted (such
that limited transmit could be used). Also controls the use of
- Tail loss probe (TLP) that converts RTOs occuring due to tail
+ Tail loss probe (TLP) that converts RTOs occurring due to tail
losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
Possible values:
0 disables ER
@@ -685,6 +685,15 @@ ip_dynaddr - BOOLEAN
occurs.
Default: 0
+ip_early_demux - BOOLEAN
+ Optimize input packet processing down to one demux for
+ certain kinds of local sockets. Currently we only do this
+ for established TCP sockets.
+
+ It may add an additional cost for pure routing workloads that
+ reduces overall throughput, in such case you should disable it.
+ Default: 1
+
icmp_echo_ignore_all - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO
requests sent to it.
@@ -729,7 +738,7 @@ icmp_ignore_bogus_error_responses - BOOLEAN
frames. Such violations are normally logged via a kernel warning.
If this is set to TRUE, the kernel will not give such warnings, which
will avoid log file clutter.
- Default: FALSE
+ Default: 1
icmp_errors_use_inbound_ifaddr - BOOLEAN
diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
index 1c2dab409625..e6088baf109d 100644
--- a/Documentation/networking/netlink_mmap.txt
+++ b/Documentation/networking/netlink_mmap.txt
@@ -54,7 +54,7 @@ it will use an allocated socket buffer as usual and the contents will be
copied to the ring on transmission, nullifying most of the performance gains.
Dumps of kernel databases automatically support memory mapped I/O.
-Conversion of the transmit path involves changing message contruction to
+Conversion of the transmit path involves changing message construction to
use memory from the TX ring instead of (usually) a buffer declared on the
stack and setting up the frame header approriately. Optionally poll() can
be used to wait for free frames in the TX ring.
@@ -65,8 +65,8 @@ Structured and definitions for using memory mapped I/O are contained in
RX and TX rings
----------------
-Each ring contains a number of continous memory blocks, containing frames of
-fixed size dependant on the parameters used for ring setup.
+Each ring contains a number of continuous memory blocks, containing frames of
+fixed size dependent on the parameters used for ring setup.
Ring: [ block 0 ]
[ frame 0 ]
@@ -80,7 +80,7 @@ Ring: [ block 0 ]
[ frame 2 * n + 1 ]
The blocks are only visible to the kernel, from the point of view of user-space
-the ring just contains the frames in a continous memory zone.
+the ring just contains the frames in a continuous memory zone.
The ring parameters used for setting up the ring are defined as follows:
@@ -91,7 +91,7 @@ struct nl_mmap_req {
unsigned int nm_frame_nr;
};
-Frames are grouped into blocks, where each block is a continous region of memory
+Frames are grouped into blocks, where each block is a continuous region of memory
and holds nm_block_size / nm_frame_size frames. The total number of frames in
the ring is nm_frame_nr. The following invariants hold:
@@ -113,8 +113,8 @@ Some parameters are constrained, specifically:
- nm_frame_nr must equal the actual number of frames as specified above.
-When the kernel can't allocate phsyically continous memory for a ring block,
-it will fall back to use physically discontinous memory. This might affect
+When the kernel can't allocate phsyically continuous memory for a ring block,
+it will fall back to use physically discontinuous memory. This might affect
performance negatively, in order to avoid this the nm_frame_size parameter
should be chosen to be as small as possible for the required frame size and
the number of blocks should be increased instead.
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 23dd80e82b8e..8572796b1eb6 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -704,6 +704,12 @@ So it seems to be a good candidate to be used with packet fanout.
Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
+/* Written from scratch, but kernel-to-user space API usage
+ * dissected from lolpcap:
+ * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
+ * License: GPL, version 2.0
+ */
+
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
@@ -722,27 +728,6 @@ it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
#include <linux/if_ether.h>
#include <linux/ip.h>
-#define BLOCK_SIZE (1 << 22)
-#define FRAME_SIZE 2048
-
-#define NUM_BLOCKS 64
-#define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE)
-
-#define BLOCK_RETIRE_TOV_IN_MS 64
-#define BLOCK_PRIV_AREA_SZ 13
-
-#define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1))
-
-#define BLOCK_STATUS(x) ((x)->h1.block_status)
-#define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts)
-#define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt)
-#define BLOCK_LEN(x) ((x)->h1.blk_len)
-#define BLOCK_SNUM(x) ((x)->h1.seq_num)
-#define BLOCK_O2PRIV(x) ((x)->offset_to_priv)
-#define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x)))
-#define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc)))
-#define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri)))
-
#ifndef likely
# define likely(x) __builtin_expect(!!(x), 1)
#endif
@@ -765,7 +750,7 @@ struct ring {
static unsigned long packets_total = 0, bytes_total = 0;
static sig_atomic_t sigint = 0;
-void sighandler(int num)
+static void sighandler(int num)
{
sigint = 1;
}
@@ -774,6 +759,8 @@ static int setup_socket(struct ring *ring, char *netdev)
{
int err, i, fd, v = TPACKET_V3;
struct sockaddr_ll ll;
+ unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
+ unsigned int blocknum = 64;
fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
if (fd < 0) {
@@ -788,13 +775,12 @@ static int setup_socket(struct ring *ring, char *netdev)
}
memset(&ring->req, 0, sizeof(ring->req));
- ring->req.tp_block_size = BLOCK_SIZE;
- ring->req.tp_frame_size = FRAME_SIZE;
- ring->req.tp_block_nr = NUM_BLOCKS;
- ring->req.tp_frame_nr = NUM_FRAMES;
- ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS;
- ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ;
- ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH;
+ ring->req.tp_block_size = blocksiz;
+ ring->req.tp_frame_size = framesiz;
+ ring->req.tp_block_nr = blocknum;
+ ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
+ ring->req.tp_retire_blk_tov = 60;
+ ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
sizeof(ring->req));
@@ -804,8 +790,7 @@ static int setup_socket(struct ring *ring, char *netdev)
}
ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
- PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
- fd, 0);
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
if (ring->map == MAP_FAILED) {
perror("mmap");
exit(1);
@@ -835,58 +820,6 @@ static int setup_socket(struct ring *ring, char *netdev)
return fd;
}
-#ifdef __checked
-static uint64_t prev_block_seq_num = 0;
-
-void assert_block_seq_num(struct block_desc *pbd)
-{
- if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) {
- printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != "
- "actual seq:%"PRIu64"\n", prev_block_seq_num,
- prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd));
- exit(1);
- }
-
- prev_block_seq_num = BLOCK_SNUM(pbd);
-}
-
-static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
-{
- if (BLOCK_NUM_PKTS(pbd)) {
- if (unlikely(bytes != BLOCK_LEN(pbd))) {
- printf("block:%u with %upackets, expected len:%u != actual len:%u\n",
- block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd));
- exit(1);
- }
- } else {
- if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) {
- printf("block:%u, expected len:%lu != actual len:%u\n",
- block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd));
- exit(1);
- }
- }
-}
-
-static void assert_block_header(struct block_desc *pbd, const int block_num)
-{
- uint32_t block_status = BLOCK_STATUS(pbd);
-
- if (unlikely((block_status & TP_STATUS_USER) == 0)) {
- printf("block:%u, not in TP_STATUS_USER\n", block_num);
- exit(1);
- }
-
- assert_block_seq_num(pbd);
-}
-#else
-static inline void assert_block_header(struct block_desc *pbd, const int block_num)
-{
-}
-static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
-{
-}
-#endif
-
static void display(struct tpacket3_hdr *ppd)
{
struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
@@ -916,37 +849,27 @@ static void display(struct tpacket3_hdr *ppd)
static void walk_block(struct block_desc *pbd, const int block_num)
{
- int num_pkts = BLOCK_NUM_PKTS(pbd), i;
+ int num_pkts = pbd->h1.num_pkts, i;
unsigned long bytes = 0;
- unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ);
struct tpacket3_hdr *ppd;
- assert_block_header(pbd, block_num);
-
- ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd));
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
+ pbd->h1.offset_to_first_pkt);
for (i = 0; i < num_pkts; ++i) {
bytes += ppd->tp_snaplen;
- if (ppd->tp_next_offset)
- bytes_with_padding += ppd->tp_next_offset;
- else
- bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac);
-
display(ppd);
- ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset);
- __sync_synchronize();
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
+ ppd->tp_next_offset);
}
- assert_block_len(pbd, bytes_with_padding, block_num);
-
packets_total += num_pkts;
bytes_total += bytes;
}
-void flush_block(struct block_desc *pbd)
+static void flush_block(struct block_desc *pbd)
{
- BLOCK_STATUS(pbd) = TP_STATUS_KERNEL;
- __sync_synchronize();
+ pbd->h1.block_status = TP_STATUS_KERNEL;
}
static void teardown_socket(struct ring *ring, int fd)
@@ -962,7 +885,7 @@ int main(int argc, char **argp)
socklen_t len;
struct ring ring;
struct pollfd pfd;
- unsigned int block_num = 0;
+ unsigned int block_num = 0, blocks = 64;
struct block_desc *pbd;
struct tpacket_stats_v3 stats;
@@ -984,15 +907,15 @@ int main(int argc, char **argp)
while (likely(!sigint)) {
pbd = (struct block_desc *) ring.rd[block_num].iov_base;
-retry_block:
- if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) {
+
+ if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
poll(&pfd, 1, -1);
- goto retry_block;
+ continue;
}
walk_block(pbd, block_num);
flush_block(pbd);
- block_num = (block_num + 1) % NUM_BLOCKS;
+ block_num = (block_num + 1) % blocks;
}
len = sizeof(stats);
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index 579994afbe06..ca6977f5b2ed 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -163,6 +163,64 @@ and unnecessary. If there are fewer hardware queues than CPUs, then
RPS might be beneficial if the rps_cpus for each queue are the ones that
share the same memory domain as the interrupting CPU for that queue.
+==== RPS Flow Limit
+
+RPS scales kernel receive processing across CPUs without introducing
+reordering. The trade-off to sending all packets from the same flow
+to the same CPU is CPU load imbalance if flows vary in packet rate.
+In the extreme case a single flow dominates traffic. Especially on
+common server workloads with many concurrent connections, such
+behavior indicates a problem such as a misconfiguration or spoofed
+source Denial of Service attack.
+
+Flow Limit is an optional RPS feature that prioritizes small flows
+during CPU contention by dropping packets from large flows slightly
+ahead of those from small flows. It is active only when an RPS or RFS
+destination CPU approaches saturation. Once a CPU's input packet
+queue exceeds half the maximum queue length (as set by sysctl
+net.core.netdev_max_backlog), the kernel starts a per-flow packet
+count over the last 256 packets. If a flow exceeds a set ratio (by
+default, half) of these packets when a new packet arrives, then the
+new packet is dropped. Packets from other flows are still only
+dropped once the input packet queue reaches netdev_max_backlog.
+No packets are dropped when the input packet queue length is below
+the threshold, so flow limit does not sever connections outright:
+even large flows maintain connectivity.
+
+== Interface
+
+Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
+turned on. It is implemented for each CPU independently (to avoid lock
+and cache contention) and toggled per CPU by setting the relevant bit
+in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
+bitmap interface as rps_cpus (see above) when called from procfs:
+
+ /proc/sys/net/core/flow_limit_cpu_bitmap
+
+Per-flow rate is calculated by hashing each packet into a hashtable
+bucket and incrementing a per-bucket counter. The hash function is
+the same that selects a CPU in RPS, but as the number of buckets can
+be much larger than the number of CPUs, flow limit has finer-grained
+identification of large flows and fewer false positives. The default
+table has 4096 buckets. This value can be modified through sysctl
+
+ net.core.flow_limit_table_len
+
+The value is only consulted when a new table is allocated. Modifying
+it does not update active tables.
+
+== Suggested Configuration
+
+Flow limit is useful on systems with many concurrent connections,
+where a single connection taking up 50% of a CPU indicates a problem.
+In such environments, enable the feature on all CPUs that handle
+network rx interrupts (as set in /proc/irq/N/smp_affinity).
+
+The feature depends on the input packet queue length to exceed
+the flow limit threshold (50%) + the flow history length (256).
+Setting net.core.netdev_max_backlog to either 1000 or 10000
+performed well in experiments.
+
RFS: Receive Flow Steering
==========================
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt
index 504dfe4d52eb..a66c9821b5ce 100644
--- a/Documentation/power/devices.txt
+++ b/Documentation/power/devices.txt
@@ -268,7 +268,7 @@ situations.
System Power Management Phases
------------------------------
Suspending or resuming the system is done in several phases. Different phases
-are used for standby or memory sleep states ("suspend-to-RAM") and the
+are used for freeze, standby, and memory sleep states ("suspend-to-RAM") and the
hibernation state ("suspend-to-disk"). Each phase involves executing callbacks
for every device before the next phase begins. Not all busses or classes
support all these callbacks and not all drivers use all the callbacks. The
@@ -309,7 +309,8 @@ execute the corresponding method from dev->driver->pm instead if there is one.
Entering System Suspend
-----------------------
-When the system goes into the standby or memory sleep state, the phases are:
+When the system goes into the freeze, standby or memory sleep state,
+the phases are:
prepare, suspend, suspend_late, suspend_noirq.
@@ -368,7 +369,7 @@ the devices that were suspended.
Leaving System Suspend
----------------------
-When resuming from standby or memory sleep, the phases are:
+When resuming from freeze, standby or memory sleep, the phases are:
resume_noirq, resume_early, resume, complete.
@@ -433,8 +434,8 @@ the system log.
Entering Hibernation
--------------------
-Hibernating the system is more complicated than putting it into the standby or
-memory sleep state, because it involves creating and saving a system image.
+Hibernating the system is more complicated than putting it into the other
+sleep states, because it involves creating and saving a system image.
Therefore there are more phases for hibernation, with a different set of
callbacks. These phases always run after tasks have been frozen and memory has
been freed.
@@ -485,8 +486,8 @@ image forms an atomic snapshot of the system state.
At this point the system image is saved, and the devices then need to be
prepared for the upcoming system shutdown. This is much like suspending them
-before putting the system into the standby or memory sleep state, and the phases
-are similar.
+before putting the system into the freeze, standby or memory sleep state,
+and the phases are similar.
9. The prepare phase is discussed above.
diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt
index c537834af005..f1f0f59a7c47 100644
--- a/Documentation/power/interface.txt
+++ b/Documentation/power/interface.txt
@@ -7,8 +7,8 @@ running. The interface exists in /sys/power/ directory (assuming sysfs
is mounted at /sys).
/sys/power/state controls system power state. Reading from this file
-returns what states are supported, which is hard-coded to 'standby'
-(Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
+returns what states are supported, which is hard-coded to 'freeze',
+'standby' (Power-On Suspend), 'mem' (Suspend-to-RAM), and 'disk'
(Suspend-to-Disk).
Writing to this file one of those strings causes the system to
diff --git a/Documentation/power/notifiers.txt b/Documentation/power/notifiers.txt
index c2a4a346c0d9..a81fa254303d 100644
--- a/Documentation/power/notifiers.txt
+++ b/Documentation/power/notifiers.txt
@@ -15,8 +15,10 @@ A suspend/hibernation notifier may be used for this purpose.
The subsystems or drivers having such needs can register suspend notifiers that
will be called upon the following events by the PM core:
-PM_HIBERNATION_PREPARE The system is going to hibernate or suspend, tasks will
- be frozen immediately.
+PM_HIBERNATION_PREPARE The system is going to hibernate, tasks will be frozen
+ immediately. This is different from PM_SUSPEND_PREPARE
+ below because here we do additional work between notifiers
+ and drivers freezing.
PM_POST_HIBERNATION The system memory state has been restored from a
hibernation image or an error occurred during
diff --git a/Documentation/power/states.txt b/Documentation/power/states.txt
index 4416b28630df..442d43df9b25 100644
--- a/Documentation/power/states.txt
+++ b/Documentation/power/states.txt
@@ -2,12 +2,26 @@
System Power Management States
-The kernel supports three power management states generically, though
-each is dependent on platform support code to implement the low-level
-details for each state. This file describes each state, what they are
+The kernel supports four power management states generically, though
+one is generic and the other three are dependent on platform support
+code to implement the low-level details for each state.
+This file describes each state, what they are
commonly called, what ACPI state they map to, and what string to write
to /sys/power/state to enter that state
+state: Freeze / Low-Power Idle
+ACPI state: S0
+String: "freeze"
+
+This state is a generic, pure software, light-weight, low-power state.
+It allows more energy to be saved relative to idle by freezing user
+space and putting all I/O devices into low-power states (possibly
+lower-power than available at run time), such that the processors can
+spend more time in their idle states.
+This state can be used for platforms without Standby/Suspend-to-RAM
+support, or it can be used in addition to Suspend-to-RAM (memory sleep)
+to provide reduced resume latency.
+
State: Standby / Power-On Suspend
ACPI State: S1
@@ -22,9 +36,6 @@ We try to put devices in a low-power state equivalent to D1, which
also offers low power savings, but low resume latency. Not all devices
support D1, and those that don't are left on.
-A transition from Standby to the On state should take about 1-2
-seconds.
-
State: Suspend-to-RAM
ACPI State: S3
@@ -42,9 +53,6 @@ transition back to the On state.
For at least ACPI, STR requires some minimal boot-strapping code to
resume the system from STR. This may be true on other platforms.
-A transition from Suspend-to-RAM to the On state should take about
-3-5 seconds.
-
State: Suspend-to-disk
ACPI State: S4
@@ -74,7 +82,3 @@ low-power state (like ACPI S4), or it may simply power down. Powering
down offers greater savings, and allows this mechanism to work on any
system. However, entering a real low-power state allows the user to
trigger wake up events (e.g. pressing a key or opening a laptop lid).
-
-A transition from Suspend-to-Disk to the On state should take about 30
-seconds, though it's typically a bit more with the current
-implementation.
diff --git a/Documentation/powerpc/transactional_memory.txt b/Documentation/powerpc/transactional_memory.txt
index c907be41d60f..dc23e58ae264 100644
--- a/Documentation/powerpc/transactional_memory.txt
+++ b/Documentation/powerpc/transactional_memory.txt
@@ -147,6 +147,25 @@ Example signal handler:
fix_the_problem(ucp->dar);
}
+When in an active transaction that takes a signal, we need to be careful with
+the stack. It's possible that the stack has moved back up after the tbegin.
+The obvious case here is when the tbegin is called inside a function that
+returns before a tend. In this case, the stack is part of the checkpointed
+transactional memory state. If we write over this non transactionally or in
+suspend, we are in trouble because if we get a tm abort, the program counter and
+stack pointer will be back at the tbegin but our in memory stack won't be valid
+anymore.
+
+To avoid this, when taking a signal in an active transaction, we need to use
+the stack pointer from the checkpointed state, rather than the speculated
+state. This ensures that the signal context (written tm suspended) will be
+written below the stack required for the rollback. The transaction is aborted
+becuase of the treclaim, so any memory written between the tbegin and the
+signal will be rolled back anyway.
+
+For signals taken in non-TM or suspended mode, we use the
+normal/non-checkpointed stack pointer.
+
Failure cause codes used by kernel
==================================
@@ -155,14 +174,18 @@ These are defined in <asm/reg.h>, and distinguish different reasons why the
kernel aborted a transaction:
TM_CAUSE_RESCHED Thread was rescheduled.
+ TM_CAUSE_TLBI Software TLB invalide.
TM_CAUSE_FAC_UNAV FP/VEC/VSX unavailable trap.
TM_CAUSE_SYSCALL Currently unused; future syscalls that must abort
transactions for consistency will use this.
TM_CAUSE_SIGNAL Signal delivered.
TM_CAUSE_MISC Currently unused.
+ TM_CAUSE_ALIGNMENT Alignment fault.
+ TM_CAUSE_EMULATE Emulation that touched memory.
-These can be checked by the user program's abort handler as TEXASR[0:7].
-
+These can be checked by the user program's abort handler as TEXASR[0:7]. If
+bit 7 is set, it indicates that the error is consider persistent. For example
+a TM_CAUSE_ALIGNMENT will be persistent while a TM_CAUSE_RESCHED will not.q
GDB
===
diff --git a/Documentation/rapidio/rapidio.txt b/Documentation/rapidio/rapidio.txt
index c75694b35d08..a9c16c979da2 100644
--- a/Documentation/rapidio/rapidio.txt
+++ b/Documentation/rapidio/rapidio.txt
@@ -79,20 +79,63 @@ master port that is used to communicate with devices within the network.
In order to initialize the RapidIO subsystem, a platform must initialize and
register at least one master port within the RapidIO network. To register mport
within the subsystem controller driver initialization code calls function
-rio_register_mport() for each available master port. After all active master
-ports are registered with a RapidIO subsystem, the rio_init_mports() routine
-is called to perform enumeration and discovery.
+rio_register_mport() for each available master port.
-In the current PowerPC-based implementation a subsys_initcall() is specified to
-perform controller initialization and mport registration. At the end it directly
-calls rio_init_mports() to execute RapidIO enumeration and discovery.
+RapidIO subsystem uses subsys_initcall() or device_initcall() to perform
+controller initialization (depending on controller device type).
+
+After all active master ports are registered with a RapidIO subsystem,
+an enumeration and/or discovery routine may be called automatically or
+by user-space command.
4. Enumeration and Discovery
----------------------------
-When rio_init_mports() is called it scans a list of registered master ports and
-calls an enumeration or discovery routine depending on the configured role of a
-master port: host or agent.
+4.1 Overview
+------------
+
+RapidIO subsystem configuration options allow users to specify enumeration and
+discovery methods as statically linked components or loadable modules.
+An enumeration/discovery method implementation and available input parameters
+define how any given method can be attached to available RapidIO mports:
+simply to all available mports OR individually to the specified mport device.
+
+Depending on selected enumeration/discovery build configuration, there are
+several methods to initiate an enumeration and/or discovery process:
+
+ (a) Statically linked enumeration and discovery process can be started
+ automatically during kernel initialization time using corresponding module
+ parameters. This was the original method used since introduction of RapidIO
+ subsystem. Now this method relies on enumerator module parameter which is
+ 'rio-scan.scan' for existing basic enumeration/discovery method.
+ When automatic start of enumeration/discovery is used a user has to ensure
+ that all discovering endpoints are started before the enumerating endpoint
+ and are waiting for enumeration to be completed.
+ Configuration option CONFIG_RAPIDIO_DISC_TIMEOUT defines time that discovering
+ endpoint waits for enumeration to be completed. If the specified timeout
+ expires the discovery process is terminated without obtaining RapidIO network
+ information. NOTE: a timed out discovery process may be restarted later using
+ a user-space command as it is described later if the given endpoint was
+ enumerated successfully.
+
+ (b) Statically linked enumeration and discovery process can be started by
+ a command from user space. This initiation method provides more flexibility
+ for a system startup compared to the option (a) above. After all participating
+ endpoints have been successfully booted, an enumeration process shall be
+ started first by issuing a user-space command, after an enumeration is
+ completed a discovery process can be started on all remaining endpoints.
+
+ (c) Modular enumeration and discovery process can be started by a command from
+ user space. After an enumeration/discovery module is loaded, a network scan
+ process can be started by issuing a user-space command.
+ Similar to the option (b) above, an enumerator has to be started first.
+
+ (d) Modular enumeration and discovery process can be started by a module
+ initialization routine. In this case an enumerating module shall be loaded
+ first.
+
+When a network scan process is started it calls an enumeration or discovery
+routine depending on the configured role of a master port: host or agent.
Enumeration is performed by a master port if it is configured as a host port by
assigning a host device ID greater than or equal to zero. A host device ID is
@@ -104,8 +147,58 @@ for it.
The enumeration and discovery routines use RapidIO maintenance transactions
to access the configuration space of devices.
-The enumeration process is implemented according to the enumeration algorithm
-outlined in the RapidIO Interconnect Specification: Annex I [1].
+4.2 Automatic Start of Enumeration and Discovery
+------------------------------------------------
+
+Automatic enumeration/discovery start method is applicable only to built-in
+enumeration/discovery RapidIO configuration selection. To enable automatic
+enumeration/discovery start by existing basic enumerator method set use boot
+command line parameter "rio-scan.scan=1".
+
+This configuration requires synchronized start of all RapidIO endpoints that
+form a network which will be enumerated/discovered. Discovering endpoints have
+to be started before an enumeration starts to ensure that all RapidIO
+controllers have been initialized and are ready to be discovered. Configuration
+parameter CONFIG_RAPIDIO_DISC_TIMEOUT defines time (in seconds) which
+a discovering endpoint will wait for enumeration to be completed.
+
+When automatic enumeration/discovery start is selected, basic method's
+initialization routine calls rio_init_mports() to perform enumeration or
+discovery for all known mport devices.
+
+Depending on RapidIO network size and configuration this automatic
+enumeration/discovery start method may be difficult to use due to the
+requirement for synchronized start of all endpoints.
+
+4.3 User-space Start of Enumeration and Discovery
+-------------------------------------------------
+
+User-space start of enumeration and discovery can be used with built-in and
+modular build configurations. For user-space controlled start RapidIO subsystem
+creates the sysfs write-only attribute file '/sys/bus/rapidio/scan'. To initiate
+an enumeration or discovery process on specific mport device, a user needs to
+write mport_ID (not RapidIO destination ID) into that file. The mport_ID is a
+sequential number (0 ... RIO_MAX_MPORTS) assigned during mport device
+registration. For example for machine with single RapidIO controller, mport_ID
+for that controller always will be 0.
+
+To initiate RapidIO enumeration/discovery on all available mports a user may
+write '-1' (or RIO_MPORT_ANY) into the scan attribute file.
+
+4.4 Basic Enumeration Method
+----------------------------
+
+This is an original enumeration/discovery method which is available since
+first release of RapidIO subsystem code. The enumeration process is
+implemented according to the enumeration algorithm outlined in the RapidIO
+Interconnect Specification: Annex I [1].
+
+This method can be configured as statically linked or loadable module.
+The method's single parameter "scan" allows to trigger the enumeration/discovery
+process from module initialization routine.
+
+This enumeration/discovery method can be started only once and does not support
+unloading if it is built as a module.
The enumeration process traverses the network using a recursive depth-first
algorithm. When a new device is found, the enumerator takes ownership of that
@@ -160,6 +253,19 @@ time period. If this wait time period expires before enumeration is completed,
an agent skips RapidIO discovery and continues with remaining kernel
initialization.
+4.5 Adding New Enumeration/Discovery Method
+-------------------------------------------
+
+RapidIO subsystem code organization allows addition of new enumeration/discovery
+methods as new configuration options without significant impact to to the core
+RapidIO code.
+
+A new enumeration/discovery method has to be attached to one or more mport
+devices before an enumeration/discovery process can be started. Normally,
+method's module initialization routine calls rio_register_scan() to attach
+an enumerator to a specified mport device (or devices). The basic enumerator
+implementation demonstrates this process.
+
5. References
-------------
diff --git a/Documentation/rapidio/sysfs.txt b/Documentation/rapidio/sysfs.txt
index 97f71ce575d6..19878179da4c 100644
--- a/Documentation/rapidio/sysfs.txt
+++ b/Documentation/rapidio/sysfs.txt
@@ -88,3 +88,20 @@ that exports additional attributes.
IDT_GEN2:
errlog - reads contents of device error log until it is empty.
+
+
+5. RapidIO Bus Attributes
+-------------------------
+
+RapidIO bus subdirectory /sys/bus/rapidio implements the following bus-specific
+attribute:
+
+ scan - allows to trigger enumeration discovery process from user space. This
+ is a write-only attribute. To initiate an enumeration or discovery
+ process on specific mport device, a user needs to write mport_ID (not
+ RapidIO destination ID) into this file. The mport_ID is a sequential
+ number (0 ... RIO_MAX_MPORTS) assigned to the mport device.
+ For example, for a machine with a single RapidIO controller, mport_ID
+ for that controller always will be 0.
+ To initiate RapidIO enumeration/discovery on all available mports
+ a user must write '-1' (or RIO_MPORT_ANY) into this attribute file.
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 98335b7a5337..5369879eafe2 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -26,7 +26,7 @@ Table : Subdirectories in /proc/sys/net
ipv4 IP version 4 x25 X.25 protocol
ipx IPX token-ring IBM token ring
bridge Bridging decnet DEC net
- ipv6 IP version 6
+ ipv6 IP version 6 tipc TIPC
..............................................................................
1. /proc/sys/net/core - Network core options
@@ -50,6 +50,13 @@ The maximum number of packets that kernel can handle on a NAPI interrupt,
it's a Per-CPU variable.
Default: 64
+low_latency_poll
+----------------
+Low latency busy poll timeout. (needs CONFIG_NET_LL_RX_POLL)
+Approximate time in us to spin waiting for packets on the device queue.
+Recommended value is 50. May increase power usage.
+Default: 0 (off)
+
rmem_default
------------
@@ -93,8 +100,7 @@ netdev_budget
Maximum number of packets taken from all interfaces in one polling cycle (NAPI
poll). In one polling cycle interfaces which are registered to polling are
-probed in a round-robin manner. The limit of packets in one such probe can be
-set per-device via sysfs class/net/<device>/weight .
+probed in a round-robin manner.
netdev_max_backlog
------------------
@@ -201,3 +207,18 @@ IPX.
The /proc/net/ipx_route table holds a list of IPX routes. For each route it
gives the destination network, the router node (or Directly) and the network
address of the router (or Connected) for internal networks.
+
+6. TIPC
+-------------------------------------------------------
+
+The TIPC protocol now has a tunable for the receive memory, similar to the
+tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max)
+
+ # cat /proc/sys/net/tipc/tipc_rmem
+ 4252725 34021800 68043600
+ #
+
+The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values
+are scaled (shifted) versions of that same value. Note that the min value
+is not at this point in time used in any meaningful way, but the triplet is
+preserved in order to be consistent with things like tcp_rmem.
OpenPOWER on IntegriCloud