diff options
Diffstat (limited to 'Documentation')
138 files changed, 11093 insertions, 1493 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index f28a24e0279b..433cf5e9ae04 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX @@ -46,6 +46,8 @@ SubmittingPatches - procedure to get a source patch included into the kernel tree. VGA-softcursor.txt - how to change your VGA cursor from a blinking underscore. +applying-patches.txt + - description of various trees and how to apply their patches. arm/ - directory with info about Linux on the ARM architecture. basic_profiling.txt @@ -275,7 +277,7 @@ tty.txt unicode.txt - info on the Unicode character/font mapping used in Linux. uml/ - - directory with infomation about User Mode Linux. + - directory with information about User Mode Linux. usb/ - directory with info regarding the Universal Serial Bus. video4linux/ diff --git a/Documentation/Changes b/Documentation/Changes index 5eaab0441d76..783ddc3ce4e8 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -65,7 +65,7 @@ o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version o nfs-utils 1.0.5 # showmount --version o procps 3.2.0 # ps --version o oprofile 0.9 # oprofiled --version -o udev 058 # udevinfo -V +o udev 071 # udevinfo -V Kernel compilation ================== @@ -237,6 +237,12 @@ udev udev is a userspace application for populating /dev dynamically with only entries for devices actually present. udev replaces devfs. +FUSE +---- + +Needs libfuse 2.4.0 or later. Absolute minimum is 2.3.0 but mount +options 'direct_io' and 'kernel_cache' won't work. + Networking ========== @@ -390,6 +396,10 @@ udev ---- o <http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev.html> +FUSE +---- +o <http://sourceforge.net/projects/fuse> + Networking ********** diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle index f25b3953f513..eb7db3c19227 100644 --- a/Documentation/CodingStyle +++ b/Documentation/CodingStyle @@ -236,6 +236,9 @@ ugly), but try to avoid excess. Instead, put the comments at the head of the function, telling people what it does, and possibly WHY it does it. +When commenting the kernel API functions, please use the kerneldoc format. +See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc +for details. Chapter 8: You've made a mess of it @@ -407,7 +410,26 @@ Kernel messages do not have to be terminated with a period. Printing numbers in parentheses (%d) adds no value and should be avoided. - Chapter 13: References + Chapter 13: Allocating memory + +The kernel provides the following general purpose memory allocators: +kmalloc(), kzalloc(), kcalloc(), and vmalloc(). Please refer to the API +documentation for further information about them. + +The preferred form for passing a size of a struct is the following: + + p = kmalloc(sizeof(*p), ...); + +The alternative form where struct name is spelled out hurts readability and +introduces an opportunity for a bug when the pointer variable type is changed +but the corresponding sizeof that is passed to a memory allocator is not. + +Casting the return value which is a void pointer is redundant. The conversion +from void pointer to any other pointer type is guaranteed by the C programming +language. + + + Chapter 14: References The C Programming Language, Second Edition by Brian W. Kernighan and Dennis M. Ritchie. diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 6ee3cd6134df..1af0f2d50220 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -121,7 +121,7 @@ pool's device. dma_addr_t addr); This puts memory back into the pool. The pool is what was passed to -the the pool allocation routine; the cpu and dma addresses are what +the pool allocation routine; the cpu and dma addresses are what were returned when that routine allocated the memory being freed. diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt new file mode 100644 index 000000000000..705f6be92bdb --- /dev/null +++ b/Documentation/DMA-ISA-LPC.txt @@ -0,0 +1,151 @@ + DMA with ISA and LPC devices + ============================ + + Pierre Ossman <drzeus@drzeus.cx> + +This document describes how to do DMA transfers using the old ISA DMA +controller. Even though ISA is more or less dead today the LPC bus +uses the same DMA system so it will be around for quite some time. + +Part I - Headers and dependencies +--------------------------------- + +To do ISA style DMA you need to include two headers: + +#include <linux/dma-mapping.h> +#include <asm/dma.h> + +The first is the generic DMA API used to convert virtual addresses to +physical addresses (see Documentation/DMA-API.txt for details). + +The second contains the routines specific to ISA DMA transfers. Since +this is not present on all platforms make sure you construct your +Kconfig to be dependent on ISA_DMA_API (not ISA) so that nobody tries +to build your driver on unsupported platforms. + +Part II - Buffer allocation +--------------------------- + +The ISA DMA controller has some very strict requirements on which +memory it can access so extra care must be taken when allocating +buffers. + +(You usually need a special buffer for DMA transfers instead of +transferring directly to and from your normal data structures.) + +The DMA-able address space is the lowest 16 MB of _physical_ memory. +Also the transfer block may not cross page boundaries (which are 64 +or 128 KiB depending on which channel you use). + +In order to allocate a piece of memory that satisfies all these +requirements you pass the flag GFP_DMA to kmalloc. + +Unfortunately the memory available for ISA DMA is scarce so unless you +allocate the memory during boot-up it's a good idea to also pass +__GFP_REPEAT and __GFP_NOWARN to make the allocater try a bit harder. + +(This scarcity also means that you should allocate the buffer as +early as possible and not release it until the driver is unloaded.) + +Part III - Address translation +------------------------------ + +To translate the virtual address to a physical use the normal DMA +API. Do _not_ use isa_virt_to_phys() even though it does the same +thing. The reason for this is that the function isa_virt_to_phys() +will require a Kconfig dependency to ISA, not just ISA_DMA_API which +is really all you need. Remember that even though the DMA controller +has its origins in ISA it is used elsewhere. + +Note: x86_64 had a broken DMA API when it came to ISA but has since +been fixed. If your arch has problems then fix the DMA API instead of +reverting to the ISA functions. + +Part IV - Channels +------------------ + +A normal ISA DMA controller has 8 channels. The lower four are for +8-bit transfers and the upper four are for 16-bit transfers. + +(Actually the DMA controller is really two separate controllers where +channel 4 is used to give DMA access for the second controller (0-3). +This means that of the four 16-bits channels only three are usable.) + +You allocate these in a similar fashion as all basic resources: + +extern int request_dma(unsigned int dmanr, const char * device_id); +extern void free_dma(unsigned int dmanr); + +The ability to use 16-bit or 8-bit transfers is _not_ up to you as a +driver author but depends on what the hardware supports. Check your +specs or test different channels. + +Part V - Transfer data +---------------------- + +Now for the good stuff, the actual DMA transfer. :) + +Before you use any ISA DMA routines you need to claim the DMA lock +using claim_dma_lock(). The reason is that some DMA operations are +not atomic so only one driver may fiddle with the registers at a +time. + +The first time you use the DMA controller you should call +clear_dma_ff(). This clears an internal register in the DMA +controller that is used for the non-atomic operations. As long as you +(and everyone else) uses the locking functions then you only need to +reset this once. + +Next, you tell the controller in which direction you intend to do the +transfer using set_dma_mode(). Currently you have the options +DMA_MODE_READ and DMA_MODE_WRITE. + +Set the address from where the transfer should start (this needs to +be 16-bit aligned for 16-bit transfers) and how many bytes to +transfer. Note that it's _bytes_. The DMA routines will do all the +required translation to values that the DMA controller understands. + +The final step is enabling the DMA channel and releasing the DMA +lock. + +Once the DMA transfer is finished (or timed out) you should disable +the channel again. You should also check get_dma_residue() to make +sure that all data has been transfered. + +Example: + +int flags, residue; + +flags = claim_dma_lock(); + +clear_dma_ff(); + +set_dma_mode(channel, DMA_MODE_WRITE); +set_dma_addr(channel, phys_addr); +set_dma_count(channel, num_bytes); + +dma_enable(channel); + +release_dma_lock(flags); + +while (!device_done()); + +flags = claim_dma_lock(); + +dma_disable(channel); + +residue = dma_get_residue(channel); +if (residue != 0) + printk(KERN_ERR "driver: Incomplete DMA transfer!" + " %d bytes left!\n", residue); + +release_dma_lock(flags); + +Part VI - Suspend/resume +------------------------ + +It is the driver's responsibility to make sure that the machine isn't +suspended while a DMA transfer is in progress. Also, all DMA settings +are lost when the system suspends so if your driver relies on the DMA +controller being in a certain state then you have to restore these +registers upon resume. diff --git a/Documentation/DocBook/journal-api.tmpl b/Documentation/DocBook/journal-api.tmpl index 1ef6f43c6d8f..341aaa4ce481 100644 --- a/Documentation/DocBook/journal-api.tmpl +++ b/Documentation/DocBook/journal-api.tmpl @@ -116,7 +116,7 @@ filesystem. Almost. You still need to actually journal your filesystem changes, this is done by wrapping them into transactions. Additionally you -also need to wrap the modification of each of the the buffers +also need to wrap the modification of each of the buffers with calls to the journal layer, so it knows what the modifications you are actually making are. To do this use journal_start() which returns a transaction handle. @@ -128,7 +128,7 @@ and its counterpart journal_stop(), which indicates the end of a transaction are nestable calls, so you can reenter a transaction if necessary, but remember you must call journal_stop() the same number of times as journal_start() before the transaction is completed (or more accurately -leaves the the update phase). Ext3/VFS makes use of this feature to simplify +leaves the update phase). Ext3/VFS makes use of this feature to simplify quota support. </para> diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index d650ce36485f..ec474e5a25ed 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl @@ -239,9 +239,9 @@ X!Ilib/string.c <title>Network device support</title> <sect1><title>Driver Support</title> !Enet/core/dev.c - </sect1> - <sect1><title>8390 Based Network Cards</title> -!Edrivers/net/8390.c +!Enet/ethernet/eth.c +!Einclude/linux/etherdevice.h +!Enet/core/wireless.c </sect1> <sect1><title>Synchronous PPP</title> !Edrivers/net/wan/syncppp.c @@ -286,7 +286,9 @@ X!Edrivers/pci/search.c --> !Edrivers/pci/msi.c !Edrivers/pci/bus.c -!Edrivers/pci/hotplug.c +<!-- FIXME: Removed for now since no structured comments in source +X!Edrivers/pci/hotplug.c +--> !Edrivers/pci/probe.c !Edrivers/pci/rom.c </sect1> diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index 49a9ef82d575..582032eea872 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -8,8 +8,7 @@ <authorgroup> <author> - <firstname>Paul</firstname> - <othername>Rusty</othername> + <firstname>Rusty</firstname> <surname>Russell</surname> <affiliation> <address> @@ -20,7 +19,7 @@ </authorgroup> <copyright> - <year>2001</year> + <year>2005</year> <holder>Rusty Russell</holder> </copyright> @@ -64,7 +63,7 @@ <chapter id="introduction"> <title>Introduction</title> <para> - Welcome, gentle reader, to Rusty's Unreliable Guide to Linux + Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux Kernel Hacking. This document describes the common routines and general requirements for kernel code: its goal is to serve as a primer for Linux kernel development for experienced C @@ -96,13 +95,13 @@ <listitem> <para> - not associated with any process, serving a softirq, tasklet or bh; + not associated with any process, serving a softirq or tasklet; </para> </listitem> <listitem> <para> - running in kernel space, associated with a process; + running in kernel space, associated with a process (user context); </para> </listitem> @@ -114,11 +113,12 @@ </itemizedlist> <para> - There is a strict ordering between these: other than the last - category (userspace) each can only be pre-empted by those above. - For example, while a softirq is running on a CPU, no other - softirq will pre-empt it, but a hardware interrupt can. However, - any other CPUs in the system execute independently. + There is an ordering between these. The bottom two can preempt + each other, but above that is a strict hierarchy: each can only be + preempted by the ones above it. For example, while a softirq is + running on a CPU, no other softirq will preempt it, but a hardware + interrupt can. However, any other CPUs in the system execute + independently. </para> <para> @@ -130,10 +130,10 @@ <title>User Context</title> <para> - User context is when you are coming in from a system call or - other trap: you can sleep, and you own the CPU (except for - interrupts) until you call <function>schedule()</function>. - In other words, user context (unlike userspace) is not pre-emptable. + User context is when you are coming in from a system call or other + trap: like userspace, you can be preempted by more important tasks + and by interrupts. You can sleep, by calling + <function>schedule()</function>. </para> <note> @@ -153,7 +153,7 @@ <caution> <para> - Beware that if you have interrupts or bottom halves disabled + Beware that if you have preemption or softirqs disabled (see below), <function>in_interrupt()</function> will return a false positive. </para> @@ -168,10 +168,10 @@ <hardware>keyboard</hardware> are examples of real hardware which produce interrupts at any time. The kernel runs interrupt handlers, which services the hardware. The kernel - guarantees that this handler is never re-entered: if another + guarantees that this handler is never re-entered: if the same interrupt arrives, it is queued (or dropped). Because it disables interrupts, this handler has to be fast: frequently it - simply acknowledges the interrupt, marks a `software interrupt' + simply acknowledges the interrupt, marks a 'software interrupt' for execution and exits. </para> @@ -188,60 +188,52 @@ </sect1> <sect1 id="basics-softirqs"> - <title>Software Interrupt Context: Bottom Halves, Tasklets, softirqs</title> + <title>Software Interrupt Context: Softirqs and Tasklets</title> <para> Whenever a system call is about to return to userspace, or a - hardware interrupt handler exits, any `software interrupts' + hardware interrupt handler exits, any 'software interrupts' which are marked pending (usually by hardware interrupts) are run (<filename>kernel/softirq.c</filename>). </para> <para> Much of the real interrupt handling work is done here. Early in - the transition to <acronym>SMP</acronym>, there were only `bottom + the transition to <acronym>SMP</acronym>, there were only 'bottom halves' (BHs), which didn't take advantage of multiple CPUs. Shortly after we switched from wind-up computers made of match-sticks and snot, - we abandoned this limitation. + we abandoned this limitation and switched to 'softirqs'. </para> <para> <filename class="headerfile">include/linux/interrupt.h</filename> lists the - different BH's. No matter how many CPUs you have, no two BHs will run at - the same time. This made the transition to SMP simpler, but sucks hard for - scalable performance. A very important bottom half is the timer - BH (<filename class="headerfile">include/linux/timer.h</filename>): you - can register to have it call functions for you in a given length of time. + different softirqs. A very important softirq is the + timer softirq (<filename + class="headerfile">include/linux/timer.h</filename>): you can + register to have it call functions for you in a given length of + time. </para> <para> - 2.3.43 introduced softirqs, and re-implemented the (now - deprecated) BHs underneath them. Softirqs are fully-SMP - versions of BHs: they can run on as many CPUs at once as - required. This means they need to deal with any races in shared - data using their own locks. A bitmask is used to keep track of - which are enabled, so the 32 available softirqs should not be - used up lightly. (<emphasis>Yes</emphasis>, people will - notice). - </para> - - <para> - tasklets (<filename class="headerfile">include/linux/interrupt.h</filename>) - are like softirqs, except they are dynamically-registrable (meaning you - can have as many as you want), and they also guarantee that any tasklet - will only run on one CPU at any time, although different tasklets can - run simultaneously (unlike different BHs). + Softirqs are often a pain to deal with, since the same softirq + will run simultaneously on more than one CPU. For this reason, + tasklets (<filename + class="headerfile">include/linux/interrupt.h</filename>) are more + often used: they are dynamically-registrable (meaning you can have + as many as you want), and they also guarantee that any tasklet + will only run on one CPU at any time, although different tasklets + can run simultaneously. </para> <caution> <para> - The name `tasklet' is misleading: they have nothing to do with `tasks', + The name 'tasklet' is misleading: they have nothing to do with 'tasks', and probably more to do with some bad vodka Alexey Kuznetsov had at the time. </para> </caution> <para> - You can tell you are in a softirq (or bottom half, or tasklet) + You can tell you are in a softirq (or tasklet) using the <function>in_softirq()</function> macro (<filename class="headerfile">include/linux/interrupt.h</filename>). </para> @@ -288,11 +280,10 @@ <term>A rigid stack limit</term> <listitem> <para> - The kernel stack is about 6K in 2.2 (for most - architectures: it's about 14K on the Alpha), and shared - with interrupts so you can't use it all. Avoid deep - recursion and huge local arrays on the stack (allocate - them dynamically instead). + Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's + about 14K on most 64-bit archs, and often shared with interrupts + so you can't use it all. Avoid deep recursion and huge local + arrays on the stack (allocate them dynamically instead). </para> </listitem> </varlistentry> @@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg) <para> If all your routine does is read or write some parameter, consider - implementing a <function>sysctl</function> interface instead. + implementing a <function>sysfs</function> interface instead. </para> <para> @@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */ </para> <para> - You will eventually lock up your box if you break these rules. + You should always compile your kernel + <symbol>CONFIG_DEBUG_SPINLOCK_SLEEP</symbol> on, and it will warn + you if you break these rules. If you <emphasis>do</emphasis> break + the rules, you will eventually lock up your box. </para> <para> @@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); success). </para> </caution> - [Yes, this moronic interface makes me cringe. Please submit a - patch and become my hero --RR.] + [Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.] </para> <para> The functions may sleep implicitly. This should never be called @@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); </variablelist> <para> - If you see a <errorname>kmem_grow: Called nonatomically from int - </errorname> warning message you called a memory allocation function - from interrupt context without <constant>GFP_ATOMIC</constant>. - You should really fix that. Run, don't walk. + If you see a <errorname>sleeping function called from invalid + context</errorname> warning message, then maybe you called a + sleeping allocation function from interrupt context without + <constant>GFP_ATOMIC</constant>. You should really fix that. + Run, don't walk. </para> <para> @@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); </sect1> <sect1 id="routines-udelay"> - <title><function>udelay()</function>/<function>mdelay()</function> + <title><function>mdelay()</function>/<function>udelay()</function> <filename class="headerfile">include/asm/delay.h</filename> <filename class="headerfile">include/linux/delay.h</filename> </title> <para> - The <function>udelay()</function> function can be used for small pauses. - Do not use large values with <function>udelay()</function> as you risk + The <function>udelay()</function> and <function>ndelay()</function> functions can be used for small pauses. + Do not use large values with them as you risk overflow - the helper function <function>mdelay()</function> is useful - here, or even consider <function>schedule_timeout()</function>. + here, or consider <function>msleep()</function>. </para> </sect1> @@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); These routines disable soft interrupts on the local CPU, and restore them. They are reentrant; if soft interrupts were disabled before, they will still be disabled after this pair - of functions has been called. They prevent softirqs, tasklets - and bottom halves from running on the current CPU. + of functions has been called. They prevent softirqs and tasklets + from running on the current CPU. </para> </sect1> @@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); <filename class="headerfile">include/asm/smp.h</filename></title> <para> - <function>smp_processor_id()</function> returns the current - processor number, between 0 and <symbol>NR_CPUS</symbol> (the - maximum number of CPUs supported by Linux, currently 32). These - values are not necessarily continuous. + <function>get_cpu()</function> disables preemption (so you won't + suddenly get moved to another CPU) and returns the current + processor number, between 0 and <symbol>NR_CPUS</symbol>. Note + that the CPU numbers are not necessarily continuous. You return + it again with <function>put_cpu()</function> when you are done. + </para> + <para> + If you know you cannot be preempted by another task (ie. you are + in interrupt context, or have preemption disabled) you can use + smp_processor_id(). </para> </sect1> @@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); <para> After boot, the kernel frees up a special section; functions marked with <type>__init</type> and data structures marked with - <type>__initdata</type> are dropped after boot is complete (within - modules this directive is currently ignored). <type>__exit</type> + <type>__initdata</type> are dropped after boot is complete: similarly + modules discard this memory after initialization. <type>__exit</type> is used to declare a function which is only required on exit: the function will be dropped if this file is not compiled as a module. See the header file for use. Note that it makes no sense for a function marked with <type>__init</type> to be exported to modules with <function>EXPORT_SYMBOL()</function> - this will break. </para> - <para> - Static data structures marked as <type>__initdata</type> must be initialised - (as opposed to ordinary static data which is zeroed BSS) and cannot be - <type>const</type>. - </para> </sect1> @@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); <para> The function can return a negative error number to cause module loading to fail (unfortunately, this has no effect if - the module is compiled into the kernel). For modules, this is - called in user context, with interrupts enabled, and the - kernel lock held, so it can sleep. + the module is compiled into the kernel). This function is + called in user context with interrupts enabled, so it can sleep. </para> </sect1> @@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); reached zero. This function can also sleep, but cannot fail: everything must be cleaned up by the time it returns. </para> + + <para> + Note that this macro is optional: if it is not present, your + module will not be removable (except for 'rmmod -f'). + </para> + </sect1> + + <sect1 id="routines-module-use-counters"> + <title> <function>try_module_get()</function>/<function>module_put()</function> + <filename class="headerfile">include/linux/module.h</filename></title> + + <para> + These manipulate the module usage count, to protect against + removal (a module also can't be removed if another module uses one + of its exported symbols: see below). Before calling into module + code, you should call <function>try_module_get()</function> on + that module: if it fails, then the module is being removed and you + should act as if it wasn't there. Otherwise, you can safely enter + the module, and call <function>module_put()</function> when you're + finished. + </para> + + <para> + Most registerable structures have an + <structfield>owner</structfield> field, such as in the + <structname>file_operations</structname> structure. Set this field + to the macro <symbol>THIS_MODULE</symbol>. + </para> </sect1> <!-- add info on new-style module refcounting here --> @@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); There is a macro to do this: <function>wait_event_interruptible()</function> - <filename class="headerfile">include/linux/sched.h</filename> The + <filename class="headerfile">include/linux/wait.h</filename> The first argument is the wait queue head, and the second is an expression which is evaluated; the macro returns <returnvalue>0</returnvalue> when this expression is true, or @@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); <para> Call <function>wake_up()</function> - <filename class="headerfile">include/linux/sched.h</filename>;, + <filename class="headerfile">include/linux/wait.h</filename>;, which will wake up every process in the queue. The exception is if one has <constant>TASK_EXCLUSIVE</constant> set, in which case - the remainder of the queue will not be woken. + the remainder of the queue will not be woken. There are other variants + of this basic function available in the same header. </para> </sect1> </chapter> @@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); first class of operations work on <type>atomic_t</type> <filename class="headerfile">include/asm/atomic.h</filename>; this - contains a signed integer (at least 24 bits long), and you must use + contains a signed integer (at least 32 bits long), and you must use these functions to manipulate or read atomic_t variables. <function>atomic_read()</function> and <function>atomic_set()</function> get and set the counter, @@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); <para> Note that these functions are slower than normal arithmetic, and - so should not be used unnecessarily. On some platforms they - are much slower, like 32-bit Sparc where they use a spinlock. + so should not be used unnecessarily. </para> <para> - The second class of atomic operations is atomic bit operations on a - <type>long</type>, defined in + The second class of atomic operations is atomic bit operations on an + <type>unsigned long</type>, defined in <filename class="headerfile">include/linux/bitops.h</filename>. These operations generally take a pointer to the bit pattern, and a bit @@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); <function>test_and_clear_bit()</function> and <function>test_and_change_bit()</function> do the same thing, except return true if the bit was previously set; these are - particularly useful for very simple locking. + particularly useful for atomically setting flags. </para> <para> @@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); than BITS_PER_LONG. The resulting behavior is strange on big-endian platforms though so it is a good idea not to do this. </para> - - <para> - Note that the order of bits depends on the architecture, and in - particular, the bitfield passed to these operations must be at - least as large as a <type>long</type>. - </para> </chapter> <chapter id="symbols"> @@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); <filename class="headerfile">include/linux/module.h</filename></title> <para> - This is the classic method of exporting a symbol, and it works - for both modules and non-modules. In the kernel all these - declarations are often bundled into a single file to help - genksyms (which searches source files for these declarations). - See the comment on genksyms and Makefiles below. + This is the classic method of exporting a symbol: dynamically + loaded modules will be able to use the symbol as normal. </para> </sect1> @@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); symbols exported by <function>EXPORT_SYMBOL_GPL()</function> can only be seen by modules with a <function>MODULE_LICENSE()</function> that specifies a GPL - compatible license. + compatible license. It implies that the function is considered + an internal implementation issue, and not really an interface. </para> </sect1> </chapter> @@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); <filename class="headerfile">include/linux/list.h</filename></title> <para> - There are three sets of linked-list routines in the kernel - headers, but this one seems to be winning out (and Linus has - used it). If you don't have some particular pressing need for - a single list, it's a good choice. In fact, I don't care - whether it's a good choice or not, just use it so we can get - rid of the others. + There used to be three sets of linked-list routines in the kernel + headers, but this one is the winner. If you don't have some + particular pressing need for a single list, it's a good choice. + </para> + + <para> + In particular, <function>list_for_each_entry</function> is useful. </para> </sect1> @@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); convention, and return <returnvalue>0</returnvalue> for success, and a negative error number (eg. <returnvalue>-EFAULT</returnvalue>) for failure. This can be - unintuitive at first, but it's fairly widespread in the networking - code, for example. + unintuitive at first, but it's fairly widespread in the kernel. </para> <para> - The filesystem code uses <function>ERR_PTR()</function> + Using <function>ERR_PTR()</function> - <filename class="headerfile">include/linux/fs.h</filename>; to + <filename class="headerfile">include/linux/err.h</filename>; to encode a negative error number into a pointer, and <function>IS_ERR()</function> and <function>PTR_ERR()</function> to get it back out again: avoids a separate pointer parameter for @@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = { supported, due to lack of general use, but the following are considered standard (see the GCC info page section "C Extensions" for more details - Yes, really the info page, the - man page is only a short summary of the stuff in info): + man page is only a short summary of the stuff in info). </para> <itemizedlist> <listitem> @@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = { </listitem> <listitem> <para> - Function names as strings (__FUNCTION__) + Function names as strings (__FUNCTION__). </para> </listitem> <listitem> @@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = { <listitem> <para> Usually you want a configuration option for your kernel hack. - Edit <filename>Config.in</filename> in the appropriate directory - (but under <filename>arch/</filename> it's called - <filename>config.in</filename>). The Config Language used is not - bash, even though it looks like bash; the safe way is to use only - the constructs that you already see in - <filename>Config.in</filename> files (see - <filename>Documentation/kbuild/kconfig-language.txt</filename>). - It's good to run "make xconfig" at least once to test (because - it's the only one with a static parser). - </para> - - <para> - Variables which can be Y or N use <type>bool</type> followed by a - tagline and the config define name (which must start with - CONFIG_). The <type>tristate</type> function is the same, but - allows the answer M (which defines - <symbol>CONFIG_foo_MODULE</symbol> in your source, instead of - <symbol>CONFIG_FOO</symbol>) if <symbol>CONFIG_MODULES</symbol> - is enabled. + Edit <filename>Kconfig</filename> in the appropriate directory. + The Config language is simple to use by cut and paste, and there's + complete documentation in + <filename>Documentation/kbuild/kconfig-language.txt</filename>. </para> <para> You may well want to make your CONFIG option only visible if <symbol>CONFIG_EXPERIMENTAL</symbol> is enabled: this serves as a warning to users. There many other fancy things you can do: see - the various <filename>Config.in</filename> files for ideas. + the various <filename>Kconfig</filename> files for ideas. </para> - </listitem> - <listitem> <para> - Edit the <filename>Makefile</filename>: the CONFIG variables are - exported here so you can conditionalize compilation with `ifeq'. - If your file exports symbols then add the names to - <varname>export-objs</varname> so that genksyms will find them. - <caution> - <para> - There is a restriction on the kernel build system that objects - which export symbols must have globally unique names. - If your object does not have a globally unique name then the - standard fix is to move the - <function>EXPORT_SYMBOL()</function> statements to their own - object with a unique name. - This is why several systems have separate exporting objects, - usually suffixed with ksyms. - </para> - </caution> + In your description of the option, make sure you address both the + expert user and the user who knows nothing about your feature. Mention + incompatibilities and issues here. <emphasis> Definitely + </emphasis> end your description with <quote> if in doubt, say N + </quote> (or, occasionally, `Y'); this is for people who have no + idea what you are talking about. </para> </listitem> <listitem> <para> - Document your option in Documentation/Configure.help. Mention - incompatibilities and issues here. <emphasis> Definitely - </emphasis> end your description with <quote> if in doubt, say N - </quote> (or, occasionally, `Y'); this is for people who have no - idea what you are talking about. + Edit the <filename>Makefile</filename>: the CONFIG variables are + exported here so you can usually just add a "obj-$(CONFIG_xxx) += + xxx.o" line. The syntax is documented in + <filename>Documentation/kbuild/makefiles.txt</filename>. </para> </listitem> @@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = { </para> <para> - <filename>include/linux/brlock.h:</filename> + <filename>include/asm-i386/delay.h:</filename> </para> <programlisting> -extern inline void br_read_lock (enum brlock_indices idx) -{ - /* - * This causes a link-time bug message if an - * invalid index is used: - */ - if (idx >= __BR_END) - __br_lock_usage_bug(); - - read_lock(&__brlock_array[smp_processor_id()][idx]); -} +#define ndelay(n) (__builtin_constant_p(n) ? \ + ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ + __ndelay(n)) </programlisting> <para> diff --git a/Documentation/DocBook/libata.tmpl b/Documentation/DocBook/libata.tmpl index 375ae760dc1e..d260d92089ad 100644 --- a/Documentation/DocBook/libata.tmpl +++ b/Documentation/DocBook/libata.tmpl @@ -415,6 +415,362 @@ and other resources, etc. </sect1> </chapter> + <chapter id="libataEH"> + <title>Error handling</title> + + <para> + This chapter describes how errors are handled under libata. + Readers are advised to read SCSI EH + (Documentation/scsi/scsi_eh.txt) and ATA exceptions doc first. + </para> + + <sect1><title>Origins of commands</title> + <para> + In libata, a command is represented with struct ata_queued_cmd + or qc. qc's are preallocated during port initialization and + repetitively used for command executions. Currently only one + qc is allocated per port but yet-to-be-merged NCQ branch + allocates one for each tag and maps each qc to NCQ tag 1-to-1. + </para> + <para> + libata commands can originate from two sources - libata itself + and SCSI midlayer. libata internal commands are used for + initialization and error handling. All normal blk requests + and commands for SCSI emulation are passed as SCSI commands + through queuecommand callback of SCSI host template. + </para> + </sect1> + + <sect1><title>How commands are issued</title> + + <variablelist> + + <varlistentry><term>Internal commands</term> + <listitem> + <para> + First, qc is allocated and initialized using + ata_qc_new_init(). Although ata_qc_new_init() doesn't + implement any wait or retry mechanism when qc is not + available, internal commands are currently issued only during + initialization and error recovery, so no other command is + active and allocation is guaranteed to succeed. + </para> + <para> + Once allocated qc's taskfile is initialized for the command to + be executed. qc currently has two mechanisms to notify + completion. One is via qc->complete_fn() callback and the + other is completion qc->waiting. qc->complete_fn() callback + is the asynchronous path used by normal SCSI translated + commands and qc->waiting is the synchronous (issuer sleeps in + process context) path used by internal commands. + </para> + <para> + Once initialization is complete, host_set lock is acquired + and the qc is issued. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>SCSI commands</term> + <listitem> + <para> + All libata drivers use ata_scsi_queuecmd() as + hostt->queuecommand callback. scmds can either be simulated + or translated. No qc is involved in processing a simulated + scmd. The result is computed right away and the scmd is + completed. + </para> + <para> + For a translated scmd, ata_qc_new_init() is invoked to + allocate a qc and the scmd is translated into the qc. SCSI + midlayer's completion notification function pointer is stored + into qc->scsidone. + </para> + <para> + qc->complete_fn() callback is used for completion + notification. ATA commands use ata_scsi_qc_complete() while + ATAPI commands use atapi_qc_complete(). Both functions end up + calling qc->scsidone to notify upper layer when the qc is + finished. After translation is completed, the qc is issued + with ata_qc_issue(). + </para> + <para> + Note that SCSI midlayer invokes hostt->queuecommand while + holding host_set lock, so all above occur while holding + host_set lock. + </para> + </listitem> + </varlistentry> + + </variablelist> + </sect1> + + <sect1><title>How commands are processed</title> + <para> + Depending on which protocol and which controller are used, + commands are processed differently. For the purpose of + discussion, a controller which uses taskfile interface and all + standard callbacks is assumed. + </para> + <para> + Currently 6 ATA command protocols are used. They can be + sorted into the following four categories according to how + they are processed. + </para> + + <variablelist> + <varlistentry><term>ATA NO DATA or DMA</term> + <listitem> + <para> + ATA_PROT_NODATA and ATA_PROT_DMA fall into this category. + These types of commands don't require any software + intervention once issued. Device will raise interrupt on + completion. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>ATA PIO</term> + <listitem> + <para> + ATA_PROT_PIO is in this category. libata currently + implements PIO with polling. ATA_NIEN bit is set to turn + off interrupt and pio_task on ata_wq performs polling and + IO. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>ATAPI NODATA or DMA</term> + <listitem> + <para> + ATA_PROT_ATAPI_NODATA and ATA_PROT_ATAPI_DMA are in this + category. packet_task is used to poll BSY bit after + issuing PACKET command. Once BSY is turned off by the + device, packet_task transfers CDB and hands off processing + to interrupt handler. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>ATAPI PIO</term> + <listitem> + <para> + ATA_PROT_ATAPI is in this category. ATA_NIEN bit is set + and, as in ATAPI NODATA or DMA, packet_task submits cdb. + However, after submitting cdb, further processing (data + transfer) is handed off to pio_task. + </para> + </listitem> + </varlistentry> + </variablelist> + </sect1> + + <sect1><title>How commands are completed</title> + <para> + Once issued, all qc's are either completed with + ata_qc_complete() or time out. For commands which are handled + by interrupts, ata_host_intr() invokes ata_qc_complete(), and, + for PIO tasks, pio_task invokes ata_qc_complete(). In error + cases, packet_task may also complete commands. + </para> + <para> + ata_qc_complete() does the following. + </para> + + <orderedlist> + + <listitem> + <para> + DMA memory is unmapped. + </para> + </listitem> + + <listitem> + <para> + ATA_QCFLAG_ACTIVE is clared from qc->flags. + </para> + </listitem> + + <listitem> + <para> + qc->complete_fn() callback is invoked. If the return value of + the callback is not zero. Completion is short circuited and + ata_qc_complete() returns. + </para> + </listitem> + + <listitem> + <para> + __ata_qc_complete() is called, which does + <orderedlist> + + <listitem> + <para> + qc->flags is cleared to zero. + </para> + </listitem> + + <listitem> + <para> + ap->active_tag and qc->tag are poisoned. + </para> + </listitem> + + <listitem> + <para> + qc->waiting is claread & completed (in that order). + </para> + </listitem> + + <listitem> + <para> + qc is deallocated by clearing appropriate bit in ap->qactive. + </para> + </listitem> + + </orderedlist> + </para> + </listitem> + + </orderedlist> + + <para> + So, it basically notifies upper layer and deallocates qc. One + exception is short-circuit path in #3 which is used by + atapi_qc_complete(). + </para> + <para> + For all non-ATAPI commands, whether it fails or not, almost + the same code path is taken and very little error handling + takes place. A qc is completed with success status if it + succeeded, with failed status otherwise. + </para> + <para> + However, failed ATAPI commands require more handling as + REQUEST SENSE is needed to acquire sense data. If an ATAPI + command fails, ata_qc_complete() is invoked with error status, + which in turn invokes atapi_qc_complete() via + qc->complete_fn() callback. + </para> + <para> + This makes atapi_qc_complete() set scmd->result to + SAM_STAT_CHECK_CONDITION, complete the scmd and return 1. As + the sense data is empty but scmd->result is CHECK CONDITION, + SCSI midlayer will invoke EH for the scmd, and returning 1 + makes ata_qc_complete() to return without deallocating the qc. + This leads us to ata_scsi_error() with partially completed qc. + </para> + + </sect1> + + <sect1><title>ata_scsi_error()</title> + <para> + ata_scsi_error() is the current hostt->eh_strategy_handler() + for libata. As discussed above, this will be entered in two + cases - timeout and ATAPI error completion. This function + calls low level libata driver's eng_timeout() callback, the + standard callback for which is ata_eng_timeout(). It checks + if a qc is active and calls ata_qc_timeout() on the qc if so. + Actual error handling occurs in ata_qc_timeout(). + </para> + <para> + If EH is invoked for timeout, ata_qc_timeout() stops BMDMA and + completes the qc. Note that as we're currently in EH, we + cannot call scsi_done. As described in SCSI EH doc, a + recovered scmd should be either retried with + scsi_queue_insert() or finished with scsi_finish_command(). + Here, we override qc->scsidone with scsi_finish_command() and + calls ata_qc_complete(). + </para> + <para> + If EH is invoked due to a failed ATAPI qc, the qc here is + completed but not deallocated. The purpose of this + half-completion is to use the qc as place holder to make EH + code reach this place. This is a bit hackish, but it works. + </para> + <para> + Once control reaches here, the qc is deallocated by invoking + __ata_qc_complete() explicitly. Then, internal qc for REQUEST + SENSE is issued. Once sense data is acquired, scmd is + finished by directly invoking scsi_finish_command() on the + scmd. Note that as we already have completed and deallocated + the qc which was associated with the scmd, we don't need + to/cannot call ata_qc_complete() again. + </para> + + </sect1> + + <sect1><title>Problems with the current EH</title> + + <itemizedlist> + + <listitem> + <para> + Error representation is too crude. Currently any and all + error conditions are represented with ATA STATUS and ERROR + registers. Errors which aren't ATA device errors are treated + as ATA device errors by setting ATA_ERR bit. Better error + descriptor which can properly represent ATA and other + errors/exceptions is needed. + </para> + </listitem> + + <listitem> + <para> + When handling timeouts, no action is taken to make device + forget about the timed out command and ready for new commands. + </para> + </listitem> + + <listitem> + <para> + EH handling via ata_scsi_error() is not properly protected + from usual command processing. On EH entrance, the device is + not in quiescent state. Timed out commands may succeed or + fail any time. pio_task and atapi_task may still be running. + </para> + </listitem> + + <listitem> + <para> + Too weak error recovery. Devices / controllers causing HSM + mismatch errors and other errors quite often require reset to + return to known state. Also, advanced error handling is + necessary to support features like NCQ and hotplug. + </para> + </listitem> + + <listitem> + <para> + ATA errors are directly handled in the interrupt handler and + PIO errors in pio_task. This is problematic for advanced + error handling for the following reasons. + </para> + <para> + First, advanced error handling often requires context and + internal qc execution. + </para> + <para> + Second, even a simple failure (say, CRC error) needs + information gathering and could trigger complex error handling + (say, resetting & reconfiguring). Having multiple code + paths to gather information, enter EH and trigger actions + makes life painful. + </para> + <para> + Third, scattered EH code makes implementing low level drivers + difficult. Low level drivers override libata callbacks. If + EH is scattered over several places, each affected callbacks + should perform its part of error handling. This can be error + prone and painful. + </para> + </listitem> + + </itemizedlist> + </sect1> + </chapter> + <chapter id="libataExt"> <title>libata Library</title> !Edrivers/scsi/libata-core.c @@ -431,6 +787,722 @@ and other resources, etc. !Idrivers/scsi/libata-scsi.c </chapter> + <chapter id="ataExceptions"> + <title>ATA errors & exceptions</title> + + <para> + This chapter tries to identify what error/exception conditions exist + for ATA/ATAPI devices and describe how they should be handled in + implementation-neutral way. + </para> + + <para> + The term 'error' is used to describe conditions where either an + explicit error condition is reported from device or a command has + timed out. + </para> + + <para> + The term 'exception' is either used to describe exceptional + conditions which are not errors (say, power or hotplug events), or + to describe both errors and non-error exceptional conditions. Where + explicit distinction between error and exception is necessary, the + term 'non-error exception' is used. + </para> + + <sect1 id="excat"> + <title>Exception categories</title> + <para> + Exceptions are described primarily with respect to legacy + taskfile + bus master IDE interface. If a controller provides + other better mechanism for error reporting, mapping those into + categories described below shouldn't be difficult. + </para> + + <para> + In the following sections, two recovery actions - reset and + reconfiguring transport - are mentioned. These are described + further in <xref linkend="exrec"/>. + </para> + + <sect2 id="excatHSMviolation"> + <title>HSM violation</title> + <para> + This error is indicated when STATUS value doesn't match HSM + requirement during issuing or excution any ATA/ATAPI command. + </para> + + <itemizedlist> + <title>Examples</title> + + <listitem> + <para> + ATA_STATUS doesn't contain !BSY && DRDY && !DRQ while trying + to issue a command. + </para> + </listitem> + + <listitem> + <para> + !BSY && !DRQ during PIO data transfer. + </para> + </listitem> + + <listitem> + <para> + DRQ on command completion. + </para> + </listitem> + + <listitem> + <para> + !BSY && ERR after CDB tranfer starts but before the + last byte of CDB is transferred. ATA/ATAPI standard states + that "The device shall not terminate the PACKET command + with an error before the last byte of the command packet has + been written" in the error outputs description of PACKET + command and the state diagram doesn't include such + transitions. + </para> + </listitem> + + </itemizedlist> + + <para> + In these cases, HSM is violated and not much information + regarding the error can be acquired from STATUS or ERROR + register. IOW, this error can be anything - driver bug, + faulty device, controller and/or cable. + </para> + + <para> + As HSM is violated, reset is necessary to restore known state. + Reconfiguring transport for lower speed might be helpful too + as transmission errors sometimes cause this kind of errors. + </para> + </sect2> + + <sect2 id="excatDevErr"> + <title>ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION)</title> + + <para> + These are errors detected and reported by ATA/ATAPI devices + indicating device problems. For this type of errors, STATUS + and ERROR register values are valid and describe error + condition. Note that some of ATA bus errors are detected by + ATA/ATAPI devices and reported using the same mechanism as + device errors. Those cases are described later in this + section. + </para> + + <para> + For ATA commands, this type of errors are indicated by !BSY + && ERR during command execution and on completion. + </para> + + <para>For ATAPI commands,</para> + + <itemizedlist> + + <listitem> + <para> + !BSY && ERR && ABRT right after issuing PACKET + indicates that PACKET command is not supported and falls in + this category. + </para> + </listitem> + + <listitem> + <para> + !BSY && ERR(==CHK) && !ABRT after the last + byte of CDB is transferred indicates CHECK CONDITION and + doesn't fall in this category. + </para> + </listitem> + + <listitem> + <para> + !BSY && ERR(==CHK) && ABRT after the last byte + of CDB is transferred *probably* indicates CHECK CONDITION and + doesn't fall in this category. + </para> + </listitem> + + </itemizedlist> + + <para> + Of errors detected as above, the followings are not ATA/ATAPI + device errors but ATA bus errors and should be handled + according to <xref linkend="excatATAbusErr"/>. + </para> + + <variablelist> + + <varlistentry> + <term>CRC error during data transfer</term> + <listitem> + <para> + This is indicated by ICRC bit in the ERROR register and + means that corruption occurred during data transfer. Upto + ATA/ATAPI-7, the standard specifies that this bit is only + applicable to UDMA transfers but ATA/ATAPI-8 draft revision + 1f says that the bit may be applicable to multiword DMA and + PIO. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>ABRT error during data transfer or on completion</term> + <listitem> + <para> + Upto ATA/ATAPI-7, the standard specifies that ABRT could be + set on ICRC errors and on cases where a device is not able + to complete a command. Combined with the fact that MWDMA + and PIO transfer errors aren't allowed to use ICRC bit upto + ATA/ATAPI-7, it seems to imply that ABRT bit alone could + indicate tranfer errors. + </para> + <para> + However, ATA/ATAPI-8 draft revision 1f removes the part + that ICRC errors can turn on ABRT. So, this is kind of + gray area. Some heuristics are needed here. + </para> + </listitem> + </varlistentry> + + </variablelist> + + <para> + ATA/ATAPI device errors can be further categorized as follows. + </para> + + <variablelist> + + <varlistentry> + <term>Media errors</term> + <listitem> + <para> + This is indicated by UNC bit in the ERROR register. ATA + devices reports UNC error only after certain number of + retries cannot recover the data, so there's nothing much + else to do other than notifying upper layer. + </para> + <para> + READ and WRITE commands report CHS or LBA of the first + failed sector but ATA/ATAPI standard specifies that the + amount of transferred data on error completion is + indeterminate, so we cannot assume that sectors preceding + the failed sector have been transferred and thus cannot + complete those sectors successfully as SCSI does. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term>Media changed / media change requested error</term> + <listitem> + <para> + <<TODO: fill here>> + </para> + </listitem> + </varlistentry> + + <varlistentry><term>Address error</term> + <listitem> + <para> + This is indicated by IDNF bit in the ERROR register. + Report to upper layer. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>Other errors</term> + <listitem> + <para> + This can be invalid command or parameter indicated by ABRT + ERROR bit or some other error condition. Note that ABRT + bit can indicate a lot of things including ICRC and Address + errors. Heuristics needed. + </para> + </listitem> + </varlistentry> + + </variablelist> + + <para> + Depending on commands, not all STATUS/ERROR bits are + applicable. These non-applicable bits are marked with + "na" in the output descriptions but upto ATA/ATAPI-7 + no definition of "na" can be found. However, + ATA/ATAPI-8 draft revision 1f describes "N/A" as + follows. + </para> + + <blockquote> + <variablelist> + <varlistentry><term>3.2.3.3a N/A</term> + <listitem> + <para> + A keyword the indicates a field has no defined value in + this standard and should not be checked by the host or + device. N/A fields should be cleared to zero. + </para> + </listitem> + </varlistentry> + </variablelist> + </blockquote> + + <para> + So, it seems reasonable to assume that "na" bits are + cleared to zero by devices and thus need no explicit masking. + </para> + + </sect2> + + <sect2 id="excatATAPIcc"> + <title>ATAPI device CHECK CONDITION</title> + + <para> + ATAPI device CHECK CONDITION error is indicated by set CHK bit + (ERR bit) in the STATUS register after the last byte of CDB is + transferred for a PACKET command. For this kind of errors, + sense data should be acquired to gather information regarding + the errors. REQUEST SENSE packet command should be used to + acquire sense data. + </para> + + <para> + Once sense data is acquired, this type of errors can be + handled similary to other SCSI errors. Note that sense data + may indicate ATA bus error (e.g. Sense Key 04h HARDWARE ERROR + && ASC/ASCQ 47h/00h SCSI PARITY ERROR). In such + cases, the error should be considered as an ATA bus error and + handled according to <xref linkend="excatATAbusErr"/>. + </para> + + </sect2> + + <sect2 id="excatNCQerr"> + <title>ATA device error (NCQ)</title> + + <para> + NCQ command error is indicated by cleared BSY and set ERR bit + during NCQ command phase (one or more NCQ commands + outstanding). Although STATUS and ERROR registers will + contain valid values describing the error, READ LOG EXT is + required to clear the error condition, determine which command + has failed and acquire more information. + </para> + + <para> + READ LOG EXT Log Page 10h reports which tag has failed and + taskfile register values describing the error. With this + information the failed command can be handled as a normal ATA + command error as in <xref linkend="excatDevErr"/> and all + other in-flight commands must be retried. Note that this + retry should not be counted - it's likely that commands + retried this way would have completed normally if it were not + for the failed command. + </para> + + <para> + Note that ATA bus errors can be reported as ATA device NCQ + errors. This should be handled as described in <xref + linkend="excatATAbusErr"/>. + </para> + + <para> + If READ LOG EXT Log Page 10h fails or reports NQ, we're + thoroughly screwed. This condition should be treated + according to <xref linkend="excatHSMviolation"/>. + </para> + + </sect2> + + <sect2 id="excatATAbusErr"> + <title>ATA bus error</title> + + <para> + ATA bus error means that data corruption occurred during + transmission over ATA bus (SATA or PATA). This type of errors + can be indicated by + </para> + + <itemizedlist> + + <listitem> + <para> + ICRC or ABRT error as described in <xref linkend="excatDevErr"/>. + </para> + </listitem> + + <listitem> + <para> + Controller-specific error completion with error information + indicating transmission error. + </para> + </listitem> + + <listitem> + <para> + On some controllers, command timeout. In this case, there may + be a mechanism to determine that the timeout is due to + transmission error. + </para> + </listitem> + + <listitem> + <para> + Unknown/random errors, timeouts and all sorts of weirdities. + </para> + </listitem> + + </itemizedlist> + + <para> + As described above, transmission errors can cause wide variety + of symptoms ranging from device ICRC error to random device + lockup, and, for many cases, there is no way to tell if an + error condition is due to transmission error or not; + therefore, it's necessary to employ some kind of heuristic + when dealing with errors and timeouts. For example, + encountering repetitive ABRT errors for known supported + command is likely to indicate ATA bus error. + </para> + + <para> + Once it's determined that ATA bus errors have possibly + occurred, lowering ATA bus transmission speed is one of + actions which may alleviate the problem. See <xref + linkend="exrecReconf"/> for more information. + </para> + + </sect2> + + <sect2 id="excatPCIbusErr"> + <title>PCI bus error</title> + + <para> + Data corruption or other failures during transmission over PCI + (or other system bus). For standard BMDMA, this is indicated + by Error bit in the BMDMA Status register. This type of + errors must be logged as it indicates something is very wrong + with the system. Resetting host controller is recommended. + </para> + + </sect2> + + <sect2 id="excatLateCompletion"> + <title>Late completion</title> + + <para> + This occurs when timeout occurs and the timeout handler finds + out that the timed out command has completed successfully or + with error. This is usually caused by lost interrupts. This + type of errors must be logged. Resetting host controller is + recommended. + </para> + + </sect2> + + <sect2 id="excatUnknown"> + <title>Unknown error (timeout)</title> + + <para> + This is when timeout occurs and the command is still + processing or the host and device are in unknown state. When + this occurs, HSM could be in any valid or invalid state. To + bring the device to known state and make it forget about the + timed out command, resetting is necessary. The timed out + command may be retried. + </para> + + <para> + Timeouts can also be caused by transmission errors. Refer to + <xref linkend="excatATAbusErr"/> for more details. + </para> + + </sect2> + + <sect2 id="excatHoplugPM"> + <title>Hotplug and power management exceptions</title> + + <para> + <<TODO: fill here>> + </para> + + </sect2> + + </sect1> + + <sect1 id="exrec"> + <title>EH recovery actions</title> + + <para> + This section discusses several important recovery actions. + </para> + + <sect2 id="exrecClr"> + <title>Clearing error condition</title> + + <para> + Many controllers require its error registers to be cleared by + error handler. Different controllers may have different + requirements. + </para> + + <para> + For SATA, it's strongly recommended to clear at least SError + register during error handling. + </para> + </sect2> + + <sect2 id="exrecRst"> + <title>Reset</title> + + <para> + During EH, resetting is necessary in the following cases. + </para> + + <itemizedlist> + + <listitem> + <para> + HSM is in unknown or invalid state + </para> + </listitem> + + <listitem> + <para> + HBA is in unknown or invalid state + </para> + </listitem> + + <listitem> + <para> + EH needs to make HBA/device forget about in-flight commands + </para> + </listitem> + + <listitem> + <para> + HBA/device behaves weirdly + </para> + </listitem> + + </itemizedlist> + + <para> + Resetting during EH might be a good idea regardless of error + condition to improve EH robustness. Whether to reset both or + either one of HBA and device depends on situation but the + following scheme is recommended. + </para> + + <itemizedlist> + + <listitem> + <para> + When it's known that HBA is in ready state but ATA/ATAPI + device in in unknown state, reset only device. + </para> + </listitem> + + <listitem> + <para> + If HBA is in unknown state, reset both HBA and device. + </para> + </listitem> + + </itemizedlist> + + <para> + HBA resetting is implementation specific. For a controller + complying to taskfile/BMDMA PCI IDE, stopping active DMA + transaction may be sufficient iff BMDMA state is the only HBA + context. But even mostly taskfile/BMDMA PCI IDE complying + controllers may have implementation specific requirements and + mechanism to reset themselves. This must be addressed by + specific drivers. + </para> + + <para> + OTOH, ATA/ATAPI standard describes in detail ways to reset + ATA/ATAPI devices. + </para> + + <variablelist> + + <varlistentry><term>PATA hardware reset</term> + <listitem> + <para> + This is hardware initiated device reset signalled with + asserted PATA RESET- signal. There is no standard way to + initiate hardware reset from software although some + hardware provides registers that allow driver to directly + tweak the RESET- signal. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>Software reset</term> + <listitem> + <para> + This is achieved by turning CONTROL SRST bit on for at + least 5us. Both PATA and SATA support it but, in case of + SATA, this may require controller-specific support as the + second Register FIS to clear SRST should be transmitted + while BSY bit is still set. Note that on PATA, this resets + both master and slave devices on a channel. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>EXECUTE DEVICE DIAGNOSTIC command</term> + <listitem> + <para> + Although ATA/ATAPI standard doesn't describe exactly, EDD + implies some level of resetting, possibly similar level + with software reset. Host-side EDD protocol can be handled + with normal command processing and most SATA controllers + should be able to handle EDD's just like other commands. + As in software reset, EDD affects both devices on a PATA + bus. + </para> + <para> + Although EDD does reset devices, this doesn't suit error + handling as EDD cannot be issued while BSY is set and it's + unclear how it will act when device is in unknown/weird + state. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>ATAPI DEVICE RESET command</term> + <listitem> + <para> + This is very similar to software reset except that reset + can be restricted to the selected device without affecting + the other device sharing the cable. + </para> + </listitem> + </varlistentry> + + <varlistentry><term>SATA phy reset</term> + <listitem> + <para> + This is the preferred way of resetting a SATA device. In + effect, it's identical to PATA hardware reset. Note that + this can be done with the standard SCR Control register. + As such, it's usually easier to implement than software + reset. + </para> + </listitem> + </varlistentry> + + </variablelist> + + <para> + One more thing to consider when resetting devices is that + resetting clears certain configuration parameters and they + need to be set to their previous or newly adjusted values + after reset. + </para> + + <para> + Parameters affected are. + </para> + + <itemizedlist> + + <listitem> + <para> + CHS set up with INITIALIZE DEVICE PARAMETERS (seldomly used) + </para> + </listitem> + + <listitem> + <para> + Parameters set with SET FEATURES including transfer mode setting + </para> + </listitem> + + <listitem> + <para> + Block count set with SET MULTIPLE MODE + </para> + </listitem> + + <listitem> + <para> + Other parameters (SET MAX, MEDIA LOCK...) + </para> + </listitem> + + </itemizedlist> + + <para> + ATA/ATAPI standard specifies that some parameters must be + maintained across hardware or software reset, but doesn't + strictly specify all of them. Always reconfiguring needed + parameters after reset is required for robustness. Note that + this also applies when resuming from deep sleep (power-off). + </para> + + <para> + Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / + IDENTIFY PACKET DEVICE is issued after any configuration + parameter is updated or a hardware reset and the result used + for further operation. OS driver is required to implement + revalidation mechanism to support this. + </para> + + </sect2> + + <sect2 id="exrecReconf"> + <title>Reconfigure transport</title> + + <para> + For both PATA and SATA, a lot of corners are cut for cheap + connectors, cables or controllers and it's quite common to see + high transmission error rate. This can be mitigated by + lowering transmission speed. + </para> + + <para> + The following is a possible scheme Jeff Garzik suggested. + </para> + + <blockquote> + <para> + If more than $N (3?) transmission errors happen in 15 minutes, + </para> + <itemizedlist> + <listitem> + <para> + if SATA, decrease SATA PHY speed. if speed cannot be decreased, + </para> + </listitem> + <listitem> + <para> + decrease UDMA xfer speed. if at UDMA0, switch to PIO4, + </para> + </listitem> + <listitem> + <para> + decrease PIO xfer speed. if at PIO3, complain, but continue + </para> + </listitem> + </itemizedlist> + </blockquote> + + </sect2> + + </sect1> + + </chapter> + <chapter id="PiixInt"> <title>ata_piix Internals</title> !Idrivers/scsi/ata_piix.c diff --git a/Documentation/DocBook/mcabook.tmpl b/Documentation/DocBook/mcabook.tmpl index 4367f4642f3d..42a760cd7467 100644 --- a/Documentation/DocBook/mcabook.tmpl +++ b/Documentation/DocBook/mcabook.tmpl @@ -96,7 +96,7 @@ <chapter id="pubfunctions"> <title>Public Functions Provided</title> -!Earch/i386/kernel/mca.c +!Edrivers/mca/mca-legacy.c </chapter> <chapter id="dmafunctions"> diff --git a/Documentation/DocBook/usb.tmpl b/Documentation/DocBook/usb.tmpl index f3ef0bf435e9..15ce0f21e5e0 100644 --- a/Documentation/DocBook/usb.tmpl +++ b/Documentation/DocBook/usb.tmpl @@ -291,7 +291,7 @@ !Edrivers/usb/core/hcd.c !Edrivers/usb/core/hcd-pci.c -!Edrivers/usb/core/buffer.c +!Idrivers/usb/core/buffer.c </chapter> <chapter> @@ -841,7 +841,7 @@ usbdev_ioctl (int fd, int ifno, unsigned request, void *param) File modification time is not updated by this request. </para><para> Those struct members are from some interface descriptor - applying to the the current configuration. + applying to the current configuration. The interface number is the bInterfaceNumber value, and the altsetting number is the bAlternateSetting value. (This resets each endpoint in the interface.) diff --git a/Documentation/DocBook/writing_usb_driver.tmpl b/Documentation/DocBook/writing_usb_driver.tmpl index 51f3bfb6fb6e..008a341234d0 100644 --- a/Documentation/DocBook/writing_usb_driver.tmpl +++ b/Documentation/DocBook/writing_usb_driver.tmpl @@ -345,8 +345,7 @@ if (!retval) { <programlisting> static inline void skel_delete (struct usb_skel *dev) { - if (dev->bulk_in_buffer != NULL) - kfree (dev->bulk_in_buffer); + kfree (dev->bulk_in_buffer); if (dev->bulk_out_buffer != NULL) usb_buffer_free (dev->udev, dev->bulk_out_size, dev->bulk_out_buffer, diff --git a/Documentation/IPMI.txt b/Documentation/IPMI.txt index 84d3d4d10c17..bf1cf98d2a27 100644 --- a/Documentation/IPMI.txt +++ b/Documentation/IPMI.txt @@ -605,12 +605,13 @@ is in the ipmi_poweroff module. When the system requests a powerdown, it will send the proper IPMI commands to do this. This is supported on several platforms. -There is a module parameter named "poweroff_control" that may either be zero -(do a power down) or 2 (do a power cycle, power the system off, then power -it on in a few seconds). Setting ipmi_poweroff.poweroff_control=x will do -the same thing on the kernel command line. The parameter is also available -via the proc filesystem in /proc/ipmi/poweroff_control. Note that if the -system does not support power cycling, it will always to the power off. +There is a module parameter named "poweroff_powercycle" that may +either be zero (do a power down) or non-zero (do a power cycle, power +the system off, then power it on in a few seconds). Setting +ipmi_poweroff.poweroff_control=x will do the same thing on the kernel +command line. The parameter is also available via the proc filesystem +in /proc/sys/dev/ipmi/poweroff_powercycle. Note that if the system +does not support power cycling, it will always do the power off. Note that if you have ACPI enabled, the system will prefer using ACPI to power off. diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt index d5032eb480aa..63edc5f847c4 100644 --- a/Documentation/MSI-HOWTO.txt +++ b/Documentation/MSI-HOWTO.txt @@ -430,7 +430,7 @@ which may result in system hang. The software driver of specific MSI-capable hardware is responsible for whether calling pci_enable_msi or not. A return of zero indicates the kernel successfully initializes the MSI/MSI-X capability structure of the -device funtion. The device function is now running on MSI/MSI-X mode. +device function. The device function is now running on MSI/MSI-X mode. 5.6 How to tell whether MSI/MSI-X is enabled on device function diff --git a/Documentation/RCU/NMI-RCU.txt b/Documentation/RCU/NMI-RCU.txt new file mode 100644 index 000000000000..d0634a5c3445 --- /dev/null +++ b/Documentation/RCU/NMI-RCU.txt @@ -0,0 +1,112 @@ +Using RCU to Protect Dynamic NMI Handlers + + +Although RCU is usually used to protect read-mostly data structures, +it is possible to use RCU to provide dynamic non-maskable interrupt +handlers, as well as dynamic irq handlers. This document describes +how to do this, drawing loosely from Zwane Mwaikambo's NMI-timer +work in "arch/i386/oprofile/nmi_timer_int.c" and in +"arch/i386/kernel/traps.c". + +The relevant pieces of code are listed below, each followed by a +brief explanation. + + static int dummy_nmi_callback(struct pt_regs *regs, int cpu) + { + return 0; + } + +The dummy_nmi_callback() function is a "dummy" NMI handler that does +nothing, but returns zero, thus saying that it did nothing, allowing +the NMI handler to take the default machine-specific action. + + static nmi_callback_t nmi_callback = dummy_nmi_callback; + +This nmi_callback variable is a global function pointer to the current +NMI handler. + + fastcall void do_nmi(struct pt_regs * regs, long error_code) + { + int cpu; + + nmi_enter(); + + cpu = smp_processor_id(); + ++nmi_count(cpu); + + if (!rcu_dereference(nmi_callback)(regs, cpu)) + default_do_nmi(regs); + + nmi_exit(); + } + +The do_nmi() function processes each NMI. It first disables preemption +in the same way that a hardware irq would, then increments the per-CPU +count of NMIs. It then invokes the NMI handler stored in the nmi_callback +function pointer. If this handler returns zero, do_nmi() invokes the +default_do_nmi() function to handle a machine-specific NMI. Finally, +preemption is restored. + +Strictly speaking, rcu_dereference() is not needed, since this code runs +only on i386, which does not need rcu_dereference() anyway. However, +it is a good documentation aid, particularly for anyone attempting to +do something similar on Alpha. + +Quick Quiz: Why might the rcu_dereference() be necessary on Alpha, + given that the code referenced by the pointer is read-only? + + +Back to the discussion of NMI and RCU... + + void set_nmi_callback(nmi_callback_t callback) + { + rcu_assign_pointer(nmi_callback, callback); + } + +The set_nmi_callback() function registers an NMI handler. Note that any +data that is to be used by the callback must be initialized up -before- +the call to set_nmi_callback(). On architectures that do not order +writes, the rcu_assign_pointer() ensures that the NMI handler sees the +initialized values. + + void unset_nmi_callback(void) + { + rcu_assign_pointer(nmi_callback, dummy_nmi_callback); + } + +This function unregisters an NMI handler, restoring the original +dummy_nmi_handler(). However, there may well be an NMI handler +currently executing on some other CPU. We therefore cannot free +up any data structures used by the old NMI handler until execution +of it completes on all other CPUs. + +One way to accomplish this is via synchronize_sched(), perhaps as +follows: + + unset_nmi_callback(); + synchronize_sched(); + kfree(my_nmi_data); + +This works because synchronize_sched() blocks until all CPUs complete +any preemption-disabled segments of code that they were executing. +Since NMI handlers disable preemption, synchronize_sched() is guaranteed +not to return until all ongoing NMI handlers exit. It is therefore safe +to free up the handler's data as soon as synchronize_sched() returns. + + +Answer to Quick Quiz + + Why might the rcu_dereference() be necessary on Alpha, given + that the code referenced by the pointer is read-only? + + Answer: The caller to set_nmi_callback() might well have + initialized some data that is to be used by the + new NMI handler. In this case, the rcu_dereference() + would be needed, because otherwise a CPU that received + an NMI just after the new handler was set might see + the pointer to the new NMI handler, but the old + pre-initialized version of the handler's data. + + More important, the rcu_dereference() makes it clear + to someone reading the code that the pointer is being + protected by RCU. diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt index 9c6d450138ea..fcbcbc35b122 100644 --- a/Documentation/RCU/RTFP.txt +++ b/Documentation/RCU/RTFP.txt @@ -2,7 +2,8 @@ Read the F-ing Papers! This document describes RCU-related publications, and is followed by -the corresponding bibtex entries. +the corresponding bibtex entries. A number of the publications may +be found at http://www.rdrop.com/users/paulmck/RCU/. The first thing resembling RCU was published in 1980, when Kung and Lehman [Kung80] recommended use of a garbage collector to defer destruction @@ -113,6 +114,10 @@ describing how to make RCU safe for soft-realtime applications [Sarma04c], and a paper describing SELinux performance with RCU [JamesMorris04b]. +2005 has seen further adaptation of RCU to realtime use, permitting +preemption of RCU realtime critical sections [PaulMcKenney05a, +PaulMcKenney05b]. + Bibtex Entries @article{Kung80 @@ -410,3 +415,32 @@ Oregon Health and Sciences University" \url{http://www.livejournal.com/users/james_morris/2153.html} [Viewed December 10, 2004]" } + +@unpublished{PaulMcKenney05a +,Author="Paul E. McKenney" +,Title="{[RFC]} {RCU} and {CONFIG\_PREEMPT\_RT} progress" +,month="May" +,year="2005" +,note="Available: +\url{http://lkml.org/lkml/2005/5/9/185} +[Viewed May 13, 2005]" +,annotation=" + First publication of working lock-based deferred free patches + for the CONFIG_PREEMPT_RT environment. +" +} + +@conference{PaulMcKenney05b +,Author="Paul E. McKenney and Dipankar Sarma" +,Title="Towards Hard Realtime Response from the Linux Kernel on SMP Hardware" +,Booktitle="linux.conf.au 2005" +,month="April" +,year="2005" +,address="Canberra, Australia" +,note="Available: +\url{http://www.rdrop.com/users/paulmck/RCU/realtimeRCU.2005.04.23a.pdf} +[Viewed May 13, 2005]" +,annotation=" + Realtime turns into making RCU yet more realtime friendly. +" +} diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt index 3bfb84b3b7db..aab4a9ec3931 100644 --- a/Documentation/RCU/UP.txt +++ b/Documentation/RCU/UP.txt @@ -8,7 +8,7 @@ is that since there is only one CPU, it should not be necessary to wait for anything else to get done, since there are no other CPUs for anything else to be happening on. Although this approach will -sort- -of- work a surprising amount of the time, it is a very bad idea in general. -This document presents two examples that demonstrate exactly how bad an +This document presents three examples that demonstrate exactly how bad an idea this is. @@ -26,6 +26,9 @@ from softirq, the list scan would find itself referencing a newly freed element B. This situation can greatly decrease the life expectancy of your kernel. +This same problem can occur if call_rcu() is invoked from a hardware +interrupt handler. + Example 2: Function-Call Fatality @@ -44,8 +47,37 @@ its arguments would cause it to fail to make the fundamental guarantee underlying RCU, namely that call_rcu() defers invoking its arguments until all RCU read-side critical sections currently executing have completed. -Quick Quiz: why is it -not- legal to invoke synchronize_rcu() in -this case? +Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in + this case? + + +Example 3: Death by Deadlock + +Suppose that call_rcu() is invoked while holding a lock, and that the +callback function must acquire this same lock. In this case, if +call_rcu() were to directly invoke the callback, the result would +be self-deadlock. + +In some cases, it would possible to restructure to code so that +the call_rcu() is delayed until after the lock is released. However, +there are cases where this can be quite ugly: + +1. If a number of items need to be passed to call_rcu() within + the same critical section, then the code would need to create + a list of them, then traverse the list once the lock was + released. + +2. In some cases, the lock will be held across some kernel API, + so that delaying the call_rcu() until the lock is released + requires that the data item be passed up via a common API. + It is far better to guarantee that callbacks are invoked + with no locks held than to have to modify such APIs to allow + arbitrary data items to be passed back up through them. + +If call_rcu() directly invokes the callback, painful locking restrictions +or API changes would be required. + +Quick Quiz #2: What locking restriction must RCU callbacks respect? Summary @@ -53,12 +85,35 @@ Summary Permitting call_rcu() to immediately invoke its arguments or permitting synchronize_rcu() to immediately return breaks RCU, even on a UP system. So do not do it! Even on a UP system, the RCU infrastructure -must- -respect grace periods. - - -Answer to Quick Quiz - -The calling function is scanning an RCU-protected linked list, and -is therefore within an RCU read-side critical section. Therefore, -the called function has been invoked within an RCU read-side critical -section, and is not permitted to block. +respect grace periods, and -must- invoke callbacks from a known environment +in which no locks are held. + + +Answer to Quick Quiz #1: + Why is it -not- legal to invoke synchronize_rcu() in this case? + + Because the calling function is scanning an RCU-protected linked + list, and is therefore within an RCU read-side critical section. + Therefore, the called function has been invoked within an RCU + read-side critical section, and is not permitted to block. + +Answer to Quick Quiz #2: + What locking restriction must RCU callbacks respect? + + Any lock that is acquired within an RCU callback must be + acquired elsewhere using an _irq variant of the spinlock + primitive. For example, if "mylock" is acquired by an + RCU callback, then a process-context acquisition of this + lock must use something like spin_lock_irqsave() to + acquire the lock. + + If the process-context code were to simply use spin_lock(), + then, since RCU callbacks can be invoked from softirq context, + the callback might be called from a softirq that interrupted + the process-context critical section. This would result in + self-deadlock. + + This restriction might seem gratuitous, since very few RCU + callbacks acquire locks directly. However, a great many RCU + callbacks do acquire locks -indirectly-, for example, via + the kfree() primitive. diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 8f3fb77c9cd3..e118a7c1a092 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -43,6 +43,10 @@ over a rather long period of time, but improvements are always welcome! rcu_read_lock_bh()) in the read-side critical sections, and are also an excellent aid to readability. + As a rough rule of thumb, any dereference of an RCU-protected + pointer must be covered by rcu_read_lock() or rcu_read_lock_bh() + or by the appropriate update-side lock. + 3. Does the update code tolerate concurrent accesses? The whole point of RCU is to permit readers to run without @@ -90,7 +94,11 @@ over a rather long period of time, but improvements are always welcome! The rcu_dereference() primitive is used by the various "_rcu()" list-traversal primitives, such as the - list_for_each_entry_rcu(). + list_for_each_entry_rcu(). Note that it is perfectly + legal (if redundant) for update-side code to use + rcu_dereference() and the "_rcu()" list-traversal + primitives. This is particularly useful in code + that is common to readers and updaters. b. If the list macros are being used, the list_add_tail_rcu() and list_add_rcu() primitives must be used in order @@ -150,16 +158,9 @@ over a rather long period of time, but improvements are always welcome! Use of the _rcu() list-traversal primitives outside of an RCU read-side critical section causes no harm other than - a slight performance degradation on Alpha CPUs and some - confusion on the part of people trying to read the code. - - Another way of thinking of this is "If you are holding the - lock that prevents the data structure from changing, why do - you also need RCU-based protection?" That said, there may - well be situations where use of the _rcu() list-traversal - primitives while the update-side lock is held results in - simpler and more maintainable code. The jury is still out - on this question. + a slight performance degradation on Alpha CPUs. It can + also be quite helpful in reducing code bloat when common + code is shared between readers and updaters. 10. Conversely, if you are in an RCU read-side critical section, you -must- use the "_rcu()" variants of the list macros. diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt index eb444006683e..6fa092251586 100644 --- a/Documentation/RCU/rcu.txt +++ b/Documentation/RCU/rcu.txt @@ -64,6 +64,54 @@ o I hear that RCU is patented? What is with that? Of these, one was allowed to lapse by the assignee, and the others have been contributed to the Linux kernel under GPL. +o I hear that RCU needs work in order to support realtime kernels? + + Yes, work in progress. + o Where can I find more information on RCU? See the RTFP.txt file in this directory. + Or point your browser at http://www.rdrop.com/users/paulmck/RCU/. + +o What are all these files in this directory? + + + NMI-RCU.txt + + Describes how to use RCU to implement dynamic + NMI handlers, which can be revectored on the fly, + without rebooting. + + RTFP.txt + + List of RCU-related publications and web sites. + + UP.txt + + Discussion of RCU usage in UP kernels. + + arrayRCU.txt + + Describes how to use RCU to protect arrays, with + resizeable arrays whose elements reference other + data structures being of the most interest. + + checklist.txt + + Lists things to check for when inspecting code that + uses RCU. + + listRCU.txt + + Describes how to use RCU to protect linked lists. + This is the simplest and most common use of RCU + in the Linux kernel. + + rcu.txt + + You are reading it! + + whatisRCU.txt + + Overview of how the RCU implementation works. Along + the way, presents a conceptual view of RCU. diff --git a/Documentation/RCU/rcuref.txt b/Documentation/RCU/rcuref.txt new file mode 100644 index 000000000000..a23fee66064d --- /dev/null +++ b/Documentation/RCU/rcuref.txt @@ -0,0 +1,74 @@ +Refcounter framework for elements of lists/arrays protected by +RCU. + +Refcounting on elements of lists which are protected by traditional +reader/writer spinlocks or semaphores are straight forward as in: + +1. 2. +add() search_and_reference() +{ { + alloc_object read_lock(&list_lock); + ... search_for_element + atomic_set(&el->rc, 1); atomic_inc(&el->rc); + write_lock(&list_lock); ... + add_element read_unlock(&list_lock); + ... ... + write_unlock(&list_lock); } +} + +3. 4. +release_referenced() delete() +{ { + ... write_lock(&list_lock); + atomic_dec(&el->rc, relfunc) ... + ... delete_element +} write_unlock(&list_lock); + ... + if (atomic_dec_and_test(&el->rc)) + kfree(el); + ... + } + +If this list/array is made lock free using rcu as in changing the +write_lock in add() and delete() to spin_lock and changing read_lock +in search_and_reference to rcu_read_lock(), the rcuref_get in +search_and_reference could potentially hold reference to an element which +has already been deleted from the list/array. rcuref_lf_get_rcu takes +care of this scenario. search_and_reference should look as; + +1. 2. +add() search_and_reference() +{ { + alloc_object rcu_read_lock(); + ... search_for_element + atomic_set(&el->rc, 1); if (rcuref_inc_lf(&el->rc)) { + write_lock(&list_lock); rcu_read_unlock(); + return FAIL; + add_element } + ... ... + write_unlock(&list_lock); rcu_read_unlock(); +} } +3. 4. +release_referenced() delete() +{ { + ... write_lock(&list_lock); + rcuref_dec(&el->rc, relfunc) ... + ... delete_element +} write_unlock(&list_lock); + ... + if (rcuref_dec_and_test(&el->rc)) + call_rcu(&el->head, el_free); + ... + } + +Sometimes, reference to the element need to be obtained in the +update (write) stream. In such cases, rcuref_inc_lf might be an overkill +since the spinlock serialising list updates are held. rcuref_inc +is to be used in such cases. +For arches which do not have cmpxchg rcuref_inc_lf +api uses a hashed spinlock implementation and the same hashed spinlock +is acquired in all rcuref_xxx primitives to preserve atomicity. +Note: Use rcuref_inc api only if you need to use rcuref_inc_lf on the +refcounter atleast at one place. Mixing rcuref_inc and atomic_xxx api +might lead to races. rcuref_inc_lf() must be used in lockfree +RCU critical sections only. diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt new file mode 100644 index 000000000000..e4c38152f7f7 --- /dev/null +++ b/Documentation/RCU/torture.txt @@ -0,0 +1,122 @@ +RCU Torture Test Operation + + +CONFIG_RCU_TORTURE_TEST + +The CONFIG_RCU_TORTURE_TEST config option is available for all RCU +implementations. It creates an rcutorture kernel module that can +be loaded to run a torture test. The test periodically outputs +status messages via printk(), which can be examined via the dmesg +command (perhaps grepping for "rcutorture"). The test is started +when the module is loaded, and stops when the module is unloaded. + +However, actually setting this config option to "y" results in the system +running the test immediately upon boot, and ending only when the system +is taken down. Normally, one will instead want to build the system +with CONFIG_RCU_TORTURE_TEST=m and to use modprobe and rmmod to control +the test, perhaps using a script similar to the one shown at the end of +this document. Note that you will need CONFIG_MODULE_UNLOAD in order +to be able to end the test. + + +MODULE PARAMETERS + +This module has the following parameters: + +nreaders This is the number of RCU reading threads supported. + The default is twice the number of CPUs. Why twice? + To properly exercise RCU implementations with preemptible + read-side critical sections. + +stat_interval The number of seconds between output of torture + statistics (via printk()). Regardless of the interval, + statistics are printed when the module is unloaded. + Setting the interval to zero causes the statistics to + be printed -only- when the module is unloaded, and this + is the default. + +verbose Enable debug printk()s. Default is disabled. + + +OUTPUT + +The statistics output is as follows: + + rcutorture: --- Start of test: nreaders=16 stat_interval=0 verbose=0 + rcutorture: rtc: 0000000000000000 ver: 1916 tfle: 0 rta: 1916 rtaf: 0 rtf: 1915 + rcutorture: Reader Pipe: 1466408 9747 0 0 0 0 0 0 0 0 0 + rcutorture: Reader Batch: 1464477 11678 0 0 0 0 0 0 0 0 + rcutorture: Free-Block Circulation: 1915 1915 1915 1915 1915 1915 1915 1915 1915 1915 0 + rcutorture: --- End of test + +The command "dmesg | grep rcutorture:" will extract this information on +most systems. On more esoteric configurations, it may be necessary to +use other commands to access the output of the printk()s used by +the RCU torture test. The printk()s use KERN_ALERT, so they should +be evident. ;-) + +The entries are as follows: + +o "ggp": The number of counter flips (or batches) since boot. + +o "rtc": The hexadecimal address of the structure currently visible + to readers. + +o "ver": The number of times since boot that the rcutw writer task + has changed the structure visible to readers. + +o "tfle": If non-zero, indicates that the "torture freelist" + containing structure to be placed into the "rtc" area is empty. + This condition is important, since it can fool you into thinking + that RCU is working when it is not. :-/ + +o "rta": Number of structures allocated from the torture freelist. + +o "rtaf": Number of allocations from the torture freelist that have + failed due to the list being empty. + +o "rtf": Number of frees into the torture freelist. + +o "Reader Pipe": Histogram of "ages" of structures seen by readers. + If any entries past the first two are non-zero, RCU is broken. + And rcutorture prints the error flag string "!!!" to make sure + you notice. The age of a newly allocated structure is zero, + it becomes one when removed from reader visibility, and is + incremented once per grace period subsequently -- and is freed + after passing through (RCU_TORTURE_PIPE_LEN-2) grace periods. + + The output displayed above was taken from a correctly working + RCU. If you want to see what it looks like when broken, break + it yourself. ;-) + +o "Reader Batch": Another histogram of "ages" of structures seen + by readers, but in terms of counter flips (or batches) rather + than in terms of grace periods. The legal number of non-zero + entries is again two. The reason for this separate view is + that it is easier to get the third entry to show up in the + "Reader Batch" list than in the "Reader Pipe" list. + +o "Free-Block Circulation": Shows the number of torture structures + that have reached a given point in the pipeline. The first element + should closely correspond to the number of structures allocated, + the second to the number that have been removed from reader view, + and all but the last remaining to the corresponding number of + passes through a grace period. The last entry should be zero, + as it is only incremented if a torture structure's counter + somehow gets incremented farther than it should. + + +USAGE + +The following script may be used to torture RCU: + + #!/bin/sh + + modprobe rcutorture + sleep 100 + rmmod rcutorture + dmesg | grep rcutorture: + +The output can be manually inspected for the error flag of "!!!". +One could of course create a more elaborate script that automatically +checked for such errors. diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt new file mode 100644 index 000000000000..354d89c78377 --- /dev/null +++ b/Documentation/RCU/whatisRCU.txt @@ -0,0 +1,902 @@ +What is RCU? + +RCU is a synchronization mechanism that was added to the Linux kernel +during the 2.5 development effort that is optimized for read-mostly +situations. Although RCU is actually quite simple once you understand it, +getting there can sometimes be a challenge. Part of the problem is that +most of the past descriptions of RCU have been written with the mistaken +assumption that there is "one true way" to describe RCU. Instead, +the experience has been that different people must take different paths +to arrive at an understanding of RCU. This document provides several +different paths, as follows: + +1. RCU OVERVIEW +2. WHAT IS RCU'S CORE API? +3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? +4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? +5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? +6. ANALOGY WITH READER-WRITER LOCKING +7. FULL LIST OF RCU APIs +8. ANSWERS TO QUICK QUIZZES + +People who prefer starting with a conceptual overview should focus on +Section 1, though most readers will profit by reading this section at +some point. People who prefer to start with an API that they can then +experiment with should focus on Section 2. People who prefer to start +with example uses should focus on Sections 3 and 4. People who need to +understand the RCU implementation should focus on Section 5, then dive +into the kernel source code. People who reason best by analogy should +focus on Section 6. Section 7 serves as an index to the docbook API +documentation, and Section 8 is the traditional answer key. + +So, start with the section that makes the most sense to you and your +preferred method of learning. If you need to know everything about +everything, feel free to read the whole thing -- but if you are really +that type of person, you have perused the source code and will therefore +never need this document anyway. ;-) + + +1. RCU OVERVIEW + +The basic idea behind RCU is to split updates into "removal" and +"reclamation" phases. The removal phase removes references to data items +within a data structure (possibly by replacing them with references to +new versions of these data items), and can run concurrently with readers. +The reason that it is safe to run the removal phase concurrently with +readers is the semantics of modern CPUs guarantee that readers will see +either the old or the new version of the data structure rather than a +partially updated reference. The reclamation phase does the work of reclaiming +(e.g., freeing) the data items removed from the data structure during the +removal phase. Because reclaiming data items can disrupt any readers +concurrently referencing those data items, the reclamation phase must +not start until readers no longer hold references to those data items. + +Splitting the update into removal and reclamation phases permits the +updater to perform the removal phase immediately, and to defer the +reclamation phase until all readers active during the removal phase have +completed, either by blocking until they finish or by registering a +callback that is invoked after they finish. Only readers that are active +during the removal phase need be considered, because any reader starting +after the removal phase will be unable to gain a reference to the removed +data items, and therefore cannot be disrupted by the reclamation phase. + +So the typical RCU update sequence goes something like the following: + +a. Remove pointers to a data structure, so that subsequent + readers cannot gain a reference to it. + +b. Wait for all previous readers to complete their RCU read-side + critical sections. + +c. At this point, there cannot be any readers who hold references + to the data structure, so it now may safely be reclaimed + (e.g., kfree()d). + +Step (b) above is the key idea underlying RCU's deferred destruction. +The ability to wait until all readers are done allows RCU readers to +use much lighter-weight synchronization, in some cases, absolutely no +synchronization at all. In contrast, in more conventional lock-based +schemes, readers must use heavy-weight synchronization in order to +prevent an updater from deleting the data structure out from under them. +This is because lock-based updaters typically update data items in place, +and must therefore exclude readers. In contrast, RCU-based updaters +typically take advantage of the fact that writes to single aligned +pointers are atomic on modern CPUs, allowing atomic insertion, removal, +and replacement of data items in a linked structure without disrupting +readers. Concurrent RCU readers can then continue accessing the old +versions, and can dispense with the atomic operations, memory barriers, +and communications cache misses that are so expensive on present-day +SMP computer systems, even in absence of lock contention. + +In the three-step procedure shown above, the updater is performing both +the removal and the reclamation step, but it is often helpful for an +entirely different thread to do the reclamation, as is in fact the case +in the Linux kernel's directory-entry cache (dcache). Even if the same +thread performs both the update step (step (a) above) and the reclamation +step (step (c) above), it is often helpful to think of them separately. +For example, RCU readers and updaters need not communicate at all, +but RCU provides implicit low-overhead communication between readers +and reclaimers, namely, in step (b) above. + +So how the heck can a reclaimer tell when a reader is done, given +that readers are not doing any sort of synchronization operations??? +Read on to learn about how RCU's API makes this easy. + + +2. WHAT IS RCU'S CORE API? + +The core RCU API is quite small: + +a. rcu_read_lock() +b. rcu_read_unlock() +c. synchronize_rcu() / call_rcu() +d. rcu_assign_pointer() +e. rcu_dereference() + +There are many other members of the RCU API, but the rest can be +expressed in terms of these five, though most implementations instead +express synchronize_rcu() in terms of the call_rcu() callback API. + +The five core RCU APIs are described below, the other 18 will be enumerated +later. See the kernel docbook documentation for more info, or look directly +at the function header comments. + +rcu_read_lock() + + void rcu_read_lock(void); + + Used by a reader to inform the reclaimer that the reader is + entering an RCU read-side critical section. It is illegal + to block while in an RCU read-side critical section, though + kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side + critical sections. Any RCU-protected data structure accessed + during an RCU read-side critical section is guaranteed to remain + unreclaimed for the full duration of that critical section. + Reference counts may be used in conjunction with RCU to maintain + longer-term references to data structures. + +rcu_read_unlock() + + void rcu_read_unlock(void); + + Used by a reader to inform the reclaimer that the reader is + exiting an RCU read-side critical section. Note that RCU + read-side critical sections may be nested and/or overlapping. + +synchronize_rcu() + + void synchronize_rcu(void); + + Marks the end of updater code and the beginning of reclaimer + code. It does this by blocking until all pre-existing RCU + read-side critical sections on all CPUs have completed. + Note that synchronize_rcu() will -not- necessarily wait for + any subsequent RCU read-side critical sections to complete. + For example, consider the following sequence of events: + + CPU 0 CPU 1 CPU 2 + ----------------- ------------------------- --------------- + 1. rcu_read_lock() + 2. enters synchronize_rcu() + 3. rcu_read_lock() + 4. rcu_read_unlock() + 5. exits synchronize_rcu() + 6. rcu_read_unlock() + + To reiterate, synchronize_rcu() waits only for ongoing RCU + read-side critical sections to complete, not necessarily for + any that begin after synchronize_rcu() is invoked. + + Of course, synchronize_rcu() does not necessarily return + -immediately- after the last pre-existing RCU read-side critical + section completes. For one thing, there might well be scheduling + delays. For another thing, many RCU implementations process + requests in batches in order to improve efficiencies, which can + further delay synchronize_rcu(). + + Since synchronize_rcu() is the API that must figure out when + readers are done, its implementation is key to RCU. For RCU + to be useful in all but the most read-intensive situations, + synchronize_rcu()'s overhead must also be quite small. + + The call_rcu() API is a callback form of synchronize_rcu(), + and is described in more detail in a later section. Instead of + blocking, it registers a function and argument which are invoked + after all ongoing RCU read-side critical sections have completed. + This callback variant is particularly useful in situations where + it is illegal to block. + +rcu_assign_pointer() + + typeof(p) rcu_assign_pointer(p, typeof(p) v); + + Yes, rcu_assign_pointer() -is- implemented as a macro, though it + would be cool to be able to declare a function in this manner. + (Compiler experts will no doubt disagree.) + + The updater uses this function to assign a new value to an + RCU-protected pointer, in order to safely communicate the change + in value from the updater to the reader. This function returns + the new value, and also executes any memory-barrier instructions + required for a given CPU architecture. + + Perhaps more important, it serves to document which pointers + are protected by RCU. That said, rcu_assign_pointer() is most + frequently used indirectly, via the _rcu list-manipulation + primitives such as list_add_rcu(). + +rcu_dereference() + + typeof(p) rcu_dereference(p); + + Like rcu_assign_pointer(), rcu_dereference() must be implemented + as a macro. + + The reader uses rcu_dereference() to fetch an RCU-protected + pointer, which returns a value that may then be safely + dereferenced. Note that rcu_deference() does not actually + dereference the pointer, instead, it protects the pointer for + later dereferencing. It also executes any needed memory-barrier + instructions for a given CPU architecture. Currently, only Alpha + needs memory barriers within rcu_dereference() -- on other CPUs, + it compiles to nothing, not even a compiler directive. + + Common coding practice uses rcu_dereference() to copy an + RCU-protected pointer to a local variable, then dereferences + this local variable, for example as follows: + + p = rcu_dereference(head.next); + return p->data; + + However, in this case, one could just as easily combine these + into one statement: + + return rcu_dereference(head.next)->data; + + If you are going to be fetching multiple fields from the + RCU-protected structure, using the local variable is of + course preferred. Repeated rcu_dereference() calls look + ugly and incur unnecessary overhead on Alpha CPUs. + + Note that the value returned by rcu_dereference() is valid + only within the enclosing RCU read-side critical section. + For example, the following is -not- legal: + + rcu_read_lock(); + p = rcu_dereference(head.next); + rcu_read_unlock(); + x = p->address; + rcu_read_lock(); + y = p->data; + rcu_read_unlock(); + + Holding a reference from one RCU read-side critical section + to another is just as illegal as holding a reference from + one lock-based critical section to another! Similarly, + using a reference outside of the critical section in which + it was acquired is just as illegal as doing so with normal + locking. + + As with rcu_assign_pointer(), an important function of + rcu_dereference() is to document which pointers are protected + by RCU. And, again like rcu_assign_pointer(), rcu_dereference() + is typically used indirectly, via the _rcu list-manipulation + primitives, such as list_for_each_entry_rcu(). + +The following diagram shows how each API communicates among the +reader, updater, and reclaimer. + + + rcu_assign_pointer() + +--------+ + +---------------------->| reader |---------+ + | +--------+ | + | | | + | | | Protect: + | | | rcu_read_lock() + | | | rcu_read_unlock() + | rcu_dereference() | | + +---------+ | | + | updater |<---------------------+ | + +---------+ V + | +-----------+ + +----------------------------------->| reclaimer | + +-----------+ + Defer: + synchronize_rcu() & call_rcu() + + +The RCU infrastructure observes the time sequence of rcu_read_lock(), +rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in +order to determine when (1) synchronize_rcu() invocations may return +to their callers and (2) call_rcu() callbacks may be invoked. Efficient +implementations of the RCU infrastructure make heavy use of batching in +order to amortize their overhead over many uses of the corresponding APIs. + +There are no fewer than three RCU mechanisms in the Linux kernel; the +diagram above shows the first one, which is by far the most commonly used. +The rcu_dereference() and rcu_assign_pointer() primitives are used for +all three mechanisms, but different defer and protect primitives are +used as follows: + + Defer Protect + +a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock() + call_rcu() + +b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh() + +c. synchronize_sched() preempt_disable() / preempt_enable() + local_irq_save() / local_irq_restore() + hardirq enter / hardirq exit + NMI enter / NMI exit + +These three mechanisms are used as follows: + +a. RCU applied to normal data structures. + +b. RCU applied to networking data structures that may be subjected + to remote denial-of-service attacks. + +c. RCU applied to scheduler and interrupt/NMI-handler tasks. + +Again, most uses will be of (a). The (b) and (c) cases are important +for specialized uses, but are relatively uncommon. + + +3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? + +This section shows a simple use of the core RCU API to protect a +global pointer to a dynamically allocated structure. More typical +uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt. + + struct foo { + int a; + char b; + long c; + }; + DEFINE_SPINLOCK(foo_mutex); + + struct foo *gbl_foo; + + /* + * Create a new struct foo that is the same as the one currently + * pointed to by gbl_foo, except that field "a" is replaced + * with "new_a". Points gbl_foo to the new structure, and + * frees up the old structure after a grace period. + * + * Uses rcu_assign_pointer() to ensure that concurrent readers + * see the initialized version of the new structure. + * + * Uses synchronize_rcu() to ensure that any readers that might + * have references to the old structure complete before freeing + * the old structure. + */ + void foo_update_a(int new_a) + { + struct foo *new_fp; + struct foo *old_fp; + + new_fp = kmalloc(sizeof(*fp), GFP_KERNEL); + spin_lock(&foo_mutex); + old_fp = gbl_foo; + *new_fp = *old_fp; + new_fp->a = new_a; + rcu_assign_pointer(gbl_foo, new_fp); + spin_unlock(&foo_mutex); + synchronize_rcu(); + kfree(old_fp); + } + + /* + * Return the value of field "a" of the current gbl_foo + * structure. Use rcu_read_lock() and rcu_read_unlock() + * to ensure that the structure does not get deleted out + * from under us, and use rcu_dereference() to ensure that + * we see the initialized version of the structure (important + * for DEC Alpha and for people reading the code). + */ + int foo_get_a(void) + { + int retval; + + rcu_read_lock(); + retval = rcu_dereference(gbl_foo)->a; + rcu_read_unlock(); + return retval; + } + +So, to sum up: + +o Use rcu_read_lock() and rcu_read_unlock() to guard RCU + read-side critical sections. + +o Within an RCU read-side critical section, use rcu_dereference() + to dereference RCU-protected pointers. + +o Use some solid scheme (such as locks or semaphores) to + keep concurrent updates from interfering with each other. + +o Use rcu_assign_pointer() to update an RCU-protected pointer. + This primitive protects concurrent readers from the updater, + -not- concurrent updates from each other! You therefore still + need to use locking (or something similar) to keep concurrent + rcu_assign_pointer() primitives from interfering with each other. + +o Use synchronize_rcu() -after- removing a data element from an + RCU-protected data structure, but -before- reclaiming/freeing + the data element, in order to wait for the completion of all + RCU read-side critical sections that might be referencing that + data item. + +See checklist.txt for additional rules to follow when using RCU. + + +4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? + +In the example above, foo_update_a() blocks until a grace period elapses. +This is quite simple, but in some cases one cannot afford to wait so +long -- there might be other high-priority work to be done. + +In such cases, one uses call_rcu() rather than synchronize_rcu(). +The call_rcu() API is as follows: + + void call_rcu(struct rcu_head * head, + void (*func)(struct rcu_head *head)); + +This function invokes func(head) after a grace period has elapsed. +This invocation might happen from either softirq or process context, +so the function is not permitted to block. The foo struct needs to +have an rcu_head structure added, perhaps as follows: + + struct foo { + int a; + char b; + long c; + struct rcu_head rcu; + }; + +The foo_update_a() function might then be written as follows: + + /* + * Create a new struct foo that is the same as the one currently + * pointed to by gbl_foo, except that field "a" is replaced + * with "new_a". Points gbl_foo to the new structure, and + * frees up the old structure after a grace period. + * + * Uses rcu_assign_pointer() to ensure that concurrent readers + * see the initialized version of the new structure. + * + * Uses call_rcu() to ensure that any readers that might have + * references to the old structure complete before freeing the + * old structure. + */ + void foo_update_a(int new_a) + { + struct foo *new_fp; + struct foo *old_fp; + + new_fp = kmalloc(sizeof(*fp), GFP_KERNEL); + spin_lock(&foo_mutex); + old_fp = gbl_foo; + *new_fp = *old_fp; + new_fp->a = new_a; + rcu_assign_pointer(gbl_foo, new_fp); + spin_unlock(&foo_mutex); + call_rcu(&old_fp->rcu, foo_reclaim); + } + +The foo_reclaim() function might appear as follows: + + void foo_reclaim(struct rcu_head *rp) + { + struct foo *fp = container_of(rp, struct foo, rcu); + + kfree(fp); + } + +The container_of() primitive is a macro that, given a pointer into a +struct, the type of the struct, and the pointed-to field within the +struct, returns a pointer to the beginning of the struct. + +The use of call_rcu() permits the caller of foo_update_a() to +immediately regain control, without needing to worry further about the +old version of the newly updated element. It also clearly shows the +RCU distinction between updater, namely foo_update_a(), and reclaimer, +namely foo_reclaim(). + +The summary of advice is the same as for the previous section, except +that we are now using call_rcu() rather than synchronize_rcu(): + +o Use call_rcu() -after- removing a data element from an + RCU-protected data structure in order to register a callback + function that will be invoked after the completion of all RCU + read-side critical sections that might be referencing that + data item. + +Again, see checklist.txt for additional rules governing the use of RCU. + + +5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? + +One of the nice things about RCU is that it has extremely simple "toy" +implementations that are a good first step towards understanding the +production-quality implementations in the Linux kernel. This section +presents two such "toy" implementations of RCU, one that is implemented +in terms of familiar locking primitives, and another that more closely +resembles "classic" RCU. Both are way too simple for real-world use, +lacking both functionality and performance. However, they are useful +in getting a feel for how RCU works. See kernel/rcupdate.c for a +production-quality implementation, and see: + + http://www.rdrop.com/users/paulmck/RCU + +for papers describing the Linux kernel RCU implementation. The OLS'01 +and OLS'02 papers are a good introduction, and the dissertation provides +more details on the current implementation. + + +5A. "TOY" IMPLEMENTATION #1: LOCKING + +This section presents a "toy" RCU implementation that is based on +familiar locking primitives. Its overhead makes it a non-starter for +real-life use, as does its lack of scalability. It is also unsuitable +for realtime use, since it allows scheduling latency to "bleed" from +one read-side critical section to another. + +However, it is probably the easiest implementation to relate to, so is +a good starting point. + +It is extremely simple: + + static DEFINE_RWLOCK(rcu_gp_mutex); + + void rcu_read_lock(void) + { + read_lock(&rcu_gp_mutex); + } + + void rcu_read_unlock(void) + { + read_unlock(&rcu_gp_mutex); + } + + void synchronize_rcu(void) + { + write_lock(&rcu_gp_mutex); + write_unlock(&rcu_gp_mutex); + } + +[You can ignore rcu_assign_pointer() and rcu_dereference() without +missing much. But here they are anyway. And whatever you do, don't +forget about them when submitting patches making use of RCU!] + + #define rcu_assign_pointer(p, v) ({ \ + smp_wmb(); \ + (p) = (v); \ + }) + + #define rcu_dereference(p) ({ \ + typeof(p) _________p1 = p; \ + smp_read_barrier_depends(); \ + (_________p1); \ + }) + + +The rcu_read_lock() and rcu_read_unlock() primitive read-acquire +and release a global reader-writer lock. The synchronize_rcu() +primitive write-acquires this same lock, then immediately releases +it. This means that once synchronize_rcu() exits, all RCU read-side +critical sections that were in progress before synchonize_rcu() was +called are guaranteed to have completed -- there is no way that +synchronize_rcu() would have been able to write-acquire the lock +otherwise. + +It is possible to nest rcu_read_lock(), since reader-writer locks may +be recursively acquired. Note also that rcu_read_lock() is immune +from deadlock (an important property of RCU). The reason for this is +that the only thing that can block rcu_read_lock() is a synchronize_rcu(). +But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex, +so there can be no deadlock cycle. + +Quick Quiz #1: Why is this argument naive? How could a deadlock + occur when using this algorithm in a real-world Linux + kernel? How could this deadlock be avoided? + + +5B. "TOY" EXAMPLE #2: CLASSIC RCU + +This section presents a "toy" RCU implementation that is based on +"classic RCU". It is also short on performance (but only for updates) and +on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT +kernels. The definitions of rcu_dereference() and rcu_assign_pointer() +are the same as those shown in the preceding section, so they are omitted. + + void rcu_read_lock(void) { } + + void rcu_read_unlock(void) { } + + void synchronize_rcu(void) + { + int cpu; + + for_each_cpu(cpu) + run_on(cpu); + } + +Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing. +This is the great strength of classic RCU in a non-preemptive kernel: +read-side overhead is precisely zero, at least on non-Alpha CPUs. +And there is absolutely no way that rcu_read_lock() can possibly +participate in a deadlock cycle! + +The implementation of synchronize_rcu() simply schedules itself on each +CPU in turn. The run_on() primitive can be implemented straightforwardly +in terms of the sched_setaffinity() primitive. Of course, a somewhat less +"toy" implementation would restore the affinity upon completion rather +than just leaving all tasks running on the last CPU, but when I said +"toy", I meant -toy-! + +So how the heck is this supposed to work??? + +Remember that it is illegal to block while in an RCU read-side critical +section. Therefore, if a given CPU executes a context switch, we know +that it must have completed all preceding RCU read-side critical sections. +Once -all- CPUs have executed a context switch, then -all- preceding +RCU read-side critical sections will have completed. + +So, suppose that we remove a data item from its structure and then invoke +synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed +that there are no RCU read-side critical sections holding a reference +to that data item, so we can safely reclaim it. + +Quick Quiz #2: Give an example where Classic RCU's read-side + overhead is -negative-. + +Quick Quiz #3: If it is illegal to block in an RCU read-side + critical section, what the heck do you do in + PREEMPT_RT, where normal spinlocks can block??? + + +6. ANALOGY WITH READER-WRITER LOCKING + +Although RCU can be used in many different ways, a very common use of +RCU is analogous to reader-writer locking. The following unified +diff shows how closely related RCU and reader-writer locking can be. + + @@ -13,15 +14,15 @@ + struct list_head *lp; + struct el *p; + + - read_lock(); + - list_for_each_entry(p, head, lp) { + + rcu_read_lock(); + + list_for_each_entry_rcu(p, head, lp) { + if (p->key == key) { + *result = p->data; + - read_unlock(); + + rcu_read_unlock(); + return 1; + } + } + - read_unlock(); + + rcu_read_unlock(); + return 0; + } + + @@ -29,15 +30,16 @@ + { + struct el *p; + + - write_lock(&listmutex); + + spin_lock(&listmutex); + list_for_each_entry(p, head, lp) { + if (p->key == key) { + list_del(&p->list); + - write_unlock(&listmutex); + + spin_unlock(&listmutex); + + synchronize_rcu(); + kfree(p); + return 1; + } + } + - write_unlock(&listmutex); + + spin_unlock(&listmutex); + return 0; + } + +Or, for those who prefer a side-by-side listing: + + 1 struct el { 1 struct el { + 2 struct list_head list; 2 struct list_head list; + 3 long key; 3 long key; + 4 spinlock_t mutex; 4 spinlock_t mutex; + 5 int data; 5 int data; + 6 /* Other data fields */ 6 /* Other data fields */ + 7 }; 7 }; + 8 spinlock_t listmutex; 8 spinlock_t listmutex; + 9 struct el head; 9 struct el head; + + 1 int search(long key, int *result) 1 int search(long key, int *result) + 2 { 2 { + 3 struct list_head *lp; 3 struct list_head *lp; + 4 struct el *p; 4 struct el *p; + 5 5 + 6 read_lock(); 6 rcu_read_lock(); + 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) { + 8 if (p->key == key) { 8 if (p->key == key) { + 9 *result = p->data; 9 *result = p->data; +10 read_unlock(); 10 rcu_read_unlock(); +11 return 1; 11 return 1; +12 } 12 } +13 } 13 } +14 read_unlock(); 14 rcu_read_unlock(); +15 return 0; 15 return 0; +16 } 16 } + + 1 int delete(long key) 1 int delete(long key) + 2 { 2 { + 3 struct el *p; 3 struct el *p; + 4 4 + 5 write_lock(&listmutex); 5 spin_lock(&listmutex); + 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) { + 7 if (p->key == key) { 7 if (p->key == key) { + 8 list_del(&p->list); 8 list_del(&p->list); + 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); + 10 synchronize_rcu(); +10 kfree(p); 11 kfree(p); +11 return 1; 12 return 1; +12 } 13 } +13 } 14 } +14 write_unlock(&listmutex); 15 spin_unlock(&listmutex); +15 return 0; 16 return 0; +16 } 17 } + +Either way, the differences are quite small. Read-side locking moves +to rcu_read_lock() and rcu_read_unlock, update-side locking moves from +from a reader-writer lock to a simple spinlock, and a synchronize_rcu() +precedes the kfree(). + +However, there is one potential catch: the read-side and update-side +critical sections can now run concurrently. In many cases, this will +not be a problem, but it is necessary to check carefully regardless. +For example, if multiple independent list updates must be seen as +a single atomic update, converting to RCU will require special care. + +Also, the presence of synchronize_rcu() means that the RCU version of +delete() can now block. If this is a problem, there is a callback-based +mechanism that never blocks, namely call_rcu(), that can be used in +place of synchronize_rcu(). + + +7. FULL LIST OF RCU APIs + +The RCU APIs are documented in docbook-format header comments in the +Linux-kernel source code, but it helps to have a full list of the +APIs, since there does not appear to be a way to categorize them +in docbook. Here is the list, by category. + +Markers for RCU read-side critical sections: + + rcu_read_lock + rcu_read_unlock + rcu_read_lock_bh + rcu_read_unlock_bh + +RCU pointer/list traversal: + + rcu_dereference + list_for_each_rcu (to be deprecated in favor of + list_for_each_entry_rcu) + list_for_each_safe_rcu (deprecated, not used) + list_for_each_entry_rcu + list_for_each_continue_rcu (to be deprecated in favor of new + list_for_each_entry_continue_rcu) + hlist_for_each_rcu (to be deprecated in favor of + hlist_for_each_entry_rcu) + hlist_for_each_entry_rcu + +RCU pointer update: + + rcu_assign_pointer + list_add_rcu + list_add_tail_rcu + list_del_rcu + list_replace_rcu + hlist_del_rcu + hlist_add_head_rcu + +RCU grace period: + + synchronize_kernel (deprecated) + synchronize_net + synchronize_sched + synchronize_rcu + call_rcu + call_rcu_bh + +See the comment headers in the source code (or the docbook generated +from them) for more information. + + +8. ANSWERS TO QUICK QUIZZES + +Quick Quiz #1: Why is this argument naive? How could a deadlock + occur when using this algorithm in a real-world Linux + kernel? [Referring to the lock-based "toy" RCU + algorithm.] + +Answer: Consider the following sequence of events: + + 1. CPU 0 acquires some unrelated lock, call it + "problematic_lock". + + 2. CPU 1 enters synchronize_rcu(), write-acquiring + rcu_gp_mutex. + + 3. CPU 0 enters rcu_read_lock(), but must wait + because CPU 1 holds rcu_gp_mutex. + + 4. CPU 1 is interrupted, and the irq handler + attempts to acquire problematic_lock. + + The system is now deadlocked. + + One way to avoid this deadlock is to use an approach like + that of CONFIG_PREEMPT_RT, where all normal spinlocks + become blocking locks, and all irq handlers execute in + the context of special tasks. In this case, in step 4 + above, the irq handler would block, allowing CPU 1 to + release rcu_gp_mutex, avoiding the deadlock. + + Even in the absence of deadlock, this RCU implementation + allows latency to "bleed" from readers to other + readers through synchronize_rcu(). To see this, + consider task A in an RCU read-side critical section + (thus read-holding rcu_gp_mutex), task B blocked + attempting to write-acquire rcu_gp_mutex, and + task C blocked in rcu_read_lock() attempting to + read_acquire rcu_gp_mutex. Task A's RCU read-side + latency is holding up task C, albeit indirectly via + task B. + + Realtime RCU implementations therefore use a counter-based + approach where tasks in RCU read-side critical sections + cannot be blocked by tasks executing synchronize_rcu(). + +Quick Quiz #2: Give an example where Classic RCU's read-side + overhead is -negative-. + +Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT + kernel where a routing table is used by process-context + code, but can be updated by irq-context code (for example, + by an "ICMP REDIRECT" packet). The usual way of handling + this would be to have the process-context code disable + interrupts while searching the routing table. Use of + RCU allows such interrupt-disabling to be dispensed with. + Thus, without RCU, you pay the cost of disabling interrupts, + and with RCU you don't. + + One can argue that the overhead of RCU in this + case is negative with respect to the single-CPU + interrupt-disabling approach. Others might argue that + the overhead of RCU is merely zero, and that replacing + the positive overhead of the interrupt-disabling scheme + with the zero-overhead RCU scheme does not constitute + negative overhead. + + In real life, of course, things are more complex. But + even the theoretical possibility of negative overhead for + a synchronization primitive is a bit unexpected. ;-) + +Quick Quiz #3: If it is illegal to block in an RCU read-side + critical section, what the heck do you do in + PREEMPT_RT, where normal spinlocks can block??? + +Answer: Just as PREEMPT_RT permits preemption of spinlock + critical sections, it permits preemption of RCU + read-side critical sections. It also permits + spinlocks blocking while in RCU read-side critical + sections. + + Why the apparent inconsistency? Because it is it + possible to use priority boosting to keep the RCU + grace periods short if need be (for example, if running + short of memory). In contrast, if blocking waiting + for (say) network reception, there is no way to know + what should be boosted. Especially given that the + process we need to boost might well be a human being + who just went out for a pizza or something. And although + a computer-operated cattle prod might arouse serious + interest, it might also provoke serious objections. + Besides, how does the computer know what pizza parlor + the human being went to??? + + +ACKNOWLEDGEMENTS + +My thanks to the people who helped make this human-readable, including +Jon Walpole, Josh Triplett, Serge Hallyn, and Suzanne Wood. + + +For more information, see http://www.rdrop.com/users/paulmck/RCU. diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index 7f43b040311e..237d54c44bc5 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -301,8 +301,84 @@ now, but you can do this to mark internal company procedures or just point out some special detail about the sign-off. +12) The canonical patch format -12) More references for submitting patches +The canonical patch subject line is: + + Subject: [PATCH 001/123] subsystem: summary phrase + +The canonical patch message body contains the following: + + - A "from" line specifying the patch author. + + - An empty line. + + - The body of the explanation, which will be copied to the + permanent changelog to describe this patch. + + - The "Signed-off-by:" lines, described above, which will + also go in the changelog. + + - A marker line containing simply "---". + + - Any additional comments not suitable for the changelog. + + - The actual patch (diff output). + +The Subject line format makes it very easy to sort the emails +alphabetically by subject line - pretty much any email reader will +support that - since because the sequence number is zero-padded, +the numerical and alphabetic sort is the same. + +The "subsystem" in the email's Subject should identify which +area or subsystem of the kernel is being patched. + +The "summary phrase" in the email's Subject should concisely +describe the patch which that email contains. The "summary +phrase" should not be a filename. Do not use the same "summary +phrase" for every patch in a whole patch series. + +Bear in mind that the "summary phrase" of your email becomes +a globally-unique identifier for that patch. It propagates +all the way into the git changelog. The "summary phrase" may +later be used in developer discussions which refer to the patch. +People will want to google for the "summary phrase" to read +discussion regarding that patch. + +A couple of example Subjects: + + Subject: [patch 2/5] ext2: improve scalability of bitmap searching + Subject: [PATCHv2 001/207] x86: fix eflags tracking + +The "from" line must be the very first line in the message body, +and has the form: + + From: Original Author <author@example.com> + +The "from" line specifies who will be credited as the author of the +patch in the permanent changelog. If the "from" line is missing, +then the "From:" line from the email header will be used to determine +the patch author in the changelog. + +The explanation body will be committed to the permanent source +changelog, so should make sense to a competent reader who has long +since forgotten the immediate details of the discussion that might +have led to this patch. + +The "---" marker line serves the essential purpose of marking for patch +handling tools where the changelog message ends. + +One good use for the additional comments after the "---" marker is for +a diffstat, to show what files have changed, and the number of inserted +and deleted lines per file. A diffstat is especially useful on bigger +patches. Other comments relevant only to the moment or the maintainer, +not suitable for the permanent changelog, should also go here. + +See more details on the proper patch format in the following +references. + + +13) More references for submitting patches Andrew Morton, "The perfect patch" (tpp). <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt> @@ -310,6 +386,14 @@ Andrew Morton, "The perfect patch" (tpp). Jeff Garzik, "Linux kernel patch submission format." <http://linux.yyz.us/patch-format.html> +Greg KH, "How to piss off a kernel subsystem maintainer" + <http://www.kroah.com/log/2005/03/31/> + +Kernel Documentation/CodingStyle + <http://sosdg.org/~coywolf/lxr/source/Documentation/CodingStyle> + +Linus Torvald's mail on the canonical patch format: + <http://lkml.org/lkml/2005/4/7/183> ----------------------------------- diff --git a/Documentation/acpi-hotkey.txt b/Documentation/acpi-hotkey.txt index 4c115a7bb826..744f1aec6553 100644 --- a/Documentation/acpi-hotkey.txt +++ b/Documentation/acpi-hotkey.txt @@ -33,3 +33,6 @@ The result of the execution of this aml method is attached to /proc/acpi/hotkey/poll_method, which is dnyamically created. Please use command "cat /proc/acpi/hotkey/polling_method" to retrieve it. + +Note: Use cmdline "acpi_generic_hotkey" to over-ride +platform-specific with generic driver. diff --git a/Documentation/aoe/mkshelf.sh b/Documentation/aoe/mkshelf.sh index 8bacf9f2c7cc..32615814271c 100644 --- a/Documentation/aoe/mkshelf.sh +++ b/Documentation/aoe/mkshelf.sh @@ -8,13 +8,15 @@ fi n_partitions=${n_partitions:-16} dir=$1 shelf=$2 +nslots=16 +maxslot=`echo $nslots 1 - p | dc` MAJOR=152 set -e -minor=`echo 10 \* $shelf \* $n_partitions | bc` +minor=`echo $nslots \* $shelf \* $n_partitions | bc` endp=`echo $n_partitions - 1 | bc` -for slot in `seq 0 9`; do +for slot in `seq 0 $maxslot`; do for part in `seq 0 $endp`; do name=e$shelf.$slot test "$part" != "0" && name=${name}p$part diff --git a/Documentation/applying-patches.txt b/Documentation/applying-patches.txt new file mode 100644 index 000000000000..681e426e2482 --- /dev/null +++ b/Documentation/applying-patches.txt @@ -0,0 +1,439 @@ + + Applying Patches To The Linux Kernel + ------------------------------------ + + (Written by Jesper Juhl, August 2005) + + + +A frequently asked question on the Linux Kernel Mailing List is how to apply +a patch to the kernel or, more specifically, what base kernel a patch for +one of the many trees/branches should be applied to. Hopefully this document +will explain this to you. + +In addition to explaining how to apply and revert patches, a brief +description of the different kernel trees (and examples of how to apply +their specific patches) is also provided. + + +What is a patch? +--- + A patch is a small text document containing a delta of changes between two +different versions of a source tree. Patches are created with the `diff' +program. +To correctly apply a patch you need to know what base it was generated from +and what new version the patch will change the source tree into. These +should both be present in the patch file metadata or be possible to deduce +from the filename. + + +How do I apply or revert a patch? +--- + You apply a patch with the `patch' program. The patch program reads a diff +(or patch) file and makes the changes to the source tree described in it. + +Patches for the Linux kernel are generated relative to the parent directory +holding the kernel source dir. + +This means that paths to files inside the patch file contain the name of the +kernel source directories it was generated against (or some other directory +names like "a/" and "b/"). +Since this is unlikely to match the name of the kernel source dir on your +local machine (but is often useful info to see what version an otherwise +unlabeled patch was generated against) you should change into your kernel +source directory and then strip the first element of the path from filenames +in the patch file when applying it (the -p1 argument to `patch' does this). + +To revert a previously applied patch, use the -R argument to patch. +So, if you applied a patch like this: + patch -p1 < ../patch-x.y.z + +You can revert (undo) it like this: + patch -R -p1 < ../patch-x.y.z + + +How do I feed a patch/diff file to `patch'? +--- + This (as usual with Linux and other UNIX like operating systems) can be +done in several different ways. +In all the examples below I feed the file (in uncompressed form) to patch +via stdin using the following syntax: + patch -p1 < path/to/patch-x.y.z + +If you just want to be able to follow the examples below and don't want to +know of more than one way to use patch, then you can stop reading this +section here. + +Patch can also get the name of the file to use via the -i argument, like +this: + patch -p1 -i path/to/patch-x.y.z + +If your patch file is compressed with gzip or bzip2 and you don't want to +uncompress it before applying it, then you can feed it to patch like this +instead: + zcat path/to/patch-x.y.z.gz | patch -p1 + bzcat path/to/patch-x.y.z.bz2 | patch -p1 + +If you wish to uncompress the patch file by hand first before applying it +(what I assume you've done in the examples below), then you simply run +gunzip or bunzip2 on the file - like this: + gunzip patch-x.y.z.gz + bunzip2 patch-x.y.z.bz2 + +Which will leave you with a plain text patch-x.y.z file that you can feed to +patch via stdin or the -i argument, as you prefer. + +A few other nice arguments for patch are -s which causes patch to be silent +except for errors which is nice to prevent errors from scrolling out of the +screen too fast, and --dry-run which causes patch to just print a listing of +what would happen, but doesn't actually make any changes. Finally --verbose +tells patch to print more information about the work being done. + + +Common errors when patching +--- + When patch applies a patch file it attempts to verify the sanity of the +file in different ways. +Checking that the file looks like a valid patch file, checking the code +around the bits being modified matches the context provided in the patch are +just two of the basic sanity checks patch does. + +If patch encounters something that doesn't look quite right it has two +options. It can either refuse to apply the changes and abort or it can try +to find a way to make the patch apply with a few minor changes. + +One example of something that's not 'quite right' that patch will attempt to +fix up is if all the context matches, the lines being changed match, but the +line numbers are different. This can happen, for example, if the patch makes +a change in the middle of the file but for some reasons a few lines have +been added or removed near the beginning of the file. In that case +everything looks good it has just moved up or down a bit, and patch will +usually adjust the line numbers and apply the patch. + +Whenever patch applies a patch that it had to modify a bit to make it fit +it'll tell you about it by saying the patch applied with 'fuzz'. +You should be wary of such changes since even though patch probably got it +right it doesn't /always/ get it right, and the result will sometimes be +wrong. + +When patch encounters a change that it can't fix up with fuzz it rejects it +outright and leaves a file with a .rej extension (a reject file). You can +read this file to see exactely what change couldn't be applied, so you can +go fix it up by hand if you wish. + +If you don't have any third party patches applied to your kernel source, but +only patches from kernel.org and you apply the patches in the correct order, +and have made no modifications yourself to the source files, then you should +never see a fuzz or reject message from patch. If you do see such messages +anyway, then there's a high risk that either your local source tree or the +patch file is corrupted in some way. In that case you should probably try +redownloading the patch and if things are still not OK then you'd be advised +to start with a fresh tree downloaded in full from kernel.org. + +Let's look a bit more at some of the messages patch can produce. + +If patch stops and presents a "File to patch:" prompt, then patch could not +find a file to be patched. Most likely you forgot to specify -p1 or you are +in the wrong directory. Less often, you'll find patches that need to be +applied with -p0 instead of -p1 (reading the patch file should reveal if +this is the case - if so, then this is an error by the person who created +the patch but is not fatal). + +If you get "Hunk #2 succeeded at 1887 with fuzz 2 (offset 7 lines)." or a +message similar to that, then it means that patch had to adjust the location +of the change (in this example it needed to move 7 lines from where it +expected to make the change to make it fit). +The resulting file may or may not be OK, depending on the reason the file +was different than expected. +This often happens if you try to apply a patch that was generated against a +different kernel version than the one you are trying to patch. + +If you get a message like "Hunk #3 FAILED at 2387.", then it means that the +patch could not be applied correctly and the patch program was unable to +fuzz its way through. This will generate a .rej file with the change that +caused the patch to fail and also a .orig file showing you the original +content that couldn't be changed. + +If you get "Reversed (or previously applied) patch detected! Assume -R? [n]" +then patch detected that the change contained in the patch seems to have +already been made. +If you actually did apply this patch previously and you just re-applied it +in error, then just say [n]o and abort this patch. If you applied this patch +previously and actually intended to revert it, but forgot to specify -R, +then you can say [y]es here to make patch revert it for you. +This can also happen if the creator of the patch reversed the source and +destination directories when creating the patch, and in that case reverting +the patch will in fact apply it. + +A message similar to "patch: **** unexpected end of file in patch" or "patch +unexpectedly ends in middle of line" means that patch could make no sense of +the file you fed to it. Either your download is broken or you tried to feed +patch a compressed patch file without uncompressing it first. + +As I already mentioned above, these errors should never happen if you apply +a patch from kernel.org to the correct version of an unmodified source tree. +So if you get these errors with kernel.org patches then you should probably +assume that either your patch file or your tree is broken and I'd advice you +to start over with a fresh download of a full kernel tree and the patch you +wish to apply. + + +Are there any alternatives to `patch'? +--- + Yes there are alternatives. You can use the `interdiff' program +(http://cyberelk.net/tim/patchutils/) to generate a patch representing the +differences between two patches and then apply the result. +This will let you move from something like 2.6.12.2 to 2.6.12.3 in a single +step. The -z flag to interdiff will even let you feed it patches in gzip or +bzip2 compressed form directly without the use of zcat or bzcat or manual +decompression. + +Here's how you'd go from 2.6.12.2 to 2.6.12.3 in a single step: + interdiff -z ../patch-2.6.12.2.bz2 ../patch-2.6.12.3.gz | patch -p1 + +Although interdiff may save you a step or two you are generally advised to +do the additional steps since interdiff can get things wrong in some cases. + + Another alternative is `ketchup', which is a python script for automatic +downloading and applying of patches (http://www.selenic.com/ketchup/). + +Other nice tools are diffstat which shows a summary of changes made by a +patch, lsdiff which displays a short listing of affected files in a patch +file, along with (optionally) the line numbers of the start of each patch +and grepdiff which displays a list of the files modified by a patch where +the patch contains a given regular expression. + + +Where can I download the patches? +--- + The patches are available at http://kernel.org/ +Most recent patches are linked from the front page, but they also have +specific homes. + +The 2.6.x.y (-stable) and 2.6.x patches live at + ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ + +The -rc patches live at + ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/ + +The -git patches live at + ftp://ftp.kernel.org/pub/linux/kernel/v2.6/snapshots/ + +The -mm kernels live at + ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/ + +In place of ftp.kernel.org you can use ftp.cc.kernel.org, where cc is a +country code. This way you'll be downloading from a mirror site that's most +likely geographically closer to you, resulting in faster downloads for you, +less bandwidth used globally and less load on the main kernel.org servers - +these are good things, do use mirrors when possible. + + +The 2.6.x kernels +--- + These are the base stable releases released by Linus. The highest numbered +release is the most recent. + +If regressions or other serious flaws are found then a -stable fix patch +will be released (see below) on top of this base. Once a new 2.6.x base +kernel is released, a patch is made available that is a delta between the +previous 2.6.x kernel and the new one. + +To apply a patch moving from 2.6.11 to 2.6.12 you'd do the following (note +that such patches do *NOT* apply on top of 2.6.x.y kernels but on top of the +base 2.6.x kernel - if you need to move from 2.6.x.y to 2.6.x+1 you need to +first revert the 2.6.x.y patch). + +Here are some examples: + +# moving from 2.6.11 to 2.6.12 +$ cd ~/linux-2.6.11 # change to kernel source dir +$ patch -p1 < ../patch-2.6.12 # apply the 2.6.12 patch +$ cd .. +$ mv linux-2.6.11 linux-2.6.12 # rename source dir + +# moving from 2.6.11.1 to 2.6.12 +$ cd ~/linux-2.6.11.1 # change to kernel source dir +$ patch -p1 -R < ../patch-2.6.11.1 # revert the 2.6.11.1 patch + # source dir is now 2.6.11 +$ patch -p1 < ../patch-2.6.12 # apply new 2.6.12 patch +$ cd .. +$ mv linux-2.6.11.1 inux-2.6.12 # rename source dir + + +The 2.6.x.y kernels +--- + Kernels with 4 digit versions are -stable kernels. They contain small(ish) +critical fixes for security problems or significant regressions discovered +in a given 2.6.x kernel. + +This is the recommended branch for users who want the most recent stable +kernel and are not interested in helping test development/experimental +versions. + +If no 2.6.x.y kernel is available, then the highest numbered 2.6.x kernel is +the current stable kernel. + +These patches are not incremental, meaning that for example the 2.6.12.3 +patch does not apply on top of the 2.6.12.2 kernel source, but rather on top +of the base 2.6.12 kernel source. +So, in order to apply the 2.6.12.3 patch to your existing 2.6.12.2 kernel +source you have to first back out the 2.6.12.2 patch (so you are left with a +base 2.6.12 kernel source) and then apply the new 2.6.12.3 patch. + +Here's a small example: + +$ cd ~/linux-2.6.12.2 # change into the kernel source dir +$ patch -p1 -R < ../patch-2.6.12.2 # revert the 2.6.12.2 patch +$ patch -p1 < ../patch-2.6.12.3 # apply the new 2.6.12.3 patch +$ cd .. +$ mv linux-2.6.12.2 linux-2.6.12.3 # rename the kernel source dir + + +The -rc kernels +--- + These are release-candidate kernels. These are development kernels released +by Linus whenever he deems the current git (the kernel's source management +tool) tree to be in a reasonably sane state adequate for testing. + +These kernels are not stable and you should expect occasional breakage if +you intend to run them. This is however the most stable of the main +development branches and is also what will eventually turn into the next +stable kernel, so it is important that it be tested by as many people as +possible. + +This is a good branch to run for people who want to help out testing +development kernels but do not want to run some of the really experimental +stuff (such people should see the sections about -git and -mm kernels below). + +The -rc patches are not incremental, they apply to a base 2.6.x kernel, just +like the 2.6.x.y patches described above. The kernel version before the -rcN +suffix denotes the version of the kernel that this -rc kernel will eventually +turn into. +So, 2.6.13-rc5 means that this is the fifth release candidate for the 2.6.13 +kernel and the patch should be applied on top of the 2.6.12 kernel source. + +Here are 3 examples of how to apply these patches: + +# first an example of moving from 2.6.12 to 2.6.13-rc3 +$ cd ~/linux-2.6.12 # change into the 2.6.12 source dir +$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch +$ cd .. +$ mv linux-2.6.12 linux-2.6.13-rc3 # rename the source dir + +# now let's move from 2.6.13-rc3 to 2.6.13-rc5 +$ cd ~/linux-2.6.13-rc3 # change into the 2.6.13-rc3 dir +$ patch -p1 -R < ../patch-2.6.13-rc3 # revert the 2.6.13-rc3 patch +$ patch -p1 < ../patch-2.6.13-rc5 # apply the new 2.6.13-rc5 patch +$ cd .. +$ mv linux-2.6.13-rc3 linux-2.6.13-rc5 # rename the source dir + +# finally let's try and move from 2.6.12.3 to 2.6.13-rc5 +$ cd ~/linux-2.6.12.3 # change to the kernel source dir +$ patch -p1 -R < ../patch-2.6.12.3 # revert the 2.6.12.3 patch +$ patch -p1 < ../patch-2.6.13-rc5 # apply new 2.6.13-rc5 patch +$ cd .. +$ mv linux-2.6.12.3 linux-2.6.13-rc5 # rename the kernel source dir + + +The -git kernels +--- + These are daily snapshots of Linus' kernel tree (managed in a git +repository, hence the name). + +These patches are usually released daily and represent the current state of +Linus' tree. They are more experimental than -rc kernels since they are +generated automatically without even a cursory glance to see if they are +sane. + +-git patches are not incremental and apply either to a base 2.6.x kernel or +a base 2.6.x-rc kernel - you can see which from their name. +A patch named 2.6.12-git1 applies to the 2.6.12 kernel source and a patch +named 2.6.13-rc3-git2 applies to the source of the 2.6.13-rc3 kernel. + +Here are some examples of how to apply these patches: + +# moving from 2.6.12 to 2.6.12-git1 +$ cd ~/linux-2.6.12 # change to the kernel source dir +$ patch -p1 < ../patch-2.6.12-git1 # apply the 2.6.12-git1 patch +$ cd .. +$ mv linux-2.6.12 linux-2.6.12-git1 # rename the kernel source dir + +# moving from 2.6.12-git1 to 2.6.13-rc2-git3 +$ cd ~/linux-2.6.12-git1 # change to the kernel source dir +$ patch -p1 -R < ../patch-2.6.12-git1 # revert the 2.6.12-git1 patch + # we now have a 2.6.12 kernel +$ patch -p1 < ../patch-2.6.13-rc2 # apply the 2.6.13-rc2 patch + # the kernel is now 2.6.13-rc2 +$ patch -p1 < ../patch-2.6.13-rc2-git3 # apply the 2.6.13-rc2-git3 patch + # the kernel is now 2.6.13-rc2-git3 +$ cd .. +$ mv linux-2.6.12-git1 linux-2.6.13-rc2-git3 # rename source dir + + +The -mm kernels +--- + These are experimental kernels released by Andrew Morton. + +The -mm tree serves as a sort of proving ground for new features and other +experimental patches. +Once a patch has proved its worth in -mm for a while Andrew pushes it on to +Linus for inclusion in mainline. + +Although it's encouraged that patches flow to Linus via the -mm tree, this +is not always enforced. +Subsystem maintainers (or individuals) sometimes push their patches directly +to Linus, even though (or after) they have been merged and tested in -mm (or +sometimes even without prior testing in -mm). + +You should generally strive to get your patches into mainline via -mm to +ensure maximum testing. + +This branch is in constant flux and contains many experimental features, a +lot of debugging patches not appropriate for mainline etc and is the most +experimental of the branches described in this document. + +These kernels are not appropriate for use on systems that are supposed to be +stable and they are more risky to run than any of the other branches (make +sure you have up-to-date backups - that goes for any experimental kernel but +even more so for -mm kernels). + +These kernels in addition to all the other experimental patches they contain +usually also contain any changes in the mainline -git kernels available at +the time of release. + +Testing of -mm kernels is greatly appreciated since the whole point of the +tree is to weed out regressions, crashes, data corruption bugs, build +breakage (and any other bug in general) before changes are merged into the +more stable mainline Linus tree. +But testers of -mm should be aware that breakage in this tree is more common +than in any other tree. + +The -mm kernels are not released on a fixed schedule, but usually a few -mm +kernels are released in between each -rc kernel (1 to 3 is common). +The -mm kernels apply to either a base 2.6.x kernel (when no -rc kernels +have been released yet) or to a Linus -rc kernel. + +Here are some examples of applying the -mm patches: + +# moving from 2.6.12 to 2.6.12-mm1 +$ cd ~/linux-2.6.12 # change to the 2.6.12 source dir +$ patch -p1 < ../2.6.12-mm1 # apply the 2.6.12-mm1 patch +$ cd .. +$ mv linux-2.6.12 linux-2.6.12-mm1 # rename the source appropriately + +# moving from 2.6.12-mm1 to 2.6.13-rc3-mm3 +$ cd ~/linux-2.6.12-mm1 +$ patch -p1 -R < ../2.6.12-mm1 # revert the 2.6.12-mm1 patch + # we now have a 2.6.12 source +$ patch -p1 < ../patch-2.6.13-rc3 # apply the 2.6.13-rc3 patch + # we now have a 2.6.13-rc3 source +$ patch -p1 < ../2.6.13-rc3-mm3 # apply the 2.6.13-rc3-mm3 patch +$ cd .. +$ mv linux-2.6.12-mm1 linux-2.6.13-rc3-mm3 # rename the source dir + + +This concludes this list of explanations of the various kernel trees and I +hope you are now crystal clear on how to apply the various patches and help +testing the kernel. + diff --git a/Documentation/arm/Samsung-S3C24XX/Overview.txt b/Documentation/arm/Samsung-S3C24XX/Overview.txt index 3af4d29a8938..89aa89d526ac 100644 --- a/Documentation/arm/Samsung-S3C24XX/Overview.txt +++ b/Documentation/arm/Samsung-S3C24XX/Overview.txt @@ -81,7 +81,8 @@ Adding New Machines Any large scale modifications, or new drivers should be discussed on the ARM kernel mailing list (linux-arm-kernel) before being - attempted. + attempted. See http://www.arm.linux.org.uk/mailinglists/ for the + mailing list information. NAND @@ -120,6 +121,43 @@ Clock Management various clock units +Platform Data +------------- + + Whenever a device has platform specific data that is specified + on a per-machine basis, care should be taken to ensure the + following: + + 1) that default data is not left in the device to confuse the + driver if a machine does not set it at startup + + 2) the data should (if possible) be marked as __initdata, + to ensure that the data is thrown away if the machine is + not the one currently in use. + + The best way of doing this is to make a function that + kmalloc()s an area of memory, and copies the __initdata + and then sets the relevant device's platform data. Making + the function `__init` takes care of ensuring it is discarded + with the rest of the initialisation code + + static __init void s3c24xx_xxx_set_platdata(struct xxx_data *pd) + { + struct s3c2410_xxx_mach_info *npd; + + npd = kmalloc(sizeof(struct s3c2410_xxx_mach_info), GFP_KERNEL); + if (npd) { + memcpy(npd, pd, sizeof(struct s3c2410_xxx_mach_info)); + s3c_device_xxx.dev.platform_data = npd; + } else { + printk(KERN_ERR "no memory for xxx platform data\n"); + } + } + + Note, since the code is marked as __init, it should not be + exported outside arch/arm/mach-s3c2410/, or exported to + modules via EXPORT_SYMBOL() and related functions. + Port Contributors ----------------- @@ -149,6 +187,7 @@ Document Changes 06 Mar 2005 - BJD - Added Christer Weinigel 08 Mar 2005 - BJD - Added LCVR to list of people, updated introduction 08 Mar 2005 - BJD - Added section on adding machines + 09 Sep 2005 - BJD - Added section on platform data Document Author --------------- diff --git a/Documentation/arm/Samsung-S3C24XX/USB-Host.txt b/Documentation/arm/Samsung-S3C24XX/USB-Host.txt new file mode 100644 index 000000000000..b93b68e2b143 --- /dev/null +++ b/Documentation/arm/Samsung-S3C24XX/USB-Host.txt @@ -0,0 +1,93 @@ + S3C24XX USB Host support + ======================== + + + +Introduction +------------ + + This document details the S3C2410/S3C2440 in-built OHCI USB host support. + +Configuration +------------- + + Enable at least the following kernel options: + + menuconfig: + + Device Drivers ---> + USB support ---> + <*> Support for Host-side USB + <*> OHCI HCD support + + + .config: + CONFIG_USB + CONFIG_USB_OHCI_HCD + + + Once these options are configured, the standard set of USB device + drivers can be configured and used. + + +Board Support +------------- + + The driver attaches to a platform device, which will need to be + added by the board specific support file in linux/arch/arm/mach-s3c2410, + such as mach-bast.c or mach-smdk2410.c + + The platform device's platform_data field is only needed if the + board implements extra power control or over-current monitoring. + + The OHCI driver does not ensure the state of the S3C2410's MISCCTRL + register, so if both ports are to be used for the host, then it is + the board support file's responsibility to ensure that the second + port is configured to be connected to the OHCI core. + + +Platform Data +------------- + + See linux/include/asm-arm/arch-s3c2410/usb-control.h for the + descriptions of the platform device data. An implementation + can be found in linux/arch/arm/mach-s3c2410/usb-simtec.c . + + The `struct s3c2410_hcd_info` contains a pair of functions + that get called to enable over-current detection, and to + control the port power status. + + The ports are numbered 0 and 1. + + power_control: + + Called to enable or disable the power on the port. + + enable_oc: + + Called to enable or disable the over-current monitoring. + This should claim or release the resources being used to + check the power condition on the port, such as an IRQ. + + report_oc: + + The OHCI driver fills this field in for the over-current code + to call when there is a change to the over-current state on + an port. The ports argument is a bitmask of 1 bit per port, + with bit X being 1 for an over-current on port X. + + The function s3c2410_usb_report_oc() has been provided to + ensure this is called correctly. + + port[x]: + + This is struct describes each port, 0 or 1. The platform driver + should set the flags field of each port to S3C_HCDFLG_USED if + the port is enabled. + + + +Document Author +--------------- + +Ben Dooks, (c) 2005 Simtec Electronics diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 6dd274d7e1cf..2d65c2182161 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -906,9 +906,20 @@ Aside: 4. The I/O scheduler -I/O schedulers are now per queue. They should be runtime switchable and modular -but aren't yet. Jens has most bits to do this, but the sysfs implementation is -missing. +I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch +queue and specific I/O schedulers. Unless stated otherwise, elevator is used +to refer to both parts and I/O scheduler to specific I/O schedulers. + +Block layer implements generic dispatch queue in ll_rw_blk.c and elevator.c. +The generic dispatch queue is responsible for properly ordering barrier +requests, requeueing, handling non-fs requests and all other subtleties. + +Specific I/O schedulers are responsible for ordering normal filesystem +requests. They can also choose to delay certain requests to improve +throughput or whatever purpose. As the plural form indicates, there are +multiple I/O schedulers. They can be built as modules but at least one should +be built inside the kernel. Each queue can choose different one and can also +change to another one dynamically. A block layer call to the i/o scheduler follows the convention elv_xxx(). This calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh, @@ -921,44 +932,36 @@ keeping work. The functions an elevator may implement are: (* are mandatory) elevator_merge_fn called to query requests for merge with a bio -elevator_merge_req_fn " " " with another request +elevator_merge_req_fn called when two requests get merged. the one + which gets merged into the other one will be + never seen by I/O scheduler again. IOW, after + being merged, the request is gone. elevator_merged_fn called when a request in the scheduler has been involved in a merge. It is used in the deadline scheduler for example, to reposition the request if its sorting order has changed. -*elevator_next_req_fn returns the next scheduled request, or NULL - if there are none (or none are ready). +elevator_dispatch_fn fills the dispatch queue with ready requests. + I/O schedulers are free to postpone requests by + not filling the dispatch queue unless @force + is non-zero. Once dispatched, I/O schedulers + are not allowed to manipulate the requests - + they belong to generic dispatch queue. -*elevator_add_req_fn called to add a new request into the scheduler +elevator_add_req_fn called to add a new request into the scheduler elevator_queue_empty_fn returns true if the merge queue is empty. Drivers shouldn't use this, but rather check if elv_next_request is NULL (without losing the request if one exists!) -elevator_remove_req_fn This is called when a driver claims ownership of - the target request - it now belongs to the - driver. It must not be modified or merged. - Drivers must not lose the request! A subsequent - call of elevator_next_req_fn must return the - _next_ request. - -elevator_requeue_req_fn called to add a request to the scheduler. This - is used when the request has alrnadebeen - returned by elv_next_request, but hasn't - completed. If this is not implemented then - elevator_add_req_fn is called instead. - elevator_former_req_fn elevator_latter_req_fn These return the request before or after the one specified in disk sort order. Used by the block layer to find merge possibilities. -elevator_completed_req_fn called when a request is completed. This might - come about due to being merged with another or - when the device completes the request. +elevator_completed_req_fn called when a request is completed. elevator_may_queue_fn returns true if the scheduler wants to allow the current context to queue a new request even if @@ -967,13 +970,33 @@ elevator_may_queue_fn returns true if the scheduler wants to allow the elevator_set_req_fn elevator_put_req_fn Must be used to allocate and free any elevator - specific storate for a request. + specific storage for a request. + +elevator_activate_req_fn Called when device driver first sees a request. + I/O schedulers can use this callback to + determine when actual execution of a request + starts. +elevator_deactivate_req_fn Called when device driver decides to delay + a request by requeueing it. elevator_init_fn elevator_exit_fn Allocate and free any elevator specific storage for a queue. -4.2 I/O scheduler implementation +4.2 Request flows seen by I/O schedulers +All requests seens by I/O schedulers strictly follow one of the following three +flows. + + set_req_fn -> + + i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> + (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn + ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn + iii. [none] + + -> put_req_fn + +4.3 I/O scheduler implementation The generic i/o scheduler algorithm attempts to sort/merge/batch requests for optimal disk scan and request servicing performance (based on generic principles and device capabilities), optimized for: @@ -993,18 +1016,7 @@ request in sort order to prevent binary tree lookups. This arrangement is not a generic block layer characteristic however, so elevators may implement queues as they please. -ii. Last merge hint -The last merge hint is part of the generic queue layer. I/O schedulers must do -some management on it. For the most part, the most important thing is to make -sure q->last_merge is cleared (set to NULL) when the request on it is no longer -a candidate for merging (for example if it has been sent to the driver). - -The last merge performed is cached as a hint for the subsequent request. If -sequential data is being submitted, the hint is used to perform merges without -any scanning. This is not sufficient when there are multiple processes doing -I/O though, so a "merge hash" is used by some schedulers. - -iii. Merge hash +ii. Merge hash AS and deadline use a hash table indexed by the last sector of a request. This enables merging code to quickly look up "back merge" candidates, even when multiple I/O streams are being performed at once on one disk. @@ -1013,29 +1025,8 @@ multiple I/O streams are being performed at once on one disk. are far less common than "back merges" due to the nature of most I/O patterns. Front merges are handled by the binary trees in AS and deadline schedulers. -iv. Handling barrier cases -A request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be ordered -around. That is, they must be processed after all older requests, and before -any newer ones. This includes merges! - -In AS and deadline schedulers, barriers have the effect of flushing the reorder -queue. The performance cost of this will vary from nothing to a lot depending -on i/o patterns and device characteristics. Obviously they won't improve -performance, so their use should be kept to a minimum. - -v. Handling insertion position directives -A request may be inserted with a position directive. The directives are one of -ELEVATOR_INSERT_BACK, ELEVATOR_INSERT_FRONT, ELEVATOR_INSERT_SORT. - -ELEVATOR_INSERT_SORT is a general directive for non-barrier requests. -ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue. -ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, and -overrides the ordering requested by any previous barriers. In practice this is -harmless and required, because it is used for SCSI requeueing. This does not -require flushing the reorder queue, so does not impose a performance penalty. - -vi. Plugging the queue to batch requests in anticipation of opportunities for - merge/sort optimizations +iii. Plugging the queue to batch requests in anticipation of opportunities for + merge/sort optimizations This is just the same as in 2.4 so far, though per-device unplugging support is anticipated for 2.5. Also with a priority-based i/o scheduler, @@ -1069,7 +1060,7 @@ Aside: blk_kick_queue() to unplug a specific queue (right away ?) or optionally, all queues, is in the plan. -4.3 I/O contexts +4.4 I/O contexts I/O contexts provide a dynamically allocated per process data area. They may be used in I/O schedulers, and in the block layer (could be used for IO statis, priorities for example). See *io_context in drivers/block/ll_rw_blk.c, and diff --git a/Documentation/cachetlb.txt b/Documentation/cachetlb.txt index e132fb1163b0..7eb715e07eda 100644 --- a/Documentation/cachetlb.txt +++ b/Documentation/cachetlb.txt @@ -49,9 +49,6 @@ changes occur: page table operations such as what happens during fork, and exec. - Platform developers note that generic code will always - invoke this interface without mm->page_table_lock held. - 3) void flush_tlb_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) @@ -72,9 +69,6 @@ changes occur: call flush_tlb_page (see below) for each entry which may be modified. - Platform developers note that generic code will always - invoke this interface with mm->page_table_lock held. - 4) void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) This time we need to remove the PAGE_SIZE sized translation @@ -93,9 +87,6 @@ changes occur: This is used primarily during fault processing. - Platform developers note that generic code will always - invoke this interface with mm->page_table_lock held. - 5) void flush_tlb_pgtables(struct mm_struct *mm, unsigned long start, unsigned long end) diff --git a/Documentation/cciss.txt b/Documentation/cciss.txt index c8f9a73111da..68a711fb82cf 100644 --- a/Documentation/cciss.txt +++ b/Documentation/cciss.txt @@ -17,7 +17,9 @@ This driver is known to work with the following cards: * SA P600 * SA P800 * SA E400 - * SA E300 + * SA P400i + * SA E200 + * SA E200i If nodes are not already created in the /dev/cciss directory, run as root: diff --git a/Documentation/cdrom/sonycd535 b/Documentation/cdrom/sonycd535 index 59581a4b302a..b81e109970aa 100644 --- a/Documentation/cdrom/sonycd535 +++ b/Documentation/cdrom/sonycd535 @@ -68,7 +68,8 @@ it a better device citizen. Further thanks to Joel Katz Porfiri Claudio <C.Porfiri@nisms.tei.ericsson.se> for patches to make the driver work with the older CDU-510/515 series, and Heiko Eissfeldt <heiko@colossus.escape.de> for pointing out that -the verify_area() checks were ignoring the results of said checks. +the verify_area() checks were ignoring the results of said checks +(note: verify_area() has since been replaced by access_ok()). (Acknowledgments from Ron Jeppesen in the 0.3 release:) Thanks to Corey Minyard who wrote the original CDU-31A driver on which diff --git a/Documentation/connector/cn_test.c b/Documentation/connector/cn_test.c new file mode 100644 index 000000000000..b7de82e9c0e0 --- /dev/null +++ b/Documentation/connector/cn_test.c @@ -0,0 +1,194 @@ +/* + * cn_test.c + * + * 2004-2005 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/moduleparam.h> +#include <linux/skbuff.h> +#include <linux/timer.h> + +#include "connector.h" + +static struct cb_id cn_test_id = { 0x123, 0x456 }; +static char cn_test_name[] = "cn_test"; +static struct sock *nls; +static struct timer_list cn_test_timer; + +void cn_test_callback(void *data) +{ + struct cn_msg *msg = (struct cn_msg *)data; + + printk("%s: %lu: idx=%x, val=%x, seq=%u, ack=%u, len=%d: %s.\n", + __func__, jiffies, msg->id.idx, msg->id.val, + msg->seq, msg->ack, msg->len, (char *)msg->data); +} + +static int cn_test_want_notify(void) +{ + struct cn_ctl_msg *ctl; + struct cn_notify_req *req; + struct cn_msg *msg = NULL; + int size, size0; + struct sk_buff *skb; + struct nlmsghdr *nlh; + u32 group = 1; + + size0 = sizeof(*msg) + sizeof(*ctl) + 3 * sizeof(*req); + + size = NLMSG_SPACE(size0); + + skb = alloc_skb(size, GFP_ATOMIC); + if (!skb) { + printk(KERN_ERR "Failed to allocate new skb with size=%u.\n", + size); + + return -ENOMEM; + } + + nlh = NLMSG_PUT(skb, 0, 0x123, NLMSG_DONE, size - sizeof(*nlh)); + + msg = (struct cn_msg *)NLMSG_DATA(nlh); + + memset(msg, 0, size0); + + msg->id.idx = -1; + msg->id.val = -1; + msg->seq = 0x123; + msg->ack = 0x345; + msg->len = size0 - sizeof(*msg); + + ctl = (struct cn_ctl_msg *)(msg + 1); + + ctl->idx_notify_num = 1; + ctl->val_notify_num = 2; + ctl->group = group; + ctl->len = msg->len - sizeof(*ctl); + + req = (struct cn_notify_req *)(ctl + 1); + + /* + * Idx. + */ + req->first = cn_test_id.idx; + req->range = 10; + + /* + * Val 0. + */ + req++; + req->first = cn_test_id.val; + req->range = 10; + + /* + * Val 1. + */ + req++; + req->first = cn_test_id.val + 20; + req->range = 10; + + NETLINK_CB(skb).dst_groups = ctl->group; + //netlink_broadcast(nls, skb, 0, ctl->group, GFP_ATOMIC); + netlink_unicast(nls, skb, 0, 0); + + printk(KERN_INFO "Request was sent. Group=0x%x.\n", ctl->group); + + return 0; + +nlmsg_failure: + printk(KERN_ERR "Failed to send %u.%u\n", msg->seq, msg->ack); + kfree_skb(skb); + return -EINVAL; +} + +static u32 cn_test_timer_counter; +static void cn_test_timer_func(unsigned long __data) +{ + struct cn_msg *m; + char data[32]; + + m = kmalloc(sizeof(*m) + sizeof(data), GFP_ATOMIC); + if (m) { + memset(m, 0, sizeof(*m) + sizeof(data)); + + memcpy(&m->id, &cn_test_id, sizeof(m->id)); + m->seq = cn_test_timer_counter; + m->len = sizeof(data); + + m->len = + scnprintf(data, sizeof(data), "counter = %u", + cn_test_timer_counter) + 1; + + memcpy(m + 1, data, m->len); + + cn_netlink_send(m, 0, gfp_any()); + kfree(m); + } + + cn_test_timer_counter++; + + mod_timer(&cn_test_timer, jiffies + HZ); +} + +static int cn_test_init(void) +{ + int err; + + err = cn_add_callback(&cn_test_id, cn_test_name, cn_test_callback); + if (err) + goto err_out; + cn_test_id.val++; + err = cn_add_callback(&cn_test_id, cn_test_name, cn_test_callback); + if (err) { + cn_del_callback(&cn_test_id); + goto err_out; + } + + init_timer(&cn_test_timer); + cn_test_timer.function = cn_test_timer_func; + cn_test_timer.expires = jiffies + HZ; + cn_test_timer.data = 0; + add_timer(&cn_test_timer); + + return 0; + + err_out: + if (nls && nls->sk_socket) + sock_release(nls->sk_socket); + + return err; +} + +static void cn_test_fini(void) +{ + del_timer_sync(&cn_test_timer); + cn_del_callback(&cn_test_id); + cn_test_id.val--; + cn_del_callback(&cn_test_id); + if (nls && nls->sk_socket) + sock_release(nls->sk_socket); +} + +module_init(cn_test_init); +module_exit(cn_test_fini); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Evgeniy Polyakov <johnpol@2ka.mipt.ru>"); +MODULE_DESCRIPTION("Connector's test module"); diff --git a/Documentation/connector/connector.txt b/Documentation/connector/connector.txt new file mode 100644 index 000000000000..57a314b14cf8 --- /dev/null +++ b/Documentation/connector/connector.txt @@ -0,0 +1,177 @@ +/*****************************************/ +Kernel Connector. +/*****************************************/ + +Kernel connector - new netlink based userspace <-> kernel space easy +to use communication module. + +Connector driver adds possibility to connect various agents using +netlink based network. One must register callback and +identifier. When driver receives special netlink message with +appropriate identifier, appropriate callback will be called. + +From the userspace point of view it's quite straightforward: + + socket(); + bind(); + send(); + recv(); + +But if kernelspace want to use full power of such connections, driver +writer must create special sockets, must know about struct sk_buff +handling... Connector allows any kernelspace agents to use netlink +based networking for inter-process communication in a significantly +easier way: + +int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *)); +void cn_netlink_send(struct cn_msg *msg, u32 __group, int gfp_mask); + +struct cb_id +{ + __u32 idx; + __u32 val; +}; + +idx and val are unique identifiers which must be registered in +connector.h for in-kernel usage. void (*callback) (void *) - is a +callback function which will be called when message with above idx.val +will be received by connector core. Argument for that function must +be dereferenced to struct cn_msg *. + +struct cn_msg +{ + struct cb_id id; + + __u32 seq; + __u32 ack; + + __u32 len; /* Length of the following data */ + __u8 data[0]; +}; + +/*****************************************/ +Connector interfaces. +/*****************************************/ + +int cn_add_callback(struct cb_id *id, char *name, void (*callback) (void *)); + +Registers new callback with connector core. + +struct cb_id *id - unique connector's user identifier. + It must be registered in connector.h for legal in-kernel users. +char *name - connector's callback symbolic name. +void (*callback) (void *) - connector's callback. + Argument must be dereferenced to struct cn_msg *. + +void cn_del_callback(struct cb_id *id); + +Unregisters new callback with connector core. + +struct cb_id *id - unique connector's user identifier. + +void cn_netlink_send(struct cn_msg *msg, u32 __groups, int gfp_mask); + +Sends message to the specified groups. It can be safely called from +any context, but may silently fail under strong memory pressure. + +struct cn_msg * - message header(with attached data). +u32 __group - destination group. + If __group is zero, then appropriate group will + be searched through all registered connector users, + and message will be delivered to the group which was + created for user with the same ID as in msg. + If __group is not zero, then message will be delivered + to the specified group. +int gfp_mask - GFP mask. + +Note: When registering new callback user, connector core assigns +netlink group to the user which is equal to it's id.idx. + +/*****************************************/ +Protocol description. +/*****************************************/ + +Current offers transport layer with fixed header. Recommended +protocol which uses such header is following: + +msg->seq and msg->ack are used to determine message genealogy. When +someone sends message it puts there locally unique sequence and random +acknowledge numbers. Sequence number may be copied into +nlmsghdr->nlmsg_seq too. + +Sequence number is incremented with each message to be sent. + +If we expect reply to our message, then sequence number in received +message MUST be the same as in original message, and acknowledge +number MUST be the same + 1. + +If we receive message and it's sequence number is not equal to one we +are expecting, then it is new message. If we receive message and it's +sequence number is the same as one we are expecting, but it's +acknowledge is not equal acknowledge number in original message + 1, +then it is new message. + +Obviously, protocol header contains above id. + +connector allows event notification in the following form: kernel +driver or userspace process can ask connector to notify it when +selected id's will be turned on or off(registered or unregistered it's +callback). It is done by sending special command to connector +driver(it also registers itself with id={-1, -1}). + +As example of usage Documentation/connector now contains cn_test.c - +testing module which uses connector to request notification and to +send messages. + +/*****************************************/ +Reliability. +/*****************************************/ + +Netlink itself is not reliable protocol, that means that messages can +be lost due to memory pressure or process' receiving queue overflowed, +so caller is warned must be prepared. That is why struct cn_msg [main +connector's message header] contains u32 seq and u32 ack fields. + +/*****************************************/ +Userspace usage. +/*****************************************/ +2.6.14 has a new netlink socket implementation, which by default does not +allow to send data to netlink groups other than 1. +So, if to use netlink socket (for example using connector) +with different group number userspace application must subscribe to +that group. It can be achieved by following pseudocode: + +s = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR); + +l_local.nl_family = AF_NETLINK; +l_local.nl_groups = 12345; +l_local.nl_pid = 0; + +if (bind(s, (struct sockaddr *)&l_local, sizeof(struct sockaddr_nl)) == -1) { + perror("bind"); + close(s); + return -1; +} + +{ + int on = l_local.nl_groups; + setsockopt(s, 270, 1, &on, sizeof(on)); +} + +Where 270 above is SOL_NETLINK, and 1 is a NETLINK_ADD_MEMBERSHIP socket +option. To drop multicast subscription one should call above socket option +with NETLINK_DROP_MEMBERSHIP parameter which is defined as 0. + +2.6.14 netlink code only allows to select a group which is less or equal to +the maximum group number, which is used at netlink_kernel_create() time. +In case of connector it is CN_NETLINK_USERS + 0xf, so if you want to use +group number 12345, you must increment CN_NETLINK_USERS to that number. +Additional 0xf numbers are allocated to be used by non-in-kernel users. + +Due to this limitation, group 0xffffffff does not work now, so one can +not use add/remove connector's group notifications, but as far as I know, +only cn_test.c test module used it. + +Some work in netlink area is still being done, so things can be changed in +2.6.15 timeframe, if it will happen, documentation will be updated for that +kernel. diff --git a/Documentation/cpu-freq/cpufreq-stats.txt b/Documentation/cpu-freq/cpufreq-stats.txt index e2d1e760b4ba..6a82948ff4bd 100644 --- a/Documentation/cpu-freq/cpufreq-stats.txt +++ b/Documentation/cpu-freq/cpufreq-stats.txt @@ -36,7 +36,7 @@ cpufreq stats provides following statistics (explained in detail below). All the statistics will be from the time the stats driver has been inserted to the time when a read of a particular statistic is done. Obviously, stats -driver will not have any information about the the frequcny transitions before +driver will not have any information about the frequency transitions before the stats driver insertion. -------------------------------------------------------------------------------- diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index ad944c060312..a09a8eb80665 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -60,6 +60,18 @@ all of the cpus in the system. This removes any overhead due to load balancing code trying to pull tasks outside of the cpu exclusive cpuset only to be prevented by the tasks' cpus_allowed mask. +A cpuset that is mem_exclusive restricts kernel allocations for +page, buffer and other data commonly shared by the kernel across +multiple users. All cpusets, whether mem_exclusive or not, restrict +allocations of memory for user space. This enables configuring a +system so that several independent jobs can share common kernel +data, such as file system pages, while isolating each jobs user +allocation in its own cpuset. To do this, construct a large +mem_exclusive cpuset to hold all the jobs, and construct child, +non-mem_exclusive cpusets for each individual job. Only a small +amount of typical kernel memory, such as requests from interrupt +handlers, is allowed to be taken outside even a mem_exclusive cpuset. + User level code may create and destroy cpusets by name in the cpuset virtual file system, manage the attributes and permissions of these cpusets and which CPUs and Memory Nodes are assigned to each cpuset, @@ -82,7 +94,7 @@ the available CPU and Memory resources amongst the requesting tasks. But larger systems, which benefit more from careful processor and memory placement to reduce memory access times and contention, and which typically represent a larger investment for the customer, -can benefit from explictly placing jobs on properly sized subsets of +can benefit from explicitly placing jobs on properly sized subsets of the system. This can be especially valuable on: @@ -265,7 +277,7 @@ rewritten to the 'tasks' file of its cpuset. This is done to avoid impacting the scheduler code in the kernel with a check for changes in a tasks processor placement. -There is an exception to the above. If hotplug funtionality is used +There is an exception to the above. If hotplug functionality is used to remove all the CPUs that are currently assigned to a cpuset, then the kernel will automatically update the cpus_allowed of all tasks attached to CPUs in that cpuset to allow all CPUs. When memory diff --git a/Documentation/crypto/api-intro.txt b/Documentation/crypto/api-intro.txt index a2d5b4900772..74dffc68ff9f 100644 --- a/Documentation/crypto/api-intro.txt +++ b/Documentation/crypto/api-intro.txt @@ -223,6 +223,7 @@ CAST5 algorithm contributors: TEA/XTEA algorithm contributors: Aaron Grothe + Michael Ringe Khazad algorithm contributors: Aaron Grothe diff --git a/Documentation/crypto/descore-readme.txt b/Documentation/crypto/descore-readme.txt index 166474c2ee0b..16e9e6350755 100644 --- a/Documentation/crypto/descore-readme.txt +++ b/Documentation/crypto/descore-readme.txt @@ -1,4 +1,4 @@ -Below is the orginal README file from the descore.shar package. +Below is the original README file from the descore.shar package. ------------------------------------------------------------------------------ des - fast & portable DES encryption & decryption. diff --git a/Documentation/dcdbas.txt b/Documentation/dcdbas.txt new file mode 100644 index 000000000000..e1c52e2dc361 --- /dev/null +++ b/Documentation/dcdbas.txt @@ -0,0 +1,91 @@ +Overview + +The Dell Systems Management Base Driver provides a sysfs interface for +systems management software such as Dell OpenManage to perform system +management interrupts and host control actions (system power cycle or +power off after OS shutdown) on certain Dell systems. + +Dell OpenManage requires this driver on the following Dell PowerEdge systems: +300, 1300, 1400, 400SC, 500SC, 1500SC, 1550, 600SC, 1600SC, 650, 1655MC, +700, and 750. Other Dell software such as the open source libsmbios project +is expected to make use of this driver, and it may include the use of this +driver on other Dell systems. + +The Dell libsmbios project aims towards providing access to as much BIOS +information as possible. See http://linux.dell.com/libsmbios/main/ for +more information about the libsmbios project. + + +System Management Interrupt + +On some Dell systems, systems management software must access certain +management information via a system management interrupt (SMI). The SMI data +buffer must reside in 32-bit address space, and the physical address of the +buffer is required for the SMI. The driver maintains the memory required for +the SMI and provides a way for the application to generate the SMI. +The driver creates the following sysfs entries for systems management +software to perform these system management interrupts: + +/sys/devices/platform/dcdbas/smi_data +/sys/devices/platform/dcdbas/smi_data_buf_phys_addr +/sys/devices/platform/dcdbas/smi_data_buf_size +/sys/devices/platform/dcdbas/smi_request + +Systems management software must perform the following steps to execute +a SMI using this driver: + +1) Lock smi_data. +2) Write system management command to smi_data. +3) Write "1" to smi_request to generate a calling interface SMI or + "2" to generate a raw SMI. +4) Read system management command response from smi_data. +5) Unlock smi_data. + + +Host Control Action + +Dell OpenManage supports a host control feature that allows the administrator +to perform a power cycle or power off of the system after the OS has finished +shutting down. On some Dell systems, this host control feature requires that +a driver perform a SMI after the OS has finished shutting down. + +The driver creates the following sysfs entries for systems management software +to schedule the driver to perform a power cycle or power off host control +action after the system has finished shutting down: + +/sys/devices/platform/dcdbas/host_control_action +/sys/devices/platform/dcdbas/host_control_smi_type +/sys/devices/platform/dcdbas/host_control_on_shutdown + +Dell OpenManage performs the following steps to execute a power cycle or +power off host control action using this driver: + +1) Write host control action to be performed to host_control_action. +2) Write type of SMI that driver needs to perform to host_control_smi_type. +3) Write "1" to host_control_on_shutdown to enable host control action. +4) Initiate OS shutdown. + (Driver will perform host control SMI when it is notified that the OS + has finished shutting down.) + + +Host Control SMI Type + +The following table shows the value to write to host_control_smi_type to +perform a power cycle or power off host control action: + +PowerEdge System Host Control SMI Type +---------------- --------------------- + 300 HC_SMITYPE_TYPE1 + 1300 HC_SMITYPE_TYPE1 + 1400 HC_SMITYPE_TYPE2 + 500SC HC_SMITYPE_TYPE2 + 1500SC HC_SMITYPE_TYPE2 + 1550 HC_SMITYPE_TYPE2 + 600SC HC_SMITYPE_TYPE2 + 1600SC HC_SMITYPE_TYPE2 + 650 HC_SMITYPE_TYPE2 + 1655MC HC_SMITYPE_TYPE2 + 700 HC_SMITYPE_TYPE3 + 750 HC_SMITYPE_TYPE3 + + diff --git a/Documentation/dell_rbu.txt b/Documentation/dell_rbu.txt new file mode 100644 index 000000000000..941343a7a265 --- /dev/null +++ b/Documentation/dell_rbu.txt @@ -0,0 +1,100 @@ +Purpose: +Demonstrate the usage of the new open sourced rbu (Remote BIOS Update) driver +for updating BIOS images on Dell servers and desktops. + +Scope: +This document discusses the functionality of the rbu driver only. +It does not cover the support needed from aplications to enable the BIOS to +update itself with the image downloaded in to the memory. + +Overview: +This driver works with Dell OpenManage or Dell Update Packages for updating +the BIOS on Dell servers (starting from servers sold since 1999), desktops +and notebooks (starting from those sold in 2005). +Please go to http://support.dell.com register and you can find info on +OpenManage and Dell Update packages (DUP). +Libsmbios can also be used to update BIOS on Dell systems go to +http://linux.dell.com/libsmbios/ for details. + +Dell_RBU driver supports BIOS update using the monilothic image and packetized +image methods. In case of moniolithic the driver allocates a contiguous chunk +of physical pages having the BIOS image. In case of packetized the app +using the driver breaks the image in to packets of fixed sizes and the driver +would place each packet in contiguous physical memory. The driver also +maintains a link list of packets for reading them back. +If the dell_rbu driver is unloaded all the allocated memory is freed. + +The rbu driver needs to have an application (as mentioned above)which will +inform the BIOS to enable the update in the next system reboot. + +The user should not unload the rbu driver after downloading the BIOS image +or updating. + +The driver load creates the following directories under the /sys file system. +/sys/class/firmware/dell_rbu/loading +/sys/class/firmware/dell_rbu/data +/sys/devices/platform/dell_rbu/image_type +/sys/devices/platform/dell_rbu/data +/sys/devices/platform/dell_rbu/packet_size + +The driver supports two types of update mechanism; monolithic and packetized. +These update mechanism depends upon the BIOS currently running on the system. +Most of the Dell systems support a monolithic update where the BIOS image is +copied to a single contiguous block of physical memory. +In case of packet mechanism the single memory can be broken in smaller chuks +of contiguous memory and the BIOS image is scattered in these packets. + +By default the driver uses monolithic memory for the update type. This can be +changed to packets during the driver load time by specifying the load +parameter image_type=packet. This can also be changed later as below +echo packet > /sys/devices/platform/dell_rbu/image_type + +In packet update mode the packet size has to be given before any packets can +be downloaded. It is done as below +echo XXXX > /sys/devices/platform/dell_rbu/packet_size +In the packet update mechanism, the user neesd to create a new file having +packets of data arranged back to back. It can be done as follows +The user creates packets header, gets the chunk of the BIOS image and +placs it next to the packetheader; now, the packetheader + BIOS image chunk +added to geather should match the specified packet_size. This makes one +packet, the user needs to create more such packets out of the entire BIOS +image file and then arrange all these packets back to back in to one single +file. +This file is then copied to /sys/class/firmware/dell_rbu/data. +Once this file gets to the driver, the driver extracts packet_size data from +the file and spreads it accross the physical memory in contiguous packet_sized +space. +This method makes sure that all the packets get to the driver in a single operation. + +In monolithic update the user simply get the BIOS image (.hdr file) and copies +to the data file as is without any change to the BIOS image itself. + +Do the steps below to download the BIOS image. +1) echo 1 > /sys/class/firmware/dell_rbu/loading +2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data +3) echo 0 > /sys/class/firmware/dell_rbu/loading + +The /sys/class/firmware/dell_rbu/ entries will remain till the following is +done. +echo -1 > /sys/class/firmware/dell_rbu/loading. +Until this step is completed the driver cannot be unloaded. +Also echoing either mono ,packet or init in to image_type will free up the +memory allocated by the driver. + +If an user by accident executes steps 1 and 3 above without executing step 2; +it will make the /sys/class/firmware/dell_rbu/ entries to disappear. +The entries can be recreated by doing the following +echo init > /sys/devices/platform/dell_rbu/image_type +NOTE: echoing init in image_type does not change it original value. + +Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to +read back the image downloaded. + +NOTE: +This driver requires a patch for firmware_class.c which has the modified +request_firmware_nowait function. +Also after updating the BIOS image an user mdoe application neeeds to execute +code which message the BIOS update request to the BIOS. So on the next reboot +the BIOS knows about the new image downloaded and it updates it self. +Also don't unload the rbu drive if the image has to be updated. + diff --git a/Documentation/device-mapper/snapshot.txt b/Documentation/device-mapper/snapshot.txt new file mode 100644 index 000000000000..dca274ff4005 --- /dev/null +++ b/Documentation/device-mapper/snapshot.txt @@ -0,0 +1,73 @@ +Device-mapper snapshot support +============================== + +Device-mapper allows you, without massive data copying: + +*) To create snapshots of any block device i.e. mountable, saved states of +the block device which are also writable without interfering with the +original content; +*) To create device "forks", i.e. multiple different versions of the +same data stream. + + +In both cases, dm copies only the chunks of data that get changed and +uses a separate copy-on-write (COW) block device for storage. + + +There are two dm targets available: snapshot and snapshot-origin. + +*) snapshot-origin <origin> + +which will normally have one or more snapshots based on it. +You must create the snapshot-origin device before you can create snapshots. +Reads will be mapped directly to the backing device. For each write, the +original data will be saved in the <COW device> of each snapshot to keep +its visible content unchanged, at least until the <COW device> fills up. + + +*) snapshot <origin> <COW device> <persistent?> <chunksize> + +A snapshot is created of the <origin> block device. Changed chunks of +<chunksize> sectors will be stored on the <COW device>. Writes will +only go to the <COW device>. Reads will come from the <COW device> or +from <origin> for unchanged data. <COW device> will often be +smaller than the origin and if it fills up the snapshot will become +useless and be disabled, returning errors. So it is important to monitor +the amount of free space and expand the <COW device> before it fills up. + +<persistent?> is P (Persistent) or N (Not persistent - will not survive +after reboot). + + +How this is used by LVM2 +======================== +When you create the first LVM2 snapshot of a volume, four dm devices are used: + +1) a device containing the original mapping table of the source volume; +2) a device used as the <COW device>; +3) a "snapshot" device, combining #1 and #2, which is the visible snapshot + volume; +4) the "original" volume (which uses the device number used by the original + source volume), whose table is replaced by a "snapshot-origin" mapping + from device #1. + +A fixed naming scheme is used, so with the following commands: + +lvcreate -L 1G -n base volumeGroup +lvcreate -L 100M --snapshot -n snap volumeGroup/base + +we'll have this situation (with volumes in above order): + +# dmsetup table|grep volumeGroup + +volumeGroup-base-real: 0 2097152 linear 8:19 384 +volumeGroup-snap-cow: 0 204800 linear 8:19 2097536 +volumeGroup-snap: 0 2097152 snapshot 254:11 254:12 P 16 +volumeGroup-base: 0 2097152 snapshot-origin 254:11 + +# ls -lL /dev/mapper/volumeGroup-* +brw------- 1 root root 254, 11 29 ago 18:15 /dev/mapper/volumeGroup-base-real +brw------- 1 root root 254, 12 29 ago 18:15 /dev/mapper/volumeGroup-snap-cow +brw------- 1 root root 254, 13 29 ago 18:15 /dev/mapper/volumeGroup-snap +brw------- 1 root root 254, 10 29 ago 18:14 /dev/mapper/volumeGroup-base + diff --git a/Documentation/dontdiff b/Documentation/dontdiff index 96bea278bbf6..24adfe9af3ca 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -55,6 +55,7 @@ aic7*seq.h* aicasm aicdb.h* asm +asm-offsets.* asm_offsets.* autoconf.h* bbootsect diff --git a/Documentation/driver-model/driver.txt b/Documentation/driver-model/driver.txt index fabaca1ab1b0..59806c9761f7 100644 --- a/Documentation/driver-model/driver.txt +++ b/Documentation/driver-model/driver.txt @@ -14,8 +14,8 @@ struct device_driver { int (*probe) (struct device * dev); int (*remove) (struct device * dev); - int (*suspend) (struct device * dev, pm_message_t state, u32 level); - int (*resume) (struct device * dev, u32 level); + int (*suspend) (struct device * dev, pm_message_t state); + int (*resume) (struct device * dev); }; @@ -194,69 +194,13 @@ device; i.e. anything in the device's driver_data field. If the device is still present, it should quiesce the device and place it into a supported low-power state. - int (*suspend) (struct device * dev, pm_message_t state, u32 level); + int (*suspend) (struct device * dev, pm_message_t state); -suspend is called to put the device in a low power state. There are -several stages to successfully suspending a device, which is denoted in -the @level parameter. Breaking the suspend transition into several -stages affords the platform flexibility in performing device power -management based on the requirements of the system and the -user-defined policy. +suspend is called to put the device in a low power state. -SUSPEND_NOTIFY notifies the device that a suspend transition is about -to happen. This happens on system power state transitions to verify -that all devices can successfully suspend. + int (*resume) (struct device * dev); -A driver may choose to fail on this call, which should cause the -entire suspend transition to fail. A driver should fail only if it -knows that the device will not be able to be resumed properly when the -system wakes up again. It could also fail if it somehow determines it -is in the middle of an operation too important to stop. - -SUSPEND_DISABLE tells the device to stop I/O transactions. When it -stops transactions, or what it should do with unfinished transactions -is a policy of the driver. After this call, the driver should not -accept any other I/O requests. - -SUSPEND_SAVE_STATE tells the device to save the context of the -hardware. This includes any bus-specific hardware state and -device-specific hardware state. A pointer to this saved state can be -stored in the device's saved_state field. - -SUSPEND_POWER_DOWN tells the driver to place the device in the low -power state requested. - -Whether suspend is called with a given level is a policy of the -platform. Some levels may be omitted; drivers must not assume the -reception of any level. However, all levels must be called in the -order above; i.e. notification will always come before disabling; -disabling the device will come before suspending the device. - -All calls are made with interrupts enabled, except for the -SUSPEND_POWER_DOWN level. - - int (*resume) (struct device * dev, u32 level); - -Resume is used to bring a device back from a low power state. Like the -suspend transition, it happens in several stages. - -RESUME_POWER_ON tells the driver to set the power state to the state -before the suspend call (The device could have already been in a low -power state before the suspend call to put in a lower power state). - -RESUME_RESTORE_STATE tells the driver to restore the state saved by -the SUSPEND_SAVE_STATE suspend call. - -RESUME_ENABLE tells the driver to start accepting I/O transactions -again. Depending on driver policy, the device may already have pending -I/O requests. - -RESUME_POWER_ON is called with interrupts disabled. The other resume -levels are called with interrupts enabled. - -As with the various suspend stages, the driver must not assume that -any other resume calls have been or will be made. Each call should be -self-contained and not dependent on any external state. +Resume is used to bring a device back from a low power state. Attributes diff --git a/Documentation/driver-model/porting.txt b/Documentation/driver-model/porting.txt index ff2fef2107f0..98b233cb8b36 100644 --- a/Documentation/driver-model/porting.txt +++ b/Documentation/driver-model/porting.txt @@ -350,7 +350,7 @@ When a driver is registered, the bus's list of devices is iterated over. bus->match() is called for each device that is not already claimed by a driver. -When a device is successfully bound to a device, device->driver is +When a device is successfully bound to a driver, device->driver is set, the device is added to a per-driver list of devices, and a symlink is created in the driver's sysfs directory that points to the device's physical directory: diff --git a/Documentation/dvb/bt8xx.txt b/Documentation/dvb/bt8xx.txt index e6b8d05bc08d..cb63b7a93c82 100644 --- a/Documentation/dvb/bt8xx.txt +++ b/Documentation/dvb/bt8xx.txt @@ -1,55 +1,74 @@ -How to get the Nebula Electronics DigiTV, Pinnacle PCTV Sat, Twinhan DST + clones working -========================================================================================= +How to get the Nebula, PCTV and Twinhan DST cards working +========================================================= -1) General information -====================== +This class of cards has a bt878a as the PCI interface, and +require the bttv driver. -This class of cards has a bt878a chip as the PCI interface. -The different card drivers require the bttv driver to provide the means -to access the i2c bus and the gpio pins of the bt8xx chipset. +Please pay close attention to the warning about the bttv module +options below for the DST card. -2) Compilation rules for Kernel >= 2.6.12 -========================================= +1) General informations +======================= -Enable the following options: +These drivers require the bttv driver to provide the means to access +the i2c bus and the gpio pins of the bt8xx chipset. +Because of this, you need to enable "Device drivers" => "Multimedia devices" - => "Video For Linux" => "BT848 Video For Linux" + => "Video For Linux" => "BT848 Video For Linux" + +Furthermore you need to enable "Device drivers" => "Multimedia devices" => "Digital Video Broadcasting Devices" - => "DVB for Linux" "DVB Core Support" "Nebula/Pinnacle PCTV/TwinHan PCI Cards" + => "DVB for Linux" "DVB Core Support" "BT8xx based PCI cards" -3) Loading Modules, described by two approaches -=============================================== +2) Loading Modules +================== In general you need to load the bttv driver, which will handle the gpio and -i2c communication for us, plus the common dvb-bt8xx device driver, -which is called the backend. -The frontends for Nebula DigiTV (nxt6000), Pinnacle PCTV Sat (cx24110), -TwinHan DST + clones (dst and dst-ca) are loaded automatically by the backend. -For further details about TwinHan DST + clones see /Documentation/dvb/ci.txt. +i2c communication for us, plus the common dvb-bt8xx device driver. +The frontends for Nebula (nxt6000), Pinnacle PCTV (cx24110) and +TwinHan (dst) are loaded automatically by the dvb-bt8xx device driver. -3a) The manual approach ------------------------ +3a) Nebula / Pinnacle PCTV +-------------------------- -Loading modules: -modprobe bttv -modprobe dvb-bt8xx + $ modprobe bttv (normally bttv is being loaded automatically by kmod) + $ modprobe dvb-bt8xx (or just place dvb-bt8xx in /etc/modules for automatic loading) -Unloading modules: -modprobe -r dvb-bt8xx -modprobe -r bttv -3b) The automatic approach +3b) TwinHan and Clones -------------------------- -If not already done by installation, place a line either in -/etc/modules.conf or in /etc/modprobe.conf containing this text: -alias char-major-81 bttv + $ modprobe bttv i2c_hw=1 card=0x71 + $ modprobe dvb-bt8xx + $ modprobe dst + +The value 0x71 will override the PCI type detection for dvb-bt8xx, +which is necessary for TwinHan cards. + +If you're having an older card (blue color circuit) and card=0x71 locks +your machine, try using 0x68, too. If that does not work, ask on the +mailing list. + +The DST module takes a couple of useful parameters. + +verbose takes values 0 to 4. These values control the verbosity level, +and can be used to debug also. + +verbose=0 means complete disabling of messages + 1 only error messages are displayed + 2 notifications are also displayed + 3 informational messages are also displayed + 4 debug setting + +dst_addons takes values 0 and 0x20. A value of 0 means it is a FTA card. +0x20 means it has a Conditional Access slot. + +The autodected values are determined bythe cards 'response +string' which you can see in your logs e.g. -Then place a line in /etc/modules containing this text: -dvb-bt8xx +dst_get_device_id: Recognise [DSTMCI] -Reboot your system and have fun! -- -Authors: Richard Walker, Jamie Honan, Michael Hunold, Manu Abraham, Uwe Bugla +Authors: Richard Walker, Jamie Honan, Michael Hunold, Manu Abraham diff --git a/Documentation/dvb/ci.txt b/Documentation/dvb/ci.txt index 62e0701b542a..95f0e73b2135 100644 --- a/Documentation/dvb/ci.txt +++ b/Documentation/dvb/ci.txt @@ -23,7 +23,6 @@ This application requires the following to function properly as of now. eg: $ szap -c channels.conf -r "TMC" -x (b) a channels.conf containing a valid PMT PID - eg: TMC:11996:h:0:27500:278:512:650:321 here 278 is a valid PMT PID. the rest of the values are the @@ -31,13 +30,7 @@ This application requires the following to function properly as of now. (c) after running a szap, you have to run ca_zap, for the descrambler to function, - - eg: $ ca_zap patched_channels.conf "TMC" - - The patched means a patch to apply to scan, such that scan can - generate a channels.conf_with pmt, which has this PMT PID info - (NOTE: szap cannot use this channels.conf with the PMT_PID) - + eg: $ ca_zap channels.conf "TMC" (d) Hopeflly Enjoy your favourite subscribed channel as you do with a FTA card. diff --git a/Documentation/exception.txt b/Documentation/exception.txt index f1d436993eb1..3cb39ade290e 100644 --- a/Documentation/exception.txt +++ b/Documentation/exception.txt @@ -7,7 +7,7 @@ To protect itself the kernel has to verify this address. In older versions of Linux this was done with the int verify_area(int type, const void * addr, unsigned long size) -function. +function (which has since been replaced by access_ok()). This function verified that the memory area starting at address addr and of size size was accessible for the operation specified diff --git a/Documentation/fb/cyblafb/bugs b/Documentation/fb/cyblafb/bugs new file mode 100644 index 000000000000..f90cc66ea919 --- /dev/null +++ b/Documentation/fb/cyblafb/bugs @@ -0,0 +1,14 @@ +Bugs +==== + +I currently don't know of any bug. Please do send reports to: + - linux-fbdev-devel@lists.sourceforge.net + - Knut_Petersen@t-online.de. + + +Untested features +================= + +All LCD stuff is untested. If it worked in tridentfb, it should work in +cyblafb. Please test and report the results to Knut_Petersen@t-online.de. + diff --git a/Documentation/fb/cyblafb/credits b/Documentation/fb/cyblafb/credits new file mode 100644 index 000000000000..0eb3b443dc2b --- /dev/null +++ b/Documentation/fb/cyblafb/credits @@ -0,0 +1,7 @@ +Thanks to +========= + * Alan Hourihane, for writing the X trident driver + * Jani Monoses, for writing the tridentfb driver + * Antonino A. Daplas, for review of the first published + version of cyblafb and some code + * Jochen Hein, for testing and a helpfull bug report diff --git a/Documentation/fb/cyblafb/documentation b/Documentation/fb/cyblafb/documentation new file mode 100644 index 000000000000..bb1aac048425 --- /dev/null +++ b/Documentation/fb/cyblafb/documentation @@ -0,0 +1,17 @@ +Available Documentation +======================= + +Apollo PLE 133 Chipset VT8601A North Bridge Datasheet, Rev. 1.82, October 22, +2001, available from VIA: + + http://www.viavpsd.com/product/6/15/DS8601A182.pdf + +The datasheet is incomplete, some registers that need to be programmed are not +explained at all and important bits are listed as "reserved". But you really +need the datasheet to understand the code. "p. xxx" comments refer to page +numbers of this document. + +XFree/XOrg drivers are available and of good quality, looking at the code +there is a good idea if the datasheet does not provide enough information +or if the datasheet seems to be wrong. + diff --git a/Documentation/fb/cyblafb/fb.modes b/Documentation/fb/cyblafb/fb.modes new file mode 100644 index 000000000000..cf4351fc32ff --- /dev/null +++ b/Documentation/fb/cyblafb/fb.modes @@ -0,0 +1,155 @@ +# +# Sample fb.modes file +# +# Provides an incomplete list of working modes for +# the cyberblade/i1 graphics core. +# +# The value 4294967256 is used instead of -40. Of course, -40 is not +# a really reasonable value, but chip design does not always follow +# logic. Believe me, it's ok, and it's the way the BIOS does it. +# +# fbset requires 4294967256 in fb.modes and -40 as an argument to +# the -t parameter. That's also not too reasonable, and it might change +# in the future or might even be differt for your current version. +# + +mode "640x480-50" + geometry 640 480 640 3756 8 + timings 47619 4294967256 24 17 0 216 3 +endmode + +mode "640x480-60" + geometry 640 480 640 3756 8 + timings 39682 4294967256 24 17 0 216 3 +endmode + +mode "640x480-70" + geometry 640 480 640 3756 8 + timings 34013 4294967256 24 17 0 216 3 +endmode + +mode "640x480-72" + geometry 640 480 640 3756 8 + timings 33068 4294967256 24 17 0 216 3 +endmode + +mode "640x480-75" + geometry 640 480 640 3756 8 + timings 31746 4294967256 24 17 0 216 3 +endmode + +mode "640x480-80" + geometry 640 480 640 3756 8 + timings 29761 4294967256 24 17 0 216 3 +endmode + +mode "640x480-85" + geometry 640 480 640 3756 8 + timings 28011 4294967256 24 17 0 216 3 +endmode + +mode "800x600-50" + geometry 800 600 800 3221 8 + timings 30303 96 24 14 0 136 11 +endmode + +mode "800x600-60" + geometry 800 600 800 3221 8 + timings 25252 96 24 14 0 136 11 +endmode + +mode "800x600-70" + geometry 800 600 800 3221 8 + timings 21645 96 24 14 0 136 11 +endmode + +mode "800x600-72" + geometry 800 600 800 3221 8 + timings 21043 96 24 14 0 136 11 +endmode + +mode "800x600-75" + geometry 800 600 800 3221 8 + timings 20202 96 24 14 0 136 11 +endmode + +mode "800x600-80" + geometry 800 600 800 3221 8 + timings 18939 96 24 14 0 136 11 +endmode + +mode "800x600-85" + geometry 800 600 800 3221 8 + timings 17825 96 24 14 0 136 11 +endmode + +mode "1024x768-50" + geometry 1024 768 1024 2815 8 + timings 19054 144 24 29 0 120 3 +endmode + +mode "1024x768-60" + geometry 1024 768 1024 2815 8 + timings 15880 144 24 29 0 120 3 +endmode + +mode "1024x768-70" + geometry 1024 768 1024 2815 8 + timings 13610 144 24 29 0 120 3 +endmode + +mode "1024x768-72" + geometry 1024 768 1024 2815 8 + timings 13232 144 24 29 0 120 3 +endmode + +mode "1024x768-75" + geometry 1024 768 1024 2815 8 + timings 12703 144 24 29 0 120 3 +endmode + +mode "1024x768-80" + geometry 1024 768 1024 2815 8 + timings 11910 144 24 29 0 120 3 +endmode + +mode "1024x768-85" + geometry 1024 768 1024 2815 8 + timings 11209 144 24 29 0 120 3 +endmode + +mode "1280x1024-50" + geometry 1280 1024 1280 2662 8 + timings 11114 232 16 39 0 160 3 +endmode + +mode "1280x1024-60" + geometry 1280 1024 1280 2662 8 + timings 9262 232 16 39 0 160 3 +endmode + +mode "1280x1024-70" + geometry 1280 1024 1280 2662 8 + timings 7939 232 16 39 0 160 3 +endmode + +mode "1280x1024-72" + geometry 1280 1024 1280 2662 8 + timings 7719 232 16 39 0 160 3 +endmode + +mode "1280x1024-75" + geometry 1280 1024 1280 2662 8 + timings 7410 232 16 39 0 160 3 +endmode + +mode "1280x1024-80" + geometry 1280 1024 1280 2662 8 + timings 6946 232 16 39 0 160 3 +endmode + +mode "1280x1024-85" + geometry 1280 1024 1280 2662 8 + timings 6538 232 16 39 0 160 3 +endmode + diff --git a/Documentation/fb/cyblafb/performance b/Documentation/fb/cyblafb/performance new file mode 100644 index 000000000000..eb4e47a9cea6 --- /dev/null +++ b/Documentation/fb/cyblafb/performance @@ -0,0 +1,80 @@ +Speed +===== + +CyBlaFB is much faster than tridentfb and vesafb. Compare the performance data +for mode 1280x1024-[8,16,32]@61 Hz. + +Test 1: Cat a file with 2000 lines of 0 characters. +Test 2: Cat a file with 2000 lines of 80 characters. +Test 3: Cat a file with 2000 lines of 160 characters. + +All values show system time use in seconds, kernel 2.6.12 was used for +the measurements. 2.6.13 is a bit slower, 2.6.14 hopefully will include a +patch that speeds up kernel bitblitting a lot ( > 20%). + ++-----------+-----------------------------------------------------+ +| | not accelerated | +| TRIDENTFB +-----------------+-----------------+-----------------+ +| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | +| | noypan | ypan | noypan | ypan | noypan | ypan | ++-----------+--------+--------+--------+--------+--------+--------+ +| Test 1 | 4.31 | 4.33 | 6.05 | 12.81 | ---- | ---- | +| Test 2 | 67.94 | 5.44 | 123.16 | 14.79 | ---- | ---- | +| Test 3 | 131.36 | 6.55 | 240.12 | 16.76 | ---- | ---- | ++-----------+--------+--------+--------+--------+--------+--------+ +| Comments | | | completely bro- | +| | | | ken, monitor | +| | | | switches off | ++-----------+-----------------+-----------------+-----------------+ + + ++-----------+-----------------------------------------------------+ +| | accelerated | +| TRIDENTFB +-----------------+-----------------+-----------------+ +| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | +| | noypan | ypan | noypan | ypan | noypan | ypan | ++-----------+--------+--------+--------+--------+--------+--------+ +| Test 1 | ---- | ---- | 20.62 | 1.22 | ---- | ---- | +| Test 2 | ---- | ---- | 22.61 | 3.19 | ---- | ---- | +| Test 3 | ---- | ---- | 24.59 | 5.16 | ---- | ---- | ++-----------+--------+--------+--------+--------+--------+--------+ +| Comments | broken, writing | broken, ok only | completely bro- | +| | to wrong places | if bgcolor is | ken, monitor | +| | on screen + bug | black, bug in | switches off | +| | in fillrect() | fillrect() | | ++-----------+-----------------+-----------------+-----------------+ + + ++-----------+-----------------------------------------------------+ +| | not accelerated | +| VESAFB +-----------------+-----------------+-----------------+ +| of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | +| | noypan | ypan | noypan | ypan | noypan | ypan | ++-----------+--------+--------+--------+--------+--------+--------+ +| Test 1 | 4.26 | 3.76 | 5.99 | 7.23 | ---- | ---- | +| Test 2 | 65.65 | 4.89 | 120.88 | 9.08 | ---- | ---- | +| Test 3 | 126.91 | 5.94 | 235.77 | 11.03 | ---- | ---- | ++-----------+--------+--------+--------+--------+--------+--------+ +| Comments | vga=0x307 | vga=0x31a | vga=0x31b not | +| | fh=80kHz | fh=80kHz | supported by | +| | fv=75kHz | fv=75kHz | video BIOS and | +| | | | hardware | ++-----------+-----------------+-----------------+-----------------+ + + ++-----------+-----------------------------------------------------+ +| | accelerated | +| CYBLAFB +-----------------+-----------------+-----------------+ +| | 8 bpp | 16 bpp | 32 bpp | +| | noypan | ypan | noypan | ypan | noypan | ypan | ++-----------+--------+--------+--------+--------+--------+--------+ +| Test 1 | 8.02 | 0.23 | 19.04 | 0.61 | 57.12 | 2.74 | +| Test 2 | 8.38 | 0.55 | 19.39 | 0.92 | 57.54 | 3.13 | +| Test 3 | 8.73 | 0.86 | 19.74 | 1.24 | 57.95 | 3.51 | ++-----------+--------+--------+--------+--------+--------+--------+ +| Comments | | | | +| | | | | +| | | | | +| | | | | ++-----------+-----------------+-----------------+-----------------+ + diff --git a/Documentation/fb/cyblafb/todo b/Documentation/fb/cyblafb/todo new file mode 100644 index 000000000000..80fb2f89b6c1 --- /dev/null +++ b/Documentation/fb/cyblafb/todo @@ -0,0 +1,32 @@ +TODO / Missing features +======================= + +Verify LCD stuff "stretch" and "center" options are + completely untested ... this code needs to be + verified. As I don't have access to such + hardware, please contact me if you are + willing run some tests. + +Interlaced video modes The reason that interleaved + modes are disabled is that I do not know + the meaning of the vertical interlace + parameter. Also the datasheet mentions a + bit d8 of a horizontal interlace parameter, + but nowhere the lower 8 bits. Please help + if you can. + +low-res double scan modes Who needs it? + +accelerated color blitting Who needs it? The console driver does use color + blitting for nothing but drawing the penguine, + everything else is done using color expanding + blitting of 1bpp character bitmaps. + +xpanning Who needs it? + +ioctls Who needs it? + +TV-out Will be done later + +??? Feel free to contact me if you have any + feature requests diff --git a/Documentation/fb/cyblafb/usage b/Documentation/fb/cyblafb/usage new file mode 100644 index 000000000000..e627c8f54211 --- /dev/null +++ b/Documentation/fb/cyblafb/usage @@ -0,0 +1,206 @@ +CyBlaFB is a framebuffer driver for the Cyberblade/i1 graphics core integrated +into the VIA Apollo PLE133 (aka vt8601) south bridge. It is developed and +tested using a VIA EPIA 5000 board. + +Cyblafb - compiled into the kernel or as a module? +================================================== + +You might compile cyblafb either as a module or compile it permanently into the +kernel. + +Unless you have a real reason to do so you should not compile both vesafb and +cyblafb permanently into the kernel. It's possible and it helps during the +developement cycle, but it's useless and will at least block some otherwise +usefull memory for ordinary users. + +Selecting Modes +=============== + + Startup Mode + ============ + + First of all, you might use the "vga=???" boot parameter as it is + documented in vesafb.txt and svga.txt. Cyblafb will detect the video + mode selected and will use the geometry and timings found by + inspecting the hardware registers. + + video=cyblafb vga=0x317 + + Alternatively you might use a combination of the mode, ref and bpp + parameters. If you compiled the driver into the kernel, add something + like this to the kernel command line: + + video=cyblafb:1280x1024,bpp=16,ref=50 ... + + If you compiled the driver as a module, the same mode would be + selected by the following command: + + modprobe cyblafb mode=1280x1024 bpp=16 ref=50 ... + + None of the modes possible to select as startup modes are affected by + the problems described at the end of the next subsection. + + Mode changes using fbset + ======================== + + You might use fbset to change the video mode, see "man fbset". Cyblafb + generally does assume that you know what you are doing. But it does + some checks, especially those that are needed to prevent you from + damaging your hardware. + + - only 8, 16, 24 and 32 bpp video modes are accepted + - interlaced video modes are not accepted + - double scan video modes are not accepted + - if a flat panel is found, cyblafb does not allow you + to program a resolution higher than the physical + resolution of the flat panel monitor + - cyblafb does not allow xres to differ from xres_virtual + - cyblafb does not allow vclk to exceed 230 MHz. As 32 bpp + and (currently) 24 bit modes use a doubled vclk internally, + the dotclock limit as seen by fbset is 115 MHz for those + modes and 230 MHz for 8 and 16 bpp modes. + + Any request that violates the rules given above will be ignored and + fbset will return an error. + + If you program a virtual y resolution higher than the hardware limit, + cyblafb will silently decrease that value to the highest possible + value. + + Attempts to disable acceleration are ignored. + + Some video modes that should work do not work as expected. If you use + the standard fb.modes, fbset 640x480-60 will program that mode, but + you will see a vertical area, about two characters wide, with only + much darker characters than the other characters on the screen. + Cyblafb does allow that mode to be set, as it does not violate the + official specifications. It would need a lot of code to reliably sort + out all invalid modes, playing around with the margin values will + give a valid mode quickly. And if cyblafb would detect such an invalid + mode, should it silently alter the requested values or should it + report an error? Both options have some pros and cons. As stated + above, none of the startup modes are affected, and if you set + verbosity to 1 or higher, cyblafb will print the fbset command that + would be needed to program that mode using fbset. + + +Other Parameters +================ + + +crt don't autodetect, assume monitor connected to + standard VGA connector + +fp don't autodetect, assume flat panel display + connected to flat panel monitor interface + +nativex inform driver about native x resolution of + flat panel monitor connected to special + interface (should be autodetected) + +stretch stretch image to adapt low resolution modes to + higer resolutions of flat panel monitors + connected to special interface + +center center image to adapt low resolution modes to + higer resolutions of flat panel monitors + connected to special interface + +memsize use if autodetected memsize is wrong ... + should never be necessary + +nopcirr disable PCI read retry +nopciwr disable PCI write retry +nopcirb disable PCI read bursts +nopciwb disable PCI write bursts + +bpp bpp for specified modes + valid values: 8 || 16 || 24 || 32 + +ref refresh rate for specified mode + valid values: 50 <= ref <= 85 + +mode 640x480 or 800x600 or 1024x768 or 1280x1024 + if not specified, the startup mode will be detected + and used, so you might also use the vga=??? parameter + described in vesafb.txt. If you do not specify a mode, + bpp and ref parameters are ignored. + +verbosity 0 is the default, increase to at least 2 for every + bug report! + +vesafb allows cyblafb to be loaded after vesafb has been + loaded. See sections "Module unloading ...". + + +Development hints +================= + +It's much faster do compile a module and to load the new version after +unloading the old module than to compile a new kernel and to reboot. So if you +try to work on cyblafb, it might be a good idea to use cyblafb as a module. +In real life, fast often means dangerous, and that's also the case here. If +you introduce a serious bug when cyblafb is compiled into the kernel, the +kernel will lock or oops with a high probability before the file system is +mounted, and the danger for your data is low. If you load a broken own version +of cyblafb on a running system, the danger for the integrity of the file +system is much higher as you might need a hard reset afterwards. Decide +yourself. + +Module unloading, the vfb method +================================ + +If you want to unload/reload cyblafb using the virtual framebuffer, you need +to enable vfb support in the kernel first. After that, load the modules as +shown below: + + modprobe vfb vfb_enable=1 + modprobe fbcon + modprobe cyblafb + fbset -fb /dev/fb1 1280x1024-60 -vyres 2662 + con2fb /dev/fb1 /dev/tty1 + ... + +If you now made some changes to cyblafb and want to reload it, you might do it +as show below: + + con2fb /dev/fb0 /dev/tty1 + ... + rmmod cyblafb + modprobe cyblafb + con2fb /dev/fb1 /dev/tty1 + ... + +Of course, you might choose another mode, and most certainly you also want to +map some other /dev/tty* to the real framebuffer device. You might also choose +to compile fbcon as a kernel module or place it permanently in the kernel. + +I do not know of any way to unload fbcon, and fbcon will prevent the +framebuffer device loaded first from unloading. [If there is a way, then +please add a description here!] + +Module unloading, the vesafb method +=================================== + +Configure the kernel: + + <*> Support for frame buffer devices + [*] VESA VGA graphics support + <M> Cyberblade/i1 support + +Add e.g. "video=vesafb:ypan vga=0x307" to the kernel parameters. The ypan +parameter is important, choose any vga parameter you like as long as it is +a graphics mode. + +After booting, load cyblafb without any mode and bpp parameter and assign +cyblafb to individual ttys using con2fb, e.g.: + + modprobe cyblafb vesafb=1 + con2fb /dev/fb1 /dev/tty1 + +Unloading cyblafb works without problems after you assign vesafb to all +ttys again, e.g.: + + con2fb /dev/fb0 /dev/tty1 + rmmod cyblafb + diff --git a/Documentation/fb/cyblafb/whycyblafb b/Documentation/fb/cyblafb/whycyblafb new file mode 100644 index 000000000000..a123bc11e698 --- /dev/null +++ b/Documentation/fb/cyblafb/whycyblafb @@ -0,0 +1,85 @@ +I tried the following framebuffer drivers: + + - TRIDENTFB is full of bugs. Acceleration is broken for Blade3D + graphics cores like the cyberblade/i1. It claims to support a great + number of devices, but documentation for most of these devices is + unfortunately not available. There is _no_ reason to use tridentfb + for cyberblade/i1 + CRT users. VESAFB is faster, and the one + advantage, mode switching, is broken in tridentfb. + + - VESAFB is used by many distributions as a standard. Vesafb does + not support mode switching. VESAFB is a bit faster than the working + configurations of TRIDENTFB, but it is still too slow, even if you + use ypan. + + - EPIAFB (you'll find it on sourceforge) supports the Cyberblade/i1 + graphics core, but it still has serious bugs and developement seems + to have stopped. This is the one driver with TV-out support. If you + do need this feature, try epiafb. + +None of these drivers was a real option for me. + +I believe that is unreasonable to change code that announces to support 20 +devices if I only have more or less sufficient documentation for exactly one +of these. The risk of breaking device foo while fixing device bar is too high. + +So I decided to start CyBlaFB as a stripped down tridentfb. + +All code specific to other Trident chips has been removed. After that there +were a lot of cosmetic changes to increase the readability of the code. All +register names were changed to those mnemonics used in the datasheet. Function +and macro names were changed if they hindered easy understanding of the code. + +After that I debugged the code and implemented some new features. I'll try to +give a little summary of the main changes: + + - calculation of vertical and horizontal timings was fixed + + - video signal quality has been improved dramatically + + - acceleration: + + - fillrect and copyarea were fixed and reenabled + + - color expanding imageblit was newly implemented, color + imageblit (only used to draw the penguine) still uses the + generic code. + + - init of the acceleration engine was improved and moved to a + place where it really works ... + + - sync function has a timeout now and tries to reset and + reinit the accel engine if necessary + + - fewer slow copyarea calls when doing ypan scrolling by using + undocumented bit d21 of screen start address stored in + CR2B[5]. BIOS does use it also, so this should be safe. + + - cyblafb rejects any attempt to set modes that would cause vclk + values above reasonable 230 MHz. 32bit modes use a clock + multiplicator of 2, so fbset does show the correct values for + pixclock but not for vclk in this case. The fbset limit is 115 MHz + for 32 bpp modes. + + - cyblafb rejects modes known to be broken or unimplemented (all + interlaced modes, all doublescan modes for now) + + - cyblafb now works independant of the video mode in effect at startup + time (tridentfb does not init all needed registers to reasonable + values) + + - switching between video modes does work reliably now + + - the first video mode now is the one selected on startup using the + vga=???? mechanism or any of + - 640x480, 800x600, 1024x768, 1280x1024 + - 8, 16, 24 or 32 bpp + - refresh between 50 Hz and 85 Hz, 1 Hz steps (1280x1024-32 + is limited to 63Hz) + + - pci retry and pci burst mode are settable (try to disable if you + experience latency problems) + + - built as a module cyblafb might be unloaded and reloaded using + the vfb module and con2vt or might be used together with vesafb + diff --git a/Documentation/fb/intel810.txt b/Documentation/fb/intel810.txt index fd68b162e4a1..4f0d6bc789ef 100644 --- a/Documentation/fb/intel810.txt +++ b/Documentation/fb/intel810.txt @@ -5,6 +5,7 @@ Intel 810/815 Framebuffer driver March 17, 2002 First Released: July 2001 + Last Update: September 12, 2005 ================================================================ A. Introduction @@ -44,6 +45,8 @@ B. Features - Hardware Cursor Support + - Supports EDID probing either by DDC/I2C or through the BIOS + C. List of available options a. "video=i810fb" @@ -52,14 +55,17 @@ C. List of available options Recommendation: required b. "xres:<value>" - select horizontal resolution in pixels + select horizontal resolution in pixels. (This parameter will be + ignored if 'mode_option' is specified. See 'o' below). Recommendation: user preference (default = 640) c. "yres:<value>" select vertical resolution in scanlines. If Discrete Video Timings - is enabled, this will be ignored and computed as 3*xres/4. + is enabled, this will be ignored and computed as 3*xres/4. (This + parameter will be ignored if 'mode_option' is specified. See 'o' + below) Recommendation: user preference (default = 480) @@ -86,7 +92,8 @@ C. List of available options g. "hsync1/hsync2:<value>" select the minimum and maximum Horizontal Sync Frequency of the monitor in KHz. If a using a fixed frequency monitor, hsync1 must - be equal to hsync2. + be equal to hsync2. If EDID probing is successful, these will be + ignored and values will be taken from the EDID block. Recommendation: check monitor manual for correct values default (29/30) @@ -94,7 +101,8 @@ C. List of available options h. "vsync1/vsync2:<value>" select the minimum and maximum Vertical Sync Frequency of the monitor in Hz. You can also use this option to lock your monitor's refresh - rate. + rate. If EDID probing is successful, these will be ignored and values + will be taken from the EDID block. Recommendation: check monitor manual for correct values (default = 60/60) @@ -154,7 +162,11 @@ C. List of available options Recommendation: do not set (default = not set) - + o. <xres>x<yres>[-<bpp>][@<refresh>] + The driver will now accept specification of boot mode option. If this + is specified, the options 'xres' and 'yres' will be ignored. See + Documentation/fb/modedb.txt for usage. + D. Kernel booting Separate each option/option-pair by commas (,) and the option from its value @@ -176,7 +188,10 @@ will be computed based on the hsync1/hsync2 and vsync1/vsync2 values. IMPORTANT: You must include hsync1, hsync2, vsync1 and vsync2 to enable video modes -better than 640x480 at 60Hz. +better than 640x480 at 60Hz. HOWEVER, if your chipset/display combination +supports I2C and has an EDID block, you can safely exclude hsync1, hsync2, +vsync1 and vsync2 parameters. These parameters will be taken from the EDID +block. E. Module options @@ -217,32 +232,21 @@ F. Setup This is required. The option is under "Character Devices" d. Under "Graphics Support", select "Intel 810/815" either statically - or as a module. Choose "use VESA GTF for video timings" if you - need to maximize the capability of your display. To be on the + or as a module. Choose "use VESA Generalized Timing Formula" if + you need to maximize the capability of your display. To be on the safe side, you can leave this unselected. - e. If you want a framebuffer console, enable it under "Console + e. If you want support for DDC/I2C probing (Plug and Play Displays), + set 'Enable DDC Support' to 'y'. To make this option appear, set + 'use VESA Generalized Timing Formula' to 'y'. + + f. If you want a framebuffer console, enable it under "Console Drivers" - f. Compile your kernel. + g. Compile your kernel. - g. Load the driver as described in section D and E. + h. Load the driver as described in section D and E. - Optional: - h. If you are going to run XFree86 with its native drivers, the - standard XFree86 4.1.0 and 4.2.0 drivers should work as is. - However, there's a bug in the XFree86 i810 drivers. It attempts - to use XAA even when switched to the console. This will crash - your server. I have a fix at this site: - - http://i810fb.sourceforge.net. - - You can either use the patch, or just replace - - /usr/X11R6/lib/modules/drivers/i810_drv.o - - with the one provided at the website. - i. Try the DirectFB (http://www.directfb.org) + the i810 gfxdriver patch to see the chipset in action (or inaction :-). diff --git a/Documentation/fb/modedb.txt b/Documentation/fb/modedb.txt index e04458b319d5..4fcdb4cf4cca 100644 --- a/Documentation/fb/modedb.txt +++ b/Documentation/fb/modedb.txt @@ -20,12 +20,83 @@ in a video= option, fbmem considers that to be a global video mode option. Valid mode specifiers (mode_option argument): - <xres>x<yres>[-<bpp>][@<refresh>] + <xres>x<yres>[M][R][-<bpp>][@<refresh>][i][m] <name>[-<bpp>][@<refresh>] with <xres>, <yres>, <bpp> and <refresh> decimal numbers and <name> a string. Things between square brackets are optional. +If 'M' is specified in the mode_option argument (after <yres> and before +<bpp> and <refresh>, if specified) the timings will be calculated using +VESA(TM) Coordinated Video Timings instead of looking up the mode from a table. +If 'R' is specified, do a 'reduced blanking' calculation for digital displays. +If 'i' is specified, calculate for an interlaced mode. And if 'm' is +specified, add margins to the calculation (1.8% of xres rounded down to 8 +pixels and 1.8% of yres). + + Sample usage: 1024x768M@60m - CVT timing with margins + +***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** + +What is the VESA(TM) Coordinated Video Timings (CVT)? + +From the VESA(TM) Website: + + "The purpose of CVT is to provide a method for generating a consistent + and coordinated set of standard formats, display refresh rates, and + timing specifications for computer display products, both those + employing CRTs, and those using other display technologies. The + intention of CVT is to give both source and display manufacturers a + common set of tools to enable new timings to be developed in a + consistent manner that ensures greater compatibility." + +This is the third standard approved by VESA(TM) concerning video timings. The +first was the Discrete Video Timings (DVT) which is a collection of +pre-defined modes approved by VESA(TM). The second is the Generalized Timing +Formula (GTF) which is an algorithm to calculate the timings, given the +pixelclock, the horizontal sync frequency, or the vertical refresh rate. + +The GTF is limited by the fact that it is designed mainly for CRT displays. +It artificially increases the pixelclock because of its high blanking +requirement. This is inappropriate for digital display interface with its high +data rate which requires that it conserves the pixelclock as much as possible. +Also, GTF does not take into account the aspect ratio of the display. + +The CVT addresses these limitations. If used with CRT's, the formula used +is a derivation of GTF with a few modifications. If used with digital +displays, the "reduced blanking" calculation can be used. + +From the framebuffer subsystem perspective, new formats need not be added +to the global mode database whenever a new mode is released by display +manufacturers. Specifying for CVT will work for most, if not all, relatively +new CRT displays and probably with most flatpanels, if 'reduced blanking' +calculation is specified. (The CVT compatibility of the display can be +determined from its EDID. The version 1.3 of the EDID has extra 128-byte +blocks where additional timing information is placed. As of this time, there +is no support yet in the layer to parse this additional blocks.) + +CVT also introduced a new naming convention (should be seen from dmesg output): + + <pix>M<a>[-R] + + where: pix = total amount of pixels in MB (xres x yres) + M = always present + a = aspect ratio (3 - 4:3; 4 - 5:4; 9 - 15:9, 16:9; A - 16:10) + -R = reduced blanking + + example: .48M3-R - 800x600 with reduced blanking + +Note: VESA(TM) has restrictions on what is a standard CVT timing: + + - aspect ratio can only be one of the above values + - acceptable refresh rates are 50, 60, 70 or 85 Hz only + - if reduced blanking, the refresh rate must be at 60Hz + +If one of the above are not satisfied, the kernel will print a warning but the +timings will still be calculated. + +***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** oOo ***** + To find a suitable video mode, you just call int __init fb_find_mode(struct fb_var_screeninfo *var, diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 8b1430b46655..b67189a8d8d4 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -17,32 +17,6 @@ Who: Greg Kroah-Hartman <greg@kroah.com> --------------------------- -What: ACPI S4bios support -When: May 2005 -Why: Noone uses it, and it probably does not work, anyway. swsusp is - faster, more reliable, and people are actually using it. -Who: Pavel Machek <pavel@suse.cz> - ---------------------------- - -What: PCI Name Database (CONFIG_PCI_NAMES) -When: July 2005 -Why: It bloats the kernel unnecessarily, and is handled by userspace better - (pciutils supports it.) Will eliminate the need to try to keep the - pci.ids file in sync with the sf.net database all of the time. -Who: Greg Kroah-Hartman <gregkh@suse.de> - ---------------------------- - -What: io_remap_page_range() (macro or function) -When: September 2005 -Why: Replaced by io_remap_pfn_range() which allows more memory space - addressabilty (by using a pfn) and supports sparc & sparc64 - iospace as part of the pfn. -Who: Randy Dunlap <rddunlap@osdl.org> - ---------------------------- - What: RAW driver (CONFIG_RAW_DRIVER) When: December 2005 Why: declared obsolete since kernel 2.6.3 @@ -51,14 +25,6 @@ Who: Adrian Bunk <bunk@stusta.de> --------------------------- -What: register_ioctl32_conversion() / unregister_ioctl32_conversion() -When: April 2005 -Why: Replaced by ->compat_ioctl in file_operations and other method - vecors. -Who: Andi Kleen <ak@muc.de>, Christoph Hellwig <hch@lst.de> - ---------------------------- - What: RCU API moves to EXPORT_SYMBOL_GPL When: April 2006 Files: include/linux/rcupdate.h, kernel/rcupdate.c @@ -74,14 +40,6 @@ Who: Paul E. McKenney <paulmck@us.ibm.com> --------------------------- -What: remove verify_area() -When: July 2006 -Files: Various uaccess.h headers. -Why: Deprecated and redundant. access_ok() should be used instead. -Who: Jesper Juhl <juhl-lkml@dif.dk> - ---------------------------- - What: IEEE1394 Audio and Music Data Transmission Protocol driver, Connection Management Procedures driver When: November 2005 @@ -102,16 +60,6 @@ Who: Jody McIntyre <scjody@steamballoon.com> --------------------------- -What: register_serial/unregister_serial -When: September 2005 -Why: This interface does not allow serial ports to be registered against - a struct device, and as such does not allow correct power management - of such ports. 8250-based ports should use serial8250_register_port - and serial8250_unregister_port, or platform devices instead. -Who: Russell King <rmk@arm.linux.org.uk> - ---------------------------- - What: i2c sysfs name change: in1_ref, vid deprecated in favour of cpu0_vid When: November 2005 Files: drivers/i2c/chips/adm1025.c, drivers/i2c/chips/adm1026.c @@ -135,3 +83,15 @@ Why: With the 16-bit PCMCIA subsystem now behaving (almost) like a pcmciautils package available at http://kernel.org/pub/linux/utils/kernel/pcmcia/ Who: Dominik Brodowski <linux@brodo.de> + +--------------------------- + +What: ip_queue and ip6_queue (old ipv4-only and ipv6-only netfilter queue) +When: December 2005 +Why: This interface has been obsoleted by the new layer3-independent + "nfnetlink_queue". The Kernel interface is compatible, so the old + ip[6]tables "QUEUE" targets still work and will transparently handle + all packets into nfnetlink queue number 0. Userspace users will have + to link against API-compatible library on top of libnfnetlink_queue + instead of the current 'libipq'. +Who: Harald Welte <laforge@netfilter.org> diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt new file mode 100644 index 000000000000..8c206f4e0250 --- /dev/null +++ b/Documentation/filesystems/files.txt @@ -0,0 +1,123 @@ +File management in the Linux kernel +----------------------------------- + +This document describes how locking for files (struct file) +and file descriptor table (struct files) works. + +Up until 2.6.12, the file descriptor table has been protected +with a lock (files->file_lock) and reference count (files->count). +->file_lock protected accesses to all the file related fields +of the table. ->count was used for sharing the file descriptor +table between tasks cloned with CLONE_FILES flag. Typically +this would be the case for posix threads. As with the common +refcounting model in the kernel, the last task doing +a put_files_struct() frees the file descriptor (fd) table. +The files (struct file) themselves are protected using +reference count (->f_count). + +In the new lock-free model of file descriptor management, +the reference counting is similar, but the locking is +based on RCU. The file descriptor table contains multiple +elements - the fd sets (open_fds and close_on_exec, the +array of file pointers, the sizes of the sets and the array +etc.). In order for the updates to appear atomic to +a lock-free reader, all the elements of the file descriptor +table are in a separate structure - struct fdtable. +files_struct contains a pointer to struct fdtable through +which the actual fd table is accessed. Initially the +fdtable is embedded in files_struct itself. On a subsequent +expansion of fdtable, a new fdtable structure is allocated +and files->fdtab points to the new structure. The fdtable +structure is freed with RCU and lock-free readers either +see the old fdtable or the new fdtable making the update +appear atomic. Here are the locking rules for +the fdtable structure - + +1. All references to the fdtable must be done through + the files_fdtable() macro : + + struct fdtable *fdt; + + rcu_read_lock(); + + fdt = files_fdtable(files); + .... + if (n <= fdt->max_fds) + .... + ... + rcu_read_unlock(); + + files_fdtable() uses rcu_dereference() macro which takes care of + the memory barrier requirements for lock-free dereference. + The fdtable pointer must be read within the read-side + critical section. + +2. Reading of the fdtable as described above must be protected + by rcu_read_lock()/rcu_read_unlock(). + +3. For any update to the the fd table, files->file_lock must + be held. + +4. To look up the file structure given an fd, a reader + must use either fcheck() or fcheck_files() APIs. These + take care of barrier requirements due to lock-free lookup. + An example : + + struct file *file; + + rcu_read_lock(); + file = fcheck(fd); + if (file) { + ... + } + .... + rcu_read_unlock(); + +5. Handling of the file structures is special. Since the look-up + of the fd (fget()/fget_light()) are lock-free, it is possible + that look-up may race with the last put() operation on the + file structure. This is avoided using the rcuref APIs + on ->f_count : + + rcu_read_lock(); + file = fcheck_files(files, fd); + if (file) { + if (rcuref_inc_lf(&file->f_count)) + *fput_needed = 1; + else + /* Didn't get the reference, someone's freed */ + file = NULL; + } + rcu_read_unlock(); + .... + return file; + + rcuref_inc_lf() detects if refcounts is already zero or + goes to zero during increment. If it does, we fail + fget()/fget_light(). + +6. Since both fdtable and file structures can be looked up + lock-free, they must be installed using rcu_assign_pointer() + API. If they are looked up lock-free, rcu_dereference() + must be used. However it is advisable to use files_fdtable() + and fcheck()/fcheck_files() which take care of these issues. + +7. While updating, the fdtable pointer must be looked up while + holding files->file_lock. If ->file_lock is dropped, then + another thread expand the files thereby creating a new + fdtable and making the earlier fdtable pointer stale. + For example : + + spin_lock(&files->file_lock); + fd = locate_fd(files, file, start); + if (fd >= 0) { + /* locate_fd() may have expanded fdtable, load the ptr */ + fdt = files_fdtable(files); + FD_SET(fd, fdt->open_fds); + FD_CLR(fd, fdt->close_on_exec); + spin_unlock(&files->file_lock); + ..... + + Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), + the fdtable pointer (fdt) must be loaded after locate_fd(). + diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt new file mode 100644 index 000000000000..6b5741e651a2 --- /dev/null +++ b/Documentation/filesystems/fuse.txt @@ -0,0 +1,315 @@ +Definitions +~~~~~~~~~~~ + +Userspace filesystem: + + A filesystem in which data and metadata are provided by an ordinary + userspace process. The filesystem can be accessed normally through + the kernel interface. + +Filesystem daemon: + + The process(es) providing the data and metadata of the filesystem. + +Non-privileged mount (or user mount): + + A userspace filesystem mounted by a non-privileged (non-root) user. + The filesystem daemon is running with the privileges of the mounting + user. NOTE: this is not the same as mounts allowed with the "user" + option in /etc/fstab, which is not discussed here. + +Mount owner: + + The user who does the mounting. + +User: + + The user who is performing filesystem operations. + +What is FUSE? +~~~~~~~~~~~~~ + +FUSE is a userspace filesystem framework. It consists of a kernel +module (fuse.ko), a userspace library (libfuse.*) and a mount utility +(fusermount). + +One of the most important features of FUSE is allowing secure, +non-privileged mounts. This opens up new possibilities for the use of +filesystems. A good example is sshfs: a secure network filesystem +using the sftp protocol. + +The userspace library and utilities are available from the FUSE +homepage: + + http://fuse.sourceforge.net/ + +Mount options +~~~~~~~~~~~~~ + +'fd=N' + + The file descriptor to use for communication between the userspace + filesystem and the kernel. The file descriptor must have been + obtained by opening the FUSE device ('/dev/fuse'). + +'rootmode=M' + + The file mode of the filesystem's root in octal representation. + +'user_id=N' + + The numeric user id of the mount owner. + +'group_id=N' + + The numeric group id of the mount owner. + +'default_permissions' + + By default FUSE doesn't check file access permissions, the + filesystem is free to implement it's access policy or leave it to + the underlying file access mechanism (e.g. in case of network + filesystems). This option enables permission checking, restricting + access based on file mode. This is option is usually useful + together with the 'allow_other' mount option. + +'allow_other' + + This option overrides the security measure restricting file access + to the user mounting the filesystem. This option is by default only + allowed to root, but this restriction can be removed with a + (userspace) configuration option. + +'max_read=N' + + With this option the maximum size of read operations can be set. + The default is infinite. Note that the size of read requests is + limited anyway to 32 pages (which is 128kbyte on i386). + +How do non-privileged mounts work? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since the mount() system call is a privileged operation, a helper +program (fusermount) is needed, which is installed setuid root. + +The implication of providing non-privileged mounts is that the mount +owner must not be able to use this capability to compromise the +system. Obvious requirements arising from this are: + + A) mount owner should not be able to get elevated privileges with the + help of the mounted filesystem + + B) mount owner should not get illegitimate access to information from + other users' and the super user's processes + + C) mount owner should not be able to induce undesired behavior in + other users' or the super user's processes + +How are requirements fulfilled? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + A) The mount owner could gain elevated privileges by either: + + 1) creating a filesystem containing a device file, then opening + this device + + 2) creating a filesystem containing a suid or sgid application, + then executing this application + + The solution is not to allow opening device files and ignore + setuid and setgid bits when executing programs. To ensure this + fusermount always adds "nosuid" and "nodev" to the mount options + for non-privileged mounts. + + B) If another user is accessing files or directories in the + filesystem, the filesystem daemon serving requests can record the + exact sequence and timing of operations performed. This + information is otherwise inaccessible to the mount owner, so this + counts as an information leak. + + The solution to this problem will be presented in point 2) of C). + + C) There are several ways in which the mount owner can induce + undesired behavior in other users' processes, such as: + + 1) mounting a filesystem over a file or directory which the mount + owner could otherwise not be able to modify (or could only + make limited modifications). + + This is solved in fusermount, by checking the access + permissions on the mountpoint and only allowing the mount if + the mount owner can do unlimited modification (has write + access to the mountpoint, and mountpoint is not a "sticky" + directory) + + 2) Even if 1) is solved the mount owner can change the behavior + of other users' processes. + + i) It can slow down or indefinitely delay the execution of a + filesystem operation creating a DoS against the user or the + whole system. For example a suid application locking a + system file, and then accessing a file on the mount owner's + filesystem could be stopped, and thus causing the system + file to be locked forever. + + ii) It can present files or directories of unlimited length, or + directory structures of unlimited depth, possibly causing a + system process to eat up diskspace, memory or other + resources, again causing DoS. + + The solution to this as well as B) is not to allow processes + to access the filesystem, which could otherwise not be + monitored or manipulated by the mount owner. Since if the + mount owner can ptrace a process, it can do all of the above + without using a FUSE mount, the same criteria as used in + ptrace can be used to check if a process is allowed to access + the filesystem or not. + + Note that the ptrace check is not strictly necessary to + prevent B/2/i, it is enough to check if mount owner has enough + privilege to send signal to the process accessing the + filesystem, since SIGSTOP can be used to get a similar effect. + +I think these limitations are unacceptable? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a sysadmin trusts the users enough, or can ensure through other +measures, that system processes will never enter non-privileged +mounts, it can relax the last limitation with a "user_allow_other" +config option. If this config option is set, the mounting user can +add the "allow_other" mount option which disables the check for other +users' processes. + +Kernel - userspace interface +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following diagram shows how a filesystem operation (in this +example unlink) is performed in FUSE. + +NOTE: everything in this description is greatly simplified + + | "rm /mnt/fuse/file" | FUSE filesystem daemon + | | + | | >sys_read() + | | >fuse_dev_read() + | | >request_wait() + | | [sleep on fc->waitq] + | | + | >sys_unlink() | + | >fuse_unlink() | + | [get request from | + | fc->unused_list] | + | >request_send() | + | [queue req on fc->pending] | + | [wake up fc->waitq] | [woken up] + | >request_wait_answer() | + | [sleep on req->waitq] | + | | <request_wait() + | | [remove req from fc->pending] + | | [copy req to read buffer] + | | [add req to fc->processing] + | | <fuse_dev_read() + | | <sys_read() + | | + | | [perform unlink] + | | + | | >sys_write() + | | >fuse_dev_write() + | | [look up req in fc->processing] + | | [remove from fc->processing] + | | [copy write buffer to req] + | [woken up] | [wake up req->waitq] + | | <fuse_dev_write() + | | <sys_write() + | <request_wait_answer() | + | <request_send() | + | [add request to | + | fc->unused_list] | + | <fuse_unlink() | + | <sys_unlink() | + +There are a couple of ways in which to deadlock a FUSE filesystem. +Since we are talking about unprivileged userspace programs, +something must be done about these. + +Scenario 1 - Simple deadlock +----------------------------- + + | "rm /mnt/fuse/file" | FUSE filesystem daemon + | | + | >sys_unlink("/mnt/fuse/file") | + | [acquire inode semaphore | + | for "file"] | + | >fuse_unlink() | + | [sleep on req->waitq] | + | | <sys_read() + | | >sys_unlink("/mnt/fuse/file") + | | [acquire inode semaphore + | | for "file"] + | | *DEADLOCK* + +The solution for this is to allow requests to be interrupted while +they are in userspace: + + | [interrupted by signal] | + | <fuse_unlink() | + | [release semaphore] | [semaphore acquired] + | <sys_unlink() | + | | >fuse_unlink() + | | [queue req on fc->pending] + | | [wake up fc->waitq] + | | [sleep on req->waitq] + +If the filesystem daemon was single threaded, this will stop here, +since there's no other thread to dequeue and execute the request. +In this case the solution is to kill the FUSE daemon as well. If +there are multiple serving threads, you just have to kill them as +long as any remain. + +Moral: a filesystem which deadlocks, can soon find itself dead. + +Scenario 2 - Tricky deadlock +---------------------------- + +This one needs a carefully crafted filesystem. It's a variation on +the above, only the call back to the filesystem is not explicit, +but is caused by a pagefault. + + | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2 + | | + | [fd = open("/mnt/fuse/file")] | [request served normally] + | [mmap fd to 'addr'] | + | [close fd] | [FLUSH triggers 'magic' flag] + | [read a byte from addr] | + | >do_page_fault() | + | [find or create page] | + | [lock page] | + | >fuse_readpage() | + | [queue READ request] | + | [sleep on req->waitq] | + | | [read request to buffer] + | | [create reply header before addr] + | | >sys_write(addr - headerlength) + | | >fuse_dev_write() + | | [look up req in fc->processing] + | | [remove from fc->processing] + | | [copy write buffer to req] + | | >do_page_fault() + | | [find or create page] + | | [lock page] + | | * DEADLOCK * + +Solution is again to let the the request be interrupted (not +elaborated further). + +An additional problem is that while the write buffer is being +copied to the request, the request must not be interrupted. This +is because the destination address of the copy may not be valid +after the request is interrupted. + +This is solved with doing the copy atomically, and allowing +interruption while the page(s) belonging to the write buffer are +faulted with get_user_pages(). The 'req->locked' flag indicates +when the copy is taking place, and interruption is delayed until +this flag is unset. + diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt index eef4aca0c753..614de3124901 100644 --- a/Documentation/filesystems/ntfs.txt +++ b/Documentation/filesystems/ntfs.txt @@ -50,9 +50,14 @@ userspace utilities, etc. Features ======== -- This is a complete rewrite of the NTFS driver that used to be in the kernel. - This new driver implements NTFS read support and is functionally equivalent - to the old ntfs driver. +- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and + earlier kernels. This new driver implements NTFS read support and is + functionally equivalent to the old ntfs driver and it also implements limited + write support. The biggest limitation at present is that files/directories + cannot be created or deleted. See below for the list of write features that + are so far supported. Another limitation is that writing to compressed files + is not implemented at all. Also, neither read nor write access to encrypted + files is so far implemented. - The new driver has full support for sparse files on NTFS 3.x volumes which the old driver isn't happy with. - The new driver supports execution of binaries due to mmap() now being @@ -78,7 +83,20 @@ Features - The new driver supports fsync(2), fdatasync(2), and msync(2). - The new driver supports readv(2) and writev(2). - The new driver supports access time updates (including mtime and ctime). - +- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present + only very limited support for highly fragmented files, i.e. ones which have + their data attribute split across multiple extents, is included. Another + limitation is that at present truncate(2) will never create sparse files, + since to mark a file sparse we need to modify the directory entry for the + file and we do not implement directory modifications yet. +- The new driver supports write(2) which can both overwrite existing data and + extend the file size so that you can write beyond the existing data. Also, + writing into sparse regions is supported and the holes are filled in with + clusters. But at present only limited support for highly fragmented files, + i.e. ones which have their data attribute split across multiple extents, is + included. Another limitation is that write(2) will never create sparse + files, since to mark a file sparse we need to modify the directory entry for + the file and we do not implement directory modifications yet. Supported mount options ======================= @@ -439,6 +457,34 @@ ChangeLog Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. +2.1.25: + - Write support is now extended with write(2) being able to both + overwrite existing file data and to extend files. Also, if a write + to a sparse region occurs, write(2) will fill in the hole. Note, + mmap(2) based writes still do not support writing into holes or + writing beyond the initialized size. + - Write support has a new feature and that is that truncate(2) and + open(2) with O_TRUNC are now implemented thus files can be both made + smaller and larger. + - Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have + limitations in that they + - only provide limited support for highly fragmented files. + - only work on regular, i.e. uncompressed and unencrypted files. + - never create sparse files although this will change once directory + operations are implemented. + - Lots of bug fixes and enhancements across the board. +2.1.24: + - Support journals ($LogFile) which have been modified by chkdsk. This + means users can boot into Windows after we marked the volume dirty. + The Windows boot will run chkdsk and then reboot. The user can then + immediately boot into Linux rather than having to do a full Windows + boot first before rebooting into Linux and we will recognize such a + journal and empty it as it is clean by definition. + - Support journals ($LogFile) with only one restart page as well as + journals with two different restart pages. We sanity check both and + either use the only sane one or the more recent one of the two in the + case that both are valid. + - Lots of bug fixes and enhancements across the board. 2.1.23: - Stamp the user space journal, aka transaction log, aka $UsnJrnl, if it is present and active thus telling Windows and applications using diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 6c98f2bd421e..d4773565ea2f 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -133,6 +133,7 @@ Table 1-1: Process specific entries in /proc statm Process memory status information status Process status in human readable form wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan + smaps Extension based on maps, presenting the rss size for each mapped file .............................................................................. For example, to get the status information of a process, all you have to do is @@ -1240,16 +1241,38 @@ swap-intensive. overcommit_memory ----------------- -This file contains one value. The following algorithm is used to decide if -there's enough memory: if the value of overcommit_memory is positive, then -there's always enough memory. This is a useful feature, since programs often -malloc() huge amounts of memory 'just in case', while they only use a small -part of it. Leaving this value at 0 will lead to the failure of such a huge -malloc(), when in fact the system has enough memory for the program to run. +Controls overcommit of system memory, possibly allowing processes +to allocate (but not use) more memory than is actually available. -On the other hand, enabling this feature can cause you to run out of memory -and thrash the system to death, so large and/or important servers will want to -set this value to 0. + +0 - Heuristic overcommit handling. Obvious overcommits of + address space are refused. Used for a typical system. It + ensures a seriously wild allocation fails while allowing + overcommit to reduce swap usage. root is allowed to + allocate slighly more memory in this mode. This is the + default. + +1 - Always overcommit. Appropriate for some scientific + applications. + +2 - Don't overcommit. The total address space commit + for the system is not permitted to exceed swap plus a + configurable percentage (default is 50) of physical RAM. + Depending on the percentage you use, in most situations + this means a process will not be killed while attempting + to use already-allocated memory but will receive errors + on memory allocation as appropriate. + +overcommit_ratio +---------------- + +Percentage of physical memory size to include in overcommit calculations +(see above.) + +Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) + + swapspace = total size of all swap areas + physmem = size of physical memory in system nr_hugepages and hugetlb_shm_group ---------------------------------- diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt new file mode 100644 index 000000000000..d803abed29f0 --- /dev/null +++ b/Documentation/filesystems/relayfs.txt @@ -0,0 +1,362 @@ + +relayfs - a high-speed data relay filesystem +============================================ + +relayfs is a filesystem designed to provide an efficient mechanism for +tools and facilities to relay large and potentially sustained streams +of data from kernel space to user space. + +The main abstraction of relayfs is the 'channel'. A channel consists +of a set of per-cpu kernel buffers each represented by a file in the +relayfs filesystem. Kernel clients write into a channel using +efficient write functions which automatically log to the current cpu's +channel buffer. User space applications mmap() the per-cpu files and +retrieve the data as it becomes available. + +The format of the data logged into the channel buffers is completely +up to the relayfs client; relayfs does however provide hooks which +allow clients to impose some structure on the buffer data. Nor does +relayfs implement any form of data filtering - this also is left to +the client. The purpose is to keep relayfs as simple as possible. + +This document provides an overview of the relayfs API. The details of +the function parameters are documented along with the functions in the +filesystem code - please see that for details. + +Semantics +========= + +Each relayfs channel has one buffer per CPU, each buffer has one or +more sub-buffers. Messages are written to the first sub-buffer until +it is too full to contain a new message, in which case it it is +written to the next (if available). Messages are never split across +sub-buffers. At this point, userspace can be notified so it empties +the first sub-buffer, while the kernel continues writing to the next. + +When notified that a sub-buffer is full, the kernel knows how many +bytes of it are padding i.e. unused. Userspace can use this knowledge +to copy only valid data. + +After copying it, userspace can notify the kernel that a sub-buffer +has been consumed. + +relayfs can operate in a mode where it will overwrite data not yet +collected by userspace, and not wait for it to consume it. + +relayfs itself does not provide for communication of such data between +userspace and kernel, allowing the kernel side to remain simple and not +impose a single interface on userspace. It does provide a separate +helper though, described below. + +klog, relay-app & librelay +========================== + +relayfs itself is ready to use, but to make things easier, two +additional systems are provided. klog is a simple wrapper to make +writing formatted text or raw data to a channel simpler, regardless of +whether a channel to write into exists or not, or whether relayfs is +compiled into the kernel or is configured as a module. relay-app is +the kernel counterpart of userspace librelay.c, combined these two +files provide glue to easily stream data to disk, without having to +bother with housekeeping. klog and relay-app can be used together, +with klog providing high-level logging functions to the kernel and +relay-app taking care of kernel-user control and disk-logging chores. + +It is possible to use relayfs without relay-app & librelay, but you'll +have to implement communication between userspace and kernel, allowing +both to convey the state of buffers (full, empty, amount of padding). + +klog, relay-app and librelay can be found in the relay-apps tarball on +http://relayfs.sourceforge.net + +The relayfs user space API +========================== + +relayfs implements basic file operations for user space access to +relayfs channel buffer data. Here are the file operations that are +available and some comments regarding their behavior: + +open() enables user to open an _existing_ buffer. + +mmap() results in channel buffer being mapped into the caller's + memory space. Note that you can't do a partial mmap - you must + map the entire file, which is NRBUF * SUBBUFSIZE. + +read() read the contents of a channel buffer. The bytes read are + 'consumed' by the reader i.e. they won't be available again + to subsequent reads. If the channel is being used in + no-overwrite mode (the default), it can be read at any time + even if there's an active kernel writer. If the channel is + being used in overwrite mode and there are active channel + writers, results may be unpredictable - users should make + sure that all logging to the channel has ended before using + read() with overwrite mode. + +poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are + notified when sub-buffer boundaries are crossed. + +close() decrements the channel buffer's refcount. When the refcount + reaches 0 i.e. when no process or kernel client has the buffer + open, the channel buffer is freed. + + +In order for a user application to make use of relayfs files, the +relayfs filesystem must be mounted. For example, + + mount -t relayfs relayfs /mnt/relay + +NOTE: relayfs doesn't need to be mounted for kernel clients to create + or use channels - it only needs to be mounted when user space + applications need access to the buffer data. + + +The relayfs kernel API +====================== + +Here's a summary of the API relayfs provides to in-kernel clients: + + + channel management functions: + + relay_open(base_filename, parent, subbuf_size, n_subbufs, + callbacks) + relay_close(chan) + relay_flush(chan) + relay_reset(chan) + relayfs_create_dir(name, parent) + relayfs_remove_dir(dentry) + + channel management typically called on instigation of userspace: + + relay_subbufs_consumed(chan, cpu, subbufs_consumed) + + write functions: + + relay_write(chan, data, length) + __relay_write(chan, data, length) + relay_reserve(chan, length) + + callbacks: + + subbuf_start(buf, subbuf, prev_subbuf, prev_padding) + buf_mapped(buf, filp) + buf_unmapped(buf, filp) + + helper functions: + + relay_buf_full(buf) + subbuf_start_reserve(buf, length) + + +Creating a channel +------------------ + +relay_open() is used to create a channel, along with its per-cpu +channel buffers. Each channel buffer will have an associated file +created for it in the relayfs filesystem, which can be opened and +mmapped from user space if desired. The files are named +basename0...basenameN-1 where N is the number of online cpus, and by +default will be created in the root of the filesystem. If you want a +directory structure to contain your relayfs files, you can create it +with relayfs_create_dir() and pass the parent directory to +relay_open(). Clients are responsible for cleaning up any directory +structure they create when the channel is closed - use +relayfs_remove_dir() for that. + +The total size of each per-cpu buffer is calculated by multiplying the +number of sub-buffers by the sub-buffer size passed into relay_open(). +The idea behind sub-buffers is that they're basically an extension of +double-buffering to N buffers, and they also allow applications to +easily implement random-access-on-buffer-boundary schemes, which can +be important for some high-volume applications. The number and size +of sub-buffers is completely dependent on the application and even for +the same application, different conditions will warrant different +values for these parameters at different times. Typically, the right +values to use are best decided after some experimentation; in general, +though, it's safe to assume that having only 1 sub-buffer is a bad +idea - you're guaranteed to either overwrite data or lose events +depending on the channel mode being used. + +Channel 'modes' +--------------- + +relayfs channels can be used in either of two modes - 'overwrite' or +'no-overwrite'. The mode is entirely determined by the implementation +of the subbuf_start() callback, as described below. In 'overwrite' +mode, also known as 'flight recorder' mode, writes continuously cycle +around the buffer and will never fail, but will unconditionally +overwrite old data regardless of whether it's actually been consumed. +In no-overwrite mode, writes will fail i.e. data will be lost, if the +number of unconsumed sub-buffers equals the total number of +sub-buffers in the channel. It should be clear that if there is no +consumer or if the consumer can't consume sub-buffers fast enought, +data will be lost in either case; the only difference is whether data +is lost from the beginning or the end of a buffer. + +As explained above, a relayfs channel is made of up one or more +per-cpu channel buffers, each implemented as a circular buffer +subdivided into one or more sub-buffers. Messages are written into +the current sub-buffer of the channel's current per-cpu buffer via the +write functions described below. Whenever a message can't fit into +the current sub-buffer, because there's no room left for it, the +client is notified via the subbuf_start() callback that a switch to a +new sub-buffer is about to occur. The client uses this callback to 1) +initialize the next sub-buffer if appropriate 2) finalize the previous +sub-buffer if appropriate and 3) return a boolean value indicating +whether or not to actually go ahead with the sub-buffer switch. + +To implement 'no-overwrite' mode, the userspace client would provide +an implementation of the subbuf_start() callback something like the +following: + +static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) +{ + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + if (relay_buf_full(buf)) + return 0; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; +} + +If the current buffer is full i.e. all sub-buffers remain unconsumed, +the callback returns 0 to indicate that the buffer switch should not +occur yet i.e. until the consumer has had a chance to read the current +set of ready sub-buffers. For the relay_buf_full() function to make +sense, the consumer is reponsible for notifying relayfs when +sub-buffers have been consumed via relay_subbufs_consumed(). Any +subsequent attempts to write into the buffer will again invoke the +subbuf_start() callback with the same parameters; only when the +consumer has consumed one or more of the ready sub-buffers will +relay_buf_full() return 0, in which case the buffer switch can +continue. + +The implementation of the subbuf_start() callback for 'overwrite' mode +would be very similar: + +static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) +{ + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; +} + +In this case, the relay_buf_full() check is meaningless and the +callback always returns 1, causing the buffer switch to occur +unconditionally. It's also meaningless for the client to use the +relay_subbufs_consumed() function in this mode, as it's never +consulted. + +The default subbuf_start() implementation, used if the client doesn't +define any callbacks, or doesn't define the subbuf_start() callback, +implements the simplest possible 'no-overwrite' mode i.e. it does +nothing but return 0. + +Header information can be reserved at the beginning of each sub-buffer +by calling the subbuf_start_reserve() helper function from within the +subbuf_start() callback. This reserved area can be used to store +whatever information the client wants. In the example above, room is +reserved in each sub-buffer to store the padding count for that +sub-buffer. This is filled in for the previous sub-buffer in the +subbuf_start() implementation; the padding value for the previous +sub-buffer is passed into the subbuf_start() callback along with a +pointer to the previous sub-buffer, since the padding value isn't +known until a sub-buffer is filled. The subbuf_start() callback is +also called for the first sub-buffer when the channel is opened, to +give the client a chance to reserve space in it. In this case the +previous sub-buffer pointer passed into the callback will be NULL, so +the client should check the value of the prev_subbuf pointer before +writing into the previous sub-buffer. + +Writing to a channel +-------------------- + +kernel clients write data into the current cpu's channel buffer using +relay_write() or __relay_write(). relay_write() is the main logging +function - it uses local_irqsave() to protect the buffer and should be +used if you might be logging from interrupt context. If you know +you'll never be logging from interrupt context, you can use +__relay_write(), which only disables preemption. These functions +don't return a value, so you can't determine whether or not they +failed - the assumption is that you wouldn't want to check a return +value in the fast logging path anyway, and that they'll always succeed +unless the buffer is full and no-overwrite mode is being used, in +which case you can detect a failed write in the subbuf_start() +callback by calling the relay_buf_full() helper function. + +relay_reserve() is used to reserve a slot in a channel buffer which +can be written to later. This would typically be used in applications +that need to write directly into a channel buffer without having to +stage data in a temporary buffer beforehand. Because the actual write +may not happen immediately after the slot is reserved, applications +using relay_reserve() can keep a count of the number of bytes actually +written, either in space reserved in the sub-buffers themselves or as +a separate array. See the 'reserve' example in the relay-apps tarball +at http://relayfs.sourceforge.net for an example of how this can be +done. Because the write is under control of the client and is +separated from the reserve, relay_reserve() doesn't protect the buffer +at all - it's up to the client to provide the appropriate +synchronization when using relay_reserve(). + +Closing a channel +----------------- + +The client calls relay_close() when it's finished using the channel. +The channel and its associated buffers are destroyed when there are no +longer any references to any of the channel buffers. relay_flush() +forces a sub-buffer switch on all the channel buffers, and can be used +to finalize and process the last sub-buffers before the channel is +closed. + +Misc +---- + +Some applications may want to keep a channel around and re-use it +rather than open and close a new channel for each use. relay_reset() +can be used for this purpose - it resets a channel to its initial +state without reallocating channel buffer memory or destroying +existing mappings. It should however only be called when it's safe to +do so i.e. when the channel isn't currently being written to. + +Finally, there are a couple of utility callbacks that can be used for +different purposes. buf_mapped() is called whenever a channel buffer +is mmapped from user space and buf_unmapped() is called when it's +unmapped. The client can use this notification to trigger actions +within the kernel application, such as enabling/disabling logging to +the channel. + + +Resources +========= + +For news, example code, mailing list, etc. see the relayfs homepage: + + http://relayfs.sourceforge.net + + +Credits +======= + +The ideas and specs for relayfs came about as a result of discussions +on tracing involving the following: + +Michel Dagenais <michel.dagenais@polymtl.ca> +Richard Moore <richardj_moore@uk.ibm.com> +Bob Wisniewski <bob@watson.ibm.com> +Karim Yaghmour <karim@opersys.com> +Tom Zanussi <zanussi@us.ibm.com> + +Also thanks to Hubertus Franke for a lot of useful suggestions and bug +reports. diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index dc276598a65a..c8bce82ddcac 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -90,7 +90,7 @@ void device_remove_file(struct device *, struct device_attribute *); It also defines this helper for defining device attributes: -#define DEVICE_ATTR(_name,_mode,_show,_store) \ +#define DEVICE_ATTR(_name, _mode, _show, _store) \ struct device_attribute dev_attr_##_name = { \ .attr = {.name = __stringify(_name) , .mode = _mode }, \ .show = _show, \ @@ -99,14 +99,14 @@ struct device_attribute dev_attr_##_name = { \ For example, declaring -static DEVICE_ATTR(foo,0644,show_foo,store_foo); +static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); is equivalent to doing: static struct device_attribute dev_attr_foo = { .attr = { .name = "foo", - .mode = 0644, + .mode = S_IWUSR | S_IRUGO, }, .show = show_foo, .store = store_foo, @@ -121,8 +121,8 @@ set of sysfs operations for forwarding read and write calls to the show and store methods of the attribute owners. struct sysfs_ops { - ssize_t (*show)(struct kobject *, struct attribute *,char *); - ssize_t (*store)(struct kobject *,struct attribute *,const char *); + ssize_t (*show)(struct kobject *, struct attribute *, char *); + ssize_t (*store)(struct kobject *, struct attribute *, const char *); }; [ Subsystems should have already defined a struct kobj_type as a @@ -137,7 +137,7 @@ calls the associated methods. To illustrate: -#define to_dev_attr(_attr) container_of(_attr,struct device_attribute,attr) +#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) #define to_dev(d) container_of(d, struct device, kobj) static ssize_t @@ -148,7 +148,7 @@ dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf) ssize_t ret = 0; if (dev_attr->show) - ret = dev_attr->show(dev,buf); + ret = dev_attr->show(dev, buf); return ret; } @@ -216,16 +216,16 @@ A very simple (and naive) implementation of a device attribute is: static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf) { - return sprintf(buf,"%s\n",dev->name); + return snprintf(buf, PAGE_SIZE, "%s\n", dev->name); } static ssize_t store_name(struct device * dev, const char * buf) { - sscanf(buf,"%20s",dev->name); - return strlen(buf); + sscanf(buf, "%20s", dev->name); + return strnlen(buf, PAGE_SIZE); } -static DEVICE_ATTR(name,S_IRUGO,show_name,store_name); +static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); (Note that the real implementation doesn't allow userspace to set the @@ -290,7 +290,7 @@ struct device_attribute { Declaring: -DEVICE_ATTR(_name,_str,_mode,_show,_store); +DEVICE_ATTR(_name, _str, _mode, _show, _store); Creation/Removal: @@ -310,7 +310,7 @@ struct bus_attribute { Declaring: -BUS_ATTR(_name,_mode,_show,_store) +BUS_ATTR(_name, _mode, _show, _store) Creation/Removal: @@ -331,7 +331,7 @@ struct driver_attribute { Declaring: -DRIVER_ATTR(_name,_mode,_show,_store) +DRIVER_ATTR(_name, _mode, _show, _store) Creation/Removal: diff --git a/Documentation/filesystems/v9fs.txt b/Documentation/filesystems/v9fs.txt new file mode 100644 index 000000000000..4e92feb6b507 --- /dev/null +++ b/Documentation/filesystems/v9fs.txt @@ -0,0 +1,95 @@ + V9FS: 9P2000 for Linux + ====================== + +ABOUT +===== + +v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. + +This software was originally developed by Ron Minnich <rminnich@lanl.gov> +and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson +<gwatson@lanl.gov> and most recently Eric Van Hensbergen +<ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>. + +USAGE +===== + +For remote file server: + + mount -t 9P 10.10.1.2 /mnt/9 + +For Plan 9 From User Space applications (http://swtch.com/plan9) + + mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER + +OPTIONS +======= + + proto=name select an alternative transport. Valid options are + currently: + unix - specifying a named pipe mount point + tcp - specifying a normal TCP/IP connection + fd - used passed file descriptors for connection + (see rfdno and wfdno) + + name=name user name to attempt mount as on the remote server. The + server may override or ignore this value. Certain user + names may require authentication. + + aname=name aname specifies the file tree to access when the server is + offering several exported file systems. + + debug=n specifies debug level. The debug level is a bitmask. + 0x01 = display verbose error messages + 0x02 = developer debug (DEBUG_CURRENT) + 0x04 = display 9P trace + 0x08 = display VFS trace + 0x10 = display Marshalling debug + 0x20 = display RPC debug + 0x40 = display transport debug + 0x80 = display allocation debug + + rfdno=n the file descriptor for reading with proto=fd + + wfdno=n the file descriptor for writing with proto=fd + + maxdata=n the number of bytes to use for 9P packet payload (msize) + + port=n port to connect to on the remote server + + timeout=n request timeouts (in ms) (default 60000ms) + + noextend force legacy mode (no 9P2000.u semantics) + + uid attempt to mount as a particular uid + + gid attempt to mount with a particular gid + + afid security channel - used by Plan 9 authentication protocols + + nodevmap do not map special files - represent them as normal files. + This can be used to share devices/named pipes/sockets between + hosts. This functionality will be expanded in later versions. + +RESOURCES +========= + +The Linux version of the 9P server, along with some client-side utilities +can be found at http://v9fs.sf.net (along with a CVS repository of the +development branch of this module). There are user and developer mailing +lists here, as well as a bug-tracker. + +For more information on the Plan 9 Operating System check out +http://plan9.bell-labs.com/plan9 + +For information on Plan 9 from User Space (Plan 9 applications and libraries +ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 + + +STATUS +====== + +The 2.6 kernel support is working on PPC and x86. + +PLEASE USE THE SOURCEFORGE BUG-TRACKER TO REPORT PROBLEMS. + diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 3f318dd44c77..f042c12e0ed2 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -1,35 +1,27 @@ -/* -*- auto-fill -*- */ - Overview of the Virtual File System + Overview of the Linux Virtual File System - Richard Gooch <rgooch@atnf.csiro.au> + Original author: Richard Gooch <rgooch@atnf.csiro.au> - 5-JUL-1999 + Last updated on August 25, 2005 + Copyright (C) 1999 Richard Gooch + Copyright (C) 2005 Pekka Enberg -Conventions used in this document <section> -================================= + This file is released under the GPLv2. -Each section in this document will have the string "<section>" at the -right-hand side of the section title. Each subsection will have -"<subsection>" at the right-hand side. These strings are meant to make -it easier to search through the document. -NOTE that the master copy of this document is available online at: -http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt - - -What is it? <section> +What is it? =========== The Virtual File System (otherwise known as the Virtual Filesystem Switch) is the software layer in the kernel that provides the filesystem interface to userspace programs. It also provides an abstraction within the kernel which allows different filesystem -implementations to co-exist. +implementations to coexist. -A Quick Look At How It Works <section> +A Quick Look At How It Works ============================ In this section I'll briefly describe how things work, before @@ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the other view which is how a filesystem is supported and subsequently mounted. -Opening a File <subsection> + +Opening a File -------------- The VFS implements the open(2), stat(2), chmod(2) and similar system @@ -77,7 +70,7 @@ back to userspace. Opening a file requires another operation: allocation of a file structure (this is the kernel-side implementation of file -descriptors). The freshly allocated file structure is initialised with +descriptors). The freshly allocated file structure is initialized with a pointer to the dentry and a set of file operation member functions. These are taken from the inode data. The open() file method is then called so the specific filesystem implementation can do it's work. You @@ -102,7 +95,8 @@ filesystem or driver code at the same time, on different processors. You should ensure that access to shared resources is protected by appropriate locks. -Registering and Mounting a Filesystem <subsection> + +Registering and Mounting a Filesystem ------------------------------------- If you want to support a new kind of filesystem in the kernel, all you @@ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem. It's now time to look at things in more detail. -struct file_system_type <section> +struct file_system_type ======================= -This describes the filesystem. As of kernel 2.1.99, the following +This describes the filesystem. As of kernel 2.6.13, the following members are defined: struct file_system_type { const char *name; int fs_flags; - struct super_block *(*read_super) (struct super_block *, void *, int); - struct file_system_type * next; + struct super_block *(*get_sb) (struct file_system_type *, int, + const char *, void *); + void (*kill_sb) (struct super_block *); + struct module *owner; + struct file_system_type * next; + struct list_head fs_supers; }; name: the name of the filesystem type, such as "ext2", "iso9660", @@ -141,51 +139,97 @@ struct file_system_type { fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) - read_super: the method to call when a new instance of this + get_sb: the method to call when a new instance of this filesystem should be mounted - next: for internal VFS use: you should initialise this to NULL + kill_sb: the method to call when an instance of this filesystem + should be unmounted + + owner: for internal VFS use: you should initialize this to THIS_MODULE in + most cases. -The read_super() method has the following arguments: + next: for internal VFS use: you should initialize this to NULL + +The get_sb() method has the following arguments: struct super_block *sb: the superblock structure. This is partially - initialised by the VFS and the rest must be initialised by the - read_super() method + initialized by the VFS and the rest must be initialized by the + get_sb() method + + int flags: mount flags + + const char *dev_name: the device name we are mounting. void *data: arbitrary mount options, usually comes as an ASCII string int silent: whether or not to be silent on error -The read_super() method must determine if the block device specified +The get_sb() method must determine if the block device specified in the superblock contains a filesystem of the type the method supports. On success the method returns the superblock pointer, on failure it returns NULL. The most interesting member of the superblock structure that the -read_super() method fills in is the "s_op" field. This is a pointer to +get_sb() method fills in is the "s_op" field. This is a pointer to a "struct super_operations" which describes the next level of the filesystem implementation. +Usually, a filesystem uses generic one of the generic get_sb() +implementations and provides a fill_super() method instead. The +generic methods are: + + get_sb_bdev: mount a filesystem residing on a block device -struct super_operations <section> + get_sb_nodev: mount a filesystem that is not backed by a device + + get_sb_single: mount a filesystem which shares the instance between + all mounts + +A fill_super() method implementation has the following arguments: + + struct super_block *sb: the superblock structure. The method fill_super() + must initialize this properly. + + void *data: arbitrary mount options, usually comes as an ASCII + string + + int silent: whether or not to be silent on error + + +struct super_operations ======================= This describes how the VFS can manipulate the superblock of your -filesystem. As of kernel 2.1.99, the following members are defined: +filesystem. As of kernel 2.6.13, the following members are defined: struct super_operations { - void (*read_inode) (struct inode *); - int (*write_inode) (struct inode *, int); - void (*put_inode) (struct inode *); - void (*drop_inode) (struct inode *); - void (*delete_inode) (struct inode *); - int (*notify_change) (struct dentry *, struct iattr *); - void (*put_super) (struct super_block *); - void (*write_super) (struct super_block *); - int (*statfs) (struct super_block *, struct statfs *, int); - int (*remount_fs) (struct super_block *, int *, char *); - void (*clear_inode) (struct inode *); + struct inode *(*alloc_inode)(struct super_block *sb); + void (*destroy_inode)(struct inode *); + + void (*read_inode) (struct inode *); + + void (*dirty_inode) (struct inode *); + int (*write_inode) (struct inode *, int); + void (*put_inode) (struct inode *); + void (*drop_inode) (struct inode *); + void (*delete_inode) (struct inode *); + void (*put_super) (struct super_block *); + void (*write_super) (struct super_block *); + int (*sync_fs)(struct super_block *sb, int wait); + void (*write_super_lockfs) (struct super_block *); + void (*unlockfs) (struct super_block *); + int (*statfs) (struct super_block *, struct kstatfs *); + int (*remount_fs) (struct super_block *, int *, char *); + void (*clear_inode) (struct inode *); + void (*umount_begin) (struct super_block *); + + void (*sync_inodes) (struct super_block *sb, + struct writeback_control *wbc); + int (*show_options)(struct seq_file *, struct vfsmount *); + + ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); + ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); }; All methods are called without any locks being held, unless otherwise @@ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are only called from a process context (i.e. not from an interrupt handler or bottom half). + alloc_inode: this method is called by inode_alloc() to allocate memory + for struct inode and initialize it. + + destroy_inode: this method is called by destroy_inode() to release + resources allocated for struct inode. + read_inode: this method is called to read a specific inode from the - mounted filesystem. The "i_ino" member in the "struct inode" - will be initialised by the VFS to indicate which inode to - read. Other members are filled in by this method + mounted filesystem. The i_ino member in the struct inode is + initialized by the VFS to indicate which inode to read. Other + members are filled in by this method. + + You can set this to NULL and use iget5_locked() instead of iget() + to read inodes. This is necessary for filesystems for which the + inode number is not sufficient to identify an inode. + + dirty_inode: this method is called by the VFS to mark an inode dirty. write_inode: this method is called when the VFS needs to write an inode to disc. The second parameter indicates whether the write should be synchronous or not, not all filesystems check this flag. put_inode: called when the VFS inode is removed from the inode - cache. This method is optional + cache. drop_inode: called when the last access to the inode is dropped, with the inode_lock spinlock held. - This method should be either NULL (normal unix filesystem + This method should be either NULL (normal UNIX filesystem semantics) or "generic_delete_inode" (for filesystems that do not want to cache inodes - causing "delete_inode" to always be called regardless of the value of i_nlink) - The "generic_delete_inode()" behaviour is equivalent to the + The "generic_delete_inode()" behavior is equivalent to the old practice of using "force_delete" in the put_inode() case, but does not have the races that the "force_delete()" approach had. delete_inode: called when the VFS wants to delete an inode - notify_change: called when VFS inode attributes are changed. If this - is NULL the VFS falls back to the write_inode() method. This - is called with the kernel lock held - put_super: called when the VFS wishes to free the superblock (i.e. unmount). This is called with the superblock lock held write_super: called when the VFS superblock needs to be written to disc. This method is optional + sync_fs: called when VFS is writing out all dirty data associated with + a superblock. The second parameter indicates whether the method + should wait until the write out has been completed. Optional. + + write_super_lockfs: called when VFS is locking a filesystem and forcing + it into a consistent state. This function is currently used by the + Logical Volume Manager (LVM). + + unlockfs: called when VFS is unlocking a filesystem and making it writable + again. + statfs: called when the VFS needs to get filesystem statistics. This is called with the kernel lock held @@ -238,21 +301,31 @@ or bottom half). clear_inode: called then the VFS clears the inode. Optional + umount_begin: called when the VFS is unmounting a filesystem. + + sync_inodes: called when the VFS is writing out dirty data associated with + a superblock. + + show_options: called by the VFS to show mount options for /proc/<pid>/mounts. + + quota_read: called by the VFS to read from filesystem quota file. + + quota_write: called by the VFS to write to filesystem quota file. + The read_inode() method is responsible for filling in the "i_op" field. This is a pointer to a "struct inode_operations" which describes the methods that can be performed on individual inodes. -struct inode_operations <section> +struct inode_operations ======================= This describes how the VFS can manipulate an inode in your -filesystem. As of kernel 2.1.99, the following members are defined: +filesystem. As of kernel 2.6.13, the following members are defined: struct inode_operations { - struct file_operations * default_file_ops; - int (*create) (struct inode *,struct dentry *,int); - int (*lookup) (struct inode *,struct dentry *); + int (*create) (struct inode *,struct dentry *,int, struct nameidata *); + struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); @@ -261,25 +334,22 @@ struct inode_operations { int (*mknod) (struct inode *,struct dentry *,int,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); - int (*readlink) (struct dentry *, char *,int); - struct dentry * (*follow_link) (struct dentry *, struct dentry *); - int (*readpage) (struct file *, struct page *); - int (*writepage) (struct page *page, struct writeback_control *wbc); - int (*bmap) (struct inode *,int); + int (*readlink) (struct dentry *, char __user *,int); + void * (*follow_link) (struct dentry *, struct nameidata *); + void (*put_link) (struct dentry *, struct nameidata *, void *); void (*truncate) (struct inode *); - int (*permission) (struct inode *, int); - int (*smap) (struct inode *,int); - int (*updatepage) (struct file *, struct page *, const char *, - unsigned long, unsigned int, int); - int (*revalidate) (struct dentry *); + int (*permission) (struct inode *, int, struct nameidata *); + int (*setattr) (struct dentry *, struct iattr *); + int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); + int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); + ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); + ssize_t (*listxattr) (struct dentry *, char *, size_t); + int (*removexattr) (struct dentry *, const char *); }; Again, all methods are called without any locks being held, unless otherwise noted. - default_file_ops: this is a pointer to a "struct file_operations" - which describes how to open and then manipulate open files - create: called by the open(2) and creat(2) system calls. Only required if you want to support regular files. The dentry you get should not have an inode (i.e. it should be a negative @@ -328,31 +398,143 @@ otherwise noted. you want to support reading symbolic links follow_link: called by the VFS to follow a symbolic link to the - inode it points to. Only required if you want to support - symbolic links + inode it points to. Only required if you want to support + symbolic links. This function returns a void pointer cookie + that is passed to put_link(). + + put_link: called by the VFS to release resources allocated by + follow_link(). The cookie returned by follow_link() is passed to + to this function as the last parameter. It is used by filesystems + such as NFS where page cache is not stable (i.e. page that was + installed when the symbolic link walk started might not be in the + page cache at the end of the walk). + + truncate: called by the VFS to change the size of a file. The i_size + field of the inode is set to the desired size by the VFS before + this function is called. This function is called by the truncate(2) + system call and related functionality. + + permission: called by the VFS to check for access rights on a POSIX-like + filesystem. + + setattr: called by the VFS to set attributes for a file. This function is + called by chmod(2) and related system calls. + + getattr: called by the VFS to get attributes of a file. This function is + called by stat(2) and related system calls. + + setxattr: called by the VFS to set an extended attribute for a file. + Extended attribute is a name:value pair associated with an inode. This + function is called by setxattr(2) system call. + + getxattr: called by the VFS to retrieve the value of an extended attribute + name. This function is called by getxattr(2) function call. + + listxattr: called by the VFS to list all extended attributes for a given + file. This function is called by listxattr(2) system call. + + removexattr: called by the VFS to remove an extended attribute from a file. + This function is called by removexattr(2) system call. + + +struct address_space_operations +=============================== + +This describes how the VFS can manipulate mapping of a file to page cache in +your filesystem. As of kernel 2.6.13, the following members are defined: + +struct address_space_operations { + int (*writepage)(struct page *page, struct writeback_control *wbc); + int (*readpage)(struct file *, struct page *); + int (*sync_page)(struct page *); + int (*writepages)(struct address_space *, struct writeback_control *); + int (*set_page_dirty)(struct page *page); + int (*readpages)(struct file *filp, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages); + int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); + int (*commit_write)(struct file *, struct page *, unsigned, unsigned); + sector_t (*bmap)(struct address_space *, sector_t); + int (*invalidatepage) (struct page *, unsigned long); + int (*releasepage) (struct page *, int); + ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, + loff_t offset, unsigned long nr_segs); + struct page* (*get_xip_page)(struct address_space *, sector_t, + int); +}; + + writepage: called by the VM write a dirty page to backing store. + + readpage: called by the VM to read a page from backing store. + + sync_page: called by the VM to notify the backing store to perform all + queued I/O operations for a page. I/O operations for other pages + associated with this address_space object may also be performed. + + writepages: called by the VM to write out pages associated with the + address_space object. + + set_page_dirty: called by the VM to set a page dirty. + + readpages: called by the VM to read pages associated with the address_space + object. + prepare_write: called by the generic write path in VM to set up a write + request for a page. -struct file_operations <section> + commit_write: called by the generic write path in VM to write page to + its backing store. + + bmap: called by the VFS to map a logical block offset within object to + physical block number. This method is use by for the legacy FIBMAP + ioctl. Other uses are discouraged. + + invalidatepage: called by the VM on truncate to disassociate a page from its + address_space mapping. + + releasepage: called by the VFS to release filesystem specific metadata from + a page. + + direct_IO: called by the VM for direct I/O writes and reads. + + get_xip_page: called by the VM to translate a block number to a page. + The page is valid until the corresponding filesystem is unmounted. + Filesystems that want to use execute-in-place (XIP) need to implement + it. An example implementation can be found in fs/ext2/xip.c. + + +struct file_operations ====================== This describes how the VFS can manipulate an open file. As of kernel -2.1.99, the following members are defined: +2.6.13, the following members are defined: struct file_operations { loff_t (*llseek) (struct file *, loff_t, int); - ssize_t (*read) (struct file *, char *, size_t, loff_t *); - ssize_t (*write) (struct file *, const char *, size_t, loff_t *); + ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); + ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); + ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); + ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); + long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); + long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); + int (*flush) (struct file *); int (*release) (struct inode *, struct file *); - int (*fsync) (struct file *, struct dentry *); - int (*fasync) (struct file *, int); - int (*check_media_change) (kdev_t dev); - int (*revalidate) (kdev_t dev); + int (*fsync) (struct file *, struct dentry *, int datasync); + int (*aio_fsync) (struct kiocb *, int datasync); + int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); + ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); + ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); + ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); + ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); + unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); + int (*check_flags)(int); + int (*dir_notify)(struct file *filp, unsigned long arg); + int (*flock) (struct file *, int, struct file_lock *); }; Again, all methods are called without any locks being held, unless @@ -362,8 +544,12 @@ otherwise noted. read: called by read(2) and related system calls + aio_read: called by io_submit(2) and other asynchronous I/O operations + write: called by write(2) and related system calls + aio_write: called by io_submit(2) and other asynchronous I/O operations + readdir: called when the VFS needs to read the directory contents poll: called by the VFS when a process wants to check if there is @@ -372,18 +558,25 @@ otherwise noted. ioctl: called by the ioctl(2) system call + unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not + require the BKL should use this method instead of the ioctl() above. + + compat_ioctl: called by the ioctl(2) system call when 32 bit system calls + are used on 64 bit kernels. + mmap: called by the mmap(2) system call open: called by the VFS when an inode should be opened. When the VFS - opens a file, it creates a new "struct file" and initialises - the "f_op" file operations member with the "default_file_ops" - field in the inode structure. It then calls the open method - for the newly allocated file structure. You might think that - the open method really belongs in "struct inode_operations", - and you may be right. I think it's done the way it is because - it makes filesystems simpler to implement. The open() method - is a good place to initialise the "private_data" member in the - file structure if you want to point to a device structure + opens a file, it creates a new "struct file". It then calls the + open method for the newly allocated file structure. You might + think that the open method really belongs in + "struct inode_operations", and you may be right. I think it's + done the way it is because it makes filesystems simpler to + implement. The open() method is a good place to initialize the + "private_data" member in the file structure if you want to point + to a device structure + + flush: called by the close(2) system call to flush a file release: called when the last reference to an open file is closed @@ -392,6 +585,23 @@ otherwise noted. fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file + lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW + commands + + readv: called by the readv(2) system call + + writev: called by the writev(2) system call + + sendfile: called by the sendfile(2) system call + + get_unmapped_area: called by the mmap(2) system call + + check_flags: called by the fcntl(2) system call for F_SETFL command + + dir_notify: called by the fcntl(2) system call for F_NOTIFY command + + flock: called by the flock(2) system call + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special @@ -400,29 +610,28 @@ driver information. These support routines replace the filesystem file operations with those for the device driver, and then proceed to call the new open() method for the file. This is how opening a device file in the filesystem eventually ends up calling the device driver open() -method. Note the devfs (the Device FileSystem) has a more direct path -from device node to device driver (this is an unofficial kernel -patch). +method. -Directory Entry Cache (dcache) <section> ------------------------------- +Directory Entry Cache (dcache) +============================== + struct dentry_operations -======================== +------------------------ This describes how a filesystem can overload the standard dentry operations. Dentries and the dcache are the domain of the VFS and the individual filesystem implementations. Device drivers have no business here. These methods may be set to NULL, as they are either optional or -the VFS uses a default. As of kernel 2.1.99, the following members are +the VFS uses a default. As of kernel 2.6.13, the following members are defined: struct dentry_operations { - int (*d_revalidate)(struct dentry *); + int (*d_revalidate)(struct dentry *, struct nameidata *); int (*d_hash) (struct dentry *, struct qstr *); int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); - void (*d_delete)(struct dentry *); + int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); }; @@ -451,6 +660,7 @@ Each dentry has a pointer to its parent dentry, as well as a hash list of child dentries. Child dentries are basically like files in a directory. + Directory Entry Cache APIs -------------------------- @@ -471,7 +681,7 @@ manipulate dentries: "d_delete" method is called d_drop: this unhashes a dentry from its parents hash list. A - subsequent call to dput() will dellocate the dentry if its + subsequent call to dput() will deallocate the dentry if its usage count drops to 0 d_delete: delete a dentry. If there are no other open references to @@ -507,16 +717,16 @@ up by walking the tree starting with the first component of the pathname and using that dentry along with the next component to look up the next level and so on. Since it is a frequent operation for workloads like multiuser -environments and webservers, it is important to optimize +environments and web servers, it is important to optimize this path. Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in every component during path look-up. Since 2.5.10 onwards, -fastwalk algorithm changed this by holding the dcache_lock +fast-walk algorithm changed this by holding the dcache_lock at the beginning and walking as many cached path component -dentries as possible. This signficantly decreases the number +dentries as possible. This significantly decreases the number of acquisition of dcache_lock. However it also increases the -lock hold time signficantly and affects performance in large +lock hold time significantly and affects performance in large SMP machines. Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to make dcache look-up lock-free. @@ -527,7 +737,7 @@ protected the hash chain, d_child, d_alias, d_lru lists as well as d_inode and several other things like mount look-up. RCU-based changes affect only the way the hash chain is protected. For everything else the dcache_lock must be taken for both traversing as well as -updating. The hash chain updations too take the dcache_lock. +updating. The hash chain updates too take the dcache_lock. The significant change is the way d_lookup traverses the hash chain, it doesn't acquire the dcache_lock for this and rely on RCU to ensure that the dentry has not been *freed*. @@ -535,14 +745,15 @@ ensure that the dentry has not been *freed*. Dcache locking details ---------------------- + For many multi-user workloads, open() and stat() on files are very frequently occurring operations. Both involve walking of path names to find the dentry corresponding to the concerned file. In 2.4 kernel, dcache_lock was held during look-up of each path component. Contention and -cacheline bouncing of this global lock caused significant +cache-line bouncing of this global lock caused significant scalability problems. With the introduction of RCU -in linux kernel, this was worked around by making +in Linux kernel, this was worked around by making the look-up of path components during path walking lock-free. @@ -562,7 +773,7 @@ Some of the important changes are : 2. Insertion of a dentry into the hash table is done using hlist_add_head_rcu() which take care of ordering the writes - the writes to the dentry must be visible before the dentry - is inserted. This works in conjuction with hlist_for_each_rcu() + is inserted. This works in conjunction with hlist_for_each_rcu() while walking the hash chain. The only requirement is that all initialization to the dentry must be done before hlist_add_head_rcu() since we don't have dcache_lock protection while traversing @@ -584,7 +795,7 @@ Some of the important changes are : the same. In some sense, dcache_rcu path walking looks like the pre-2.5.10 version. -5. All dentry hash chain updations must take the dcache_lock as well as +5. All dentry hash chain updates must take the dcache_lock as well as the per-dentry lock in that order. dput() does this to ensure that a dentry that has just been looked up in another CPU doesn't get deleted before dget() can be done on it. @@ -640,10 +851,10 @@ handled as described below : Since we redo the d_parent check and compare name while holding d_lock, lock-free look-up will not race against d_move(). -4. There can be a theoritical race when a dentry keeps coming back +4. There can be a theoretical race when a dentry keeps coming back to original bucket due to double moves. Due to this look-up may consider that it has never moved and can end up in a infinite loop. - But this is not any worse that theoritical livelocks we already + But this is not any worse that theoretical livelocks we already have in the kernel. diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index c7d5d0c7067d..74aeb142ae5f 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -19,15 +19,43 @@ Mount Options When mounting an XFS filesystem, the following options are accepted. - biosize=size - Sets the preferred buffered I/O size (default size is 64K). - "size" must be expressed as the logarithm (base2) of the - desired I/O size. - Valid values for this option are 14 through 16, inclusive - (i.e. 16K, 32K, and 64K bytes). On machines with a 4K - pagesize, 13 (8K bytes) is also a valid size. - The preferred buffered I/O size can also be altered on an - individual file basis using the ioctl(2) system call. + allocsize=size + Sets the buffered I/O end-of-file preallocation size when + doing delayed allocation writeout (default size is 64KiB). + Valid values for this option are page size (typically 4KiB) + through to 1GiB, inclusive, in power-of-2 increments. + + attr2/noattr2 + The options enable/disable (default is disabled for backward + compatibility on-disk) an "opportunistic" improvement to be + made in the way inline extended attributes are stored on-disk. + When the new form is used for the first time (by setting or + removing extended attributes) the on-disk superblock feature + bit field will be updated to reflect this format being in use. + + barrier + Enables the use of block layer write barriers for writes into + the journal and unwritten extent conversion. This allows for + drive level write caching to be enabled, for devices that + support write barriers. + + dmapi + Enable the DMAPI (Data Management API) event callouts. + Use with the "mtpt" option. + + grpid/bsdgroups and nogrpid/sysvgroups + These options define what group ID a newly created file gets. + When grpid is set, it takes the group ID of the directory in + which it is created; otherwise (the default) it takes the fsgid + of the current process, unless the directory has the setgid bit + set, in which case it takes the gid from the parent directory, + and also gets the setgid bit set if it is a directory itself. + + ihashsize=value + Sets the number of hash buckets available for hashing the + in-memory inodes of the specified mount point. If a value + of zero is used, the value selected by the default algorithm + will be displayed in /proc/mounts. ikeep/noikeep When inode clusters are emptied of inodes, keep them around @@ -35,12 +63,31 @@ When mounting an XFS filesystem, the following options are accepted. and is still the default for now. Using the noikeep option, inode clusters are returned to the free space pool. + inode64 + Indicates that XFS is allowed to create inodes at any location + in the filesystem, including those which will result in inode + numbers occupying more than 32 bits of significance. This is + provided for backwards compatibility, but causes problems for + backup applications that cannot handle large inode numbers. + + largeio/nolargeio + If "nolargeio" is specified, the optimal I/O reported in + st_blksize by stat(2) will be as small as possible to allow user + applications to avoid inefficient read/modify/write I/O. + If "largeio" specified, a filesystem that has a "swidth" specified + will return the "swidth" value (in bytes) in st_blksize. If the + filesystem does not have a "swidth" specified but does specify + an "allocsize" then "allocsize" (in bytes) will be returned + instead. + If neither of these two options are specified, then filesystem + will behave as if "nolargeio" was specified. + logbufs=value Set the number of in-memory log buffers. Valid numbers range from 2-8 inclusive. The default value is 8 buffers for filesystems with a - blocksize of 64K, 4 buffers for filesystems with a blocksize - of 32K, 3 buffers for filesystems with a blocksize of 16K + blocksize of 64KiB, 4 buffers for filesystems with a blocksize + of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB and 2 buffers for all other configurations. Increasing the number of buffers may increase performance on some workloads at the cost of the memory used for the additional log buffers @@ -49,10 +96,10 @@ When mounting an XFS filesystem, the following options are accepted. logbsize=value Set the size of each in-memory log buffer. Size may be specified in bytes, or in kilobytes with a "k" suffix. - Valid sizes for version 1 and version 2 logs are 16384 (16k) and - 32768 (32k). Valid sizes for version 2 logs also include + Valid sizes for version 1 and version 2 logs are 16384 (16k) and + 32768 (32k). Valid sizes for version 2 logs also include 65536 (64k), 131072 (128k) and 262144 (256k). - The default value for machines with more than 32MB of memory + The default value for machines with more than 32MiB of memory is 32768, machines with less memory use 16384 by default. logdev=device and rtdev=device @@ -62,6 +109,11 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. + mtpt=mountpoint + Use with the "dmapi" option. The value specified here will be + included in the DMAPI mount event, and should be the path of + the actual mountpoint that is used. + noalign Data allocations will not be aligned at stripe unit boundaries. @@ -91,13 +143,17 @@ When mounting an XFS filesystem, the following options are accepted. O_SYNC writes can be lost if the system crashes. If timestamp updates are critical, use the osyncisosync option. - quota/usrquota/uqnoenforce + uquota/usrquota/uqnoenforce/quota User disk quota accounting enabled, and limits (optionally) - enforced. + enforced. Refer to xfs_quota(8) for further details. - grpquota/gqnoenforce + gquota/grpquota/gqnoenforce Group disk quota accounting enabled and limits (optionally) - enforced. + enforced. Refer to xfs_quota(8) for further details. + + pquota/prjquota/pqnoenforce + Project disk quota accounting enabled and limits (optionally) + enforced. Refer to xfs_quota(8) for further details. sunit=value and swidth=value Used to specify the stripe unit and width for a RAID device or @@ -113,15 +169,21 @@ When mounting an XFS filesystem, the following options are accepted. The "swidth" option is required if the "sunit" option has been specified, and must be a multiple of the "sunit" value. + swalloc + Data allocations will be rounded up to stripe width boundaries + when the current end of file is being extended and the file + size is larger than the stripe width size. + + sysctls ======= The following sysctls are available for the XFS filesystem: fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) - Setting this to "1" clears accumulated XFS statistics + Setting this to "1" clears accumulated XFS statistics in /proc/fs/xfs/stat. It then immediately resets to "0". - + fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) The interval at which the xfssyncd thread flushes metadata out to disk. This thread will flush log activity out, and @@ -143,9 +205,9 @@ The following sysctls are available for the XFS filesystem: XFS_ERRLEVEL_HIGH: 5 fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) - Causes certain error conditions to call BUG(). Value is a bitmask; + Causes certain error conditions to call BUG(). Value is a bitmask; AND together the tags which represent errors which should cause panics: - + XFS_NO_PTAG 0 XFS_PTAG_IFLUSH 0x00000001 XFS_PTAG_LOGRES 0x00000002 @@ -155,7 +217,7 @@ The following sysctls are available for the XFS filesystem: XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 - This option is intended for debugging only. + This option is intended for debugging only. fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) Controls whether symlinks are created with mode 0777 (default) @@ -164,25 +226,37 @@ The following sysctls are available for the XFS filesystem: fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) Controls files created in SGID directories. If the group ID of the new file does not match the effective group - ID or one of the supplementary group IDs of the parent dir, the - ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl + ID or one of the supplementary group IDs of the parent dir, the + ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl is set. fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) Controls whether unprivileged users can use chown to "give away" a file to another user. - fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1) - Setting this to "1" will cause the "sync" flag set - by the chattr(1) command on a directory to be + fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "sync" flag set + by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. - fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1) - Setting this to "1" will cause the "nodump" flag set - by the chattr(1) command on a directory to be + fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nodump" flag set + by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. - fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1) - Setting this to "1" will cause the "noatime" flag set - by the chattr(1) command on a directory to be + fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "noatime" flag set + by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. + + fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nosymlinks" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + + fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) + In "inode32" allocation mode, this option determines how many + files the allocator attempts to allocate in the same allocation + group before moving to the next allocation group. The intent + is to control the rate at which the allocator moves between + allocation groups when allocating extents for new files. diff --git a/Documentation/firmware_class/firmware_sample_driver.c b/Documentation/firmware_class/firmware_sample_driver.c index e1c56a7e6583..d3ad2c24490a 100644 --- a/Documentation/firmware_class/firmware_sample_driver.c +++ b/Documentation/firmware_class/firmware_sample_driver.c @@ -13,6 +13,7 @@ #include <linux/kernel.h> #include <linux/init.h> #include <linux/device.h> +#include <linux/string.h> #include "linux/firmware.h" @@ -32,14 +33,14 @@ static void sample_firmware_load(char *firmware, int size) u8 buf[size+1]; memcpy(buf, firmware, size); buf[size] = '\0'; - printk("firmware_sample_driver: firmware: %s\n", buf); + printk(KERN_INFO "firmware_sample_driver: firmware: %s\n", buf); } static void sample_probe_default(void) { /* uses the default method to get the firmware */ const struct firmware *fw_entry; - printk("firmware_sample_driver: a ghost device got inserted :)\n"); + printk(KERN_INFO "firmware_sample_driver: a ghost device got inserted :)\n"); if(request_firmware(&fw_entry, "sample_driver_fw", &ghost_device)!=0) { @@ -61,7 +62,7 @@ static void sample_probe_specific(void) /* NOTE: This currently doesn't work */ - printk("firmware_sample_driver: a ghost device got inserted :)\n"); + printk(KERN_INFO "firmware_sample_driver: a ghost device got inserted :)\n"); if(request_firmware(NULL, "sample_driver_fw", &ghost_device)!=0) { @@ -83,7 +84,7 @@ static void sample_probe_async_cont(const struct firmware *fw, void *context) return; } - printk("firmware_sample_driver: device pointer \"%s\"\n", + printk(KERN_INFO "firmware_sample_driver: device pointer \"%s\"\n", (char *)context); sample_firmware_load(fw->data, fw->size); } diff --git a/Documentation/firmware_class/firmware_sample_firmware_class.c b/Documentation/firmware_class/firmware_sample_firmware_class.c index 09eab2f1b373..57b956aecbc5 100644 --- a/Documentation/firmware_class/firmware_sample_firmware_class.c +++ b/Documentation/firmware_class/firmware_sample_firmware_class.c @@ -14,6 +14,8 @@ #include <linux/module.h> #include <linux/init.h> #include <linux/timer.h> +#include <linux/slab.h> +#include <linux/string.h> #include <linux/firmware.h> diff --git a/Documentation/hwmon/it87 b/Documentation/hwmon/it87 index 0d0195040d88..7f42e441c645 100644 --- a/Documentation/hwmon/it87 +++ b/Documentation/hwmon/it87 @@ -4,18 +4,18 @@ Kernel driver it87 Supported chips: * IT8705F Prefix: 'it87' - Addresses scanned: from Super I/O config space, or default ISA 0x290 (8 I/O ports) + Addresses scanned: from Super I/O config space (8 I/O ports) Datasheet: Publicly available at the ITE website http://www.ite.com.tw/ * IT8712F Prefix: 'it8712' Addresses scanned: I2C 0x28 - 0x2f - from Super I/O config space, or default ISA 0x290 (8 I/O ports) + from Super I/O config space (8 I/O ports) Datasheet: Publicly available at the ITE website http://www.ite.com.tw/ * SiS950 [clone of IT8705F] - Prefix: 'sis950' - Addresses scanned: from Super I/O config space, or default ISA 0x290 (8 I/O ports) + Prefix: 'it87' + Addresses scanned: from Super I/O config space (8 I/O ports) Datasheet: No longer be available Author: Christophe Gauthron <chrisg@0-in.com> diff --git a/Documentation/hwmon/lm78 b/Documentation/hwmon/lm78 index 357086ed7f64..fd5dc7a19f0e 100644 --- a/Documentation/hwmon/lm78 +++ b/Documentation/hwmon/lm78 @@ -2,16 +2,11 @@ Kernel driver lm78 ================== Supported chips: - * National Semiconductor LM78 + * National Semiconductor LM78 / LM78-J Prefix: 'lm78' Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports) Datasheet: Publicly available at the National Semiconductor website http://www.national.com/ - * National Semiconductor LM78-J - Prefix: 'lm78-j' - Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports) - Datasheet: Publicly available at the National Semiconductor website - http://www.national.com/ * National Semiconductor LM79 Prefix: 'lm79' Addresses scanned: I2C 0x20 - 0x2f, ISA 0x290 (8 I/O ports) diff --git a/Documentation/hwmon/lm90 b/Documentation/hwmon/lm90 index 2c4cf39471f4..438cb24cee5b 100644 --- a/Documentation/hwmon/lm90 +++ b/Documentation/hwmon/lm90 @@ -24,14 +24,14 @@ Supported chips: http://www.national.com/pf/LM/LM86.html * Analog Devices ADM1032 Prefix: 'adm1032' - Addresses scanned: I2C 0x4c + Addresses scanned: I2C 0x4c and 0x4d Datasheet: Publicly available at the Analog Devices website - http://products.analog.com/products/info.asp?product=ADM1032 + http://www.analog.com/en/prod/0,2877,ADM1032,00.html * Analog Devices ADT7461 Prefix: 'adt7461' - Addresses scanned: I2C 0x4c + Addresses scanned: I2C 0x4c and 0x4d Datasheet: Publicly available at the Analog Devices website - http://products.analog.com/products/info.asp?product=ADT7461 + http://www.analog.com/en/prod/0,2877,ADT7461,00.html Note: Only if in ADM1032 compatibility mode * Maxim MAX6657 Prefix: 'max6657' @@ -71,8 +71,8 @@ increased resolution of the remote temperature measurement. The different chipsets of the family are not strictly identical, although very similar. This driver doesn't handle any specific feature for now, -but could if there ever was a need for it. For reference, here comes a -non-exhaustive list of specific features: +with the exception of SMBus PEC. For reference, here comes a non-exhaustive +list of specific features: LM90: * Filter and alert configuration register at 0xBF. @@ -91,6 +91,7 @@ ADM1032: * Conversion averaging. * Up to 64 conversions/s. * ALERT is triggered by open remote sensor. + * SMBus PEC support for Write Byte and Receive Byte transactions. ADT7461 * Extended temperature range (breaks compatibility) @@ -119,3 +120,37 @@ The lm90 driver will not update its values more frequently than every other second; reading them more often will do no harm, but will return 'old' values. +PEC Support +----------- + +The ADM1032 is the only chip of the family which supports PEC. It does +not support PEC on all transactions though, so some care must be taken. + +When reading a register value, the PEC byte is computed and sent by the +ADM1032 chip. However, in the case of a combined transaction (SMBus Read +Byte), the ADM1032 computes the CRC value over only the second half of +the message rather than its entirety, because it thinks the first half +of the message belongs to a different transaction. As a result, the CRC +value differs from what the SMBus master expects, and all reads fail. + +For this reason, the lm90 driver will enable PEC for the ADM1032 only if +the bus supports the SMBus Send Byte and Receive Byte transaction types. +These transactions will be used to read register values, instead of +SMBus Read Byte, and PEC will work properly. + +Additionally, the ADM1032 doesn't support SMBus Send Byte with PEC. +Instead, it will try to write the PEC value to the register (because the +SMBus Send Byte transaction with PEC is similar to a Write Byte transaction +without PEC), which is not what we want. Thus, PEC is explicitely disabled +on SMBus Send Byte transactions in the lm90 driver. + +PEC on byte data transactions represents a significant increase in bandwidth +usage (+33% for writes, +25% for reads) in normal conditions. With the need +to use two SMBus transaction for reads, this overhead jumps to +50%. Worse, +two transactions will typically mean twice as much delay waiting for +transaction completion, effectively doubling the register cache refresh time. +I guess reliability comes at a price, but it's quite expensive this time. + +So, as not everyone might enjoy the slowdown, PEC can be disabled through +sysfs. Just write 0 to the "pec" file and PEC will be disabled. Write 1 +to that file to enable PEC again. diff --git a/Documentation/hwmon/smsc47b397 b/Documentation/hwmon/smsc47b397 index da9d80c96432..20682f15ae41 100644 --- a/Documentation/hwmon/smsc47b397 +++ b/Documentation/hwmon/smsc47b397 @@ -3,6 +3,7 @@ Kernel driver smsc47b397 Supported chips: * SMSC LPC47B397-NC + * SMSC SCH5307-NS Prefix: 'smsc47b397' Addresses scanned: none, address read from Super I/O config space Datasheet: In this file @@ -12,11 +13,14 @@ Authors: Mark M. Hoffman <mhoffman@lightlink.com> November 23, 2004 -The following specification describes the SMSC LPC47B397-NC sensor chip +The following specification describes the SMSC LPC47B397-NC[1] sensor chip (for which there is no public datasheet available). This document was provided by Craig Kelly (In-Store Broadcast Network) and edited/corrected by Mark M. Hoffman <mhoffman@lightlink.com>. +[1] And SMSC SCH5307-NS, which has a different device ID but is otherwise +compatible. + * * * * * Methods for detecting the HP SIO and reading the thermal data on a dc7100. @@ -127,7 +131,7 @@ OUT DX,AL The registers of interest for identifying the SIO on the dc7100 are Device ID (0x20) and Device Rev (0x21). -The Device ID will read 0X6F +The Device ID will read 0x6F (for SCH5307-NS, 0x81) The Device Rev currently reads 0x01 Obtaining the HWM Base Address. diff --git a/Documentation/hwmon/smsc47m1 b/Documentation/hwmon/smsc47m1 index 34e6478c1425..c15bbe68264e 100644 --- a/Documentation/hwmon/smsc47m1 +++ b/Documentation/hwmon/smsc47m1 @@ -12,6 +12,10 @@ Supported chips: http://www.smsc.com/main/datasheets/47m14x.pdf http://www.smsc.com/main/tools/discontinued/47m15x.pdf http://www.smsc.com/main/datasheets/47m192.pdf + * SMSC LPC47M997 + Addresses scanned: none, address read from Super I/O config space + Prefix: 'smsc47m1' + Datasheet: none Authors: Mark D. Studebaker <mdsxyz123@yahoo.com>, @@ -30,6 +34,9 @@ The 47M15x and 47M192 chips contain a full 'hardware monitoring block' in addition to the fan monitoring and control. The hardware monitoring block is not supported by the driver. +No documentation is available for the 47M997, but it has the same device +ID as the 47M15x and 47M192 chips and seems to be compatible. + Fan rotation speeds are reported in RPM (rotations per minute). An alarm is triggered if the rotation speed has dropped below a programmable limit. Fan readings can be divided by a programmable divider (1, 2, 4 or 8) to give diff --git a/Documentation/hwmon/sysfs-interface b/Documentation/hwmon/sysfs-interface index 346400519d0d..764cdc5480e7 100644 --- a/Documentation/hwmon/sysfs-interface +++ b/Documentation/hwmon/sysfs-interface @@ -272,3 +272,6 @@ beep_mask Bitmask for beep. eeprom Raw EEPROM data in binary form. Read only. + +pec Enable or disable PEC (SMBus only) + Read/Write diff --git a/Documentation/hwmon/via686a b/Documentation/hwmon/via686a index b82014cb7c53..a936fb3824b2 100644 --- a/Documentation/hwmon/via686a +++ b/Documentation/hwmon/via686a @@ -18,8 +18,9 @@ Authors: Module Parameters ----------------- -force_addr=0xaddr Set the I/O base address. Useful for Asus A7V boards - that don't set the address in the BIOS. Does not do a +force_addr=0xaddr Set the I/O base address. Useful for boards that + don't set the address in the BIOS. Look for a BIOS + upgrade before resorting to this. Does not do a PCI force; the via686a must still be present in lspci. Don't use this unless the driver complains that the base address is not set. @@ -63,3 +64,15 @@ miss once-only alarms. The driver only updates its values each 1.5 seconds; reading it more often will do no harm, but will return 'old' values. + +Known Issues +------------ + +This driver handles sensors integrated in some VIA south bridges. It is +possible that a motherboard maker used a VT82C686A/B chip as part of a +product design but was not interested in its hardware monitoring features, +in which case the sensor inputs will not be wired. This is the case of +the Asus K7V, A7V and A7V133 motherboards, to name only a few of them. +So, if you need the force_addr parameter, and end up with values which +don't seem to make any sense, don't look any further: your chip is simply +not wired for hardware monitoring. diff --git a/Documentation/hwmon/w83792d b/Documentation/hwmon/w83792d new file mode 100644 index 000000000000..8171c285bb55 --- /dev/null +++ b/Documentation/hwmon/w83792d @@ -0,0 +1,174 @@ +Kernel driver w83792d +===================== + +Supported chips: + * Winbond W83792D + Prefix: 'w83792d' + Addresses scanned: I2C 0x2c - 0x2f + Datasheet: http://www.winbond.com.tw/E-WINBONDHTM/partner/PDFresult.asp?Pname=1035 + +Author: Chunhao Huang +Contact: DZShen <DZShen@Winbond.com.tw> + + +Module Parameters +----------------- + +* init int + (default 1) + Use 'init=0' to bypass initializing the chip. + Try this if your computer crashes when you load the module. + +* force_subclients=bus,caddr,saddr,saddr + This is used to force the i2c addresses for subclients of + a certain chip. Example usage is `force_subclients=0,0x2f,0x4a,0x4b' + to force the subclients of chip 0x2f on bus 0 to i2c addresses + 0x4a and 0x4b. + + +Description +----------- + +This driver implements support for the Winbond W83792AD/D. + +Detection of the chip can sometimes be foiled because it can be in an +internal state that allows no clean access (Bank with ID register is not +currently selected). If you know the address of the chip, use a 'force' +parameter; this will put it into a more well-behaved state first. + +The driver implements three temperature sensors, seven fan rotation speed +sensors, nine voltage sensors, and two automatic fan regulation +strategies called: Smart Fan I (Thermal Cruise mode) and Smart Fan II. +Automatic fan control mode is possible only for fan1-fan3. Fan4-fan7 can run +synchronized with selected fan (fan1-fan3). This functionality and manual PWM +control for fan4-fan7 is not yet implemented. + +Temperatures are measured in degrees Celsius and measurement resolution is 1 +degC for temp1 and 0.5 degC for temp2 and temp3. An alarm is triggered when +the temperature gets higher than the Overtemperature Shutdown value; it stays +on until the temperature falls below the Hysteresis value. + +Fan rotation speeds are reported in RPM (rotations per minute). An alarm is +triggered if the rotation speed has dropped below a programmable limit. Fan +readings can be divided by a programmable divider (1, 2, 4, 8, 16, 32, 64 or +128) to give the readings more range or accuracy. + +Voltage sensors (also known as IN sensors) report their values in millivolts. +An alarm is triggered if the voltage has crossed a programmable minimum +or maximum limit. + +Alarms are provided as output from "realtime status register". Following bits +are defined: + +bit - alarm on: +0 - in0 +1 - in1 +2 - temp1 +3 - temp2 +4 - temp3 +5 - fan1 +6 - fan2 +7 - fan3 +8 - in2 +9 - in3 +10 - in4 +11 - in5 +12 - in6 +13 - VID change +14 - chassis +15 - fan7 +16 - tart1 +17 - tart2 +18 - tart3 +19 - in7 +20 - in8 +21 - fan4 +22 - fan5 +23 - fan6 + +Tart will be asserted while target temperature cannot be achieved after 3 minutes +of full speed rotation of corresponding fan. + +In addition to the alarms described above, there is a CHAS alarm on the chips +which triggers if your computer case is open (This one is latched, contrary +to realtime alarms). + +The chips only update values each 3 seconds; reading them more often will +do no harm, but will return 'old' values. + + +W83792D PROBLEMS +---------------- +Known problems: + - This driver is only for Winbond W83792D C version device, there + are also some motherboards with B version W83792D device. The + calculation method to in6-in7(measured value, limits) is a little + different between C and B version. C or B version can be identified + by CR[0x49h]. + - The function of vid and vrm has not been finished, because I'm NOT + very familiar with them. Adding support is welcome. + - The function of chassis open detection needs more tests. + - If you have ASUS server board and chip was not found: Then you will + need to upgrade to latest (or beta) BIOS. If it does not help please + contact us. + +Fan control +----------- + +Manual mode +----------- + +Works as expected. You just need to specify desired PWM/DC value (fan speed) +in appropriate pwm# file. + +Thermal cruise +-------------- + +In this mode, W83792D provides the Smart Fan system to automatically control +fan speed to keep the temperatures of CPU and the system within specific +range. At first a wanted temperature and interval must be set. This is done +via thermal_cruise# file. The tolerance# file serves to create T +- tolerance +interval. The fan speed will be lowered as long as the current temperature +remains below the thermal_cruise# +- tolerance# value. Once the temperature +exceeds the high limit (T+tolerance), the fan will be turned on with a +specific speed set by pwm# and automatically controlled its PWM duty cycle +with the temperature varying. Three conditions may occur: + +(1) If the temperature still exceeds the high limit, PWM duty +cycle will increase slowly. + +(2) If the temperature goes below the high limit, but still above the low +limit (T-tolerance), the fan speed will be fixed at the current speed because +the temperature is in the target range. + +(3) If the temperature goes below the low limit, PWM duty cycle will decrease +slowly to 0 or a preset stop value until the temperature exceeds the low +limit. (The preset stop value handling is not yet implemented in driver) + +Smart Fan II +------------ + +W83792D also provides a special mode for fan. Four temperature points are +available. When related temperature sensors detects the temperature in preset +temperature region (sf2_point@_fan# +- tolerance#) it will cause fans to run +on programmed value from sf2_level@_fan#. You need to set four temperatures +for each fan. + + +/sys files +---------- + +pwm[1-3] - this file stores PWM duty cycle or DC value (fan speed) in range: + 0 (stop) to 255 (full) +pwm[1-3]_enable - this file controls mode of fan/temperature control: + * 0 Disabled + * 1 Manual mode + * 2 Smart Fan II + * 3 Thermal Cruise +pwm[1-3]_mode - Select PWM of DC mode + * 0 DC + * 1 PWM +thermal_cruise[1-3] - Selects the desired temperature for cruise (degC) +tolerance[1-3] - Value in degrees of Celsius (degC) for +- T +sf2_point[1-4]_fan[1-3] - four temperature points for each fan for Smart Fan II +sf2_level[1-3]_fan[1-3] - three PWM/DC levels for each fan for Smart Fan II diff --git a/Documentation/i2c/busses/i2c-i810 b/Documentation/i2c/busses/i2c-i810 index 0544eb332887..83c3b9743c3c 100644 --- a/Documentation/i2c/busses/i2c-i810 +++ b/Documentation/i2c/busses/i2c-i810 @@ -2,6 +2,7 @@ Kernel driver i2c-i810 Supported adapters: * Intel 82810, 82810-DC100, 82810E, and 82815 (GMCH) + * Intel 82845G (GMCH) Authors: Frodo Looijaard <frodol@dds.nl>, diff --git a/Documentation/i2c/busses/i2c-viapro b/Documentation/i2c/busses/i2c-viapro index 702f5ac68c09..9363b8bd6109 100644 --- a/Documentation/i2c/busses/i2c-viapro +++ b/Documentation/i2c/busses/i2c-viapro @@ -4,17 +4,18 @@ Supported adapters: * VIA Technologies, Inc. VT82C596A/B Datasheet: Sometimes available at the VIA website - * VIA Technologies, Inc. VT82C686A/B + * VIA Technologies, Inc. VT82C686A/B Datasheet: Sometimes available at the VIA website * VIA Technologies, Inc. VT8231, VT8233, VT8233A, VT8235, VT8237 Datasheet: available on request from Via Authors: - Frodo Looijaard <frodol@dds.nl>, - Philip Edelbrock <phil@netroedge.com>, - Kyösti Mälkki <kmalkki@cc.hut.fi>, - Mark D. Studebaker <mdsxyz123@yahoo.com> + Frodo Looijaard <frodol@dds.nl>, + Philip Edelbrock <phil@netroedge.com>, + Kyösti Mälkki <kmalkki@cc.hut.fi>, + Mark D. Studebaker <mdsxyz123@yahoo.com>, + Jean Delvare <khali@linux-fr.org> Module Parameters ----------------- @@ -28,20 +29,22 @@ Description ----------- i2c-viapro is a true SMBus host driver for motherboards with one of the -supported VIA southbridges. +supported VIA south bridges. Your lspci -n listing must show one of these : - device 1106:3050 (VT82C596 function 3) - device 1106:3051 (VT82C596 function 3) + device 1106:3050 (VT82C596A function 3) + device 1106:3051 (VT82C596B function 3) device 1106:3057 (VT82C686 function 4) device 1106:3074 (VT8233) device 1106:3147 (VT8233A) - device 1106:8235 (VT8231) - devide 1106:3177 (VT8235) - devide 1106:3227 (VT8237) + device 1106:8235 (VT8231 function 4) + device 1106:3177 (VT8235) + device 1106:3227 (VT8237R) If none of these show up, you should look in the BIOS for settings like enable ACPI / SMBus or even USB. - +Except for the oldest chips (VT82C596A/B, VT82C686A and most probably +VT8231), this driver supports I2C block transactions. Such transactions +are mainly useful to read from and write to EEPROMs. diff --git a/Documentation/i2c/chips/max6875 b/Documentation/i2c/chips/max6875 index b02002898a09..96fec562a8e9 100644 --- a/Documentation/i2c/chips/max6875 +++ b/Documentation/i2c/chips/max6875 @@ -4,22 +4,13 @@ Kernel driver max6875 Supported chips: * Maxim MAX6874, MAX6875 Prefix: 'max6875' - Addresses scanned: 0x50, 0x52 + Addresses scanned: None (see below) Datasheet: http://pdfserv.maxim-ic.com/en/ds/MAX6874-MAX6875.pdf Author: Ben Gardner <bgardner@wabtec.com> -Module Parameters ------------------ - -* allow_write int - Set to non-zero to enable write permission: - *0: Read only - 1: Read and write - - Description ----------- @@ -33,34 +24,85 @@ registers. The Maxim MAX6874 is a similar, mostly compatible device, with more intputs and outputs: - vin gpi vout MAX6874 6 4 8 MAX6875 4 3 5 -MAX6874 chips can have four different addresses (as opposed to only two for -the MAX6875). The additional addresses (0x54 and 0x56) are not probed by -this driver by default, but the probe module parameter can be used if -needed. - -See the datasheet for details on how to program the EEPROM. +See the datasheet for more information. Sysfs entries ------------- -eeprom_user - 512 bytes of user-defined EEPROM space. Only writable if - allow_write was set and register 0x43 is 0. - -eeprom_config - 70 bytes of config EEPROM. Note that changes will not get - loaded into register space until a power cycle or device reset. - -reg_config - 70 bytes of register space. Any changes take affect immediately. +eeprom - 512 bytes of user-defined EEPROM space. General Remarks --------------- -A typical application will require that the EEPROMs be programmed once and -never altered afterwards. +Valid addresses for the MAX6875 are 0x50 and 0x52. +Valid addresses for the MAX6874 are 0x50, 0x52, 0x54 and 0x56. +The driver does not probe any address, so you must force the address. + +Example: +$ modprobe max6875 force=0,0x50 + +The MAX6874/MAX6875 ignores address bit 0, so this driver attaches to multiple +addresses. For example, for address 0x50, it also reserves 0x51. +The even-address instance is called 'max6875', the odd one is 'max6875 subclient'. + + +Programming the chip using i2c-dev +---------------------------------- + +Use the i2c-dev interface to access and program the chips. +Reads and writes are performed differently depending on the address range. + +The configuration registers are at addresses 0x00 - 0x45. +Use i2c_smbus_write_byte_data() to write a register and +i2c_smbus_read_byte_data() to read a register. +The command is the register number. + +Examples: +To write a 1 to register 0x45: + i2c_smbus_write_byte_data(fd, 0x45, 1); + +To read register 0x45: + value = i2c_smbus_read_byte_data(fd, 0x45); + + +The configuration EEPROM is at addresses 0x8000 - 0x8045. +The user EEPROM is at addresses 0x8100 - 0x82ff. + +Use i2c_smbus_write_word_data() to write a byte to EEPROM. + +The command is the upper byte of the address: 0x80, 0x81, or 0x82. +The data word is the lower part of the address or'd with data << 8. + cmd = address >> 8; + val = (address & 0xff) | (data << 8); + +Example: +To write 0x5a to address 0x8003: + i2c_smbus_write_word_data(fd, 0x80, 0x5a03); + + +Reading data from the EEPROM is a little more complicated. +Use i2c_smbus_write_byte_data() to set the read address and then +i2c_smbus_read_byte() or i2c_smbus_read_i2c_block_data() to read the data. + +Example: +To read data starting at offset 0x8100, first set the address: + i2c_smbus_write_byte_data(fd, 0x81, 0x00); + +And then read the data + value = i2c_smbus_read_byte(fd); + + or + + count = i2c_smbus_read_i2c_block_data(fd, 0x84, buffer); + +The block read should read 16 bytes. +0x84 is the block read command. + +See the datasheet for more details. diff --git a/Documentation/i2c/chips/x1205 b/Documentation/i2c/chips/x1205 new file mode 100644 index 000000000000..09407c991fe5 --- /dev/null +++ b/Documentation/i2c/chips/x1205 @@ -0,0 +1,38 @@ +Kernel driver x1205 +=================== + +Supported chips: + * Xicor X1205 RTC + Prefix: 'x1205' + Addresses scanned: none + Datasheet: http://www.intersil.com/cda/deviceinfo/0,1477,X1205,00.html + +Authors: + Karen Spearel <kas11@tampabay.rr.com>, + Alessandro Zummo <a.zummo@towertech.it> + +Description +----------- + +This module aims to provide complete access to the Xicor X1205 RTC. +Recently Xicor has merged with Intersil, but the chip is +still sold under the Xicor brand. + +This chip is located at address 0x6f and uses a 2-byte register addressing. +Two bytes need to be written to read a single register, while most +other chips just require one and take the second one as the data +to be written. To prevent corrupting unknown chips, the user must +explicitely set the probe parameter. + +example: + +modprobe x1205 probe=0,0x6f + +The module supports one more option, hctosys, which is used to set the +software clock from the x1205. On systems where the x1205 is the +only hardware rtc, this parameter could be used to achieve a correct +date/time earlier in the system boot sequence. + +example: + +modprobe x1205 probe=0,0x6f hctosys=1 diff --git a/Documentation/i2c/functionality b/Documentation/i2c/functionality index 8a78a95ae04e..60cca249e452 100644 --- a/Documentation/i2c/functionality +++ b/Documentation/i2c/functionality @@ -17,9 +17,10 @@ For the most up-to-date list of functionality constants, please check I2C_FUNC_I2C Plain i2c-level commands (Pure SMBus adapters typically can not do these) I2C_FUNC_10BIT_ADDR Handles the 10-bit address extensions - I2C_FUNC_PROTOCOL_MANGLING Knows about the I2C_M_REV_DIR_ADDR, - I2C_M_REV_DIR_ADDR and I2C_M_REV_DIR_NOSTART - flags (which modify the i2c protocol!) + I2C_FUNC_PROTOCOL_MANGLING Knows about the I2C_M_IGNORE_NAK, + I2C_M_REV_DIR_ADDR, I2C_M_NOSTART and + I2C_M_NO_RD_ACK flags (which modify the + I2C protocol!) I2C_FUNC_SMBUS_QUICK Handles the SMBus write_quick command I2C_FUNC_SMBUS_READ_BYTE Handles the SMBus read_byte command I2C_FUNC_SMBUS_WRITE_BYTE Handles the SMBus write_byte command @@ -115,7 +116,7 @@ CHECKING THROUGH /DEV If you try to access an adapter from a userspace program, you will have to use the /dev interface. You will still have to check whether the functionality you need is supported, of course. This is done using -the I2C_FUNCS ioctl. An example, adapted from the lm_sensors i2c_detect +the I2C_FUNCS ioctl. An example, adapted from the lm_sensors i2cdetect program, is below: int file; diff --git a/Documentation/i2c/porting-clients b/Documentation/i2c/porting-clients index a7adbdd9ea8a..184fac2377aa 100644 --- a/Documentation/i2c/porting-clients +++ b/Documentation/i2c/porting-clients @@ -1,4 +1,4 @@ -Revision 4, 2004-03-30 +Revision 5, 2005-07-29 Jean Delvare <khali@linux-fr.org> Greg KH <greg@kroah.com> @@ -17,20 +17,22 @@ yours for best results. Technical changes: -* [Includes] Get rid of "version.h". Replace <linux/i2c-proc.h> with - <linux/i2c-sensor.h>. Includes typically look like that: +* [Includes] Get rid of "version.h" and <linux/i2c-proc.h>. + Includes typically look like that: #include <linux/module.h> #include <linux/init.h> #include <linux/slab.h> #include <linux/i2c.h> - #include <linux/i2c-sensor.h> - #include <linux/i2c-vid.h> /* if you need VRM support */ + #include <linux/hwmon.h> /* for hardware monitoring drivers */ + #include <linux/hwmon-sysfs.h> + #include <linux/hwmon-vid.h> /* if you need VRM support */ #include <asm/io.h> /* if you have I/O operations */ Please respect this inclusion order. Some extra headers may be required for a given driver (e.g. "lm75.h"). -* [Addresses] SENSORS_I2C_END becomes I2C_CLIENT_END, SENSORS_ISA_END - becomes I2C_CLIENT_ISA_END. +* [Addresses] SENSORS_I2C_END becomes I2C_CLIENT_END, ISA addresses + are no more handled by the i2c core. + SENSORS_INSMOD_<n> becomes I2C_CLIENT_INSMOD_<n>. * [Client data] Get rid of sysctl_id. Try using standard names for register values (for example, temp_os becomes temp_max). You're @@ -66,19 +68,21 @@ Technical changes: if (!(adapter->class & I2C_CLASS_HWMON)) return 0; ISA-only drivers of course don't need this. + Call i2c_probe() instead of i2c_detect(). * [Detect] As mentioned earlier, the flags parameter is gone. The type_name and client_name strings are replaced by a single name string, which will be filled with a lowercase, short string (typically the driver name, e.g. "lm75"). In i2c-only drivers, drop the i2c_is_isa_adapter check, it's - useless. + useless. Same for isa-only drivers, as the test would always be + true. Only hybrid drivers (which are quite rare) still need it. The errorN labels are reduced to the number needed. If that number is 2 (i2c-only drivers), it is advised that the labels are named exit and exit_free. For i2c+isa drivers, labels should be named ERROR0, ERROR1 and ERROR2. Don't forget to properly set err before jumping to error labels. By the way, labels should be left-aligned. - Use memset to fill the client and data area with 0x00. + Use kzalloc instead of kmalloc. Use i2c_set_clientdata to set the client data (as opposed to a direct access to client->data). Use strlcpy instead of strcpy to copy the client name. @@ -86,6 +90,8 @@ Technical changes: device_create_file. Move the driver initialization before any sysfs file creation. Drop client->id. + Drop any 24RF08 corruption prevention you find, as this is now done + at the i2c-core level, and doing it twice voids it. * [Init] Limits must not be set by the driver (can be done later in user-space). Chip should not be reset default (although a module @@ -93,7 +99,8 @@ Technical changes: limited to the strictly necessary steps. * [Detach] Get rid of data, remove the call to - i2c_deregister_entry. + i2c_deregister_entry. Do not log an error message if + i2c_detach_client fails, as i2c-core will now do it for you. * [Update] Don't access client->data directly, use i2c_get_clientdata(client) instead. diff --git a/Documentation/i2c/writing-clients b/Documentation/i2c/writing-clients index 91664be91ffc..cff7b652588a 100644 --- a/Documentation/i2c/writing-clients +++ b/Documentation/i2c/writing-clients @@ -33,8 +33,8 @@ static struct i2c_driver foo_driver = { .command = &foo_command /* may be NULL */ } -The name can be chosen freely, and may be upto 40 characters long. Please -use something descriptive here. +The name field must match the driver name, including the case. It must not +contain spaces, and may be up to 31 characters long. Don't worry about the flags field; just put I2C_DF_NOTIFY into it. This means that your driver will be notified when new adapters are found. @@ -43,9 +43,6 @@ This is almost always what you want. All other fields are for call-back functions which will be explained below. -There use to be two additional fields in this structure, inc_use et dec_use, -for module usage count, but these fields were obsoleted and removed. - Extra client data ================= @@ -58,6 +55,7 @@ be very useful. An example structure is below. struct foo_data { + struct i2c_client client; struct semaphore lock; /* For ISA access in `sensors' drivers. */ int sysctl_id; /* To keep the /proc directory entry for `sensors' drivers. */ @@ -148,15 +146,15 @@ are defined in i2c.h to help you support them, as well as a generic detection algorithm. You do not have to use this parameter interface; but don't try to use -function i2c_probe() (or i2c_detect()) if you don't. +function i2c_probe() if you don't. NOTE: If you want to write a `sensors' driver, the interface is slightly different! See below. -Probing classes (i2c) ---------------------- +Probing classes +--------------- All parameters are given as lists of unsigned 16-bit integers. Lists are terminated by I2C_CLIENT_END. @@ -171,12 +169,18 @@ The following lists are used internally: ignore: insmod parameter. A list of pairs. The first value is a bus number (-1 for any I2C bus), the second is the I2C address. These addresses are never probed. - This parameter overrules 'normal' and 'probe', but not the 'force' lists. + This parameter overrules the 'normal_i2c' list only. force: insmod parameter. A list of pairs. The first value is a bus number (-1 for any I2C bus), the second is the I2C address. A device is blindly assumed to be on the given address, no probing is done. +Additionally, kind-specific force lists may optionally be defined if +the driver supports several chip kinds. They are grouped in a +NULL-terminated list of pointers named forces, those first element if the +generic force list mentioned above. Each additional list correspond to an +insmod parameter of the form force_<kind>. + Fortunately, as a module writer, you just have to define the `normal_i2c' parameter. The complete declaration could look like this: @@ -186,66 +190,17 @@ parameter. The complete declaration could look like this: /* Magic definition of all other variables and things */ I2C_CLIENT_INSMOD; + /* Or, if your driver supports, say, 2 kind of devices: */ + I2C_CLIENT_INSMOD_2(foo, bar); + +If you use the multi-kind form, an enum will be defined for you: + enum chips { any_chip, foo, bar, ... } +You can then (and certainly should) use it in the driver code. Note that you *have* to call the defined variable `normal_i2c', without any prefix! -Probing classes (sensors) -------------------------- - -If you write a `sensors' driver, you use a slightly different interface. -As well as I2C addresses, we have to cope with ISA addresses. Also, we -use a enum of chip types. Don't forget to include `sensors.h'. - -The following lists are used internally. They are all lists of integers. - - normal_i2c: filled in by the module writer. Terminated by SENSORS_I2C_END. - A list of I2C addresses which should normally be examined. - normal_isa: filled in by the module writer. Terminated by SENSORS_ISA_END. - A list of ISA addresses which should normally be examined. - probe: insmod parameter. Initialize this list with SENSORS_I2C_END values. - A list of pairs. The first value is a bus number (SENSORS_ISA_BUS for - the ISA bus, -1 for any I2C bus), the second is the address. These - addresses are also probed, as if they were in the 'normal' list. - ignore: insmod parameter. Initialize this list with SENSORS_I2C_END values. - A list of pairs. The first value is a bus number (SENSORS_ISA_BUS for - the ISA bus, -1 for any I2C bus), the second is the I2C address. These - addresses are never probed. This parameter overrules 'normal' and - 'probe', but not the 'force' lists. - -Also used is a list of pointers to sensors_force_data structures: - force_data: insmod parameters. A list, ending with an element of which - the force field is NULL. - Each element contains the type of chip and a list of pairs. - The first value is a bus number (SENSORS_ISA_BUS for the ISA bus, - -1 for any I2C bus), the second is the address. - These are automatically translated to insmod variables of the form - force_foo. - -So we have a generic insmod variabled `force', and chip-specific variables -`force_CHIPNAME'. - -Fortunately, as a module writer, you just have to define the `normal_i2c' -and `normal_isa' parameters, and define what chip names are used. -The complete declaration could look like this: - /* Scan i2c addresses 0x37, and 0x48 to 0x4f */ - static unsigned short normal_i2c[] = { 0x37, 0x48, 0x49, 0x4a, 0x4b, 0x4c, - 0x4d, 0x4e, 0x4f, I2C_CLIENT_END }; - /* Scan ISA address 0x290 */ - static unsigned int normal_isa[] = {0x0290,SENSORS_ISA_END}; - - /* Define chips foo and bar, as well as all module parameters and things */ - SENSORS_INSMOD_2(foo,bar); - -If you have one chip, you use macro SENSORS_INSMOD_1(chip), if you have 2 -you use macro SENSORS_INSMOD_2(chip1,chip2), etc. If you do not want to -bother with chip types, you can use SENSORS_INSMOD_0. - -A enum is automatically defined as follows: - enum chips { any_chip, chip1, chip2, ... } - - Attaching to an adapter ----------------------- @@ -264,17 +219,10 @@ detected at a specific address, another callback is called. return i2c_probe(adapter,&addr_data,&foo_detect_client); } -For `sensors' drivers, use the i2c_detect function instead: - - int foo_attach_adapter(struct i2c_adapter *adapter) - { - return i2c_detect(adapter,&addr_data,&foo_detect_client); - } - Remember, structure `addr_data' is defined by the macros explained above, so you do not have to define it yourself. -The i2c_probe or i2c_detect function will call the foo_detect_client +The i2c_probe function will call the foo_detect_client function only for those i2c addresses that actually have a device on them (unless a `force' parameter was used). In addition, addresses that are already in use (by some other registered client) are skipped. @@ -283,19 +231,18 @@ are already in use (by some other registered client) are skipped. The detect client function -------------------------- -The detect client function is called by i2c_probe or i2c_detect. -The `kind' parameter contains 0 if this call is due to a `force' -parameter, and -1 otherwise (for i2c_detect, it contains 0 if -this call is due to the generic `force' parameter, and the chip type -number if it is due to a specific `force' parameter). +The detect client function is called by i2c_probe. The `kind' parameter +contains -1 for a probed detection, 0 for a forced detection, or a positive +number for a forced detection with a chip type forced. Below, some things are only needed if this is a `sensors' driver. Those parts are between /* SENSORS ONLY START */ and /* SENSORS ONLY END */ markers. -This function should only return an error (any value != 0) if there is -some reason why no more detection should be done anymore. If the -detection just fails for this address, return 0. +Returning an error different from -ENODEV in a detect function will cause +the detection to stop: other addresses and adapters won't be scanned. +This should only be done on fatal or internal errors, such as a memory +shortage or i2c_attach_client failing. For now, you can ignore the `flags' parameter. It is there for future use. @@ -320,13 +267,13 @@ For now, you can ignore the `flags' parameter. It is there for future use. const char *type_name = ""; int is_isa = i2c_is_isa_adapter(adapter); - if (is_isa) { + /* Do this only if the chip can additionally be found on the ISA bus + (hybrid chip). */ - /* If this client can't be on the ISA bus at all, we can stop now - (call `goto ERROR0'). But for kicks, we will assume it is all - right. */ + if (is_isa) { /* Discard immediately if this ISA range is already used */ + /* FIXME: never use check_region(), only request_region() */ if (check_region(address,FOO_EXTENT)) goto ERROR0; @@ -362,22 +309,15 @@ For now, you can ignore the `flags' parameter. It is there for future use. client structure, even though we cannot fill it completely yet. But it allows us to access several i2c functions safely */ - /* Note that we reserve some space for foo_data too. If you don't - need it, remove it. We do it here to help to lessen memory - fragmentation. */ - if (! (new_client = kmalloc(sizeof(struct i2c_client) + - sizeof(struct foo_data), - GFP_KERNEL))) { + if (!(data = kzalloc(sizeof(struct foo_data), GFP_KERNEL))) { err = -ENOMEM; goto ERROR0; } - /* This is tricky, but it will set the data to the right value. */ - client->data = new_client + 1; - data = (struct foo_data *) (client->data); + new_client = &data->client; + i2c_set_clientdata(new_client, data); new_client->addr = address; - new_client->data = data; new_client->adapter = adapter; new_client->driver = &foo_driver; new_client->flags = 0; @@ -495,17 +435,15 @@ much simpler than the attachment code, fortunately! /* SENSORS ONLY END */ /* Try to detach the client from i2c space */ - if ((err = i2c_detach_client(client))) { - printk("foo.o: Client deregistration failed, client not detached.\n"); + if ((err = i2c_detach_client(client))) return err; - } - /* SENSORS ONLY START */ + /* HYBRID SENSORS CHIP ONLY START */ if i2c_is_isa_client(client) release_region(client->addr,LM78_EXTENT); - /* SENSORS ONLY END */ + /* HYBRID SENSORS CHIP ONLY END */ - kfree(client); /* Frees client data too, if allocated at the same time */ + kfree(data); return 0; } @@ -630,12 +568,12 @@ SMBus communication extern s32 i2c_smbus_write_block_data(struct i2c_client * client, u8 command, u8 length, u8 *values); + extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client, + u8 command, u8 *values); These ones were removed in Linux 2.6.10 because they had no users, but could be added back later if needed: - extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client, - u8 command, u8 *values); extern s32 i2c_smbus_read_block_data(struct i2c_client * client, u8 command, u8 *values); extern s32 i2c_smbus_write_i2c_block_data(struct i2c_client * client, diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt index 1c48f0eba6fb..10312bebe55d 100644 --- a/Documentation/i386/boot.txt +++ b/Documentation/i386/boot.txt @@ -2,7 +2,7 @@ ---------------------------- H. Peter Anvin <hpa@zytor.com> - Last update 2002-01-01 + Last update 2005-09-02 On the i386 platform, the Linux kernel uses a rather complicated boot convention. This has evolved partially due to historical aspects, as @@ -34,6 +34,8 @@ Protocol 2.02: (Kernel 2.4.0-test3-pre3) New command line protocol. Protocol 2.03: (Kernel 2.4.18-pre1) Explicitly makes the highest possible initrd address available to the bootloader. +Protocol 2.04: (Kernel 2.6.14) Extend the syssize field to four bytes. + **** MEMORY LAYOUT @@ -103,10 +105,9 @@ The header looks like: Offset Proto Name Meaning /Size -01F1/1 ALL setup_sects The size of the setup in sectors +01F1/1 ALL(1 setup_sects The size of the setup in sectors 01F2/2 ALL root_flags If set, the root is mounted readonly -01F4/2 ALL syssize DO NOT USE - for bootsect.S use only -01F6/2 ALL swap_dev DO NOT USE - obsolete +01F4/4 2.04+(2 syssize The size of the 32-bit code in 16-byte paras 01F8/2 ALL ram_size DO NOT USE - for bootsect.S use only 01FA/2 ALL vid_mode Video mode control 01FC/2 ALL root_dev Default root device number @@ -129,8 +130,12 @@ Offset Proto Name Meaning 0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line 022C/4 2.03+ initrd_addr_max Highest legal initrd address -For backwards compatibility, if the setup_sects field contains 0, the -real value is 4. +(1) For backwards compatibility, if the setup_sects field contains 0, the + real value is 4. + +(2) For boot protocol prior to 2.04, the upper two bytes of the syssize + field are unusable, which means the size of a bzImage kernel + cannot be determined. If the "HdrS" (0x53726448) magic number is not found at offset 0x202, the boot protocol version is "old". Loading an old kernel, the @@ -230,12 +235,16 @@ loader to communicate with the kernel. Some of its options are also relevant to the boot loader itself, see "special command line options" below. -The kernel command line is a null-terminated string up to 255 -characters long, plus the final null. +The kernel command line is a null-terminated string currently up to +255 characters long, plus the final null. A string that is too long +will be automatically truncated by the kernel, a boot loader may allow +a longer command line to be passed to permit future kernels to extend +this limit. If the boot protocol version is 2.02 or later, the address of the kernel command line is given by the header field cmd_line_ptr (see -above.) +above.) This address can be anywhere between the end of the setup +heap and 0xA0000. If the protocol version is *not* 2.02 or higher, the kernel command line is entered using the following protocol: @@ -255,7 +264,7 @@ command line is entered using the following protocol: **** SAMPLE BOOT CONFIGURATION As a sample configuration, assume the following layout of the real -mode segment: +mode segment (this is a typical, and recommended layout): 0x0000-0x7FFF Real mode kernel 0x8000-0x8FFF Stack and heap @@ -312,9 +321,9 @@ Such a boot loader should enter the following fields in the header: **** LOADING THE REST OF THE KERNEL -The non-real-mode kernel starts at offset (setup_sects+1)*512 in the -kernel file (again, if setup_sects == 0 the real value is 4.) It -should be loaded at address 0x10000 for Image/zImage kernels and +The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512 +in the kernel file (again, if setup_sects == 0 the real value is 4.) +It should be loaded at address 0x10000 for Image/zImage kernels and 0x100000 for bzImage kernels. The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01 diff --git a/Documentation/ia64/mca.txt b/Documentation/ia64/mca.txt new file mode 100644 index 000000000000..a71cc6a67ef7 --- /dev/null +++ b/Documentation/ia64/mca.txt @@ -0,0 +1,194 @@ +An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel +free to update it with notes about any area that is not clear. + +--- + +MCA/INIT are completely asynchronous. They can occur at any time, when +the OS is in any state. Including when one of the cpus is already +holding a spinlock. Trying to get any lock from MCA/INIT state is +asking for deadlock. Also the state of structures that are protected +by locks is indeterminate, including linked lists. + +--- + +The complicated ia64 MCA process. All of this is mandated by Intel's +specification for ia64 SAL, error recovery and and unwind, it is not as +if we have a choice here. + +* MCA occurs on one cpu, usually due to a double bit memory error. + This is the monarch cpu. + +* SAL sends an MCA rendezvous interrupt (which is a normal interrupt) + to all the other cpus, the slaves. + +* Slave cpus that receive the MCA interrupt call down into SAL, they + end up spinning disabled while the MCA is being serviced. + +* If any slave cpu was already spinning disabled when the MCA occurred + then it cannot service the MCA interrupt. SAL waits ~20 seconds then + sends an unmaskable INIT event to the slave cpus that have not + already rendezvoused. + +* Because MCA/INIT can be delivered at any time, including when the cpu + is down in PAL in physical mode, the registers at the time of the + event are _completely_ undefined. In particular the MCA/INIT + handlers cannot rely on the thread pointer, PAL physical mode can + (and does) modify TP. It is allowed to do that as long as it resets + TP on return. However MCA/INIT events expose us to these PAL + internal TP changes. Hence curr_task(). + +* If an MCA/INIT event occurs while the kernel was running (not user + space) and the kernel has called PAL then the MCA/INIT handler cannot + assume that the kernel stack is in a fit state to be used. Mainly + because PAL may or may not maintain the stack pointer internally. + Because the MCA/INIT handlers cannot trust the kernel stack, they + have to use their own, per-cpu stacks. The MCA/INIT stacks are + preformatted with just enough task state to let the relevant handlers + do their job. + +* Unlike most other architectures, the ia64 struct task is embedded in + the kernel stack[1]. So switching to a new kernel stack means that + we switch to a new task as well. Because various bits of the kernel + assume that current points into the struct task, switching to a new + stack also means a new value for current. + +* Once all slaves have rendezvoused and are spinning disabled, the + monarch is entered. The monarch now tries to diagnose the problem + and decide if it can recover or not. + +* Part of the monarch's job is to look at the state of all the other + tasks. The only way to do that on ia64 is to call the unwinder, + as mandated by Intel. + +* The starting point for the unwind depends on whether a task is + running or not. That is, whether it is on a cpu or is blocked. The + monarch has to determine whether or not a task is on a cpu before it + knows how to start unwinding it. The tasks that received an MCA or + INIT event are no longer running, they have been converted to blocked + tasks. But (and its a big but), the cpus that received the MCA + rendezvous interrupt are still running on their normal kernel stacks! + +* To distinguish between these two cases, the monarch must know which + tasks are on a cpu and which are not. Hence each slave cpu that + switches to an MCA/INIT stack, registers its new stack using + set_curr_task(), so the monarch can tell that the _original_ task is + no longer running on that cpu. That gives us a decent chance of + getting a valid backtrace of the _original_ task. + +* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a + nested error, we want diagnostics on the MCA/INIT handler that + failed, not on the task that was originally running. Again this + requires set_curr_task() so the MCA/INIT handlers can register their + own stack as running on that cpu. Then a recursive error gets a + trace of the failing handler's "task". + +[1] My (Keith Owens) original design called for ia64 to separate its + struct task and the kernel stacks. Then the MCA/INIT data would be + chained stacks like i386 interrupt stacks. But that required + radical surgery on the rest of ia64, plus extra hard wired TLB + entries with its associated performance degradation. David + Mosberger vetoed that approach. Which meant that separate kernel + stacks meant separate "tasks" for the MCA/INIT handlers. + +--- + +INIT is less complicated than MCA. Pressing the nmi button or using +the equivalent command on the management console sends INIT to all +cpus. SAL picks one one of the cpus as the monarch and the rest are +slaves. All the OS INIT handlers are entered at approximately the same +time. The OS monarch prints the state of all tasks and returns, after +which the slaves return and the system resumes. + +At least that is what is supposed to happen. Alas there are broken +versions of SAL out there. Some drive all the cpus as monarchs. Some +drive them all as slaves. Some drive one cpu as monarch, wait for that +cpu to return from the OS then drive the rest as slaves. Some versions +of SAL cannot even cope with returning from the OS, they spin inside +SAL on resume. The OS INIT code has workarounds for some of these +broken SAL symptoms, but some simply cannot be fixed from the OS side. + +--- + +The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer +violations. Unfortunately MCA/INIT start off as massive layer +violations (can occur at _any_ time) and they build from there. + +At least ia64 makes an attempt at recovering from hardware errors, but +it is a difficult problem because of the asynchronous nature of these +errors. When processing an unmaskable interrupt we sometimes need +special code to cope with our inability to take any locks. + +--- + +How is ia64 MCA/INIT different from x86 NMI? + +* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to + all cpus. + +* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 + per cpu. + +* x86 has a separate struct task which points to one of multiple kernel + stacks. ia64 has the struct task embedded in the single kernel + stack, so switching stack means switching task. + +* x86 does not call the BIOS so the NMI handler does not have to worry + about any registers having changed. MCA/INIT can occur while the cpu + is in PAL in physical mode, with undefined registers and an undefined + kernel stack. + +* i386 backtrace is not very sensitive to whether a process is running + or not. ia64 unwind is very, very sensitive to whether a process is + running or not. + +--- + +What happens when MCA/INIT is delivered what a cpu is running user +space code? + +The user mode registers are stored in the RSE area of the MCA/INIT on +entry to the OS and are restored from there on return to SAL, so user +mode registers are preserved across a recoverable MCA/INIT. Since the +OS has no idea what unwind data is available for the user space stack, +MCA/INIT never tries to backtrace user space. Which means that the OS +does not bother making the user space process look like a blocked task, +i.e. the OS does not copy pt_regs and switch_stack to the user space +stack. Also the OS has no idea how big the user space RSE and memory +stacks are, which makes it too risky to copy the saved state to a user +mode stack. + +--- + +How do we get a backtrace on the tasks that were running when MCA/INIT +was delivered? + +mca.c:::ia64_mca_modify_original_stack(). That identifies and +verifies the original kernel stack, copies the dirty registers from +the MCA/INIT stack's RSE to the original stack's RSE, copies the +skeleton struct pt_regs and switch_stack to the original stack, fills +in the skeleton structures from the PAL minstate area and updates the +original stack's thread.ksp. That makes the original stack look +exactly like any other blocked task, i.e. it now appears to be +sleeping. To get a backtrace, just start with thread.ksp for the +original task and unwind like any other sleeping task. + +--- + +How do we identify the tasks that were running when MCA/INIT was +delivered? + +If the previous task has been verified and converted to a blocked +state, then sos->prev_task on the MCA/INIT stack is updated to point to +the previous task. You can look at that field in dumps or debuggers. +To help distinguish between the handler and the original tasks, +handlers have _TIF_MCA_INIT set in thread_info.flags. + +The sos data is always in the MCA/INIT handler stack, at offset +MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it +as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct +ia64_sal_os_state), with 16 byte alignment for all structures. + +Also the comm field of the MCA/INIT task is modified to include the pid +of the original task, for humans to use. For example, a comm field of +'MCA 12159' means that pid 12159 was running when the MCA was +delivered. diff --git a/Documentation/ibm-acpi.txt b/Documentation/ibm-acpi.txt index c437b1aeff55..8b3fd82b2ce7 100644 --- a/Documentation/ibm-acpi.txt +++ b/Documentation/ibm-acpi.txt @@ -1,16 +1,16 @@ IBM ThinkPad ACPI Extras Driver - Version 0.8 - 8 November 2004 + Version 0.12 + 17 August 2005 Borislav Deianov <borislav@users.sf.net> http://ibm-acpi.sf.net/ -This is a Linux ACPI driver for the IBM ThinkPad laptops. It aims to -support various features of these laptops which are accessible through -the ACPI framework but not otherwise supported by the generic Linux -ACPI drivers. +This is a Linux ACPI driver for the IBM ThinkPad laptops. It supports +various features of these laptops which are accessible through the +ACPI framework but not otherwise supported by the generic Linux ACPI +drivers. Status @@ -25,9 +25,14 @@ detailed description): - ThinkLight on and off - limited docking and undocking - UltraBay eject - - Experimental: CMOS control - - Experimental: LED control - - Experimental: ACPI sounds + - CMOS control + - LED control + - ACPI sounds + - temperature sensors + - Experimental: embedded controller register dump + - Experimental: LCD brightness control + - Experimental: volume control + - Experimental: fan speed, fan enable/disable A compatibility table by model and feature is maintained on the web site, http://ibm-acpi.sf.net/. I appreciate any success or failure @@ -91,12 +96,12 @@ driver is still in the alpha stage, the exact proc file format and commands supported by the various features is guaranteed to change frequently. -Driver Version -- /proc/acpi/ibm/driver --------------------------------------- +Driver version -- /proc/acpi/ibm/driver +--------------------------------------- The driver name and version. No commands can be written to this file. -Hot Keys -- /proc/acpi/ibm/hotkey +Hot keys -- /proc/acpi/ibm/hotkey --------------------------------- Without this driver, only the Fn-F4 key (sleep button) generates an @@ -188,7 +193,7 @@ and, on the X40, video corruption. By disabling automatic switching, the flickering or video corruption can be avoided. The video_switch command cycles through the available video outputs -(it sumulates the behavior of Fn-F7). +(it simulates the behavior of Fn-F7). Video expansion can be toggled through this feature. This controls whether the display is expanded to fill the entire LCD screen when a @@ -201,6 +206,12 @@ Fn-F7 from working. This also disables the video output switching features of this driver, as it uses the same ACPI methods as Fn-F7. Video switching on the console should still work. +UPDATE: There's now a patch for the X.org Radeon driver which +addresses this issue. Some people are reporting success with the patch +while others are still having problems. For more information: + +https://bugs.freedesktop.org/show_bug.cgi?id=2000 + ThinkLight control -- /proc/acpi/ibm/light ------------------------------------------ @@ -211,7 +222,7 @@ models which do not make the status available will show it as echo on > /proc/acpi/ibm/light echo off > /proc/acpi/ibm/light -Docking / Undocking -- /proc/acpi/ibm/dock +Docking / undocking -- /proc/acpi/ibm/dock ------------------------------------------ Docking and undocking (e.g. with the X4 UltraBase) requires some @@ -228,11 +239,15 @@ NOTE: These events will only be generated if the laptop was docked when originally booted. This is due to the current lack of support for hot plugging of devices in the Linux ACPI framework. If the laptop was booted while not in the dock, the following message is shown in the -logs: "ibm_acpi: dock device not present". No dock-related events are -generated but the dock and undock commands described below still -work. They can be executed manually or triggered by Fn key -combinations (see the example acpid configuration files included in -the driver tarball package available on the web site). +logs: + + Mar 17 01:42:34 aero kernel: ibm_acpi: dock device not present + +In this case, no dock-related events are generated but the dock and +undock commands described below still work. They can be executed +manually or triggered by Fn key combinations (see the example acpid +configuration files included in the driver tarball package available +on the web site). When the eject request button on the dock is pressed, the first event above is generated. The handler for this event should issue the @@ -267,7 +282,7 @@ the only docking stations currently supported are the X-series UltraBase docks and "dumb" port replicators like the Mini Dock (the latter don't need any ACPI support, actually). -UltraBay Eject -- /proc/acpi/ibm/bay +UltraBay eject -- /proc/acpi/ibm/bay ------------------------------------ Inserting or ejecting an UltraBay device requires some actions to be @@ -284,8 +299,11 @@ when the laptop was originally booted (on the X series, the UltraBay is in the dock, so it may not be present if the laptop was undocked). This is due to the current lack of support for hot plugging of devices in the Linux ACPI framework. If the laptop was booted without the -UltraBay, the following message is shown in the logs: "ibm_acpi: bay -device not present". No bay-related events are generated but the eject +UltraBay, the following message is shown in the logs: + + Mar 17 01:42:34 aero kernel: ibm_acpi: bay device not present + +In this case, no bay-related events are generated but the eject command described below still works. It can be executed manually or triggered by a hot key combination. @@ -306,22 +324,33 @@ necessary to enable the UltraBay device (e.g. call idectl). The contents of the /proc/acpi/ibm/bay file shows the current status of the UltraBay, as provided by the ACPI framework. -Experimental Features ---------------------- +EXPERIMENTAL warm eject support on the 600e/x, A22p and A3x (To use +this feature, you need to supply the experimental=1 parameter when +loading the module): + +These models do not have a button near the UltraBay device to request +a hot eject but rather require the laptop to be put to sleep +(suspend-to-ram) before the bay device is ejected or inserted). +The sequence of steps to eject the device is as follows: + + echo eject > /proc/acpi/ibm/bay + put the ThinkPad to sleep + remove the drive + resume from sleep + cat /proc/acpi/ibm/bay should show that the drive was removed + +On the A3x, both the UltraBay 2000 and UltraBay Plus devices are +supported. Use "eject2" instead of "eject" for the second bay. -The following features are marked experimental because using them -involves guessing the correct values of some parameters. Guessing -incorrectly may have undesirable effects like crashing your -ThinkPad. USE THESE WITH CAUTION! To activate them, you'll need to -supply the experimental=1 parameter when loading the module. +Note: the UltraBay eject support on the 600e/x, A22p and A3x is +EXPERIMENTAL and may not work as expected. USE WITH CAUTION! -Experimental: CMOS control - /proc/acpi/ibm/cmos ------------------------------------------------- +CMOS control -- /proc/acpi/ibm/cmos +----------------------------------- This feature is used internally by the ACPI firmware to control the -ThinkLight on most newer ThinkPad models. It appears that it can also -control LCD brightness, sounds volume and more, but only on some -models. +ThinkLight on most newer ThinkPad models. It may also control LCD +brightness, sounds volume and more, but only on some models. The commands are non-negative integer numbers: @@ -330,10 +359,9 @@ The commands are non-negative integer numbers: echo 2 >/proc/acpi/ibm/cmos ... -The range of numbers which are used internally by various models is 0 -to 21, but it's possible that numbers outside this range have -interesting behavior. Here is the behavior on the X40 (tpb is the -ThinkPad Buttons utility): +The range of valid numbers is 0 to 21, but not all have an effect and +the behavior varies from model to model. Here is the behavior on the +X40 (tpb is the ThinkPad Buttons utility): 0 - no effect but tpb reports "Volume down" 1 - no effect but tpb reports "Volume up" @@ -346,26 +374,18 @@ ThinkPad Buttons utility): 13 - ThinkLight off 14 - no effect but tpb reports ThinkLight status change -If you try this feature, please send me a report similar to the -above. On models which allow control of LCD brightness or sound -volume, I'd like to provide this functionality in an user-friendly -way, but first I need a way to identify the models which this is -possible. - -Experimental: LED control - /proc/acpi/ibm/LED ----------------------------------------------- +LED control -- /proc/acpi/ibm/led +--------------------------------- Some of the LED indicators can be controlled through this feature. The available commands are: - echo <led number> on >/proc/acpi/ibm/led - echo <led number> off >/proc/acpi/ibm/led - echo <led number> blink >/proc/acpi/ibm/led + echo '<led number> on' >/proc/acpi/ibm/led + echo '<led number> off' >/proc/acpi/ibm/led + echo '<led number> blink' >/proc/acpi/ibm/led -The <led number> parameter is a non-negative integer. The range of LED -numbers used internally by various models is 0 to 7 but it's possible -that numbers outside this range are also valid. Here is the mapping on -the X40: +The <led number> range is 0 to 7. The set of LEDs that can be +controlled varies from model to model. Here is the mapping on the X40: 0 - power 1 - battery (orange) @@ -376,49 +396,224 @@ the X40: All of the above can be turned on and off and can be made to blink. -If you try this feature, please send me a report similar to the -above. I'd like to provide this functionality in an user-friendly way, -but first I need to identify the which numbers correspond to which -LEDs on various models. - -Experimental: ACPI sounds - /proc/acpi/ibm/beep ------------------------------------------------ +ACPI sounds -- /proc/acpi/ibm/beep +---------------------------------- The BEEP method is used internally by the ACPI firmware to provide -audible alerts in various situtation. This feature allows the same +audible alerts in various situations. This feature allows the same sounds to be triggered manually. The commands are non-negative integer numbers: - echo 0 >/proc/acpi/ibm/beep - echo 1 >/proc/acpi/ibm/beep - echo 2 >/proc/acpi/ibm/beep - ... + echo <number> >/proc/acpi/ibm/beep -The range of numbers which are used internally by various models is 0 -to 17, but it's possible that numbers outside this range are also -valid. Here is the behavior on the X40: +The valid <number> range is 0 to 17. Not all numbers trigger sounds +and the sounds vary from model to model. Here is the behavior on the +X40: - 2 - two beeps, pause, third beep + 0 - stop a sound in progress (but use 17 to stop 16) + 2 - two beeps, pause, third beep ("low battery") 3 - single beep - 4 - "unable" + 4 - high, followed by low-pitched beep ("unable") 5 - single beep - 6 - "AC/DC" + 6 - very high, followed by high-pitched beep ("AC/DC") 7 - high-pitched beep 9 - three short beeps 10 - very long beep 12 - low-pitched beep + 15 - three high-pitched beeps repeating constantly, stop with 0 + 16 - one medium-pitched beep repeating constantly, stop with 17 + 17 - stop 16 + +Temperature sensors -- /proc/acpi/ibm/thermal +--------------------------------------------- + +Most ThinkPads include six or more separate temperature sensors but +only expose the CPU temperature through the standard ACPI methods. +This feature shows readings from up to eight different sensors. Some +readings may not be valid, e.g. may show large negative values. For +example, on the X40, a typical output may be: + +temperatures: 42 42 45 41 36 -128 33 -128 + +Thomas Gruber took his R51 apart and traced all six active sensors in +his laptop (the location of sensors may vary on other models): + +1: CPU +2: Mini PCI Module +3: HDD +4: GPU +5: Battery +6: N/A +7: Battery +8: N/A + +No commands can be written to this file. + +EXPERIMENTAL: Embedded controller reigster dump -- /proc/acpi/ibm/ecdump +------------------------------------------------------------------------ + +This feature is marked EXPERIMENTAL because the implementation +directly accesses hardware registers and may not work as expected. USE +WITH CAUTION! To use this feature, you need to supply the +experimental=1 parameter when loading the module. + +This feature dumps the values of 256 embedded controller +registers. Values which have changed since the last time the registers +were dumped are marked with a star: + +[root@x40 ibm-acpi]# cat /proc/acpi/ibm/ecdump +EC +00 +01 +02 +03 +04 +05 +06 +07 +08 +09 +0a +0b +0c +0d +0e +0f +EC 0x00: a7 47 87 01 fe 96 00 08 01 00 cb 00 00 00 40 00 +EC 0x10: 00 00 ff ff f4 3c 87 09 01 ff 42 01 ff ff 0d 00 +EC 0x20: 00 00 00 00 00 00 00 00 00 00 00 03 43 00 00 80 +EC 0x30: 01 07 1a 00 30 04 00 00 *85 00 00 10 00 50 00 00 +EC 0x40: 00 00 00 00 00 00 14 01 00 04 00 00 00 00 00 00 +EC 0x50: 00 c0 02 0d 00 01 01 02 02 03 03 03 03 *bc *02 *bc +EC 0x60: *02 *bc *02 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0x70: 00 00 00 00 00 12 30 40 *24 *26 *2c *27 *20 80 *1f 80 +EC 0x80: 00 00 00 06 *37 *0e 03 00 00 00 0e 07 00 00 00 00 +EC 0x90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0xa0: *ff 09 ff 09 ff ff *64 00 *00 *00 *a2 41 *ff *ff *e0 00 +EC 0xb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0xc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0xd0: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0xe0: 00 00 00 00 00 00 00 00 11 20 49 04 24 06 55 03 +EC 0xf0: 31 55 48 54 35 38 57 57 08 2f 45 73 07 65 6c 1a + +This feature can be used to determine the register holding the fan +speed on some models. To do that, do the following: + + - make sure the battery is fully charged + - make sure the fan is running + - run 'cat /proc/acpi/ibm/ecdump' several times, once per second or so + +The first step makes sure various charging-related values don't +vary. The second ensures that the fan-related values do vary, since +the fan speed fluctuates a bit. The third will (hopefully) mark the +fan register with a star: + +[root@x40 ibm-acpi]# cat /proc/acpi/ibm/ecdump +EC +00 +01 +02 +03 +04 +05 +06 +07 +08 +09 +0a +0b +0c +0d +0e +0f +EC 0x00: a7 47 87 01 fe 96 00 08 01 00 cb 00 00 00 40 00 +EC 0x10: 00 00 ff ff f4 3c 87 09 01 ff 42 01 ff ff 0d 00 +EC 0x20: 00 00 00 00 00 00 00 00 00 00 00 03 43 00 00 80 +EC 0x30: 01 07 1a 00 30 04 00 00 85 00 00 10 00 50 00 00 +EC 0x40: 00 00 00 00 00 00 14 01 00 04 00 00 00 00 00 00 +EC 0x50: 00 c0 02 0d 00 01 01 02 02 03 03 03 03 bc 02 bc +EC 0x60: 02 bc 02 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0x70: 00 00 00 00 00 12 30 40 24 27 2c 27 21 80 1f 80 +EC 0x80: 00 00 00 06 *be 0d 03 00 00 00 0e 07 00 00 00 00 +EC 0x90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0xa0: ff 09 ff 09 ff ff 64 00 00 00 a2 41 ff ff e0 00 +EC 0xb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0xc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0xd0: 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 +EC 0xe0: 00 00 00 00 00 00 00 00 11 20 49 04 24 06 55 03 +EC 0xf0: 31 55 48 54 35 38 57 57 08 2f 45 73 07 65 6c 1a + +Another set of values that varies often is the temperature +readings. Since temperatures don't change vary fast, you can take +several quick dumps to eliminate them. + +You can use a similar method to figure out the meaning of other +embedded controller registers - e.g. make sure nothing else changes +except the charging or discharging battery to determine which +registers contain the current battery capacity, etc. If you experiment +with this, do send me your results (including some complete dumps with +a description of the conditions when they were taken.) + +EXPERIMENTAL: LCD brightness control -- /proc/acpi/ibm/brightness +----------------------------------------------------------------- + +This feature is marked EXPERIMENTAL because the implementation +directly accesses hardware registers and may not work as expected. USE +WITH CAUTION! To use this feature, you need to supply the +experimental=1 parameter when loading the module. + +This feature allows software control of the LCD brightness on ThinkPad +models which don't have a hardware brightness slider. The available +commands are: + + echo up >/proc/acpi/ibm/brightness + echo down >/proc/acpi/ibm/brightness + echo 'level <level>' >/proc/acpi/ibm/brightness + +The <level> number range is 0 to 7, although not all of them may be +distinct. The current brightness level is shown in the file. + +EXPERIMENTAL: Volume control -- /proc/acpi/ibm/volume +----------------------------------------------------- + +This feature is marked EXPERIMENTAL because the implementation +directly accesses hardware registers and may not work as expected. USE +WITH CAUTION! To use this feature, you need to supply the +experimental=1 parameter when loading the module. + +This feature allows volume control on ThinkPad models which don't have +a hardware volume knob. The available commands are: + + echo up >/proc/acpi/ibm/volume + echo down >/proc/acpi/ibm/volume + echo mute >/proc/acpi/ibm/volume + echo 'level <level>' >/proc/acpi/ibm/volume + +The <level> number range is 0 to 15 although not all of them may be +distinct. The unmute the volume after the mute command, use either the +up or down command (the level command will not unmute the volume). +The current volume level and mute state is shown in the file. + +EXPERIMENTAL: fan speed, fan enable/disable -- /proc/acpi/ibm/fan +----------------------------------------------------------------- + +This feature is marked EXPERIMENTAL because the implementation +directly accesses hardware registers and may not work as expected. USE +WITH CAUTION! To use this feature, you need to supply the +experimental=1 parameter when loading the module. + +This feature attempts to show the current fan speed. The speed is read +directly from the hardware registers of the embedded controller. This +is known to work on later R, T and X series ThinkPads but may show a +bogus value on other models. + +The fan may be enabled or disabled with the following commands: + + echo enable >/proc/acpi/ibm/fan + echo disable >/proc/acpi/ibm/fan + +WARNING WARNING WARNING: do not leave the fan disabled unless you are +monitoring the temperature sensor readings and you are ready to enable +it if necessary to avoid overheating. + +The fan only runs if it's enabled *and* the various temperature +sensors which control it read high enough. On the X40, this seems to +depend on the CPU and HDD temperatures. Specifically, the fan is +turned on when either the CPU temperature climbs to 56 degrees or the +HDD temperature climbs to 46 degrees. The fan is turned off when the +CPU temperature drops to 49 degrees and the HDD temperature drops to +41 degrees. These thresholds cannot currently be controlled. + +On the X31 and X40 (and ONLY on those models), the fan speed can be +controlled to a certain degree. Once the fan is running, it can be +forced to run faster or slower with the following command: + + echo 'speed <speed>' > /proc/acpi/ibm/thermal + +The sustainable range of fan speeds on the X40 appears to be from +about 3700 to about 7350. Values outside this range either do not have +any effect or the fan speed eventually settles somewhere in that +range. The fan cannot be stopped or started with this command. + +On the 570, temperature readings are not available through this +feature and the fan control works a little differently. The fan speed +is reported in levels from 0 (off) to 7 (max) and can be controlled +with the following command: -(I've only been able to identify a couple of them). - -If you try this feature, please send me a report similar to the -above. I'd like to provide this functionality in an user-friendly way, -but first I need to identify the which numbers correspond to which -sounds on various models. + echo 'level <level>' > /proc/acpi/ibm/thermal -Multiple Command, Module Parameters ------------------------------------ +Multiple Commands, Module Parameters +------------------------------------ Multiple commands can be written to the proc files in one shot by separating them with commas, for example: @@ -451,24 +646,19 @@ scripts (included with ibm-acpi for completeness): /usr/local/sbin/laptop_mode -- from the Linux kernel source distribution, see Documentation/laptop-mode.txt /sbin/service -- comes with Redhat/Fedora distributions + /usr/sbin/hibernate -- from the Software Suspend 2 distribution, + see http://softwaresuspend.berlios.de/ -Toan T Nguyen <ntt@control.uchicago.edu> has written a SuSE powersave -script for the X20, included in config/usr/sbin/ibm_hotkeys_X20 +Toan T Nguyen <ntt@physics.ucla.edu> notes that Suse uses the +powersave program to suspend ('powersave --suspend-to-ram') or +hibernate ('powersave --suspend-to-disk'). This means that the +hibernate script is not needed on that distribution. Henrik Brix Andersen <brix@gentoo.org> has written a Gentoo ACPI event handler script for the X31. You can get the latest version from http://dev.gentoo.org/~brix/files/x31.sh David Schweikert <dws@ee.eth.ch> has written an alternative blank.sh -script which works on Debian systems, included in -configs/etc/acpi/actions/blank-debian.sh - - -TODO ----- - -I'd like to implement the following features but haven't yet found the -time and/or I don't yet know how to implement them: - -- UltraBay floppy drive support - +script which works on Debian systems. This scripts has now been +extended to also work on Fedora systems and included as the default +blank.sh in the distribution. diff --git a/Documentation/input/appletouch.txt b/Documentation/input/appletouch.txt new file mode 100644 index 000000000000..b48d11d0326d --- /dev/null +++ b/Documentation/input/appletouch.txt @@ -0,0 +1,84 @@ +Apple Touchpad Driver (appletouch) +---------------------------------- + Copyright (C) 2005 Stelian Pop <stelian@popies.net> + +appletouch is a Linux kernel driver for the USB touchpad found on post +February 2005 Apple Alu Powerbooks. + +This driver is derived from Johannes Berg's appletrackpad driver[1], but it has +been improved in some areas: + * appletouch is a full kernel driver, no userspace program is necessary + * appletouch can be interfaced with the synaptics X11 driver, in order + to have touchpad acceleration, scrolling, etc. + +Credits go to Johannes Berg for reverse-engineering the touchpad protocol, +Frank Arnold for further improvements, and Alex Harper for some additional +information about the inner workings of the touchpad sensors. + +Usage: +------ + +In order to use the touchpad in the basic mode, compile the driver and load +the module. A new input device will be detected and you will be able to read +the mouse data from /dev/input/mice (using gpm, or X11). + +In X11, you can configure the touchpad to use the synaptics X11 driver, which +will give additional functionalities, like acceleration, scrolling, 2 finger +tap for middle button mouse emulation, 3 finger tap for right button mouse +emulation, etc. In order to do this, make sure you're using a recent version of +the synaptics driver (tested with 0.14.2, available from [2]), and configure a +new input device in your X11 configuration file (take a look below for an +example). For additional configuration, see the synaptics driver documentation. + + Section "InputDevice" + Identifier "Synaptics Touchpad" + Driver "synaptics" + Option "SendCoreEvents" "true" + Option "Device" "/dev/input/mice" + Option "Protocol" "auto-dev" + Option "LeftEdge" "0" + Option "RightEdge" "850" + Option "TopEdge" "0" + Option "BottomEdge" "645" + Option "MinSpeed" "0.4" + Option "MaxSpeed" "1" + Option "AccelFactor" "0.02" + Option "FingerLow" "0" + Option "FingerHigh" "30" + Option "MaxTapMove" "20" + Option "MaxTapTime" "100" + Option "HorizScrollDelta" "0" + Option "VertScrollDelta" "30" + Option "SHMConfig" "on" + EndSection + + Section "ServerLayout" + ... + InputDevice "Mouse" + InputDevice "Synaptics Touchpad" + ... + EndSection + +Fuzz problems: +-------------- + +The touchpad sensors are very sensitive to heat, and will generate a lot of +noise when the temperature changes. This is especially true when you power-on +the laptop for the first time. + +The appletouch driver tries to handle this noise and auto adapt itself, but it +is not perfect. If finger movements are not recognized anymore, try reloading +the driver. + +You can activate debugging using the 'debug' module parameter. A value of 0 +deactivates any debugging, 1 activates tracing of invalid samples, 2 activates +full tracing (each sample is being traced): + modprobe appletouch debug=1 + or + echo "1" > /sys/module/appletouch/parameters/debug + +Links: +------ + +[1]: http://johannes.sipsolutions.net/PowerBook/touchpad/ +[2]: http://web.telia.com/~u89404340/touchpad/index.html diff --git a/Documentation/input/yealink.txt b/Documentation/input/yealink.txt new file mode 100644 index 000000000000..0962c5c948be --- /dev/null +++ b/Documentation/input/yealink.txt @@ -0,0 +1,216 @@ +Driver documentation for yealink usb-p1k phones + +0. Status +~~~~~~~~~ +The p1k is a relatively cheap usb 1.1 phone with: + - keyboard full support, yealink.ko / input event API + - LCD full support, yealink.ko / sysfs API + - LED full support, yealink.ko / sysfs API + - dialtone full support, yealink.ko / sysfs API + - ringtone full support, yealink.ko / sysfs API + - audio playback full support, snd_usb_audio.ko / alsa API + - audio record full support, snd_usb_audio.ko / alsa API + +For vendor documentation see http://www.yealink.com + + +1. Compilation (stand alone version) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Currently only kernel 2.6.x.y versions are supported. +In order to build the yealink.ko module do + + make + +If you encounter problems please check if in the MAKE_OPTS variable in +the Makefile is pointing to the location where your kernel sources +are located, default /usr/src/linux. + + +1.1 Troubleshooting +~~~~~~~~~~~~~~~~~~~ +Q: Module yealink compiled and installed without any problem but phone + is not initialized and does not react to any actions. +A: If you see something like: + hiddev0: USB HID v1.00 Device [Yealink Network Technology Ltd. VOIP USB Phone + in dmesg, it means that the hid driver has grabbed the device first. Try to + load module yealink before any other usb hid driver. Please see the + instructions provided by your distribution on module configuration. + +Q: Phone is working now (displays version and accepts keypad input) but I can't + find the sysfs files. +A: The sysfs files are located on the particular usb endpoint. On most + distributions you can do: "find /sys/ -name get_icons" for a hint. + + +2. keyboard features +~~~~~~~~~~~~~~~~~~~~ +The current mapping in the kernel is provided by the map_p1k_to_key +function: + + Physical USB-P1K button layout input events + + + up up + IN OUT left, right + down down + + pickup C hangup enter, backspace, escape + 1 2 3 1, 2, 3 + 4 5 6 4, 5, 6, + 7 8 9 7, 8, 9, + * 0 # *, 0, #, + + The "up" and "down" keys, are symbolised by arrows on the button. + The "pickup" and "hangup" keys are symbolised by a green and red phone + on the button. + + +3. LCD features +~~~~~~~~~~~~~~~ +The LCD is divided and organised as a 3 line display: + + |[] [][] [][] [][] in |[][] + |[] M [][] D [][] : [][] out |[][] + store + + NEW REP SU MO TU WE TH FR SA + + [] [] [] [] [] [] [] [] [] [] [] [] + [] [] [] [] [] [] [] [] [] [] [] [] + + +Line 1 Format (see below) : 18.e8.M8.88...188 + Icon names : M D : IN OUT STORE +Line 2 Format : ......... + Icon name : NEW REP SU MO TU WE TH FR SA +Line 3 Format : 888888888888 + + +Format description: + From a user space perspective the world is seperated in "digits" and "icons". + A digit can have a character set, an icon can only be ON or OFF. + + Format specifier + '8' : Generic 7 segment digit with individual addressable segments + + Reduced capabillity 7 segm digit, when segments are hard wired together. + '1' : 2 segments digit only able to produce a 1. + 'e' : Most significant day of the month digit, + able to produce at least 1 2 3. + 'M' : Most significant minute digit, + able to produce at least 0 1 2 3 4 5. + + Icons or pictograms: + '.' : For example like AM, PM, SU, a 'dot' .. or other single segment + elements. + + +4. Driver usage +~~~~~~~~~~~~~~~ +For userland the following interfaces are available using the sysfs interface: + /sys/.../ + line1 Read/Write, lcd line1 + line2 Read/Write, lcd line2 + line3 Read/Write, lcd line3 + + get_icons Read, returns a set of available icons. + hide_icon Write, hide the element by writing the icon name. + show_icon Write, display the element by writing the icon name. + + map_seg7 Read/Write, the 7 segments char set, common for all + yealink phones. (see map_to_7segment.h) + + ringtone Write, upload binary representation of a ringtone, + see yealink.c. status EXPERIMENTAL due to potential + races between async. and sync usb calls. + + +4.1 lineX +~~~~~~~~~ +Reading /sys/../lineX will return the format string with its current value: + + Example: + cat ./line3 + 888888888888 + Linux Rocks! + +Writing to /sys/../lineX will set the coresponding LCD line. + - Excess characters are ignored. + - If less characters are written than allowed, the remaining digits are + unchanged. + - The tab '\t'and '\n' char does not overwrite the original content. + - Writing a space to an icon will always hide its content. + + Example: + date +"%m.%e.%k:%M" | sed 's/^0/ /' > ./line1 + + Will update the LCD with the current date & time. + + +4.2 get_icons +~~~~~~~~~~~~~ +Reading will return all available icon names and its current settings: + + cat ./get_icons + on M + on D + on : + IN + OUT + STORE + NEW + REP + SU + MO + TU + WE + TH + FR + SA + LED + DIALTONE + RINGTONE + + +4.3 show/hide icons +~~~~~~~~~~~~~~~~~~~ +Writing to these files will update the state of the icon. +Only one icon at a time can be updated. + +If an icon is also on a ./lineX the corresponding value is +updated with the first letter of the icon. + + Example - light up the store icon: + echo -n "STORE" > ./show_icon + + cat ./line1 + 18.e8.M8.88...188 + S + + Example - sound the ringtone for 10 seconds: + echo -n RINGTONE > /sys/..../show_icon + sleep 10 + echo -n RINGTONE > /sys/..../hide_icon + + +5. Sound features +~~~~~~~~~~~~~~~~~ +Sound is supported by the ALSA driver: snd_usb_audio + +One 16-bit channel with sample and playback rates of 8000 Hz is the practical +limit of the device. + + Example - recording test: + arecord -v -d 10 -r 8000 -f S16_LE -t wav foobar.wav + + Example - playback test: + aplay foobar.wav + + +6. Credits & Acknowledgments +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + - Olivier Vandorpe, for starting the usbb2k-api project doing much of + the reverse engineering. + - Martin Diehl, for pointing out how to handle USB memory allocation. + - Dmitry Torokhov, for the numerous code reviews and suggestions. + diff --git a/Documentation/ioctl/cdrom.txt b/Documentation/ioctl/cdrom.txt index 4ccdcc6fe364..8ec32cc49eb1 100644 --- a/Documentation/ioctl/cdrom.txt +++ b/Documentation/ioctl/cdrom.txt @@ -878,7 +878,7 @@ DVD_READ_STRUCT Read structure error returns: EINVAL physical.layer_num exceeds number of layers - EIO Recieved invalid response from drive + EIO Received invalid response from drive diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index 2616a58a5a4b..d802ce88bedc 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -31,7 +31,7 @@ This document describes the Linux kernel Makefiles. === 6 Architecture Makefiles --- 6.1 Set variables to tweak the build to the architecture - --- 6.2 Add prerequisites to prepare: + --- 6.2 Add prerequisites to archprepare: --- 6.3 List directories to visit when descending --- 6.4 Architecture specific boot images --- 6.5 Building non-kbuild targets @@ -734,18 +734,18 @@ When kbuild executes the following steps are followed (roughly): for loadable kernel modules. ---- 6.2 Add prerequisites to prepare: +--- 6.2 Add prerequisites to archprepare: - The prepare: rule is used to list prerequisites that needs to be + The archprepare: rule is used to list prerequisites that needs to be built before starting to descend down in the subdirectories. This is usual header files containing assembler constants. Example: - #arch/s390/Makefile - prepare: include/asm-$(ARCH)/offsets.h + #arch/arm/Makefile + archprepare: maketools - In this example the file include/asm-$(ARCH)/offsets.h will - be built before descending down in the subdirectories. + In this example the file target maketools will be processed + before descending down in the subdirectories. See also chapter XXX-TODO that describe how kbuild supports generating offset header files. @@ -872,7 +872,13 @@ When kbuild executes the following steps are followed (roughly): Assignments to $(targets) are without $(obj)/ prefix. if_changed may be used in conjunction with custom commands as defined in 6.7 "Custom kbuild commands". + Note: It is a typical mistake to forget the FORCE prerequisite. + Another common pitfall is that whitespace is sometimes + significant; for instance, the below will fail (note the extra space + after the comma): + target: source(s) FORCE + #WRONG!# $(call if_changed, ld/objcopy/gzip) ld Link target. Often LDFLAGS_$@ is used to set specific options to ld. diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt index 7ff213f4becd..5f08f9ce6046 100644 --- a/Documentation/kdump/kdump.txt +++ b/Documentation/kdump/kdump.txt @@ -39,8 +39,7 @@ SETUP and apply http://lse.sourceforge.net/kdump/patches/kexec-tools-1.101-kdump.patch and after that build the source. -2) Download and build the appropriate (latest) kexec/kdump (-mm) kernel - patchset and apply it to the vanilla kernel tree. +2) Download and build the appropriate (2.6.13-rc1 onwards) vanilla kernel. Two kernels need to be built in order to get this feature working. @@ -67,11 +66,11 @@ SETUP c) Enable "/proc/vmcore support" (Optional, in Pseudo filesystems). CONFIG_PROC_VMCORE=y d) Disable SMP support and build a UP kernel (Until it is fixed). - CONFIG_SMP=n + CONFIG_SMP=n e) Enable "Local APIC support on uniprocessors". - CONFIG_X86_UP_APIC=y + CONFIG_X86_UP_APIC=y f) Enable "IO-APIC support on uniprocessors" - CONFIG_X86_UP_IOAPIC=y + CONFIG_X86_UP_IOAPIC=y Note: i) Options a) and b) depend upon "Configure standard kernel features (for small systems)" (under General setup). @@ -84,17 +83,23 @@ SETUP 4) Load the second kernel to be booted using: - kexec -p <second-kernel> --crash-dump --args-linux --append="root=<root-dev> - init 1 irqpoll" + kexec -p <second-kernel> --args-linux --elf32-core-headers + --append="root=<root-dev> init 1 irqpoll" Note: i) <second-kernel> has to be a vmlinux image. bzImage will not work, as of now. - ii) By default ELF headers are stored in ELF32 format (for i386). This - is sufficient to represent the physical memory up to 4GB. To store - headers in ELF64 format, specifiy "--elf64-core-headers" on the - kexec command line additionally. + ii) By default ELF headers are stored in ELF64 format. Option + --elf32-core-headers forces generation of ELF32 headers. gdb can + not open ELF64 headers on 32 bit systems. So creating ELF32 + headers can come handy for users who have got non-PAE systems and + hence have memory less than 4GB. iii) Specify "irqpoll" as command line parameter. This reduces driver initialization failures in second kernel due to shared interrupts. + iv) <root-dev> needs to be specified in a format corresponding to + the root device name in the output of mount command. + v) If you have built the drivers required to mount root file + system as modules in <second-kernel>, then, specify + --initrd=<initrd-for-second-kernel>. 5) System reboots into the second kernel when a panic occurs. A module can be written to force the panic or "ALT-SysRq-c" can be used initiate a crash diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index a998a8c2f95b..5dffcfefc3c7 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -17,7 +17,7 @@ are specified on the kernel command line with the module name plus usbcore.blinkenlights=1 -The text in square brackets at the beginning of the description state the +The text in square brackets at the beginning of the description states the restrictions on the kernel for the said kernel parameter to be valid. The restrictions referred to are that the relevant option is valid if: @@ -27,8 +27,8 @@ restrictions referred to are that the relevant option is valid if: APM Advanced Power Management support is enabled. AX25 Appropriate AX.25 support is enabled. CD Appropriate CD support is enabled. - DEVFS devfs support is enabled. - DRM Direct Rendering Management support is enabled. + DEVFS devfs support is enabled. + DRM Direct Rendering Management support is enabled. EDD BIOS Enhanced Disk Drive Services (EDD) is enabled EFI EFI Partitioning (GPT) is enabled EIDE EIDE/ATAPI support is enabled. @@ -71,7 +71,7 @@ restrictions referred to are that the relevant option is valid if: SERIAL Serial support is enabled. SMP The kernel is an SMP kernel. SPARC Sparc architecture is enabled. - SWSUSP Software suspension is enabled. + SWSUSP Software suspend is enabled. TS Appropriate touchscreen support is enabled. USB USB support is enabled. USBHID USB Human Interface Device support is enabled. @@ -105,13 +105,13 @@ running once the system is up. See header of drivers/scsi/53c7xx.c. See also Documentation/scsi/ncr53c7xx.txt. - acpi= [HW,ACPI] Advanced Configuration and Power Interface - Format: { force | off | ht | strict } + acpi= [HW,ACPI] Advanced Configuration and Power Interface + Format: { force | off | ht | strict | noirq } force -- enable ACPI if default was off off -- disable ACPI if default was on noirq -- do not use ACPI for IRQ routing ht -- run only enough ACPI to enable Hyper Threading - strict -- Be less tolerant of platforms that are not + strict -- Be less tolerant of platforms that are not strictly ACPI specification compliant. See also Documentation/pm.txt, pci=noacpi @@ -119,20 +119,23 @@ running once the system is up. acpi_sleep= [HW,ACPI] Sleep options Format: { s3_bios, s3_mode } See Documentation/power/video.txt - + acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode - Format: { level | edge | high | low } + Format: { level | edge | high | low } - acpi_irq_balance [HW,ACPI] ACPI will balance active IRQs - default in APIC mode + acpi_irq_balance [HW,ACPI] + ACPI will balance active IRQs + default in APIC mode - acpi_irq_nobalance [HW,ACPI] ACPI will not move active IRQs (default) - default in PIC mode + acpi_irq_nobalance [HW,ACPI] + ACPI will not move active IRQs (default) + default in PIC mode - acpi_irq_pci= [HW,ACPI] If irq_balance, Clear listed IRQs for use by PCI + acpi_irq_pci= [HW,ACPI] If irq_balance, clear listed IRQs for + use by PCI Format: <irq>,<irq>... - acpi_irq_isa= [HW,ACPI] If irq_balance, Mark listed IRQs used by ISA + acpi_irq_isa= [HW,ACPI] If irq_balance, mark listed IRQs used by ISA Format: <irq>,<irq>... acpi_osi= [HW,ACPI] empty param disables _OSI @@ -145,20 +148,35 @@ running once the system is up. acpi_dbg_layer= [HW,ACPI] Format: <int> - Each bit of the <int> indicates an acpi debug layer, + Each bit of the <int> indicates an ACPI debug layer, 1: enable, 0: disable. It is useful for boot time debugging. After system has booted up, it can be set via /proc/acpi/debug_layer. acpi_dbg_level= [HW,ACPI] Format: <int> - Each bit of the <int> indicates an acpi debug level, + Each bit of the <int> indicates an ACPI debug level, 1: enable, 0: disable. It is useful for boot time debugging. After system has booted up, it can be set via /proc/acpi/debug_level. acpi_fake_ecdt [HW,ACPI] Workaround failure due to BIOS lacking ECDT + acpi_generic_hotkey [HW,ACPI] + Allow consolidated generic hotkey driver to + override platform specific driver. + See also Documentation/acpi-hotkey.txt. + + enable_timer_pin_1 [i386,x86-64] + Enable PIN 1 of APIC timer + Can be useful to work around chipset bugs + (in particular on some ATI chipsets). + The kernel tries to set a reasonable default. + + disable_timer_pin_1 [i386,x86-64] + Disable PIN 1 of APIC timer + Can be useful to work around chipset bugs. + ad1816= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2> See also Documentation/sound/oss/AD1816. @@ -168,7 +186,7 @@ running once the system is up. adlib= [HW,OSS] Format: <io> - + advansys= [HW,SCSI] See header of drivers/scsi/advansys.c. @@ -178,7 +196,7 @@ running once the system is up. aedsp16= [HW,OSS] Audio Excel DSP 16 Format: <io>,<irq>,<dma>,<mss_io>,<mpu_io>,<mpu_irq> See also header of sound/oss/aedsp16.c. - + aha152x= [HW,SCSI] See Documentation/scsi/aha152x.txt. @@ -191,10 +209,6 @@ running once the system is up. aic79xx= [HW,SCSI] See Documentation/scsi/aic79xx.txt. - AM53C974= [HW,SCSI] - Format: <host-scsi-id>,<target-scsi-id>,<max-rate>,<max-offset> - See also header of drivers/scsi/AM53C974.c. - amijoy.map= [HW,JOY] Amiga joystick support Map of devices attached to JOY0DAT and JOY1DAT Format: <a>,<b> @@ -205,23 +219,24 @@ running once the system is up. connected to one of 16 gameports Format: <type1>,<type2>,..<type16> - apc= [HW,SPARC] Power management functions (SPARCstation-4/5 + deriv.) + apc= [HW,SPARC] + Power management functions (SPARCstation-4/5 + deriv.) Format: noidle Disable APC CPU standby support. SPARCstation-Fox does not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. - apic= [APIC,i386] Change the output verbosity whilst booting + apic= [APIC,i386] Change the output verbosity whilst booting Format: { quiet (default) | verbose | debug } Change the amount of debugging information output when initialising the APIC and IO-APIC components. - + apm= [APM] Advanced Power Management See header of arch/i386/kernel/apm.c. applicom= [HW] Format: <mem>,<irq> - + arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards Format: <io>,<irq>,<nodeID> @@ -236,38 +251,40 @@ running once the system is up. atkbd.reset= [HW] Reset keyboard during initialization - atkbd.set= [HW] Select keyboard code set - Format: <int> (2 = AT (default) 3 = PS/2) + atkbd.set= [HW] Select keyboard code set + Format: <int> (2 = AT (default), 3 = PS/2) atkbd.scroll= [HW] Enable scroll wheel on MS Office and similar keyboards atkbd.softraw= [HW] Choose between synthetic and real raw mode Format: <bool> (0 = real, 1 = synthetic (default)) - - atkbd.softrepeat= - [HW] Use software keyboard repeat + + atkbd.softrepeat= [HW] + Use software keyboard repeat autotest [IA64] awe= [HW,OSS] AWE32/SB32/AWE64 wave table synth Format: <io>,<memsize>,<isapnp> - + aztcd= [HW,CD] Aztech CD268 CDROM driver Format: <io>,0x79 (?) baycom_epp= [HW,AX25] Format: <io>,<mode> - + baycom_par= [HW,AX25] BayCom Parallel Port AX.25 Modem Format: <io>,<mode> See header of drivers/net/hamradio/baycom_par.c. - baycom_ser_fdx= [HW,AX25] BayCom Serial Port AX.25 Modem (Full Duplex Mode) + baycom_ser_fdx= [HW,AX25] + BayCom Serial Port AX.25 Modem (Full Duplex Mode) Format: <io>,<irq>,<mode>[,<baud>] See header of drivers/net/hamradio/baycom_ser_fdx.c. - baycom_ser_hdx= [HW,AX25] BayCom Serial Port AX.25 Modem (Half Duplex Mode) + baycom_ser_hdx= [HW,AX25] + BayCom Serial Port AX.25 Modem (Half Duplex Mode) Format: <io>,<irq>,<mode> See header of drivers/net/hamradio/baycom_ser_hdx.c. @@ -278,7 +295,8 @@ running once the system is up. blkmtd_count= bttv.card= [HW,V4L] bttv (bt848 + bt878 based grabber cards) - bttv.radio= Most important insmod options are available as kernel args too. + bttv.radio= Most important insmod options are available as + kernel args too. bttv.pll= See Documentation/video4linux/bttv/Insmod-options bttv.tuner= and Documentation/video4linux/bttv/CARDLIST @@ -304,15 +322,17 @@ running once the system is up. checkreqprot [SELINUX] Set initial checkreqprot flag value. Format: { "0" | "1" } See security/selinux/Kconfig help text. - 0 -- check protection applied by kernel (includes any implied execute protection). + 0 -- check protection applied by kernel (includes + any implied execute protection). 1 -- check protection requested by application. Default value is set via a kernel config option. - Value can be changed at runtime via /selinux/checkreqprot. - - clock= [BUGS=IA-32, HW] gettimeofday timesource override. + Value can be changed at runtime via + /selinux/checkreqprot. + + clock= [BUGS=IA-32,HW] gettimeofday timesource override. Forces specified timesource (if avaliable) to be used - when calculating gettimeofday(). If specicified timesource - is not avalible, it defaults to PIT. + when calculating gettimeofday(). If specicified + timesource is not avalible, it defaults to PIT. Format: { pit | tsc | cyclone | pmtmr } hpet= [IA-32,HPET] option to disable HPET and use PIT. @@ -322,17 +342,19 @@ running once the system is up. Format: { auto | [<io>,][<irq>] } com20020= [HW,NET] ARCnet - COM20020 chipset - Format: <io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]] + Format: + <io>[,<irq>[,<nodeID>[,<backplane>[,<ckp>[,<timeout>]]]]] com90io= [HW,NET] ARCnet - COM90xx chipset (IO-mapped buffers) Format: <io>[,<irq>] - com90xx= [HW,NET] ARCnet - COM90xx chipset (memory-mapped buffers) + com90xx= [HW,NET] + ARCnet - COM90xx chipset (memory-mapped buffers) Format: <io>[,<irq>[,<memstart>]] condev= [HW,S390] console device conmode= - + console= [KNL] Output console device and options. tty<n> Use the virtual console device <n>. @@ -353,7 +375,8 @@ running once the system is up. options are the same as for ttyS, above. cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver - Format: <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>] + Format: + <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>] cpia_pp= [HW,PPT] Format: { parport<nr> | auto | none } @@ -370,10 +393,10 @@ running once the system is up. cs89x0_media= [HW,NET] Format: { rj45 | aui | bnc } - + cyclades= [HW,SERIAL] Cyclades multi-serial port adapter. - - dasd= [HW,NET] + + dasd= [HW,NET] See header of drivers/s390/block/dasd_devmap.c. db9.dev[2|3]= [HW,JOY] Multisystem joystick support via parallel port @@ -392,7 +415,7 @@ running once the system is up. dhash_entries= [KNL] Set number of hash buckets for dentry cache. - + digi= [HW,SERIAL] IO parameters + enable/disable command. @@ -410,11 +433,11 @@ running once the system is up. dtc3181e= [HW,SCSI] - earlyprintk= [IA-32, X86-64] + earlyprintk= [IA-32,X86-64] earlyprintk=vga earlyprintk=serial[,ttySn[,baudrate]] - Append ,keep to not disable it when the real console + Append ",keep" to not disable it when the real console takes over. Only vga or serial at a time, not both. @@ -437,7 +460,7 @@ running once the system is up. Format: {"of[f]" | "sk[ipmbr]"} See comment in arch/i386/boot/edd.S - eicon= [HW,ISDN] + eicon= [HW,ISDN] Format: <id>,<membase>,<irq> eisa_irq_edge= [PARISC,HW] @@ -448,12 +471,13 @@ running once the system is up. arch/i386/kernel/cpu/cpufreq/elanfreq.c. elevator= [IOSCHED] - Format: {"as"|"cfq"|"deadline"|"noop"} - See Documentation/block/as-iosched.txt - and Documentation/block/deadline-iosched.txt for details. + Format: {"as" | "cfq" | "deadline" | "noop"} + See Documentation/block/as-iosched.txt and + Documentation/block/deadline-iosched.txt for details. + elfcorehdr= [IA-32] - Specifies physical address of start of kernel core image - elf header. + Specifies physical address of start of kernel core + image elf header. See Documentation/kdump.txt for details. enforcing [SELINUX] Set initial enforcing status. @@ -471,7 +495,7 @@ running once the system is up. es1371= [HW,OSS] Format: <spdif>,[<nomix>,[<amplifier>]] See also header of sound/oss/es1371.c. - + ether= [HW,NET] Ethernet cards parameters This option is obsoleted by the "netdev=" option, which has equivalent usage. See its documentation for details. @@ -512,12 +536,13 @@ running once the system is up. gus= [HW,OSS] Format: <io>,<irq>,<dma>,<dma16> - + gvp11= [HW,SCSI] hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on for IA-64, off otherwise. + Format: 0 | 1 (for off | on) hcl= [IA-64] SGI's Hardware Graph compatibility layer @@ -544,6 +569,7 @@ running once the system is up. keyboard and can not control its state (Don't attempt to blink the leds) i8042.noaux [HW] Don't check for auxiliary (== mouse) port + i8042.nokbd [HW] Don't check/create keyboard port i8042.nomux [HW] Don't check presence of an active multiplexing controller i8042.nopnp [HW] Don't use ACPIPnP / PnPBIOS to discover KBD/AUX @@ -580,13 +606,13 @@ running once the system is up. ide?= [HW] (E)IDE subsystem Format: ide?=noprobe or chipset specific parameters. See Documentation/ide.txt. - + idebus= [HW] (E)IDE subsystem - VLB/PCI bus speed See Documentation/ide.txt. idle= [HW] Format: idle=poll or idle=halt - + ihash_entries= [KNL] Set number of hash buckets for inode cache. @@ -634,7 +660,7 @@ running once the system is up. firmware running. isapnp= [ISAPNP] - Format: <RDP>, <reset>, <pci_scan>, <verbosity> + Format: <RDP>,<reset>,<pci_scan>,<verbosity> isolcpus= [KNL,SMP] Isolate CPUs from the general scheduler. Format: <cpu number>,...,<cpu number> @@ -646,32 +672,33 @@ running once the system is up. "number of CPUs in system - 1". This option is the preferred way to isolate CPUs. The - alternative - manually setting the CPU mask of all tasks - in the system can cause problems and suboptimal load - balancer performance. + alternative -- manually setting the CPU mask of all + tasks in the system -- can cause problems and + suboptimal load balancer performance. isp16= [HW,CD] Format: <io>,<irq>,<dma>,<setup> - iucv= [HW,NET] + iucv= [HW,NET] js= [HW,JOY] Analog joystick See Documentation/input/joystick.txt. keepinitrd [HW,ARM] - kstack=N [IA-32, X86-64] Print N words from the kernel stack + kstack=N [IA-32,X86-64] Print N words from the kernel stack in oops dumps. l2cr= [PPC] - lapic [IA-32,APIC] Enable the local APIC even if BIOS disabled it. + lapic [IA-32,APIC] Enable the local APIC even if BIOS + disabled it. lasi= [HW,SCSI] PARISC LASI driver for the 53c700 chip Format: addr:<io>,irq:<irq> - llsc*= [IA64] - See function print_params() in arch/ia64/sn/kernel/llsc4.c. + llsc*= [IA64] See function print_params() in + arch/ia64/sn/kernel/llsc4.c. load_ramdisk= [RAM] List of ramdisks to load from floppy See Documentation/ramdisk.txt. @@ -698,8 +725,9 @@ running once the system is up. 7 (KERN_DEBUG) debug-level messages log_buf_len=n Sets the size of the printk ring buffer, in bytes. - Format is n, nk, nM. n must be a power of two. The - default is set in kernel config. + Format: { n | nk | nM } + n must be a power of two. The default size + is set in the kernel config file. lp=0 [LP] Specify parallel ports to use, e.g, lp=port[,port...] lp=none,parport0 (lp0 not configured, lp1 uses @@ -735,23 +763,23 @@ running once the system is up. ltpc= [NET] Format: <io>,<irq>,<dma> - mac5380= [HW,SCSI] - Format: <can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> + mac5380= [HW,SCSI] Format: + <can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> - mac53c9x= [HW,SCSI] - Format: <num_esps>,<disconnect>,<nosync>,<can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> + mac53c9x= [HW,SCSI] Format: + <num_esps>,<disconnect>,<nosync>,<can_queue>,<cmd_per_lun>,<sg_tablesize>,<hostid>,<use_tags> - machvec= [IA64] - Force the use of a particular machine-vector (machvec) in a generic - kernel. Example: machvec=hpzx1_swiotlb + machvec= [IA64] Force the use of a particular machine-vector + (machvec) in a generic kernel. + Example: machvec=hpzx1_swiotlb - mad16= [HW,OSS] - Format: <io>,<irq>,<dma>,<dma16>,<mpu_io>,<mpu_irq>,<joystick> + mad16= [HW,OSS] Format: + <io>,<irq>,<dma>,<dma16>,<mpu_io>,<mpu_irq>,<joystick> maui= [HW,OSS] Format: <io>,<irq> - - max_loop= [LOOP] Maximum number of loopback devices that can + + max_loop= [LOOP] Maximum number of loopback devices that can be mounted Format: <1-256> @@ -761,11 +789,11 @@ running once the system is up. max_addr=[KMG] [KNL,BOOT,ia64] All physical memory greater than or equal to this physical address is ignored. - max_luns= [SCSI] Maximum number of LUNs to probe + max_luns= [SCSI] Maximum number of LUNs to probe. Should be between 1 and 2^32-1. max_report_luns= - [SCSI] Maximum number of LUNs received + [SCSI] Maximum number of LUNs received. Should be between 1 and 16384. mca-pentium [BUGS=IA-32] @@ -781,11 +809,11 @@ running once the system is up. md= [HW] RAID subsystems devices and level See Documentation/md.txt. - + mdacon= [MDA] Format: <first>,<last> Specifies range of consoles to be captured by the MDA. - + mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory Amount of memory to be used when the kernel is not able to see the whole system memory or for test. @@ -836,15 +864,15 @@ running once the system is up. MTD_Partition= [MTD] Format: <name>,<region-number>,<size>,<offset> - MTD_Region= [MTD] - Format: <name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>] + MTD_Region= [MTD] Format: + <name>,<region-number>[,<base>,<size>,<buswidth>,<altbuswidth>] mtdparts= [MTD] See drivers/mtd/cmdline.c. mtouchusb.raw_coordinates= - [HW] Make the MicroTouch USB driver use raw coordinates ('y', default) - or cooked coordinates ('n') + [HW] Make the MicroTouch USB driver use raw coordinates + ('y', default) or cooked coordinates ('n') n2= [NET] SDL Inc. RISCom/N2 synchronous serial card @@ -865,7 +893,9 @@ running once the system is up. Format: <irq>,<io>,<mem_start>,<mem_end>,<name> Note that mem_start is often overloaded to mean something different and driver-specific. - + This usage is only documented in each driver source + file if at all. + nfsaddrs= [NFS] See Documentation/nfsroot.txt. @@ -878,8 +908,8 @@ running once the system is up. emulation library even if a 387 maths coprocessor is present. - noalign [KNL,ARM] - + noalign [KNL,ARM] + noapic [SMP,APIC] Tells the kernel to not make use of any IOAPICs that may be present in the system. @@ -890,19 +920,19 @@ running once the system is up. on "Classic" PPC cores. nocache [ARM] - + nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. noexec [IA-64] - noexec [IA-32, X86-64] + noexec [IA-32,X86-64] noexec=on: enable non-executable mappings (default) noexec=off: disable nn-executable mappings nofxsr [BUGS=IA-32] nohlt [BUGS=ARM] - + no-hlt [BUGS=IA-32] Tells the kernel that the hlt instruction doesn't work correctly and not to use it. @@ -933,8 +963,9 @@ running once the system is up. noresidual [PPC] Don't use residual data on PReP machines. - noresume [SWSUSP] Disables resume and restore original swap space. - + noresume [SWSUSP] Disables resume and restores original swap + space. + no-scroll [VGA] Disables scrollback. This is required for the Braillex ib80-piezo Braille reader made by F.H. Papenmeier (Germany). @@ -950,16 +981,16 @@ running once the system is up. nousb [USB] Disable the USB subsystem nowb [ARM] - + opl3= [HW,OSS] Format: <io> opl3sa= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2>,<mpu_io>,<mpu_irq> - opl3sa2= [HW,OSS] - Format: <io>,<irq>,<dma>,<dma2>,<mss_io>,<mpu_io>,<ymode>,<loopback>[,<isapnp>,<multiple] - + opl3sa2= [HW,OSS] Format: + <io>,<irq>,<dma>,<dma2>,<mss_io>,<mpu_io>,<ymode>,<loopback>[,<isapnp>,<multiple] + oprofile.timer= [HW] Use timer interrupt instead of performance counters @@ -978,36 +1009,33 @@ running once the system is up. Format: <parport#> parkbd.mode= [HW] Parallel port keyboard adapter mode of operation, 0 for XT, 1 for AT (default is AT). - Format: <mode> - - parport=0 [HW,PPT] Specify parallel ports. 0 disables. - parport=auto Use 'auto' to force the driver to use - parport=0xBBB[,IRQ[,DMA]] any IRQ/DMA settings detected (the - default is to ignore detected IRQ/DMA - settings because of possible - conflicts). You can specify the base - address, IRQ, and DMA settings; IRQ and - DMA should be numbers, or 'auto' (for - using detected settings on that - particular port), or 'nofifo' (to avoid - using a FIFO even if it is detected). - Parallel ports are assigned in the - order they are specified on the command - line, starting with parport0. - - parport_init_mode= - [HW,PPT] Configure VIA parallel port to - operate in specific mode. This is - necessary on Pegasos computer where - firmware has no options for setting up - parallel port mode and sets it to - spp. Currently this function knows - 686a and 8231 chips. + Format: <mode> + + parport= [HW,PPT] Specify parallel ports. 0 disables. + Format: { 0 | auto | 0xBBB[,IRQ[,DMA]] } + Use 'auto' to force the driver to use any + IRQ/DMA settings detected (the default is to + ignore detected IRQ/DMA settings because of + possible conflicts). You can specify the base + address, IRQ, and DMA settings; IRQ and DMA + should be numbers, or 'auto' (for using detected + settings on that particular port), or 'nofifo' + (to avoid using a FIFO even if it is detected). + Parallel ports are assigned in the order they + are specified on the command line, starting + with parport0. + + parport_init_mode= [HW,PPT] + Configure VIA parallel port to operate in + a specific mode. This is necessary on Pegasos + computer where firmware has no options for setting + up parallel port mode and sets it to spp. + Currently this function knows 686a and 8231 chips. Format: [spp|ps2|epp|ecp|ecpepp] - pas2= [HW,OSS] - Format: <io>,<irq>,<dma>,<dma16>,<sb_io>,<sb_irq>,<sb_dma>,<sb_dma16> - + pas2= [HW,OSS] Format: + <io>,<irq>,<dma>,<dma16>,<sb_io>,<sb_irq>,<sb_dma>,<sb_dma16> + pas16= [HW,SCSI] See header of drivers/scsi/pas16.c. @@ -1017,64 +1045,67 @@ running once the system is up. See header of drivers/block/paride/pcd.c. See also Documentation/paride.txt. - pci=option[,option...] [PCI] various PCI subsystem options: - off [IA-32] don't probe for the PCI bus - bios [IA-32] force use of PCI BIOS, don't access - the hardware directly. Use this if your machine - has a non-standard PCI host bridge. - nobios [IA-32] disallow use of PCI BIOS, only direct - hardware access methods are allowed. Use this - if you experience crashes upon bootup and you - suspect they are caused by the BIOS. - conf1 [IA-32] Force use of PCI Configuration Mechanism 1. - conf2 [IA-32] Force use of PCI Configuration Mechanism 2. - nosort [IA-32] Don't sort PCI devices according to - order given by the PCI BIOS. This sorting is done - to get a device order compatible with older kernels. - biosirq [IA-32] Use PCI BIOS calls to get the interrupt - routing table. These calls are known to be buggy - on several machines and they hang the machine when used, - but on other computers it's the only way to get the - interrupt routing table. Try this option if the kernel - is unable to allocate IRQs or discover secondary PCI - buses on your motherboard. - rom [IA-32] Assign address space to expansion ROMs. - Use with caution as certain devices share address - decoders between ROMs and other resources. - irqmask=0xMMMM [IA-32] Set a bit mask of IRQs allowed to be assigned - automatically to PCI devices. You can make the kernel - exclude IRQs of your ISA cards this way. + pci=option[,option...] [PCI] various PCI subsystem options: + off [IA-32] don't probe for the PCI bus + bios [IA-32] force use of PCI BIOS, don't access + the hardware directly. Use this if your machine + has a non-standard PCI host bridge. + nobios [IA-32] disallow use of PCI BIOS, only direct + hardware access methods are allowed. Use this + if you experience crashes upon bootup and you + suspect they are caused by the BIOS. + conf1 [IA-32] Force use of PCI Configuration + Mechanism 1. + conf2 [IA-32] Force use of PCI Configuration + Mechanism 2. + nosort [IA-32] Don't sort PCI devices according to + order given by the PCI BIOS. This sorting is + done to get a device order compatible with + older kernels. + biosirq [IA-32] Use PCI BIOS calls to get the interrupt + routing table. These calls are known to be buggy + on several machines and they hang the machine + when used, but on other computers it's the only + way to get the interrupt routing table. Try + this option if the kernel is unable to allocate + IRQs or discover secondary PCI buses on your + motherboard. + rom [IA-32] Assign address space to expansion ROMs. + Use with caution as certain devices share + address decoders between ROMs and other + resources. + irqmask=0xMMMM [IA-32] Set a bit mask of IRQs allowed to be + assigned automatically to PCI devices. You can + make the kernel exclude IRQs of your ISA cards + this way. pirqaddr=0xAAAAA [IA-32] Specify the physical address - of the PIRQ table (normally generated - by the BIOS) if it is outside the - F0000h-100000h range. - lastbus=N [IA-32] Scan all buses till bus #N. Can be useful - if the kernel is unable to find your secondary buses - and you want to tell it explicitly which ones they are. - assign-busses [IA-32] Always assign all PCI bus - numbers ourselves, overriding - whatever the firmware may have - done. - usepirqmask [IA-32] Honor the possible IRQ mask - stored in the BIOS $PIR table. This is - needed on some systems with broken - BIOSes, notably some HP Pavilion N5400 - and Omnibook XE3 notebooks. This will - have no effect if ACPI IRQ routing is - enabled. - noacpi [IA-32] Do not use ACPI for IRQ routing - or for PCI scanning. - routeirq Do IRQ routing for all PCI devices. - This is normally done in pci_enable_device(), - so this option is a temporary workaround - for broken drivers that don't call it. - - firmware [ARM] Do not re-enumerate the bus but - instead just use the configuration - from the bootloader. This is currently - used on IXP2000 systems where the - bus has to be configured a certain way - for adjunct CPUs. + of the PIRQ table (normally generated + by the BIOS) if it is outside the + F0000h-100000h range. + lastbus=N [IA-32] Scan all buses thru bus #N. Can be + useful if the kernel is unable to find your + secondary buses and you want to tell it + explicitly which ones they are. + assign-busses [IA-32] Always assign all PCI bus + numbers ourselves, overriding + whatever the firmware may have done. + usepirqmask [IA-32] Honor the possible IRQ mask stored + in the BIOS $PIR table. This is needed on + some systems with broken BIOSes, notably + some HP Pavilion N5400 and Omnibook XE3 + notebooks. This will have no effect if ACPI + IRQ routing is enabled. + noacpi [IA-32] Do not use ACPI for IRQ routing + or for PCI scanning. + routeirq Do IRQ routing for all PCI devices. + This is normally done in pci_enable_device(), + so this option is a temporary workaround + for broken drivers that don't call it. + firmware [ARM] Do not re-enumerate the bus but instead + just use the configuration from the + bootloader. This is currently used on + IXP2000 systems where the bus has to be + configured a certain way for adjunct CPUs. pcmv= [HW,PCMCIA] BadgePAD 4 @@ -1112,19 +1143,20 @@ running once the system is up. [ISAPNP] Exclude DMAs for the autoconfiguration pnp_reserve_io= [ISAPNP] Exclude I/O ports for the autoconfiguration - Ranges are in pairs (I/O port base and size). + Ranges are in pairs (I/O port base and size). pnp_reserve_mem= - [ISAPNP] Exclude memory regions for the autoconfiguration + [ISAPNP] Exclude memory regions for the + autoconfiguration. Ranges are in pairs (memory base and size). profile= [KNL] Enable kernel profiling via /proc/profile - { schedule | <number> } - (param: schedule - profile schedule points} - (param: profile step/bucket size as a power of 2 for - statistical time based profiling) + Format: [schedule,]<number> + Param: "schedule" - profile schedule points. + Param: <number> - step/bucket size as a power of 2 for + statistical time based profiling. - processor.max_cstate= [HW, ACPI] + processor.max_cstate= [HW,ACPI] Limit processor to maximum C-state max_cstate=9 overrides any DMI blacklist limit. @@ -1132,27 +1164,28 @@ running once the system is up. before loading. See Documentation/ramdisk.txt. - psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to - probe for (bare|imps|exps|lifebook|any). + psmouse.proto= [HW,MOUSE] Highest PS2 mouse protocol extension to + probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate= [HW,MOUSE] Set desired mouse report rate, in reports per second. - psmouse.resetafter= - [HW,MOUSE] Try to reset the device after so many bad packets + psmouse.resetafter= [HW,MOUSE] + Try to reset the device after so many bad packets (0 = never). psmouse.resolution= [HW,MOUSE] Set desired mouse resolution, in dpi. psmouse.smartscroll= - [HW,MOUSE] Controls Logitech smartscroll autorepeat, + [HW,MOUSE] Controls Logitech smartscroll autorepeat. 0 = disabled, 1 = enabled (default). pss= [HW,OSS] Personal Sound System (ECHO ESC614) - Format: <io>,<mss_io>,<mss_irq>,<mss_dma>,<mpu_io>,<mpu_irq> + Format: + <io>,<mss_io>,<mss_irq>,<mss_dma>,<mpu_io>,<mpu_irq> pt. [PARIDE] See Documentation/paride.txt. quiet= [KNL] Disable log messages - + r128= [HW,DRM] raid= [HW,RAID] @@ -1161,21 +1194,26 @@ running once the system is up. ramdisk= [RAM] Sizes of RAM disks in kilobytes [deprecated] See Documentation/ramdisk.txt. - ramdisk_blocksize= - [RAM] + ramdisk_blocksize= [RAM] See Documentation/ramdisk.txt. - + ramdisk_size= [RAM] Sizes of RAM disks in kilobytes New name for the ramdisk parameter. See Documentation/ramdisk.txt. + rdinit= [KNL] + Format: <full_path> + Run specified binary instead of /init from the ramdisk, + used for early userspace startup. See initrd. + reboot= [BUGS=IA-32,BUGS=ARM,BUGS=IA-64] Rebooting mode Format: <reboot_mode>[,<reboot_mode2>[,...]] See arch/*/kernel/reboot.c. reserve= [KNL,BUGS] Force the kernel to ignore some iomem area - resume= [SWSUSP] Specify the partition device for software suspension + resume= [SWSUSP] + Specify the partition device for software suspend rhash_entries= [KNL,NET] Set number of hash buckets for route cache @@ -1205,7 +1243,7 @@ running once the system is up. Format: <io>,<irq>,<dma>,<dma2> sbni= [NET] Granch SBNI12 leased line adapter - + sbpcd= [HW,CD] Soundblaster CD adapter Format: <io>,<type> See a comment before function sbpcd_setup() in @@ -1238,21 +1276,20 @@ running once the system is up. serialnumber [BUGS=IA-32] - sg_def_reserved_size= - [SCSI] - + sg_def_reserved_size= [SCSI] + sgalaxy= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2>,<sgbase> shapers= [NET] Maximal number of shapers. - + sim710= [SCSI,HW] See header of drivers/scsi/sim710.c. simeth= [IA-64] simscsi= - + sjcd= [HW,CD] Format: <io>,<irq>,<dma> See header of drivers/cdrom/sjcd.c. @@ -1383,10 +1420,10 @@ running once the system is up. snd-wavefront= [HW,ALSA] snd-ymfpci= [HW,ALSA] - + sonicvibes= [HW,OSS] Format: <reverb> - + sonycd535= [HW,CD] Format: <io>[,<irq>] @@ -1403,7 +1440,7 @@ running once the system is up. sscape= [HW,OSS] Format: <io>,<irq>,<dma>,<mpu_io>,<mpu_irq> - + st= [HW,SCSI] SCSI tape parameters (buffers, etc.) See Documentation/scsi/st.txt. @@ -1423,10 +1460,8 @@ running once the system is up. stifb= [HW] Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]] - stram_swap= [HW,M68k] - swiotlb= [IA-64] Number of I/O TLB slabs - + switches= [HW,M68k] sym53c416= [HW,SCSI] @@ -1459,14 +1494,16 @@ running once the system is up. tp720= [HW,PS2] trix= [HW,OSS] MediaTrix AudioTrix Pro - Format: <io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq> - + Format: + <io>,<irq>,<dma>,<dma2>,<sb_io>,<sb_irq>,<sb_dma>,<mpu_io>,<mpu_irq> + tsdev.xres= [TS] Horizontal screen resolution. tsdev.yres= [TS] Vertical screen resolution. - turbografx.map[2|3]= - [HW,JOY] TurboGraFX parallel port interface - Format: <port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7> + turbografx.map[2|3]= [HW,JOY] + TurboGraFX parallel port interface + Format: + <port#>,<js1>,<js2>,<js3>,<js4>,<js5>,<js6>,<js7> See also Documentation/input/joystick-parport.txt u14-34f= [HW,SCSI] UltraStor 14F/34F SCSI host adapter @@ -1478,21 +1515,20 @@ running once the system is up. uart6850= [HW,OSS] Format: <io>,<irq> - usb-handoff [HW] Enable early USB BIOS -> OS handoff - usbhid.mousepoll= [USBHID] The interval which mice are to be polled at. - + video= [FB] Frame buffer configuration See Documentation/fb/modedb.txt. vga= [BOOT,IA-32] Select a particular video mode - See Documentation/i386/boot.txt and Documentation/svga.txt. + See Documentation/i386/boot.txt and + Documentation/svga.txt. Use vga=ask for menu. This is actually a boot loader parameter; the value is passed to the kernel using a special protocol. - vmalloc=nn[KMG] [KNL,BOOT] forces the vmalloc area to have an exact + vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact size of <nn>. This can be used to increase the minimum size (128MB on x86). It can also be used to decrease the size and leave more room for directly @@ -1500,11 +1536,11 @@ running once the system is up. vmhalt= [KNL,S390] - vmpoff= [KNL,S390] - + vmpoff= [KNL,S390] + waveartist= [HW,OSS] Format: <io>,<irq>,<dma>,<dma2> - + wd33c93= [HW,SCSI] See header of drivers/scsi/wd33c93.c. @@ -1518,21 +1554,25 @@ running once the system is up. xd_geo= See header of drivers/block/xd.c. xirc2ps_cs= [NET,PCMCIA] - Format: <irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]] - + Format: + <irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]] +______________________________________________________________________ Changelog: +2000-06-?? Mr. Unknown The last known update (for 2.4.0) - the changelog was not kept before. - 2000-06-?? Mr. Unknown +2002-11-24 Petr Baudis <pasky@ucw.cz> + Randy Dunlap <randy.dunlap@verizon.net> Update for 2.5.49, description for most of the options introduced, references to other documentation (C files, READMEs, ..), added S390, PPC, SPARC, MTD, ALSA and OSS category. Minor corrections and reformatting. - 2002-11-24 Petr Baudis <pasky@ucw.cz> - Randy Dunlap <randy.dunlap@verizon.net> + +2005-10-19 Randy Dunlap <rdunlap@xenotime.net> + Lots of typos, whitespace, some reformatting. TODO: diff --git a/Documentation/keys-request-key.txt b/Documentation/keys-request-key.txt new file mode 100644 index 000000000000..5f2b9c5edbb5 --- /dev/null +++ b/Documentation/keys-request-key.txt @@ -0,0 +1,161 @@ + =================== + KEY REQUEST SERVICE + =================== + +The key request service is part of the key retention service (refer to +Documentation/keys.txt). This document explains more fully how that the +requesting algorithm works. + +The process starts by either the kernel requesting a service by calling +request_key(): + + struct key *request_key(const struct key_type *type, + const char *description, + const char *callout_string); + +Or by userspace invoking the request_key system call: + + key_serial_t request_key(const char *type, + const char *description, + const char *callout_info, + key_serial_t dest_keyring); + +The main difference between the two access points is that the in-kernel +interface does not need to link the key to a keyring to prevent it from being +immediately destroyed. The kernel interface returns a pointer directly to the +key, and it's up to the caller to destroy the key. + +The userspace interface links the key to a keyring associated with the process +to prevent the key from going away, and returns the serial number of the key to +the caller. + + +=========== +THE PROCESS +=========== + +A request proceeds in the following manner: + + (1) Process A calls request_key() [the userspace syscall calls the kernel + interface]. + + (2) request_key() searches the process's subscribed keyrings to see if there's + a suitable key there. If there is, it returns the key. If there isn't, and + callout_info is not set, an error is returned. Otherwise the process + proceeds to the next step. + + (3) request_key() sees that A doesn't have the desired key yet, so it creates + two things: + + (a) An uninstantiated key U of requested type and description. + + (b) An authorisation key V that refers to key U and notes that process A + is the context in which key U should be instantiated and secured, and + from which associated key requests may be satisfied. + + (4) request_key() then forks and executes /sbin/request-key with a new session + keyring that contains a link to auth key V. + + (5) /sbin/request-key execs an appropriate program to perform the actual + instantiation. + + (6) The program may want to access another key from A's context (say a + Kerberos TGT key). It just requests the appropriate key, and the keyring + search notes that the session keyring has auth key V in its bottom level. + + This will permit it to then search the keyrings of process A with the + UID, GID, groups and security info of process A as if it was process A, + and come up with key W. + + (7) The program then does what it must to get the data with which to + instantiate key U, using key W as a reference (perhaps it contacts a + Kerberos server using the TGT) and then instantiates key U. + + (8) Upon instantiating key U, auth key V is automatically revoked so that it + may not be used again. + + (9) The program then exits 0 and request_key() deletes key V and returns key + U to the caller. + +This also extends further. If key W (step 5 above) didn't exist, key W would be +created uninstantiated, another auth key (X) would be created [as per step 3] +and another copy of /sbin/request-key spawned [as per step 4]; but the context +specified by auth key X will still be process A, as it was in auth key V. + +This is because process A's keyrings can't simply be attached to +/sbin/request-key at the appropriate places because (a) execve will discard two +of them, and (b) it requires the same UID/GID/Groups all the way through. + + +====================== +NEGATIVE INSTANTIATION +====================== + +Rather than instantiating a key, it is possible for the possessor of an +authorisation key to negatively instantiate a key that's under construction. +This is a short duration placeholder that causes any attempt at re-requesting +the key whilst it exists to fail with error ENOKEY. + +This is provided to prevent excessive repeated spawning of /sbin/request-key +processes for a key that will never be obtainable. + +Should the /sbin/request-key process exit anything other than 0 or die on a +signal, the key under construction will be automatically negatively +instantiated for a short amount of time. + + +==================== +THE SEARCH ALGORITHM +==================== + +A search of any particular keyring proceeds in the following fashion: + + (1) When the key management code searches for a key (keyring_search_aux) it + firstly calls key_permission(SEARCH) on the keyring it's starting with, + if this denies permission, it doesn't search further. + + (2) It considers all the non-keyring keys within that keyring and, if any key + matches the criteria specified, calls key_permission(SEARCH) on it to see + if the key is allowed to be found. If it is, that key is returned; if + not, the search continues, and the error code is retained if of higher + priority than the one currently set. + + (3) It then considers all the keyring-type keys in the keyring it's currently + searching. It calls key_permission(SEARCH) on each keyring, and if this + grants permission, it recurses, executing steps (2) and (3) on that + keyring. + +The process stops immediately a valid key is found with permission granted to +use it. Any error from a previous match attempt is discarded and the key is +returned. + +When search_process_keyrings() is invoked, it performs the following searches +until one succeeds: + + (1) If extant, the process's thread keyring is searched. + + (2) If extant, the process's process keyring is searched. + + (3) The process's session keyring is searched. + + (4) If the process has a request_key() authorisation key in its session + keyring then: + + (a) If extant, the calling process's thread keyring is searched. + + (b) If extant, the calling process's process keyring is searched. + + (c) The calling process's session keyring is searched. + +The moment one succeeds, all pending errors are discarded and the found key is +returned. + +Only if all these fail does the whole thing fail with the highest priority +error. Note that several errors may have come from LSM. + +The error priority is: + + EKEYREVOKED > EKEYEXPIRED > ENOKEY + +EACCES/EPERM are only returned on a direct search of a specific keyring where +the basal keyring does not grant Search permission. diff --git a/Documentation/keys.txt b/Documentation/keys.txt index 0321ded4b9ae..31154882000a 100644 --- a/Documentation/keys.txt +++ b/Documentation/keys.txt @@ -195,8 +195,8 @@ KEY ACCESS PERMISSIONS ====================== Keys have an owner user ID, a group access ID, and a permissions mask. The mask -has up to eight bits each for user, group and other access. Only five of each -set of eight bits are defined. These permissions granted are: +has up to eight bits each for possessor, user, group and other access. Only +six of each set of eight bits are defined. These permissions granted are: (*) View @@ -224,6 +224,10 @@ set of eight bits are defined. These permissions granted are: keyring to a key, a process must have Write permission on the keyring and Link permission on the key. + (*) Set Attribute + + This permits a key's UID, GID and permissions mask to be changed. + For changing the ownership, group ID or permissions mask, being the owner of the key or having the sysadmin capability is sufficient. @@ -241,16 +245,16 @@ about the status of the key service: type, description and permissions. The payload of the key is not available this way: - SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY - 00000001 I----- 39 perm 1f0000 0 0 keyring _uid_ses.0: 1/4 - 00000002 I----- 2 perm 1f0000 0 0 keyring _uid.0: empty - 00000007 I----- 1 perm 1f0000 0 0 keyring _pid.1: empty - 0000018d I----- 1 perm 1f0000 0 0 keyring _pid.412: empty - 000004d2 I--Q-- 1 perm 1f0000 32 -1 keyring _uid.32: 1/4 - 000004d3 I--Q-- 3 perm 1f0000 32 -1 keyring _uid_ses.32: empty - 00000892 I--QU- 1 perm 1f0000 0 0 user metal:copper: 0 - 00000893 I--Q-N 1 35s 1f0000 0 0 user metal:silver: 0 - 00000894 I--Q-- 1 10h 1f0000 0 0 user metal:gold: 0 + SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY + 00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4 + 00000002 I----- 2 perm 1f3f0000 0 0 keyring _uid.0: empty + 00000007 I----- 1 perm 1f3f0000 0 0 keyring _pid.1: empty + 0000018d I----- 1 perm 1f3f0000 0 0 keyring _pid.412: empty + 000004d2 I--Q-- 1 perm 1f3f0000 32 -1 keyring _uid.32: 1/4 + 000004d3 I--Q-- 3 perm 1f3f0000 32 -1 keyring _uid_ses.32: empty + 00000892 I--QU- 1 perm 1f000000 0 0 user metal:copper: 0 + 00000893 I--Q-N 1 35s 1f3f0000 0 0 user metal:silver: 0 + 00000894 I--Q-- 1 10h 003f0000 0 0 user metal:gold: 0 The flags are: @@ -361,6 +365,8 @@ The main syscalls are: /sbin/request-key will be invoked in an attempt to obtain a key. The callout_info string will be passed as an argument to the program. + See also Documentation/keys-request-key.txt. + The keyctl syscall functions are: @@ -533,8 +539,8 @@ The keyctl syscall functions are: (*) Read the payload data from a key: - key_serial_t keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer, - size_t buflen); + long keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer, + size_t buflen); This function attempts to read the payload data from the specified key into the buffer. The process must have read permission on the key to @@ -555,9 +561,9 @@ The keyctl syscall functions are: (*) Instantiate a partially constructed key. - key_serial_t keyctl(KEYCTL_INSTANTIATE, key_serial_t key, - const void *payload, size_t plen, - key_serial_t keyring); + long keyctl(KEYCTL_INSTANTIATE, key_serial_t key, + const void *payload, size_t plen, + key_serial_t keyring); If the kernel calls back to userspace to complete the instantiation of a key, userspace should use this call to supply data for the key before the @@ -576,8 +582,8 @@ The keyctl syscall functions are: (*) Negatively instantiate a partially constructed key. - key_serial_t keyctl(KEYCTL_NEGATE, key_serial_t key, - unsigned timeout, key_serial_t keyring); + long keyctl(KEYCTL_NEGATE, key_serial_t key, + unsigned timeout, key_serial_t keyring); If the kernel calls back to userspace to complete the instantiation of a key, userspace should use this call mark the key as negative before the @@ -637,6 +643,34 @@ call, and the key released upon close. How to deal with conflicting keys due to two different users opening the same file is left to the filesystem author to solve. +Note that there are two different types of pointers to keys that may be +encountered: + + (*) struct key * + + This simply points to the key structure itself. Key structures will be at + least four-byte aligned. + + (*) key_ref_t + + This is equivalent to a struct key *, but the least significant bit is set + if the caller "possesses" the key. By "possession" it is meant that the + calling processes has a searchable link to the key from one of its + keyrings. There are three functions for dealing with these: + + key_ref_t make_key_ref(const struct key *key, + unsigned long possession); + + struct key *key_ref_to_ptr(const key_ref_t key_ref); + + unsigned long is_key_possessed(const key_ref_t key_ref); + + The first function constructs a key reference from a key pointer and + possession information (which must be 0 or 1 and not any other value). + + The second function retrieves the key pointer from a reference and the + third retrieves the possession flag. + When accessing a key's payload contents, certain precautions must be taken to prevent access vs modification races. See the section "Notes on accessing payload contents" for more information. @@ -660,12 +694,18 @@ payload contents" for more information. If successful, the key will have been attached to the default keyring for implicitly obtained request-key keys, as set by KEYCTL_SET_REQKEY_KEYRING. + See also Documentation/keys-request-key.txt. + (*) When it is no longer required, the key should be released using: void key_put(struct key *key); - This can be called from interrupt context. If CONFIG_KEYS is not set then + Or: + + void key_ref_put(key_ref_t key_ref); + + These can be called from interrupt context. If CONFIG_KEYS is not set then the argument will not be parsed. @@ -689,13 +729,17 @@ payload contents" for more information. (*) If a keyring was found in the search, this can be further searched by: - struct key *keyring_search(struct key *keyring, - const struct key_type *type, - const char *description) + key_ref_t keyring_search(key_ref_t keyring_ref, + const struct key_type *type, + const char *description) This searches the keyring tree specified for a matching key. Error ENOKEY - is returned upon failure. If successful, the returned key will need to be - released. + is returned upon failure (use IS_ERR/PTR_ERR to determine). If successful, + the returned key will need to be released. + + The possession attribute from the keyring reference is used to control + access through the permissions mask and is propagated to the returned key + reference pointer if successful. (*) To check the validity of a key, this function can be called: @@ -732,7 +776,7 @@ More complex payload contents must be allocated and a pointer to them set in key->payload.data. One of the following ways must be selected to access the data: - (1) Unmodifyable key type. + (1) Unmodifiable key type. If the key type does not have a modify method, then the key's payload can be accessed without any form of locking, provided that it's known to be diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt new file mode 100644 index 000000000000..0541fe1de704 --- /dev/null +++ b/Documentation/kprobes.txt @@ -0,0 +1,588 @@ +Title : Kernel Probes (Kprobes) +Authors : Jim Keniston <jkenisto@us.ibm.com> + : Prasanna S Panchamukhi <prasanna@in.ibm.com> + +CONTENTS + +1. Concepts: Kprobes, Jprobes, Return Probes +2. Architectures Supported +3. Configuring Kprobes +4. API Reference +5. Kprobes Features and Limitations +6. Probe Overhead +7. TODO +8. Kprobes Example +9. Jprobes Example +10. Kretprobes Example + +1. Concepts: Kprobes, Jprobes, Return Probes + +Kprobes enables you to dynamically break into any kernel routine and +collect debugging and performance information non-disruptively. You +can trap at almost any kernel code address, specifying a handler +routine to be invoked when the breakpoint is hit. + +There are currently three types of probes: kprobes, jprobes, and +kretprobes (also called return probes). A kprobe can be inserted +on virtually any instruction in the kernel. A jprobe is inserted at +the entry to a kernel function, and provides convenient access to the +function's arguments. A return probe fires when a specified function +returns. + +In the typical case, Kprobes-based instrumentation is packaged as +a kernel module. The module's init function installs ("registers") +one or more probes, and the exit function unregisters them. A +registration function such as register_kprobe() specifies where +the probe is to be inserted and what handler is to be called when +the probe is hit. + +The next three subsections explain how the different types of +probes work. They explain certain things that you'll need to +know in order to make the best use of Kprobes -- e.g., the +difference between a pre_handler and a post_handler, and how +to use the maxactive and nmissed fields of a kretprobe. But +if you're in a hurry to start using Kprobes, you can skip ahead +to section 2. + +1.1 How Does a Kprobe Work? + +When a kprobe is registered, Kprobes makes a copy of the probed +instruction and replaces the first byte(s) of the probed instruction +with a breakpoint instruction (e.g., int3 on i386 and x86_64). + +When a CPU hits the breakpoint instruction, a trap occurs, the CPU's +registers are saved, and control passes to Kprobes via the +notifier_call_chain mechanism. Kprobes executes the "pre_handler" +associated with the kprobe, passing the handler the addresses of the +kprobe struct and the saved registers. + +Next, Kprobes single-steps its copy of the probed instruction. +(It would be simpler to single-step the actual instruction in place, +but then Kprobes would have to temporarily remove the breakpoint +instruction. This would open a small time window when another CPU +could sail right past the probepoint.) + +After the instruction is single-stepped, Kprobes executes the +"post_handler," if any, that is associated with the kprobe. +Execution then continues with the instruction following the probepoint. + +1.2 How Does a Jprobe Work? + +A jprobe is implemented using a kprobe that is placed on a function's +entry point. It employs a simple mirroring principle to allow +seamless access to the probed function's arguments. The jprobe +handler routine should have the same signature (arg list and return +type) as the function being probed, and must always end by calling +the Kprobes function jprobe_return(). + +Here's how it works. When the probe is hit, Kprobes makes a copy of +the saved registers and a generous portion of the stack (see below). +Kprobes then points the saved instruction pointer at the jprobe's +handler routine, and returns from the trap. As a result, control +passes to the handler, which is presented with the same register and +stack contents as the probed function. When it is done, the handler +calls jprobe_return(), which traps again to restore the original stack +contents and processor state and switch to the probed function. + +By convention, the callee owns its arguments, so gcc may produce code +that unexpectedly modifies that portion of the stack. This is why +Kprobes saves a copy of the stack and restores it after the jprobe +handler has run. Up to MAX_STACK_SIZE bytes are copied -- e.g., +64 bytes on i386. + +Note that the probed function's args may be passed on the stack +or in registers (e.g., for x86_64 or for an i386 fastcall function). +The jprobe will work in either case, so long as the handler's +prototype matches that of the probed function. + +1.3 How Does a Return Probe Work? + +When you call register_kretprobe(), Kprobes establishes a kprobe at +the entry to the function. When the probed function is called and this +probe is hit, Kprobes saves a copy of the return address, and replaces +the return address with the address of a "trampoline." The trampoline +is an arbitrary piece of code -- typically just a nop instruction. +At boot time, Kprobes registers a kprobe at the trampoline. + +When the probed function executes its return instruction, control +passes to the trampoline and that probe is hit. Kprobes' trampoline +handler calls the user-specified handler associated with the kretprobe, +then sets the saved instruction pointer to the saved return address, +and that's where execution resumes upon return from the trap. + +While the probed function is executing, its return address is +stored in an object of type kretprobe_instance. Before calling +register_kretprobe(), the user sets the maxactive field of the +kretprobe struct to specify how many instances of the specified +function can be probed simultaneously. register_kretprobe() +pre-allocates the indicated number of kretprobe_instance objects. + +For example, if the function is non-recursive and is called with a +spinlock held, maxactive = 1 should be enough. If the function is +non-recursive and can never relinquish the CPU (e.g., via a semaphore +or preemption), NR_CPUS should be enough. If maxactive <= 0, it is +set to a default value. If CONFIG_PREEMPT is enabled, the default +is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS. + +It's not a disaster if you set maxactive too low; you'll just miss +some probes. In the kretprobe struct, the nmissed field is set to +zero when the return probe is registered, and is incremented every +time the probed function is entered but there is no kretprobe_instance +object available for establishing the return probe. + +2. Architectures Supported + +Kprobes, jprobes, and return probes are implemented on the following +architectures: + +- i386 +- x86_64 (AMD-64, E64MT) +- ppc64 +- ia64 (Support for probes on certain instruction types is still in progress.) +- sparc64 (Return probes not yet implemented.) + +3. Configuring Kprobes + +When configuring the kernel using make menuconfig/xconfig/oldconfig, +ensure that CONFIG_KPROBES is set to "y". Under "Kernel hacking", +look for "Kprobes". You may have to enable "Kernel debugging" +(CONFIG_DEBUG_KERNEL) before you can enable Kprobes. + +You may also want to ensure that CONFIG_KALLSYMS and perhaps even +CONFIG_KALLSYMS_ALL are set to "y", since kallsyms_lookup_name() +is a handy, version-independent way to find a function's address. + +If you need to insert a probe in the middle of a function, you may find +it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO), +so you can use "objdump -d -l vmlinux" to see the source-to-object +code mapping. + +4. API Reference + +The Kprobes API includes a "register" function and an "unregister" +function for each type of probe. Here are terse, mini-man-page +specifications for these functions and the associated probe handlers +that you'll write. See the latter half of this document for examples. + +4.1 register_kprobe + +#include <linux/kprobes.h> +int register_kprobe(struct kprobe *kp); + +Sets a breakpoint at the address kp->addr. When the breakpoint is +hit, Kprobes calls kp->pre_handler. After the probed instruction +is single-stepped, Kprobe calls kp->post_handler. If a fault +occurs during execution of kp->pre_handler or kp->post_handler, +or during single-stepping of the probed instruction, Kprobes calls +kp->fault_handler. Any or all handlers can be NULL. + +register_kprobe() returns 0 on success, or a negative errno otherwise. + +User's pre-handler (kp->pre_handler): +#include <linux/kprobes.h> +#include <linux/ptrace.h> +int pre_handler(struct kprobe *p, struct pt_regs *regs); + +Called with p pointing to the kprobe associated with the breakpoint, +and regs pointing to the struct containing the registers saved when +the breakpoint was hit. Return 0 here unless you're a Kprobes geek. + +User's post-handler (kp->post_handler): +#include <linux/kprobes.h> +#include <linux/ptrace.h> +void post_handler(struct kprobe *p, struct pt_regs *regs, + unsigned long flags); + +p and regs are as described for the pre_handler. flags always seems +to be zero. + +User's fault-handler (kp->fault_handler): +#include <linux/kprobes.h> +#include <linux/ptrace.h> +int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr); + +p and regs are as described for the pre_handler. trapnr is the +architecture-specific trap number associated with the fault (e.g., +on i386, 13 for a general protection fault or 14 for a page fault). +Returns 1 if it successfully handled the exception. + +4.2 register_jprobe + +#include <linux/kprobes.h> +int register_jprobe(struct jprobe *jp) + +Sets a breakpoint at the address jp->kp.addr, which must be the address +of the first instruction of a function. When the breakpoint is hit, +Kprobes runs the handler whose address is jp->entry. + +The handler should have the same arg list and return type as the probed +function; and just before it returns, it must call jprobe_return(). +(The handler never actually returns, since jprobe_return() returns +control to Kprobes.) If the probed function is declared asmlinkage, +fastcall, or anything else that affects how args are passed, the +handler's declaration must match. + +register_jprobe() returns 0 on success, or a negative errno otherwise. + +4.3 register_kretprobe + +#include <linux/kprobes.h> +int register_kretprobe(struct kretprobe *rp); + +Establishes a return probe for the function whose address is +rp->kp.addr. When that function returns, Kprobes calls rp->handler. +You must set rp->maxactive appropriately before you call +register_kretprobe(); see "How Does a Return Probe Work?" for details. + +register_kretprobe() returns 0 on success, or a negative errno +otherwise. + +User's return-probe handler (rp->handler): +#include <linux/kprobes.h> +#include <linux/ptrace.h> +int kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs); + +regs is as described for kprobe.pre_handler. ri points to the +kretprobe_instance object, of which the following fields may be +of interest: +- ret_addr: the return address +- rp: points to the corresponding kretprobe object +- task: points to the corresponding task struct +The handler's return value is currently ignored. + +4.4 unregister_*probe + +#include <linux/kprobes.h> +void unregister_kprobe(struct kprobe *kp); +void unregister_jprobe(struct jprobe *jp); +void unregister_kretprobe(struct kretprobe *rp); + +Removes the specified probe. The unregister function can be called +at any time after the probe has been registered. + +5. Kprobes Features and Limitations + +As of Linux v2.6.12, Kprobes allows multiple probes at the same +address. Currently, however, there cannot be multiple jprobes on +the same function at the same time. + +In general, you can install a probe anywhere in the kernel. +In particular, you can probe interrupt handlers. Known exceptions +are discussed in this section. + +For obvious reasons, it's a bad idea to install a probe in +the code that implements Kprobes (mostly kernel/kprobes.c and +arch/*/kernel/kprobes.c). A patch in the v2.6.13 timeframe instructs +Kprobes to reject such requests. + +If you install a probe in an inline-able function, Kprobes makes +no attempt to chase down all inline instances of the function and +install probes there. gcc may inline a function without being asked, +so keep this in mind if you're not seeing the probe hits you expect. + +A probe handler can modify the environment of the probed function +-- e.g., by modifying kernel data structures, or by modifying the +contents of the pt_regs struct (which are restored to the registers +upon return from the breakpoint). So Kprobes can be used, for example, +to install a bug fix or to inject faults for testing. Kprobes, of +course, has no way to distinguish the deliberately injected faults +from the accidental ones. Don't drink and probe. + +Kprobes makes no attempt to prevent probe handlers from stepping on +each other -- e.g., probing printk() and then calling printk() from a +probe handler. As of Linux v2.6.12, if a probe handler hits a probe, +that second probe's handlers won't be run in that instance. + +In Linux v2.6.12 and previous versions, Kprobes' data structures are +protected by a single lock that is held during probe registration and +unregistration and while handlers are run. Thus, no two handlers +can run simultaneously. To improve scalability on SMP systems, +this restriction will probably be removed soon, in which case +multiple handlers (or multiple instances of the same handler) may +run concurrently on different CPUs. Code your handlers accordingly. + +Kprobes does not use semaphores or allocate memory except during +registration and unregistration. + +Probe handlers are run with preemption disabled. Depending on the +architecture, handlers may also run with interrupts disabled. In any +case, your handler should not yield the CPU (e.g., by attempting to +acquire a semaphore). + +Since a return probe is implemented by replacing the return +address with the trampoline's address, stack backtraces and calls +to __builtin_return_address() will typically yield the trampoline's +address instead of the real return address for kretprobed functions. +(As far as we can tell, __builtin_return_address() is used only +for instrumentation and error reporting.) + +If the number of times a function is called does not match the +number of times it returns, registering a return probe on that +function may produce undesirable results. We have the do_exit() +and do_execve() cases covered. do_fork() is not an issue. We're +unaware of other specific cases where this could be a problem. + +6. Probe Overhead + +On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 +microseconds to process. Specifically, a benchmark that hits the same +probepoint repeatedly, firing a simple handler each time, reports 1-2 +million hits per second, depending on the architecture. A jprobe or +return-probe hit typically takes 50-75% longer than a kprobe hit. +When you have a return probe set on a function, adding a kprobe at +the entry to that function adds essentially no overhead. + +Here are sample overhead figures (in usec) for different architectures. +k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe +on same function; jr = jprobe + return probe on same function + +i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips +k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40 + +x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips +k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07 + +ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) +k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99 + +7. TODO + +a. SystemTap (http://sourceware.org/systemtap): Work in progress +to provide a simplified programming interface for probe-based +instrumentation. +b. Improved SMP scalability: Currently, work is in progress to handle +multiple kprobes in parallel. +c. Kernel return probes for sparc64. +d. Support for other architectures. +e. User-space probes. + +8. Kprobes Example + +Here's a sample kernel module showing the use of kprobes to dump a +stack trace and selected i386 registers when do_fork() is called. +----- cut here ----- +/*kprobe_example.c*/ +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/kprobes.h> +#include <linux/kallsyms.h> +#include <linux/sched.h> + +/*For each probe you need to allocate a kprobe structure*/ +static struct kprobe kp; + +/*kprobe pre_handler: called just before the probed instruction is executed*/ +int handler_pre(struct kprobe *p, struct pt_regs *regs) +{ + printk("pre_handler: p->addr=0x%p, eip=%lx, eflags=0x%lx\n", + p->addr, regs->eip, regs->eflags); + dump_stack(); + return 0; +} + +/*kprobe post_handler: called after the probed instruction is executed*/ +void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags) +{ + printk("post_handler: p->addr=0x%p, eflags=0x%lx\n", + p->addr, regs->eflags); +} + +/* fault_handler: this is called if an exception is generated for any + * instruction within the pre- or post-handler, or when Kprobes + * single-steps the probed instruction. + */ +int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr) +{ + printk("fault_handler: p->addr=0x%p, trap #%dn", + p->addr, trapnr); + /* Return 0 because we don't handle the fault. */ + return 0; +} + +int init_module(void) +{ + int ret; + kp.pre_handler = handler_pre; + kp.post_handler = handler_post; + kp.fault_handler = handler_fault; + kp.addr = (kprobe_opcode_t*) kallsyms_lookup_name("do_fork"); + /* register the kprobe now */ + if (!kp.addr) { + printk("Couldn't find %s to plant kprobe\n", "do_fork"); + return -1; + } + if ((ret = register_kprobe(&kp) < 0)) { + printk("register_kprobe failed, returned %d\n", ret); + return -1; + } + printk("kprobe registered\n"); + return 0; +} + +void cleanup_module(void) +{ + unregister_kprobe(&kp); + printk("kprobe unregistered\n"); +} + +MODULE_LICENSE("GPL"); +----- cut here ----- + +You can build the kernel module, kprobe-example.ko, using the following +Makefile: +----- cut here ----- +obj-m := kprobe-example.o +KDIR := /lib/modules/$(shell uname -r)/build +PWD := $(shell pwd) +default: + $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules +clean: + rm -f *.mod.c *.ko *.o +----- cut here ----- + +$ make +$ su - +... +# insmod kprobe-example.ko + +You will see the trace data in /var/log/messages and on the console +whenever do_fork() is invoked to create a new process. + +9. Jprobes Example + +Here's a sample kernel module showing the use of jprobes to dump +the arguments of do_fork(). +----- cut here ----- +/*jprobe-example.c */ +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/fs.h> +#include <linux/uio.h> +#include <linux/kprobes.h> +#include <linux/kallsyms.h> + +/* + * Jumper probe for do_fork. + * Mirror principle enables access to arguments of the probed routine + * from the probe handler. + */ + +/* Proxy routine having the same arguments as actual do_fork() routine */ +long jdo_fork(unsigned long clone_flags, unsigned long stack_start, + struct pt_regs *regs, unsigned long stack_size, + int __user * parent_tidptr, int __user * child_tidptr) +{ + printk("jprobe: clone_flags=0x%lx, stack_size=0x%lx, regs=0x%p\n", + clone_flags, stack_size, regs); + /* Always end with a call to jprobe_return(). */ + jprobe_return(); + /*NOTREACHED*/ + return 0; +} + +static struct jprobe my_jprobe = { + .entry = (kprobe_opcode_t *) jdo_fork +}; + +int init_module(void) +{ + int ret; + my_jprobe.kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name("do_fork"); + if (!my_jprobe.kp.addr) { + printk("Couldn't find %s to plant jprobe\n", "do_fork"); + return -1; + } + + if ((ret = register_jprobe(&my_jprobe)) <0) { + printk("register_jprobe failed, returned %d\n", ret); + return -1; + } + printk("Planted jprobe at %p, handler addr %p\n", + my_jprobe.kp.addr, my_jprobe.entry); + return 0; +} + +void cleanup_module(void) +{ + unregister_jprobe(&my_jprobe); + printk("jprobe unregistered\n"); +} + +MODULE_LICENSE("GPL"); +----- cut here ----- + +Build and insert the kernel module as shown in the above kprobe +example. You will see the trace data in /var/log/messages and on +the console whenever do_fork() is invoked to create a new process. +(Some messages may be suppressed if syslogd is configured to +eliminate duplicate messages.) + +10. Kretprobes Example + +Here's a sample kernel module showing the use of return probes to +report failed calls to sys_open(). +----- cut here ----- +/*kretprobe-example.c*/ +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/kprobes.h> +#include <linux/kallsyms.h> + +static const char *probed_func = "sys_open"; + +/* Return-probe handler: If the probed function fails, log the return value. */ +static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs) +{ + // Substitute the appropriate register name for your architecture -- + // e.g., regs->rax for x86_64, regs->gpr[3] for ppc64. + int retval = (int) regs->eax; + if (retval < 0) { + printk("%s returns %d\n", probed_func, retval); + } + return 0; +} + +static struct kretprobe my_kretprobe = { + .handler = ret_handler, + /* Probe up to 20 instances concurrently. */ + .maxactive = 20 +}; + +int init_module(void) +{ + int ret; + my_kretprobe.kp.addr = + (kprobe_opcode_t *) kallsyms_lookup_name(probed_func); + if (!my_kretprobe.kp.addr) { + printk("Couldn't find %s to plant return probe\n", probed_func); + return -1; + } + if ((ret = register_kretprobe(&my_kretprobe)) < 0) { + printk("register_kretprobe failed, returned %d\n", ret); + return -1; + } + printk("Planted return probe at %p\n", my_kretprobe.kp.addr); + return 0; +} + +void cleanup_module(void) +{ + unregister_kretprobe(&my_kretprobe); + printk("kretprobe unregistered\n"); + /* nmissed > 0 suggests that maxactive was set too low. */ + printk("Missed probing %d instances of %s\n", + my_kretprobe.nmissed, probed_func); +} + +MODULE_LICENSE("GPL"); +----- cut here ----- + +Build and insert the kernel module as shown in the above kprobe +example. You will see the trace data in /var/log/messages and on the +console whenever sys_open() returns a negative value. (Some messages +may be suppressed if syslogd is configured to eliminate duplicate +messages.) + +For additional information on Kprobes, refer to the following URLs: +http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe +http://www.redhat.com/magazine/005mar05/features/kprobes/ diff --git a/Documentation/m68k/kernel-options.txt b/Documentation/m68k/kernel-options.txt index e191baad8308..d5d3f064f552 100644 --- a/Documentation/m68k/kernel-options.txt +++ b/Documentation/m68k/kernel-options.txt @@ -626,7 +626,7 @@ ignored (others aren't affected). can be performed in optimal order. Not all SCSI devices support tagged queuing (:-(). -4.6 switches= +4.5 switches= ------------- Syntax: switches=<list of switches> @@ -661,28 +661,6 @@ correctly. earlier initialization ("ov_"-less) takes precedence. But the switching-off on reset still happens in this case. -4.5) stram_swap= ----------------- - -Syntax: stram_swap=<do_swap>[,<max_swap>] - - This option is available only if the kernel has been compiled with -CONFIG_STRAM_SWAP enabled. Normally, the kernel then determines -dynamically whether to actually use ST-RAM as swap space. (Currently, -the fraction of ST-RAM must be less or equal 1/3 of total memory to -enable this swapping.) You can override the kernel's decision by -specifying this option. 1 for <do_swap> means always enable the swap, -even if you have less alternate RAM. 0 stands for never swap to -ST-RAM, even if it's small enough compared to the rest of memory. - - If ST-RAM swapping is enabled, the kernel usually uses all free -ST-RAM as swap "device". If the kernel resides in ST-RAM, the region -allocated by it is obviously never used for swapping :-) You can also -limit this amount by specifying the second parameter, <max_swap>, if -you want to use parts of ST-RAM as normal system memory. <max_swap> is -in kBytes and the number should be a multiple of 4 (otherwise: rounded -down). - 5) Options for Amiga Only: ========================== diff --git a/Documentation/mips/AU1xxx_IDE.README b/Documentation/mips/AU1xxx_IDE.README new file mode 100644 index 000000000000..a7e4c4ea3560 --- /dev/null +++ b/Documentation/mips/AU1xxx_IDE.README @@ -0,0 +1,168 @@ +README for MIPS AU1XXX IDE driver - Released 2005-07-15 + +ABOUT +----- +This file describes the 'drivers/ide/mips/au1xxx-ide.c', related files and the +services they provide. + +If you are short in patience and just want to know how to add your hard disc to +the white or black list, go to the 'ADD NEW HARD DISC TO WHITE OR BLACK LIST' +section. + + +LICENSE +------- + +Copyright (c) 2003-2005 AMD, Personal Connectivity Solutions + +This program is free software; you can redistribute it and/or modify it under +the terms of the GNU General Public License as published by the Free Software +Foundation; either version 2 of the License, or (at your option) any later +version. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, +INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR +BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR +CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF +SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS +INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN +CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +You should have received a copy of the GNU General Public License along with +this program; if not, write to the Free Software Foundation, Inc., +675 Mass Ave, Cambridge, MA 02139, USA. + +Note: for more information, please refer "AMD Alchemy Au1200/Au1550 IDE + Interface and Linux Device Driver" Application Note. + + +FILES, CONFIGS AND COMPATABILITY +-------------------------------- + +Two files are introduced: + + a) 'include/asm-mips/mach-au1x00/au1xxx_ide.h' + containes : struct _auide_hwif + struct drive_list_entry dma_white_list + struct drive_list_entry dma_black_list + timing parameters for PIO mode 0/1/2/3/4 + timing parameters for MWDMA 0/1/2 + + b) 'drivers/ide/mips/au1xxx-ide.c' + contains the functionality of the AU1XXX IDE driver + +Four configs variables are introduced: + + CONFIG_BLK_DEV_IDE_AU1XXX_PIO_DBDMA - enable the PIO+DBDMA mode + CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA - enable the MWDMA mode + CONFIG_BLK_DEV_IDE_AU1XXX_BURSTABLE_ON - set Burstable FIFO in DBDMA + controler + CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ - maximum transfer size + per descriptor + +If MWDMA is enabled and the connected hard disc is not on the white list, the +kernel switches to a "safe mwdma mode" at boot time. In this mode the IDE +performance is substantial slower then in full speed mwdma. In this case +please add your hard disc to the white list (follow instruction from 'ADD NEW +HARD DISC TO WHITE OR BLACK LIST' section). + + +SUPPORTED IDE MODES +------------------- + +The AU1XXX IDE driver supported all PIO modes - PIO mode 0/1/2/3/4 - and all +MWDMA modes - MWDMA 0/1/2 -. There is no support for SWDMA and UDMA mode. + +To change the PIO mode use the program hdparm with option -p, e.g. +'hdparm -p0 [device]' for PIO mode 0. To enable the MWDMA mode use the option +-X, e.g. 'hdparm -X32 [device]' for MWDMA mode 0. + + +PERFORMANCE CONFIGURATIONS +-------------------------- + +If the used system doesn't need USB support enable the following kernel configs: + +CONFIG_IDE=y +CONFIG_BLK_DEV_IDE=y +CONFIG_IDE_GENERIC=y +CONFIG_BLK_DEV_IDEPCI=y +CONFIG_BLK_DEV_GENERIC=y +CONFIG_BLK_DEV_IDEDMA_PCI=y +CONFIG_IDEDMA_PCI_AUTO=y +CONFIG_BLK_DEV_IDE_AU1XXX=y +CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA=y +CONFIG_BLK_DEV_IDE_AU1XXX_BURSTABLE_ON=y +CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ=128 +CONFIG_BLK_DEV_IDEDMA=y +CONFIG_IDEDMA_AUTO=y + +If the used system need the USB support enable the following kernel configs for +high IDE to USB throughput. + +CONFIG_BLK_DEV_IDEDISK=y +CONFIG_IDE_GENERIC=y +CONFIG_BLK_DEV_IDEPCI=y +CONFIG_BLK_DEV_GENERIC=y +CONFIG_BLK_DEV_IDEDMA_PCI=y +CONFIG_IDEDMA_PCI_AUTO=y +CONFIG_BLK_DEV_IDE_AU1XXX=y +CONFIG_BLK_DEV_IDE_AU1XXX_MDMA2_DBDMA=y +CONFIG_BLK_DEV_IDE_AU1XXX_SEQTS_PER_RQ=128 +CONFIG_BLK_DEV_IDEDMA=y +CONFIG_IDEDMA_AUTO=y + + +ADD NEW HARD DISC TO WHITE OR BLACK LIST +---------------------------------------- + +Step 1 : detect the model name of your hard disc + + a) connect your hard disc to the AU1XXX + + b) boot your kernel and get the hard disc model. + + Example boot log: + + --snipped-- + Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 + ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx + Au1xxx IDE(builtin) configured for MWDMA2 + Probing IDE interface ide0... + hda: Maxtor 6E040L0, ATA DISK drive + ide0 at 0xac800000-0xac800007,0xac8001c0 on irq 64 + hda: max request size: 64KiB + hda: 80293248 sectors (41110 MB) w/2048KiB Cache, CHS=65535/16/63, (U)DMA + --snipped-- + + In this example 'Maxtor 6E040L0'. + +Step 2 : edit 'include/asm-mips/mach-au1x00/au1xxx_ide.h' + + Add your hard disc to the dma_white_list or dma_black_list structur. + +Step 3 : Recompile the kernel + + Enable MWDMA support in the kernel configuration. Recompile the kernel and + reboot. + +Step 4 : Tests + + If you have add a hard disc to the white list, please run some stress tests + for verification. + + +ACKNOWLEDGMENTS +--------------- + +These drivers wouldn't have been done without the base of kernel 2.4.x AU1XXX +IDE driver from AMD. + +Additional input also from: +Matthias Lenk <matthias.lenk@amd.com> + +Happy hacking! +Enrico Walther <enrico.walther@amd.com> diff --git a/Documentation/mono.txt b/Documentation/mono.txt index 6739ab9615ef..807a0c7b4737 100644 --- a/Documentation/mono.txt +++ b/Documentation/mono.txt @@ -30,7 +30,7 @@ other program after you have done the following: Read the file 'binfmt_misc.txt' in this directory to know more about the configuration process. -3) Add the following enries to /etc/rc.local or similar script +3) Add the following entries to /etc/rc.local or similar script to be run at system startup: # Insert BINFMT_MISC module into the kernel diff --git a/Documentation/networking/README.ipw2100 b/Documentation/networking/README.ipw2100 new file mode 100644 index 000000000000..2046948b020d --- /dev/null +++ b/Documentation/networking/README.ipw2100 @@ -0,0 +1,246 @@ + +=========================== +Intel(R) PRO/Wireless 2100 Network Connection Driver for Linux +README.ipw2100 + +March 14, 2005 + +=========================== +Index +--------------------------- +0. Introduction +1. Release 1.1.0 Current Features +2. Command Line Parameters +3. Sysfs Helper Files +4. Radio Kill Switch +5. Dynamic Firmware +6. Power Management +7. Support +8. License + + +=========================== +0. Introduction +------------ ----- ----- ---- --- -- - + +This document provides a brief overview of the features supported by the +IPW2100 driver project. The main project website, where the latest +development version of the driver can be found, is: + + http://ipw2100.sourceforge.net + +There you can find the not only the latest releases, but also information about +potential fixes and patches, as well as links to the development mailing list +for the driver project. + + +=========================== +1. Release 1.1.0 Current Supported Features +--------------------------- +- Managed (BSS) and Ad-Hoc (IBSS) +- WEP (shared key and open) +- Wireless Tools support +- 802.1x (tested with XSupplicant 1.0.1) + +Enabled (but not supported) features: +- Monitor/RFMon mode +- WPA/WPA2 + +The distinction between officially supported and enabled is a reflection +on the amount of validation and interoperability testing that has been +performed on a given feature. + + +=========================== +2. Command Line Parameters +--------------------------- + +If the driver is built as a module, the following optional parameters are used +by entering them on the command line with the modprobe command using this +syntax: + + modprobe ipw2100 [<option>=<VAL1><,VAL2>...] + +For example, to disable the radio on driver loading, enter: + + modprobe ipw2100 disable=1 + +The ipw2100 driver supports the following module parameters: + +Name Value Example: +debug 0x0-0xffffffff debug=1024 +mode 0,1,2 mode=1 /* AdHoc */ +channel int channel=3 /* Only valid in AdHoc or Monitor */ +associate boolean associate=0 /* Do NOT auto associate */ +disable boolean disable=1 /* Do not power the HW */ + + +=========================== +3. Sysfs Helper Files +--------------------------- + +There are several ways to control the behavior of the driver. Many of the +general capabilities are exposed through the Wireless Tools (iwconfig). There +are a few capabilities that are exposed through entries in the Linux Sysfs. + + +----- Driver Level ------ +For the driver level files, look in /sys/bus/pci/drivers/ipw2100/ + + debug_level + + This controls the same global as the 'debug' module parameter. For + information on the various debugging levels available, run the 'dvals' + script found in the driver source directory. + + NOTE: 'debug_level' is only enabled if CONFIG_IPW2100_DEBUG is turn + on. + +----- Device Level ------ +For the device level files look in + + /sys/bus/pci/drivers/ipw2100/{PCI-ID}/ + +For example: + /sys/bus/pci/drivers/ipw2100/0000:02:01.0 + +For the device level files, see /sys/bus/pci/drivers/ipw2100: + + rf_kill + read - + 0 = RF kill not enabled (radio on) + 1 = SW based RF kill active (radio off) + 2 = HW based RF kill active (radio off) + 3 = Both HW and SW RF kill active (radio off) + write - + 0 = If SW based RF kill active, turn the radio back on + 1 = If radio is on, activate SW based RF kill + + NOTE: If you enable the SW based RF kill and then toggle the HW + based RF kill from ON -> OFF -> ON, the radio will NOT come back on + + +=========================== +4. Radio Kill Switch +--------------------------- +Most laptops provide the ability for the user to physically disable the radio. +Some vendors have implemented this as a physical switch that requires no +software to turn the radio off and on. On other laptops, however, the switch +is controlled through a button being pressed and a software driver then making +calls to turn the radio off and on. This is referred to as a "software based +RF kill switch" + +See the Sysfs helper file 'rf_kill' for determining the state of the RF switch +on your system. + + +=========================== +5. Dynamic Firmware +--------------------------- +As the firmware is licensed under a restricted use license, it can not be +included within the kernel sources. To enable the IPW2100 you will need a +firmware image to load into the wireless NIC's processors. + +You can obtain these images from <http://ipw2100.sf.net/firmware.php>. + +See INSTALL for instructions on installing the firmware. + + +=========================== +6. Power Management +--------------------------- +The IPW2100 supports the configuration of the Power Save Protocol +through a private wireless extension interface. The IPW2100 supports +the following different modes: + + off No power management. Radio is always on. + on Automatic power management + 1-5 Different levels of power management. The higher the + number the greater the power savings, but with an impact to + packet latencies. + +Power management works by powering down the radio after a certain +interval of time has passed where no packets are passed through the +radio. Once powered down, the radio remains in that state for a given +period of time. For higher power savings, the interval between last +packet processed to sleep is shorter and the sleep period is longer. + +When the radio is asleep, the access point sending data to the station +must buffer packets at the AP until the station wakes up and requests +any buffered packets. If you have an AP that does not correctly support +the PSP protocol you may experience packet loss or very poor performance +while power management is enabled. If this is the case, you will need +to try and find a firmware update for your AP, or disable power +management (via `iwconfig eth1 power off`) + +To configure the power level on the IPW2100 you use a combination of +iwconfig and iwpriv. iwconfig is used to turn power management on, off, +and set it to auto. + + iwconfig eth1 power off Disables radio power down + iwconfig eth1 power on Enables radio power management to + last set level (defaults to AUTO) + iwpriv eth1 set_power 0 Sets power level to AUTO and enables + power management if not previously + enabled. + iwpriv eth1 set_power 1-5 Set the power level as specified, + enabling power management if not + previously enabled. + +You can view the current power level setting via: + + iwpriv eth1 get_power + +It will return the current period or timeout that is configured as a string +in the form of xxxx/yyyy (z) where xxxx is the timeout interval (amount of +time after packet processing), yyyy is the period to sleep (amount of time to +wait before powering the radio and querying the access point for buffered +packets), and z is the 'power level'. If power management is turned off the +xxxx/yyyy will be replaced with 'off' -- the level reported will be the active +level if `iwconfig eth1 power on` is invoked. + + +=========================== +7. Support +--------------------------- + +For general development information and support, +go to: + + http://ipw2100.sf.net/ + +The ipw2100 1.1.0 driver and firmware can be downloaded from: + + http://support.intel.com + +For installation support on the ipw2100 1.1.0 driver on Linux kernels +2.6.8 or greater, email support is available from: + + http://supportmail.intel.com + +=========================== +8. License +--------------------------- + + Copyright(c) 2003 - 2005 Intel Corporation. All rights reserved. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License (version 2) as + published by the Free Software Foundation. + + This program is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., 59 + Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + The full GNU General Public License is included in this distribution in the + file called LICENSE. + + License Contact Information: + James P. Ketrenos <ipw2100-admin@linux.intel.com> + Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497 + diff --git a/Documentation/networking/README.ipw2200 b/Documentation/networking/README.ipw2200 new file mode 100644 index 000000000000..6916080c5f03 --- /dev/null +++ b/Documentation/networking/README.ipw2200 @@ -0,0 +1,300 @@ + +Intel(R) PRO/Wireless 2915ABG Driver for Linux in support of: + +Intel(R) PRO/Wireless 2200BG Network Connection +Intel(R) PRO/Wireless 2915ABG Network Connection + +Note: The Intel(R) PRO/Wireless 2915ABG Driver for Linux and Intel(R) +PRO/Wireless 2200BG Driver for Linux is a unified driver that works on +both hardware adapters listed above. In this document the Intel(R) +PRO/Wireless 2915ABG Driver for Linux will be used to reference the +unified driver. + +Copyright (C) 2004-2005, Intel Corporation + +README.ipw2200 + +Version: 1.0.0 +Date : January 31, 2005 + + +Index +----------------------------------------------- +1. Introduction +1.1. Overview of features +1.2. Module parameters +1.3. Wireless Extension Private Methods +1.4. Sysfs Helper Files +2. About the Version Numbers +3. Support +4. License + + +1. Introduction +----------------------------------------------- +The following sections attempt to provide a brief introduction to using +the Intel(R) PRO/Wireless 2915ABG Driver for Linux. + +This document is not meant to be a comprehensive manual on +understanding or using wireless technologies, but should be sufficient +to get you moving without wires on Linux. + +For information on building and installing the driver, see the INSTALL +file. + + +1.1. Overview of Features +----------------------------------------------- +The current release (1.0.0) supports the following features: + ++ BSS mode (Infrastructure, Managed) ++ IBSS mode (Ad-Hoc) ++ WEP (OPEN and SHARED KEY mode) ++ 802.1x EAP via wpa_supplicant and xsupplicant ++ Wireless Extension support ++ Full B and G rate support (2200 and 2915) ++ Full A rate support (2915 only) ++ Transmit power control ++ S state support (ACPI suspend/resume) ++ long/short preamble support + + + +1.2. Command Line Parameters +----------------------------------------------- + +Like many modules used in the Linux kernel, the Intel(R) PRO/Wireless +2915ABG Driver for Linux allows certain configuration options to be +provided as module parameters. The most common way to specify a module +parameter is via the command line. + +The general form is: + +% modprobe ipw2200 parameter=value + +Where the supported parameter are: + + associate + Set to 0 to disable the auto scan-and-associate functionality of the + driver. If disabled, the driver will not attempt to scan + for and associate to a network until it has been configured with + one or more properties for the target network, for example configuring + the network SSID. Default is 1 (auto-associate) + + Example: % modprobe ipw2200 associate=0 + + auto_create + Set to 0 to disable the auto creation of an Ad-Hoc network + matching the channel and network name parameters provided. + Default is 1. + + channel + channel number for association. The normal method for setting + the channel would be to use the standard wireless tools + (i.e. `iwconfig eth1 channel 10`), but it is useful sometimes + to set this while debugging. Channel 0 means 'ANY' + + debug + If using a debug build, this is used to control the amount of debug + info is logged. See the 'dval' and 'load' script for more info on + how to use this (the dval and load scripts are provided as part + of the ipw2200 development snapshot releases available from the + SourceForge project at http://ipw2200.sf.net) + + mode + Can be used to set the default mode of the adapter. + 0 = Managed, 1 = Ad-Hoc + + +1.3. Wireless Extension Private Methods +----------------------------------------------- + +As an interface designed to handle generic hardware, there are certain +capabilities not exposed through the normal Wireless Tool interface. As +such, a provision is provided for a driver to declare custom, or +private, methods. The Intel(R) PRO/Wireless 2915ABG Driver for Linux +defines several of these to configure various settings. + +The general form of using the private wireless methods is: + + % iwpriv $IFNAME method parameters + +Where $IFNAME is the interface name the device is registered with +(typically eth1, customized via one of the various network interface +name managers, such as ifrename) + +The supported private methods are: + + get_mode + Can be used to report out which IEEE mode the driver is + configured to support. Example: + + % iwpriv eth1 get_mode + eth1 get_mode:802.11bg (6) + + set_mode + Can be used to configure which IEEE mode the driver will + support. + + Usage: + % iwpriv eth1 set_mode {mode} + Where {mode} is a number in the range 1-7: + 1 802.11a (2915 only) + 2 802.11b + 3 802.11ab (2915 only) + 4 802.11g + 5 802.11ag (2915 only) + 6 802.11bg + 7 802.11abg (2915 only) + + get_preamble + Can be used to report configuration of preamble length. + + set_preamble + Can be used to set the configuration of preamble length: + + Usage: + % iwpriv eth1 set_preamble {mode} + Where {mode} is one of: + 1 Long preamble only + 0 Auto (long or short based on connection) + + +1.4. Sysfs Helper Files: +----------------------------------------------- + +The Linux kernel provides a pseudo file system that can be used to +access various components of the operating system. The Intel(R) +PRO/Wireless 2915ABG Driver for Linux exposes several configuration +parameters through this mechanism. + +An entry in the sysfs can support reading and/or writing. You can +typically query the contents of a sysfs entry through the use of cat, +and can set the contents via echo. For example: + +% cat /sys/bus/pci/drivers/ipw2200/debug_level + +Will report the current debug level of the driver's logging subsystem +(only available if CONFIG_IPW_DEBUG was configured when the driver was +built). + +You can set the debug level via: + +% echo $VALUE > /sys/bus/pci/drivers/ipw2200/debug_level + +Where $VALUE would be a number in the case of this sysfs entry. The +input to sysfs files does not have to be a number. For example, the +firmware loader used by hotplug utilizes sysfs entries for transferring +the firmware image from user space into the driver. + +The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries +at two levels -- driver level, which apply to all instances of the +driver (in the event that there are more than one device installed) and +device level, which applies only to the single specific instance. + + +1.4.1 Driver Level Sysfs Helper Files +----------------------------------------------- + +For the driver level files, look in /sys/bus/pci/drivers/ipw2200/ + + debug_level + + This controls the same global as the 'debug' module parameter + + +1.4.2 Device Level Sysfs Helper Files +----------------------------------------------- + +For the device level files, look in + + /sys/bus/pci/drivers/ipw2200/{PCI-ID}/ + +For example: + /sys/bus/pci/drivers/ipw2200/0000:02:01.0 + +For the device level files, see /sys/bus/pci/[drivers/ipw2200: + + rf_kill + read - + 0 = RF kill not enabled (radio on) + 1 = SW based RF kill active (radio off) + 2 = HW based RF kill active (radio off) + 3 = Both HW and SW RF kill active (radio off) + write - + 0 = If SW based RF kill active, turn the radio back on + 1 = If radio is on, activate SW based RF kill + + NOTE: If you enable the SW based RF kill and then toggle the HW + based RF kill from ON -> OFF -> ON, the radio will NOT come back on + + ucode + read-only access to the ucode version number + + +2. About the Version Numbers +----------------------------------------------- + +Due to the nature of open source development projects, there are +frequently changes being incorporated that have not gone through +a complete validation process. These changes are incorporated into +development snapshot releases. + +Releases are numbered with a three level scheme: + + major.minor.development + +Any version where the 'development' portion is 0 (for example +1.0.0, 1.1.0, etc.) indicates a stable version that will be made +available for kernel inclusion. + +Any version where the 'development' portion is not a 0 (for +example 1.0.1, 1.1.5, etc.) indicates a development version that is +being made available for testing and cutting edge users. The stability +and functionality of the development releases are not know. We make +efforts to try and keep all snapshots reasonably stable, but due to the +frequency of their release, and the desire to get those releases +available as quickly as possible, unknown anomalies should be expected. + +The major version number will be incremented when significant changes +are made to the driver. Currently, there are no major changes planned. + + +3. Support +----------------------------------------------- + +For installation support of the 1.0.0 version, you can contact +http://supportmail.intel.com, or you can use the open source project +support. + +For general information and support, go to: + + http://ipw2200.sf.net/ + + +4. License +----------------------------------------------- + + Copyright(c) 2003 - 2005 Intel Corporation. All rights reserved. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License version 2 as + published by the Free Software Foundation. + + This program is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., 59 + Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + The full GNU General Public License is included in this distribution in the + file called LICENSE. + + Contact Information: + James P. Ketrenos <ipw2100-admin@linux.intel.com> + Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497 + + diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 24d029455baa..b0fe41da007b 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -777,7 +777,7 @@ doing so is the same as described in the "Configuring Multiple Bonds Manually" section, below. NOTE: It has been observed that some Red Hat supplied kernels -are apparently unable to rename modules at load time (the "-obonding1" +are apparently unable to rename modules at load time (the "-o bond1" part). Attempts to pass that option to modprobe will produce an "Operation not permitted" error. This has been reported on some Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels @@ -883,7 +883,8 @@ the above does not work, and the second bonding instance never sees its options. In that case, the second options line can be substituted as follows: -install bonding1 /sbin/modprobe bonding -obond1 mode=balance-alb miimon=50 +install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ + mode=balance-alb miimon=50 This may be repeated any number of times, specifying a new and unique name in place of bond1 for each subsequent instance. @@ -1241,7 +1242,7 @@ traffic while still maintaining carrier on. If running SNMP agents, the bonding driver should be loaded before any network drivers participating in a bond. This requirement -is due to the the interface index (ipAdEntIfIndex) being associated to +is due to the interface index (ipAdEntIfIndex) being associated to the first interface found with a given IP address. That is, there is only one ipAdEntIfIndex for each IP address. For example, if eth0 and eth1 are slaves of bond0 and the driver for eth0 is loaded before the @@ -1937,7 +1938,7 @@ switches currently available support 802.3ad. If not explicitly configured (with ifconfig or ip link), the MAC address of the bonding device is taken from its first slave device. This MAC address is then passed to all following slaves and -remains persistent (even if the the first slave is removed) until the +remains persistent (even if the first slave is removed) until the bonding device is brought down or reconfigured. If you wish to change the MAC address, you can set it with diff --git a/Documentation/networking/cxgb.txt b/Documentation/networking/cxgb.txt new file mode 100644 index 000000000000..76324638626b --- /dev/null +++ b/Documentation/networking/cxgb.txt @@ -0,0 +1,352 @@ + Chelsio N210 10Gb Ethernet Network Controller + + Driver Release Notes for Linux + + Version 2.1.1 + + June 20, 2005 + +CONTENTS +======== + INTRODUCTION + FEATURES + PERFORMANCE + DRIVER MESSAGES + KNOWN ISSUES + SUPPORT + + +INTRODUCTION +============ + + This document describes the Linux driver for Chelsio 10Gb Ethernet Network + Controller. This driver supports the Chelsio N210 NIC and is backward + compatible with the Chelsio N110 model 10Gb NICs. + + +FEATURES +======== + + Adaptive Interrupts (adaptive-rx) + --------------------------------- + + This feature provides an adaptive algorithm that adjusts the interrupt + coalescing parameters, allowing the driver to dynamically adapt the latency + settings to achieve the highest performance during various types of network + load. + + The interface used to control this feature is ethtool. Please see the + ethtool manpage for additional usage information. + + By default, adaptive-rx is disabled. + To enable adaptive-rx: + + ethtool -C <interface> adaptive-rx on + + To disable adaptive-rx, use ethtool: + + ethtool -C <interface> adaptive-rx off + + After disabling adaptive-rx, the timer latency value will be set to 50us. + You may set the timer latency after disabling adaptive-rx: + + ethtool -C <interface> rx-usecs <microseconds> + + An example to set the timer latency value to 100us on eth0: + + ethtool -C eth0 rx-usecs 100 + + You may also provide a timer latency value while disabling adpative-rx: + + ethtool -C <interface> adaptive-rx off rx-usecs <microseconds> + + If adaptive-rx is disabled and a timer latency value is specified, the timer + will be set to the specified value until changed by the user or until + adaptive-rx is enabled. + + To view the status of the adaptive-rx and timer latency values: + + ethtool -c <interface> + + + TCP Segmentation Offloading (TSO) Support + ----------------------------------------- + + This feature, also known as "large send", enables a system's protocol stack + to offload portions of outbound TCP processing to a network interface card + thereby reducing system CPU utilization and enhancing performance. + + The interface used to control this feature is ethtool version 1.8 or higher. + Please see the ethtool manpage for additional usage information. + + By default, TSO is enabled. + To disable TSO: + + ethtool -K <interface> tso off + + To enable TSO: + + ethtool -K <interface> tso on + + To view the status of TSO: + + ethtool -k <interface> + + +PERFORMANCE +=========== + + The following information is provided as an example of how to change system + parameters for "performance tuning" an what value to use. You may or may not + want to change these system parameters, depending on your server/workstation + application. Doing so is not warranted in any way by Chelsio Communications, + and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss + of data or damage to equipment. + + Your distribution may have a different way of doing things, or you may prefer + a different method. These commands are shown only to provide an example of + what to do and are by no means definitive. + + Making any of the following system changes will only last until you reboot + your system. You may want to write a script that runs at boot-up which + includes the optimal settings for your system. + + Setting PCI Latency Timer: + setpci -d 1425:* 0x0c.l=0x0000F800 + + Disabling TCP timestamp: + sysctl -w net.ipv4.tcp_timestamps=0 + + Disabling SACK: + sysctl -w net.ipv4.tcp_sack=0 + + Setting large number of incoming connection requests: + sysctl -w net.ipv4.tcp_max_syn_backlog=3000 + + Setting maximum receive socket buffer size: + sysctl -w net.core.rmem_max=1024000 + + Setting maximum send socket buffer size: + sysctl -w net.core.wmem_max=1024000 + + Set smp_affinity (on a multiprocessor system) to a single CPU: + echo 1 > /proc/irq/<interrupt_number>/smp_affinity + + Setting default receive socket buffer size: + sysctl -w net.core.rmem_default=524287 + + Setting default send socket buffer size: + sysctl -w net.core.wmem_default=524287 + + Setting maximum option memory buffers: + sysctl -w net.core.optmem_max=524287 + + Setting maximum backlog (# of unprocessed packets before kernel drops): + sysctl -w net.core.netdev_max_backlog=300000 + + Setting TCP read buffers (min/default/max): + sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000" + + Setting TCP write buffers (min/pressure/max): + sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000" + + Setting TCP buffer space (min/pressure/max): + sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000" + + TCP window size for single connections: + The receive buffer (RX_WINDOW) size must be at least as large as the + Bandwidth-Delay Product of the communication link between the sender and + receiver. Due to the variations of RTT, you may want to increase the buffer + size up to 2 times the Bandwidth-Delay Product. Reference page 289 of + "TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens. + At 10Gb speeds, use the following formula: + RX_WINDOW >= 1.25MBytes * RTT(in milliseconds) + Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000 + RX_WINDOW sizes of 256KB - 512KB should be sufficient. + Setting the min, max, and default receive buffer (RX_WINDOW) size: + sysctl -w net.ipv4.tcp_rmem="<min> <default> <max>" + + TCP window size for multiple connections: + The receive buffer (RX_WINDOW) size may be calculated the same as single + connections, but should be divided by the number of connections. The + smaller window prevents congestion and facilitates better pacing, + especially if/when MAC level flow control does not work well or when it is + not supported on the machine. Experimentation may be necessary to attain + the correct value. This method is provided as a starting point fot the + correct receive buffer size. + Setting the min, max, and default receive buffer (RX_WINDOW) size is + performed in the same manner as single connection. + + +DRIVER MESSAGES +=============== + + The following messages are the most common messages logged by syslog. These + may be found in /var/log/messages. + + Driver up: + Chelsio Network Driver - version 2.1.1 + + NIC detected: + eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit + + Link up: + eth#: link is up at 10 Gbps, full duplex + + Link down: + eth#: link is down + + +KNOWN ISSUES +============ + + These issues have been identified during testing. The following information + is provided as a workaround to the problem. In some cases, this problem is + inherent to Linux or to a particular Linux Distribution and/or hardware + platform. + + 1. Large number of TCP retransmits on a multiprocessor (SMP) system. + + On a system with multiple CPUs, the interrupt (IRQ) for the network + controller may be bound to more than one CPU. This will cause TCP + retransmits if the packet data were to be split across different CPUs + and re-assembled in a different order than expected. + + To eliminate the TCP retransmits, set smp_affinity on the particular + interrupt to a single CPU. You can locate the interrupt (IRQ) used on + the N110/N210 by using ifconfig: + ifconfig <dev_name> | grep Interrupt + Set the smp_affinity to a single CPU: + echo 1 > /proc/irq/<interrupt_number>/smp_affinity + + It is highly suggested that you do not run the irqbalance daemon on your + system, as this will change any smp_affinity setting you have applied. + The irqbalance daemon runs on a 10 second interval and binds interrupts + to the least loaded CPU determined by the daemon. To disable this daemon: + chkconfig --level 2345 irqbalance off + + By default, some Linux distributions enable the kernel feature, + irqbalance, which performs the same function as the daemon. To disable + this feature, add the following line to your bootloader: + noirqbalance + + Example using the Grub bootloader: + title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp) + root (hd0,0) + kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance + initrd /initrd-2.4.21-27.ELsmp.img + + 2. After running insmod, the driver is loaded and the incorrect network + interface is brought up without running ifup. + + When using 2.4.x kernels, including RHEL kernels, the Linux kernel + invokes a script named "hotplug". This script is primarily used to + automatically bring up USB devices when they are plugged in, however, + the script also attempts to automatically bring up a network interface + after loading the kernel module. The hotplug script does this by scanning + the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking + for HWADDR=<mac_address>. + + If the hotplug script does not find the HWADDRR within any of the + ifcfg-eth# files, it will bring up the device with the next available + interface name. If this interface is already configured for a different + network card, your new interface will have incorrect IP address and + network settings. + + To solve this issue, you can add the HWADDR=<mac_address> key to the + interface config file of your network controller. + + To disable this "hotplug" feature, you may add the driver (module name) + to the "blacklist" file located in /etc/hotplug. It has been noted that + this does not work for network devices because the net.agent script + does not use the blacklist file. Simply remove, or rename, the net.agent + script located in /etc/hotplug to disable this feature. + + 3. Transport Protocol (TP) hangs when running heavy multi-connection traffic + on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset. + + If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel + chipset, you may experience the "133-Mhz Mode Split Completion Data + Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the + bus PCI-X bus. + + AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel + can provide stale data via split completion cycles to a PCI-X card that + is operating at 133 Mhz", causing data corruption. + + AMD's provides three workarounds for this problem, however, Chelsio + recommends the first option for best performance with this bug: + + For 133Mhz secondary bus operation, limit the transaction length and + the number of outstanding transactions, via BIOS configuration + programming of the PCI-X card, to the following: + + Data Length (bytes): 1k + Total allowed outstanding transactions: 2 + + Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004, + section 56, "133-MHz Mode Split Completion Data Corruption" for more + details with this bug and workarounds suggested by AMD. + + It may be possible to work outside AMD's recommended PCI-X settings, try + increasing the Data Length to 2k bytes for increased performance. If you + have issues with these settings, please revert to the "safe" settings + and duplicate the problem before submitting a bug or asking for support. + + NOTE: The default setting on most systems is 8 outstanding transactions + and 2k bytes data length. + + 4. On multiprocessor systems, it has been noted that an application which + is handling 10Gb networking can switch between CPUs causing degraded + and/or unstable performance. + + If running on an SMP system and taking performance measurements, it + is suggested you either run the latest netperf-2.4.0+ or use a binding + tool such as Tim Hockin's procstate utilities (runon) + <http://www.hockin.org/~thockin/procstate/>. + + Binding netserver and netperf (or other applications) to particular + CPUs will have a significant difference in performance measurements. + You may need to experiment which CPU to bind the application to in + order to achieve the best performance for your system. + + If you are developing an application designed for 10Gb networking, + please keep in mind you may want to look at kernel functions + sched_setaffinity & sched_getaffinity to bind your application. + + If you are just running user-space applications such as ftp, telnet, + etc., you may want to try the runon tool provided by Tim Hockin's + procstate utility. You could also try binding the interface to a + particular CPU: runon 0 ifup eth0 + + +SUPPORT +======= + + If you have problems with the software or hardware, please contact our + customer support team via email at support@chelsio.com or check our website + at http://www.chelsio.com + +=============================================================================== + + Chelsio Communications + 370 San Aleso Ave. + Suite 100 + Sunnyvale, CA 94085 + http://www.chelsio.com + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License, version 2, as +published by the Free Software Foundation. + +You should have received a copy of the GNU General Public License along +with this program; if not, write to the Free Software Foundation, Inc., +59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED +WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. + + Copyright (c) 2003-2005 Chelsio Communications. All rights reserved. + +=============================================================================== diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ab65714d95fc..65895bb51414 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -309,7 +309,7 @@ tcp_tso_win_divisor - INTEGER can be consumed by a single TSO frame. The setting of this parameter is a choice between burstiness and building larger TSO frames. - Default: 8 + Default: 3 tcp_frto - BOOLEAN Enables F-RTO, an enhanced recovery algorithm for TCP retransmission @@ -355,10 +355,14 @@ ip_dynaddr - BOOLEAN Default: 0 icmp_echo_ignore_all - BOOLEAN + If set non-zero, then the kernel will ignore all ICMP ECHO + requests sent to it. + Default: 0 + icmp_echo_ignore_broadcasts - BOOLEAN - If either is set to true, then the kernel will ignore either all - ICMP ECHO requests sent to it or just those to broadcast/multicast - addresses, respectively. + If set non-zero, then the kernel will ignore all ICMP ECHO and + TIMESTAMP requests sent to it via broadcast/multicast. + Default: 1 icmp_ratelimit - INTEGER Limit the maximal rates for sending ICMP packets whose type matches diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt new file mode 100644 index 000000000000..29ccae409031 --- /dev/null +++ b/Documentation/networking/phy.txt @@ -0,0 +1,288 @@ + +------- +PHY Abstraction Layer +(Updated 2005-07-21) + +Purpose + + Most network devices consist of set of registers which provide an interface + to a MAC layer, which communicates with the physical connection through a + PHY. The PHY concerns itself with negotiating link parameters with the link + partner on the other side of the network connection (typically, an ethernet + cable), and provides a register interface to allow drivers to determine what + settings were chosen, and to configure what settings are allowed. + + While these devices are distinct from the network devices, and conform to a + standard layout for the registers, it has been common practice to integrate + the PHY management code with the network driver. This has resulted in large + amounts of redundant code. Also, on embedded systems with multiple (and + sometimes quite different) ethernet controllers connected to the same + management bus, it is difficult to ensure safe use of the bus. + + Since the PHYs are devices, and the management busses through which they are + accessed are, in fact, busses, the PHY Abstraction Layer treats them as such. + In doing so, it has these goals: + + 1) Increase code-reuse + 2) Increase overall code-maintainability + 3) Speed development time for new network drivers, and for new systems + + Basically, this layer is meant to provide an interface to PHY devices which + allows network driver writers to write as little code as possible, while + still providing a full feature set. + +The MDIO bus + + Most network devices are connected to a PHY by means of a management bus. + Different devices use different busses (though some share common interfaces). + In order to take advantage of the PAL, each bus interface needs to be + registered as a distinct device. + + 1) read and write functions must be implemented. Their prototypes are: + + int write(struct mii_bus *bus, int mii_id, int regnum, u16 value); + int read(struct mii_bus *bus, int mii_id, int regnum); + + mii_id is the address on the bus for the PHY, and regnum is the register + number. These functions are guaranteed not to be called from interrupt + time, so it is safe for them to block, waiting for an interrupt to signal + the operation is complete + + 2) A reset function is necessary. This is used to return the bus to an + initialized state. + + 3) A probe function is needed. This function should set up anything the bus + driver needs, setup the mii_bus structure, and register with the PAL using + mdiobus_register. Similarly, there's a remove function to undo all of + that (use mdiobus_unregister). + + 4) Like any driver, the device_driver structure must be configured, and init + exit functions are used to register the driver. + + 5) The bus must also be declared somewhere as a device, and registered. + + As an example for how one driver implemented an mdio bus driver, see + drivers/net/gianfar_mii.c and arch/ppc/syslib/mpc85xx_devices.c + +Connecting to a PHY + + Sometime during startup, the network driver needs to establish a connection + between the PHY device, and the network device. At this time, the PHY's bus + and drivers need to all have been loaded, so it is ready for the connection. + At this point, there are several ways to connect to the PHY: + + 1) The PAL handles everything, and only calls the network driver when + the link state changes, so it can react. + + 2) The PAL handles everything except interrupts (usually because the + controller has the interrupt registers). + + 3) The PAL handles everything, but checks in with the driver every second, + allowing the network driver to react first to any changes before the PAL + does. + + 4) The PAL serves only as a library of functions, with the network device + manually calling functions to update status, and configure the PHY + + +Letting the PHY Abstraction Layer do Everything + + If you choose option 1 (The hope is that every driver can, but to still be + useful to drivers that can't), connecting to the PHY is simple: + + First, you need a function to react to changes in the link state. This + function follows this protocol: + + static void adjust_link(struct net_device *dev); + + Next, you need to know the device name of the PHY connected to this device. + The name will look something like, "phy0:0", where the first number is the + bus id, and the second is the PHY's address on that bus. + + Now, to connect, just call this function: + + phydev = phy_connect(dev, phy_name, &adjust_link, flags); + + phydev is a pointer to the phy_device structure which represents the PHY. If + phy_connect is successful, it will return the pointer. dev, here, is the + pointer to your net_device. Once done, this function will have started the + PHY's software state machine, and registered for the PHY's interrupt, if it + has one. The phydev structure will be populated with information about the + current state, though the PHY will not yet be truly operational at this + point. + + flags is a u32 which can optionally contain phy-specific flags. + This is useful if the system has put hardware restrictions on + the PHY/controller, of which the PHY needs to be aware. + + Now just make sure that phydev->supported and phydev->advertising have any + values pruned from them which don't make sense for your controller (a 10/100 + controller may be connected to a gigabit capable PHY, so you would need to + mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions + for these bitfields. Note that you should not SET any bits, or the PHY may + get put into an unsupported state. + + Lastly, once the controller is ready to handle network traffic, you call + phy_start(phydev). This tells the PAL that you are ready, and configures the + PHY to connect to the network. If you want to handle your own interrupts, + just set phydev->irq to PHY_IGNORE_INTERRUPT before you call phy_start. + Similarly, if you don't want to use interrupts, set phydev->irq to PHY_POLL. + + When you want to disconnect from the network (even if just briefly), you call + phy_stop(phydev). + +Keeping Close Tabs on the PAL + + It is possible that the PAL's built-in state machine needs a little help to + keep your network device and the PHY properly in sync. If so, you can + register a helper function when connecting to the PHY, which will be called + every second before the state machine reacts to any changes. To do this, you + need to manually call phy_attach() and phy_prepare_link(), and then call + phy_start_machine() with the second argument set to point to your special + handler. + + Currently there are no examples of how to use this functionality, and testing + on it has been limited because the author does not have any drivers which use + it (they all use option 1). So Caveat Emptor. + +Doing it all yourself + + There's a remote chance that the PAL's built-in state machine cannot track + the complex interactions between the PHY and your network device. If this is + so, you can simply call phy_attach(), and not call phy_start_machine or + phy_prepare_link(). This will mean that phydev->state is entirely yours to + handle (phy_start and phy_stop toggle between some of the states, so you + might need to avoid them). + + An effort has been made to make sure that useful functionality can be + accessed without the state-machine running, and most of these functions are + descended from functions which did not interact with a complex state-machine. + However, again, no effort has been made so far to test running without the + state machine, so tryer beware. + + Here is a brief rundown of the functions: + + int phy_read(struct phy_device *phydev, u16 regnum); + int phy_write(struct phy_device *phydev, u16 regnum, u16 val); + + Simple read/write primitives. They invoke the bus's read/write function + pointers. + + void phy_print_status(struct phy_device *phydev); + + A convenience function to print out the PHY status neatly. + + int phy_clear_interrupt(struct phy_device *phydev); + int phy_config_interrupt(struct phy_device *phydev, u32 interrupts); + + Clear the PHY's interrupt, and configure which ones are allowed, + respectively. Currently only supports all on, or all off. + + int phy_enable_interrupts(struct phy_device *phydev); + int phy_disable_interrupts(struct phy_device *phydev); + + Functions which enable/disable PHY interrupts, clearing them + before and after, respectively. + + int phy_start_interrupts(struct phy_device *phydev); + int phy_stop_interrupts(struct phy_device *phydev); + + Requests the IRQ for the PHY interrupts, then enables them for + start, or disables then frees them for stop. + + struct phy_device * phy_attach(struct net_device *dev, const char *phy_id, + u32 flags); + + Attaches a network device to a particular PHY, binding the PHY to a generic + driver if none was found during bus initialization. Passes in + any phy-specific flags as needed. + + int phy_start_aneg(struct phy_device *phydev); + + Using variables inside the phydev structure, either configures advertising + and resets autonegotiation, or disables autonegotiation, and configures + forced settings. + + static inline int phy_read_status(struct phy_device *phydev); + + Fills the phydev structure with up-to-date information about the current + settings in the PHY. + + void phy_sanitize_settings(struct phy_device *phydev) + + Resolves differences between currently desired settings, and + supported settings for the given PHY device. Does not make + the changes in the hardware, though. + + int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd); + int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd); + + Ethtool convenience functions. + + int phy_mii_ioctl(struct phy_device *phydev, + struct mii_ioctl_data *mii_data, int cmd); + + The MII ioctl. Note that this function will completely screw up the state + machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to + use this only to write registers which are not standard, and don't set off + a renegotiation. + + +PHY Device Drivers + + With the PHY Abstraction Layer, adding support for new PHYs is + quite easy. In some cases, no work is required at all! However, + many PHYs require a little hand-holding to get up-and-running. + +Generic PHY driver + + If the desired PHY doesn't have any errata, quirks, or special + features you want to support, then it may be best to not add + support, and let the PHY Abstraction Layer's Generic PHY Driver + do all of the work. + +Writing a PHY driver + + If you do need to write a PHY driver, the first thing to do is + make sure it can be matched with an appropriate PHY device. + This is done during bus initialization by reading the device's + UID (stored in registers 2 and 3), then comparing it to each + driver's phy_id field by ANDing it with each driver's + phy_id_mask field. Also, it needs a name. Here's an example: + + static struct phy_driver dm9161_driver = { + .phy_id = 0x0181b880, + .name = "Davicom DM9161E", + .phy_id_mask = 0x0ffffff0, + ... + } + + Next, you need to specify what features (speed, duplex, autoneg, + etc) your PHY device and driver support. Most PHYs support + PHY_BASIC_FEATURES, but you can look in include/mii.h for other + features. + + Each driver consists of a number of function pointers: + + config_init: configures PHY into a sane state after a reset. + For instance, a Davicom PHY requires descrambling disabled. + probe: Does any setup needed by the driver + suspend/resume: power management + config_aneg: Changes the speed/duplex/negotiation settings + read_status: Reads the current speed/duplex/negotiation settings + ack_interrupt: Clear a pending interrupt + config_intr: Enable or disable interrupts + remove: Does any driver take-down + + Of these, only config_aneg and read_status are required to be + assigned by the driver code. The rest are optional. Also, it is + preferred to use the generic phy driver's versions of these two + functions if at all possible: genphy_read_status and + genphy_config_aneg. If this is not possible, it is likely that + you only need to perform some actions before and after invoking + these functions, and so your functions will wrap the generic + ones. + + Feel free to look at the Marvell, Cicada, and Davicom drivers in + drivers/net/phy/ for examples (the lxt and qsemi drivers have + not been tested as of this writing) diff --git a/Documentation/networking/wan-router.txt b/Documentation/networking/wan-router.txt index aea20cd2a56e..c96897aa08b6 100644 --- a/Documentation/networking/wan-router.txt +++ b/Documentation/networking/wan-router.txt @@ -355,7 +355,7 @@ REVISION HISTORY There is no functional difference between the two packages 2.0.7 Aug 26, 1999 o Merged X25API code into WANPIPE. - o Fixed a memeory leak for X25API + o Fixed a memory leak for X25API o Updated the X25API code for 2.2.X kernels. o Improved NEM handling. @@ -514,7 +514,7 @@ beta2-2.2.0 Jan 8 2001 o Patches for 2.4.0 kernel o Patches for 2.2.18 kernel o Minor updates to PPP and CHLDC drivers. - Note: No functinal difference. + Note: No functional difference. beta3-2.2.9 Jan 10 2001 o I missed the 2.2.18 kernel patches in beta2-2.2.0 diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt index da711028e5f7..66eaaab7773d 100644 --- a/Documentation/oops-tracing.txt +++ b/Documentation/oops-tracing.txt @@ -205,8 +205,8 @@ Phone: 701-234-7556 Tainted kernels: Some oops reports contain the string 'Tainted: ' after the program -counter, this indicates that the kernel has been tainted by some -mechanism. The string is followed by a series of position sensitive +counter. This indicates that the kernel has been tainted by some +mechanism. The string is followed by a series of position-sensitive characters, each representing a particular tainted value. 1: 'G' if all modules loaded have a GPL or compatible license, 'P' if @@ -214,16 +214,25 @@ characters, each representing a particular tainted value. MODULE_LICENSE or with a MODULE_LICENSE that is not recognised by insmod as GPL compatible are assumed to be proprietary. - 2: 'F' if any module was force loaded by insmod -f, ' ' if all + 2: 'F' if any module was force loaded by "insmod -f", ' ' if all modules were loaded normally. 3: 'S' if the oops occurred on an SMP kernel running on hardware that - hasn't been certified as safe to run multiprocessor. - Currently this occurs only on various Athlons that are not - SMP capable. + hasn't been certified as safe to run multiprocessor. + Currently this occurs only on various Athlons that are not + SMP capable. + + 4: 'R' if a module was force unloaded by "rmmod -f", ' ' if all + modules were unloaded normally. + + 5: 'M' if any processor has reported a Machine Check Exception, + ' ' if no Machine Check Exceptions have occurred. + + 6: 'B' if a page-release function has found a bad page reference or + some unexpected page flags. The primary reason for the 'Tainted: ' string is to tell kernel debuggers if this is a clean kernel or if anything unusual has -occurred. Tainting is permanent, even if an offending module is -unloading the tainted value remains to indicate that the kernel is not +occurred. Tainting is permanent: even if an offending module is +unloaded, the tainted value remains to indicate that the kernel is not trustworthy. diff --git a/Documentation/pci.txt b/Documentation/pci.txt index 62b1dc5d97e2..711210b38f5f 100644 --- a/Documentation/pci.txt +++ b/Documentation/pci.txt @@ -84,7 +84,7 @@ Each entry consists of: Most drivers don't need to use the driver_data field. Best practice for use of driver_data is to use it as an index into a static list of -equivalant device types, not to use it as a pointer. +equivalent device types, not to use it as a pointer. Have a table entry {PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID} to have probe() called for every PCI device known to the system. @@ -266,20 +266,6 @@ port an old driver to the new PCI interface. They are no longer present in the kernel as they aren't compatible with hotplug or PCI domains or having sane locking. -pcibios_present() and Since ages, you don't need to test presence -pci_present() of PCI subsystem when trying to talk to it. - If it's not there, the list of PCI devices - is empty and all functions for searching for - devices just return NULL. -pcibios_(read|write)_* Superseded by their pci_(read|write)_* - counterparts. -pcibios_find_* Superseded by their pci_get_* counterparts. -pci_for_each_dev() Superseded by pci_get_device() -pci_for_each_dev_reverse() Superseded by pci_find_device_reverse() -pci_for_each_bus() Superseded by pci_find_next_bus() pci_find_device() Superseded by pci_get_device() pci_find_subsys() Superseded by pci_get_subsys() pci_find_slot() Superseded by pci_get_slot() -pcibios_find_class() Superseded by pci_get_class() -pci_find_class() Superseded by pci_get_class() -pci_(read|write)_*_nodev() Superseded by pci_bus_(read|write)_*() diff --git a/Documentation/pm.txt b/Documentation/pm.txt index cc63ae18d147..2ea1149bf6b0 100644 --- a/Documentation/pm.txt +++ b/Documentation/pm.txt @@ -38,6 +38,12 @@ system the associated daemon will exit gracefully. Driver Interface -- OBSOLETE, DO NOT USE! ----------------************************* + +Note: pm_register(), pm_access(), pm_dev_idle() and friends are +obsolete. Please do not use them. Instead you should properly hook +your driver into the driver model, and use its suspend()/resume() +callbacks to do this kind of stuff. + If you are writing a new driver or maintaining an old driver, it should include power management support. Without power management support, a single driver may prevent a system with power management diff --git a/Documentation/power/swsusp-dmcrypt.txt b/Documentation/power/swsusp-dmcrypt.txt new file mode 100644 index 000000000000..59931b46ff7e --- /dev/null +++ b/Documentation/power/swsusp-dmcrypt.txt @@ -0,0 +1,138 @@ +Author: Andreas Steinmetz <ast@domdv.de> + + +How to use dm-crypt and swsusp together: +======================================== + +Some prerequisites: +You know how dm-crypt works. If not, visit the following web page: +http://www.saout.de/misc/dm-crypt/ +You have read Documentation/power/swsusp.txt and understand it. +You did read Documentation/initrd.txt and know how an initrd works. +You know how to create or how to modify an initrd. + +Now your system is properly set up, your disk is encrypted except for +the swap device(s) and the boot partition which may contain a mini +system for crypto setup and/or rescue purposes. You may even have +an initrd that does your current crypto setup already. + +At this point you want to encrypt your swap, too. Still you want to +be able to suspend using swsusp. This, however, means that you +have to be able to either enter a passphrase or that you read +the key(s) from an external device like a pcmcia flash disk +or an usb stick prior to resume. So you need an initrd, that sets +up dm-crypt and then asks swsusp to resume from the encrypted +swap device. + +The most important thing is that you set up dm-crypt in such +a way that the swap device you suspend to/resume from has +always the same major/minor within the initrd as well as +within your running system. The easiest way to achieve this is +to always set up this swap device first with dmsetup, so that +it will always look like the following: + +brw------- 1 root root 254, 0 Jul 28 13:37 /dev/mapper/swap0 + +Now set up your kernel to use /dev/mapper/swap0 as the default +resume partition, so your kernel .config contains: + +CONFIG_PM_STD_PARTITION="/dev/mapper/swap0" + +Prepare your boot loader to use the initrd you will create or +modify. For lilo the simplest setup looks like the following +lines: + +image=/boot/vmlinuz +initrd=/boot/initrd.gz +label=linux +append="root=/dev/ram0 init=/linuxrc rw" + +Finally you need to create or modify your initrd. Lets assume +you create an initrd that reads the required dm-crypt setup +from a pcmcia flash disk card. The card is formatted with an ext2 +fs which resides on /dev/hde1 when the card is inserted. The +card contains at least the encrypted swap setup in a file +named "swapkey". /etc/fstab of your initrd contains something +like the following: + +/dev/hda1 /mnt ext3 ro 0 0 +none /proc proc defaults,noatime,nodiratime 0 0 +none /sys sysfs defaults,noatime,nodiratime 0 0 + +/dev/hda1 contains an unencrypted mini system that sets up all +of your crypto devices, again by reading the setup from the +pcmcia flash disk. What follows now is a /linuxrc for your +initrd that allows you to resume from encrypted swap and that +continues boot with your mini system on /dev/hda1 if resume +does not happen: + +#!/bin/sh +PATH=/sbin:/bin:/usr/sbin:/usr/bin +mount /proc +mount /sys +mapped=0 +noresume=`grep -c noresume /proc/cmdline` +if [ "$*" != "" ] +then + noresume=1 +fi +dmesg -n 1 +/sbin/cardmgr -q +for i in 1 2 3 4 5 6 7 8 9 0 +do + if [ -f /proc/ide/hde/media ] + then + usleep 500000 + mount -t ext2 -o ro /dev/hde1 /mnt + if [ -f /mnt/swapkey ] + then + dmsetup create swap0 /mnt/swapkey > /dev/null 2>&1 && mapped=1 + fi + umount /mnt + break + fi + usleep 500000 +done +killproc /sbin/cardmgr +dmesg -n 6 +if [ $mapped = 1 ] +then + if [ $noresume != 0 ] + then + mkswap /dev/mapper/swap0 > /dev/null 2>&1 + fi + echo 254:0 > /sys/power/resume + dmsetup remove swap0 +fi +umount /sys +mount /mnt +umount /proc +cd /mnt +pivot_root . mnt +mount /proc +umount -l /mnt +umount /proc +exec chroot . /sbin/init $* < dev/console > dev/console 2>&1 + +Please don't mind the weird loop above, busybox's msh doesn't know +the let statement. Now, what is happening in the script? +First we have to decide if we want to try to resume, or not. +We will not resume if booting with "noresume" or any parameters +for init like "single" or "emergency" as boot parameters. + +Then we need to set up dmcrypt with the setup data from the +pcmcia flash disk. If this succeeds we need to reset the swap +device if we don't want to resume. The line "echo 254:0 > /sys/power/resume" +then attempts to resume from the first device mapper device. +Note that it is important to set the device in /sys/power/resume, +regardless if resuming or not, otherwise later suspend will fail. +If resume starts, script execution terminates here. + +Otherwise we just remove the encrypted swap device and leave it to the +mini system on /dev/hda1 to set the whole crypto up (it is up to +you to modify this to your taste). + +What then follows is the well known process to change the root +file system and continue booting from there. I prefer to unmount +the initrd prior to continue booting but it is up to you to modify +this. diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt index 7a6b78966459..b0d50840788e 100644 --- a/Documentation/power/swsusp.txt +++ b/Documentation/power/swsusp.txt @@ -1,22 +1,20 @@ -From kernel/suspend.c: +Some warnings, first. * BIG FAT WARNING ********************************************************* * - * If you have unsupported (*) devices using DMA... - * ...say goodbye to your data. - * * If you touch anything on disk between suspend and resume... * ...kiss your data goodbye. * - * If your disk driver does not support suspend... (IDE does) - * ...you'd better find out how to get along - * without your data. - * - * If you change kernel command line between suspend and resume... - * ...prepare for nasty fsck or worse. + * If you do resume from initrd after your filesystems are mounted... + * ...bye bye root partition. + * [this is actually same case as above] * - * If you change your hardware while system is suspended... - * ...well, it was not good idea. + * If you have unsupported (*) devices using DMA, you may have some + * problems. If your disk driver does not support suspend... (IDE does), + * it may cause some problems, too. If you change kernel command line + * between suspend and resume, it may do something wrong. If you change + * your hardware while system is suspended... well, it was not good idea; + * but it will probably only crash. * * (*) suspend/resume support is needed to make it safe. @@ -30,6 +28,13 @@ echo shutdown > /sys/power/disk; echo disk > /sys/power/state echo platform > /sys/power/disk; echo disk > /sys/power/state +Encrypted suspend image: +------------------------ +If you want to store your suspend image encrypted with a temporary +key to prevent data gathering after resume you must compile +crypto and the aes algorithm into the kernel - modules won't work +as they cannot be loaded at resume time. + Article about goals and implementation of Software Suspend for Linux ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -85,11 +90,6 @@ resume. You have your server on UPS. Power died, and UPS is indicating 30 seconds to failure. What do you do? Suspend to disk. -Ethernet card in your server died. You want to replace it. Your -server is not hotplug capable. What do you do? Suspend to disk, -replace ethernet card, resume. If you are fast your users will not -even see broken connections. - Q: Maybe I'm missing something, but why don't the regular I/O paths work? @@ -117,31 +117,6 @@ Q: Does linux support ACPI S4? A: Yes. That's what echo platform > /sys/power/disk does. -Q: My machine doesn't work with ACPI. How can I use swsusp than ? - -A: Do a reboot() syscall with right parameters. Warning: glibc gets in -its way, so check with strace: - -reboot(LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, 0xd000fce2) - -(Thanks to Peter Osterlund:) - -#include <unistd.h> -#include <syscall.h> - -#define LINUX_REBOOT_MAGIC1 0xfee1dead -#define LINUX_REBOOT_MAGIC2 672274793 -#define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2 - -int main() -{ - syscall(SYS_reboot, LINUX_REBOOT_MAGIC1, LINUX_REBOOT_MAGIC2, - LINUX_REBOOT_CMD_SW_SUSPEND, 0); - return 0; -} - -Also /sys/ interface should be still present. - Q: What is 'suspend2'? A: suspend2 is 'Software Suspend 2', a forked implementation of @@ -311,3 +286,46 @@ As a rule of thumb use encrypted swap to protect your data while your system is shut down or suspended. Additionally use the encrypted suspend image to prevent sensitive data from being stolen after resume. + +Q: Why can't we suspend to a swap file? + +A: Because accessing swap file needs the filesystem mounted, and +filesystem might do something wrong (like replaying the journal) +during mount. + +There are few ways to get that fixed: + +1) Probably could be solved by modifying every filesystem to support +some kind of "really read-only!" option. Patches welcome. + +2) suspend2 gets around that by storing absolute positions in on-disk +image (and blocksize), with resume parameter pointing directly to +suspend header. + +Q: Is there a maximum system RAM size that is supported by swsusp? + +A: It should work okay with highmem. + +Q: Does swsusp (to disk) use only one swap partition or can it use +multiple swap partitions (aggregate them into one logical space)? + +A: Only one swap partition, sorry. + +Q: If my application(s) causes lots of memory & swap space to be used +(over half of the total system RAM), is it correct that it is likely +to be useless to try to suspend to disk while that app is running? + +A: No, it should work okay, as long as your app does not mlock() +it. Just prepare big enough swap partition. + +Q: What information is usefull for debugging suspend-to-disk problems? + +A: Well, last messages on the screen are always useful. If something +is broken, it is usually some kernel driver, therefore trying with as +little as possible modules loaded helps a lot. I also prefer people to +suspend from console, preferably without X running. Booting with +init=/bin/bash, then swapon and starting suspend sequence manually +usually does the trick. Then it is good idea to try with latest +vanilla kernel. + + diff --git a/Documentation/power/video.txt b/Documentation/power/video.txt index 7a4a5036d123..526d6dd267ea 100644 --- a/Documentation/power/video.txt +++ b/Documentation/power/video.txt @@ -46,6 +46,12 @@ There are a few types of systems where video works after S3 resume: POSTing bios works. Ole Rohne has patch to do just that at http://dev.gentoo.org/~marineam/patch-radeonfb-2.6.11-rc2-mm2. +(8) on some systems, you can use the video_post utility mentioned here: + http://bugzilla.kernel.org/show_bug.cgi?id=3670. Do echo 3 > /sys/power/state + && /usr/sbin/video_post - which will initialize the display in console mode. + If you are in X, you can switch to a virtual terminal and back to X using + CTRL+ALT+F1 - CTRL+ALT+F7 to get the display working in graphical mode again. + Now, if you pass acpi_sleep=something, and it does not work with your bios, you'll get a hard crash during resume. Be careful. Also it is safest to do your experiments with plain old VGA console. The vesafb @@ -64,7 +70,8 @@ Model hack (or "how to do it") ------------------------------------------------------------------------------ Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI Acer TM 242FX vbetool (6) -Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) +Acer TM C110 video_post (8) +Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8) Acer TM 4052LCi s3_bios (2) Acer TM 636Lci s3_bios vga=normal (2) Acer TM 650 (Radeon M7) vga=normal plus boot-radeon (5) gets text console back @@ -113,6 +120,7 @@ IBM ThinkPad T42p (2373-GTG) s3_bios (2) IBM TP X20 ??? (*) IBM TP X30 s3_bios (2) IBM TP X31 / Type 2672-XXH none (1), use radeontool (http://fdd.com/software/radeon/) to turn off backlight. +IBM TP X32 none (1), but backlight is on and video is trashed after long suspend IBM Thinkpad X40 Type 2371-7JG s3_bios,s3_mode (4) Medion MD4220 ??? (*) Samsung P35 vbetool needed (6) diff --git a/Documentation/powerpc/eeh-pci-error-recovery.txt b/Documentation/powerpc/eeh-pci-error-recovery.txt index 2bfe71beec5b..e75d7474322c 100644 --- a/Documentation/powerpc/eeh-pci-error-recovery.txt +++ b/Documentation/powerpc/eeh-pci-error-recovery.txt @@ -134,7 +134,7 @@ pci_get_device_by_addr() will find the pci device associated with that address (if any). The default include/asm-ppc64/io.h macros readb(), inb(), insb(), -etc. include a check to see if the the i/o read returned all-0xff's. +etc. include a check to see if the i/o read returned all-0xff's. If so, these make a call to eeh_dn_check_failure(), which in turn asks the firmware if the all-ff's value is the sign of a true EEH error. If it is not, processing continues as normal. The grand diff --git a/Documentation/s390/s390dbf.txt b/Documentation/s390/s390dbf.txt index e24fdeada970..e321a8ed2a2d 100644 --- a/Documentation/s390/s390dbf.txt +++ b/Documentation/s390/s390dbf.txt @@ -468,7 +468,7 @@ The hex_ascii view shows the data field in hex and ascii representation The raw view returns a bytestream as the debug areas are stored in memory. The sprintf view formats the debug entries in the same way as the sprintf -function would do. The sprintf event/expection fuctions write to the +function would do. The sprintf event/expection functions write to the debug entry a pointer to the format string (size = sizeof(long)) and for each vararg a long value. So e.g. for a debug entry with a format string plus two varargs one would need to allocate a (3 * sizeof(long)) diff --git a/Documentation/scsi/00-INDEX b/Documentation/scsi/00-INDEX index f9cb5bdcce41..fef92ebf266f 100644 --- a/Documentation/scsi/00-INDEX +++ b/Documentation/scsi/00-INDEX @@ -60,6 +60,8 @@ scsi.txt - short blurb on using SCSI support as a module. scsi_mid_low_api.txt - info on API between SCSI layer and low level drivers +scsi_eh.txt + - info on SCSI midlayer error handling infrastructure st.txt - info on scsi tape driver sym53c500_cs.txt diff --git a/Documentation/scsi/aic7xxx.txt b/Documentation/scsi/aic7xxx.txt index 160e7354cd1e..47e74ddc4bc9 100644 --- a/Documentation/scsi/aic7xxx.txt +++ b/Documentation/scsi/aic7xxx.txt @@ -1,5 +1,5 @@ ==================================================================== -= Adaptec Aic7xxx Fast -> Ultra160 Family Manager Set v6.2.28 = += Adaptec Aic7xxx Fast -> Ultra160 Family Manager Set v7.0 = = README for = = The Linux Operating System = ==================================================================== @@ -131,6 +131,10 @@ The following information is available in this file: SCSI "stub" effects. 2. Version History + 7.0 (4th August, 2005) + - Updated driver to use SCSI transport class infrastructure + - Upported sequencer and core fixes from last adaptec released + version of the driver. 6.2.36 (June 3rd, 2003) - Correct code that disables PCI parity error checking. - Correct and simplify handling of the ignore wide residue diff --git a/Documentation/scsi/ibmmca.txt b/Documentation/scsi/ibmmca.txt index 2814491600ff..2ffb3ae0ef4d 100644 --- a/Documentation/scsi/ibmmca.txt +++ b/Documentation/scsi/ibmmca.txt @@ -344,7 +344,7 @@ /proc/scsi/ibmmca/<host_no>. ibmmca_proc_info() provides this information. This table is quite informative for interested users. It shows the load - of commands on the subsystem and wether you are running the bypassed + of commands on the subsystem and whether you are running the bypassed (software) or integrated (hardware) SCSI-command set (see below). The amount of accesses is shown. Read, write, modeselect is shown separately in order to help debugging problems with CD-ROMs or tapedrives. diff --git a/Documentation/scsi/scsi_eh.txt b/Documentation/scsi/scsi_eh.txt new file mode 100644 index 000000000000..534a50922a7b --- /dev/null +++ b/Documentation/scsi/scsi_eh.txt @@ -0,0 +1,479 @@ + +SCSI EH +====================================== + + This document describes SCSI midlayer error handling infrastructure. +Please refer to Documentation/scsi/scsi_mid_low_api.txt for more +information regarding SCSI midlayer. + +TABLE OF CONTENTS + +[1] How SCSI commands travel through the midlayer and to EH + [1-1] struct scsi_cmnd + [1-2] How do scmd's get completed? + [1-2-1] Completing a scmd w/ scsi_done + [1-2-2] Completing a scmd w/ timeout + [1-3] How EH takes over +[2] How SCSI EH works + [2-1] EH through fine-grained callbacks + [2-1-1] Overview + [2-1-2] Flow of scmds through EH + [2-1-3] Flow of control + [2-2] EH through hostt->eh_strategy_handler() + [2-2-1] Pre hostt->eh_strategy_handler() SCSI midlayer conditions + [2-2-2] Post hostt->eh_strategy_handler() SCSI midlayer conditions + [2-2-3] Things to consider + + +[1] How SCSI commands travel through the midlayer and to EH + +[1-1] struct scsi_cmnd + + Each SCSI command is represented with struct scsi_cmnd (== scmd). A +scmd has two list_head's to link itself into lists. The two are +scmd->list and scmd->eh_entry. The former is used for free list or +per-device allocated scmd list and not of much interest to this EH +discussion. The latter is used for completion and EH lists and unless +otherwise stated scmds are always linked using scmd->eh_entry in this +discussion. + + +[1-2] How do scmd's get completed? + + Once LLDD gets hold of a scmd, either the LLDD will complete the +command by calling scsi_done callback passed from midlayer when +invoking hostt->queuecommand() or SCSI midlayer will time it out. + + +[1-2-1] Completing a scmd w/ scsi_done + + For all non-EH commands, scsi_done() is the completion callback. It +does the following. + + 1. Delete timeout timer. If it fails, it means that timeout timer + has expired and is going to finish the command. Just return. + + 2. Link scmd to per-cpu scsi_done_q using scmd->en_entry + + 3. Raise SCSI_SOFTIRQ + + SCSI_SOFTIRQ handler scsi_softirq calls scsi_decide_disposition() to +determine what to do with the command. scsi_decide_disposition() +looks at the scmd->result value and sense data to determine what to do +with the command. + + - SUCCESS + scsi_finish_command() is invoked for the command. The + function does some maintenance choirs and notify completion by + calling scmd->done() callback, which, for fs requests, would + be HLD completion callback - sd:sd_rw_intr, sr:rw_intr, + st:st_intr. + + - NEEDS_RETRY + - ADD_TO_MLQUEUE + scmd is requeued to blk queue. + + - otherwise + scsi_eh_scmd_add(scmd, 0) is invoked for the command. See + [1-3] for details of this funciton. + + +[1-2-2] Completing a scmd w/ timeout + + The timeout handler is scsi_times_out(). When a timeout occurs, this +function + + 1. invokes optional hostt->eh_timedout() callback. Return value can + be one of + + - EH_HANDLED + This indicates that eh_timedout() dealt with the timeout. The + scmd is passed to __scsi_done() and thus linked into per-cpu + scsi_done_q. Normal command completion described in [1-2-1] + follows. + + - EH_RESET_TIMER + This indicates that more time is required to finish the + command. Timer is restarted. This action is counted as a + retry and only allowed scmd->allowed + 1(!) times. Once the + limit is reached, action for EH_NOT_HANDLED is taken instead. + + *NOTE* This action is racy as the LLDD could finish the scmd + after the timeout has expired but before it's added back. In + such cases, scsi_done() would think that timeout has occurred + and return without doing anything. We lose completion and the + command will time out again. + + - EH_NOT_HANDLED + This is the same as when eh_timedout() callback doesn't exist. + Step #2 is taken. + + 2. scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD) is invoked for the + command. See [1-3] for more information. + + +[1-3] How EH takes over + + scmds enter EH via scsi_eh_scmd_add(), which does the following. + + 1. Turns on scmd->eh_eflags as requested. It's 0 for error + completions and SCSI_EH_CANCEL_CMD for timeouts. + + 2. Links scmd->eh_entry to shost->eh_cmd_q + + 3. Sets SHOST_RECOVERY bit in shost->shost_state + + 4. Increments shost->host_failed + + 5. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed + + As can be seen above, once any scmd is added to shost->eh_cmd_q, +SHOST_RECOVERY shost_state bit is turned on. This prevents any new +scmd to be issued from blk queue to the host; eventually, all scmds on +the host either complete normally, fail and get added to eh_cmd_q, or +time out and get added to shost->eh_cmd_q. + + If all scmds either complete or fail, the number of in-flight scmds +becomes equal to the number of failed scmds - i.e. shost->host_busy == +shost->host_failed. This wakes up SCSI EH thread. So, once woken up, +SCSI EH thread can expect that all in-flight commands have failed and +are linked on shost->eh_cmd_q. + + Note that this does not mean lower layers are quiescent. If a LLDD +completed a scmd with error status, the LLDD and lower layers are +assumed to forget about the scmd at that point. However, if a scmd +has timed out, unless hostt->eh_timedout() made lower layers forget +about the scmd, which currently no LLDD does, the command is still +active as long as lower layers are concerned and completion could +occur at any time. Of course, all such completions are ignored as the +timer has already expired. + + We'll talk about how SCSI EH takes actions to abort - make LLDD +forget about - timed out scmds later. + + +[2] How SCSI EH works + + LLDD's can implement SCSI EH actions in one of the following two +ways. + + - Fine-grained EH callbacks + LLDD can implement fine-grained EH callbacks and let SCSI + midlayer drive error handling and call appropriate callbacks. + This will be dicussed further in [2-1]. + + - eh_strategy_handler() callback + This is one big callback which should perform whole error + handling. As such, it should do all choirs SCSI midlayer + performs during recovery. This will be discussed in [2-2]. + + Once recovery is complete, SCSI EH resumes normal operation by +calling scsi_restart_operations(), which + + 1. Checks if door locking is needed and locks door. + + 2. Clears SHOST_RECOVERY shost_state bit + + 3. Wakes up waiters on shost->host_wait. This occurs if someone + calls scsi_block_when_processing_errors() on the host. + (*QUESTION* why is it needed? All operations will be blocked + anyway after it reaches blk queue.) + + 4. Kicks queues in all devices on the host in the asses + + +[2-1] EH through fine-grained callbacks + +[2-1-1] Overview + + If eh_strategy_handler() is not present, SCSI midlayer takes charge +of driving error handling. EH's goals are two - make LLDD, host and +device forget about timed out scmds and make them ready for new +commands. A scmd is said to be recovered if the scmd is forgotten by +lower layers and lower layers are ready to process or fail the scmd +again. + + To achieve these goals, EH performs recovery actions with increasing +severity. Some actions are performed by issueing SCSI commands and +others are performed by invoking one of the following fine-grained +hostt EH callbacks. Callbacks may be omitted and omitted ones are +considered to fail always. + +int (* eh_abort_handler)(struct scsi_cmnd *); +int (* eh_device_reset_handler)(struct scsi_cmnd *); +int (* eh_bus_reset_handler)(struct scsi_cmnd *); +int (* eh_host_reset_handler)(struct scsi_cmnd *); + + Higher-severity actions are taken only when lower-severity actions +cannot recover some of failed scmds. Also, note that failure of the +highest-severity action means EH failure and results in offlining of +all unrecovered devices. + + During recovery, the following rules are followed + + - Recovery actions are performed on failed scmds on the to do list, + eh_work_q. If a recovery action succeeds for a scmd, recovered + scmds are removed from eh_work_q. + + Note that single recovery action on a scmd can recover multiple + scmds. e.g. resetting a device recovers all failed scmds on the + device. + + - Higher severity actions are taken iff eh_work_q is not empty after + lower severity actions are complete. + + - EH reuses failed scmds to issue commands for recovery. For + timed-out scmds, SCSI EH ensures that LLDD forgets about a scmd + before reusing it for EH commands. + + When a scmd is recovered, the scmd is moved from eh_work_q to EH +local eh_done_q using scsi_eh_finish_cmd(). After all scmds are +recovered (eh_work_q is empty), scsi_eh_flush_done_q() is invoked to +either retry or error-finish (notify upper layer of failure) recovered +scmds. + + scmds are retried iff its sdev is still online (not offlined during +EH), REQ_FAILFAST is not set and ++scmd->retries is less than +scmd->allowed. + + +[2-1-2] Flow of scmds through EH + + 1. Error completion / time out + ACTION: scsi_eh_scmd_add() is invoked for scmd + - set scmd->eh_eflags + - add scmd to shost->eh_cmd_q + - set SHOST_RECOVERY + - shost->host_failed++ + LOCKING: shost->host_lock + + 2. EH starts + ACTION: move all scmds to EH's local eh_work_q. shost->eh_cmd_q + is cleared. + LOCKING: shost->host_lock (not strictly necessary, just for + consistency) + + 3. scmd recovered + ACTION: scsi_eh_finish_cmd() is invoked to EH-finish scmd + - shost->host_failed-- + - clear scmd->eh_eflags + - scsi_setup_cmd_retry() + - move from local eh_work_q to local eh_done_q + LOCKING: none + + 4. EH completes + ACTION: scsi_eh_flush_done_q() retries scmds or notifies upper + layer of failure. + - scmd is removed from eh_done_q and scmd->eh_entry is cleared + - if retry is necessary, scmd is requeued using + scsi_queue_insert() + - otherwise, scsi_finish_command() is invoked for scmd + LOCKING: queue or finish function performs appropriate locking + + +[2-1-3] Flow of control + + EH through fine-grained callbacks start from scsi_unjam_host(). + +<<scsi_unjam_host>> + + 1. Lock shost->host_lock, splice_init shost->eh_cmd_q into local + eh_work_q and unlock host_lock. Note that shost->eh_cmd_q is + cleared by this action. + + 2. Invoke scsi_eh_get_sense. + + <<scsi_eh_get_sense>> + + This action is taken for each error-completed + (!SCSI_EH_CANCEL_CMD) commands without valid sense data. Most + SCSI transports/LLDDs automatically acquire sense data on + command failures (autosense). Autosense is recommended for + performance reasons and as sense information could get out of + sync inbetween occurrence of CHECK CONDITION and this action. + + Note that if autosense is not supported, scmd->sense_buffer + contains invalid sense data when error-completing the scmd + with scsi_done(). scsi_decide_disposition() always returns + FAILED in such cases thus invoking SCSI EH. When the scmd + reaches here, sense data is acquired and + scsi_decide_disposition() is called again. + + 1. Invoke scsi_request_sense() which issues REQUEST_SENSE + command. If fails, no action. Note that taking no action + causes higher-severity recovery to be taken for the scmd. + + 2. Invoke scsi_decide_disposition() on the scmd + + - SUCCESS + scmd->retries is set to scmd->allowed preventing + scsi_eh_flush_done_q() from retrying the scmd and + scsi_eh_finish_cmd() is invoked. + + - NEEDS_RETRY + scsi_eh_finish_cmd() invoked + + - otherwise + No action. + + 3. If !list_empty(&eh_work_q), invoke scsi_eh_abort_cmds(). + + <<scsi_eh_abort_cmds>> + + This action is taken for each timed out command. + hostt->eh_abort_handler() is invoked for each scmd. The + handler returns SUCCESS if it has succeeded to make LLDD and + all related hardware forget about the scmd. + + If a timedout scmd is successfully aborted and the sdev is + either offline or ready, scsi_eh_finish_cmd() is invoked for + the scmd. Otherwise, the scmd is left in eh_work_q for + higher-severity actions. + + Note that both offline and ready status mean that the sdev is + ready to process new scmds, where processing also implies + immediate failing; thus, if a sdev is in one of the two + states, no further recovery action is needed. + + Device readiness is tested using scsi_eh_tur() which issues + TEST_UNIT_READY command. Note that the scmd must have been + aborted successfully before reusing it for TEST_UNIT_READY. + + 4. If !list_empty(&eh_work_q), invoke scsi_eh_ready_devs() + + <<scsi_eh_ready_devs>> + + This function takes four increasingly more severe measures to + make failed sdevs ready for new commands. + + 1. Invoke scsi_eh_stu() + + <<scsi_eh_stu>> + + For each sdev which has failed scmds with valid sense data + of which scsi_check_sense()'s verdict is FAILED, + START_STOP_UNIT command is issued w/ start=1. Note that + as we explicitly choose error-completed scmds, it is known + that lower layers have forgotten about the scmd and we can + reuse it for STU. + + If STU succeeds and the sdev is either offline or ready, + all failed scmds on the sdev are EH-finished with + scsi_eh_finish_cmd(). + + *NOTE* If hostt->eh_abort_handler() isn't implemented or + failed, we may still have timed out scmds at this point + and STU doesn't make lower layers forget about those + scmds. Yet, this function EH-finish all scmds on the sdev + if STU succeeds leaving lower layers in an inconsistent + state. It seems that STU action should be taken only when + a sdev has no timed out scmd. + + 2. If !list_empty(&eh_work_q), invoke scsi_eh_bus_device_reset(). + + <<scsi_eh_bus_device_reset>> + + This action is very similar to scsi_eh_stu() except that, + instead of issuing STU, hostt->eh_device_reset_handler() + is used. Also, as we're not issuing SCSI commands and + resetting clears all scmds on the sdev, there is no need + to choose error-completed scmds. + + 3. If !list_empty(&eh_work_q), invoke scsi_eh_bus_reset() + + <<scsi_eh_bus_reset>> + + hostt->eh_bus_reset_handler() is invoked for each channel + with failed scmds. If bus reset succeeds, all failed + scmds on all ready or offline sdevs on the channel are + EH-finished. + + 4. If !list_empty(&eh_work_q), invoke scsi_eh_host_reset() + + <<scsi_eh_host_reset>> + + This is the last resort. hostt->eh_host_reset_handler() + is invoked. If host reset succeeds, all failed scmds on + all ready or offline sdevs on the host are EH-finished. + + 5. If !list_empty(&eh_work_q), invoke scsi_eh_offline_sdevs() + + <<scsi_eh_offline_sdevs>> + + Take all sdevs which still have unrecovered scmds offline + and EH-finish the scmds. + + 5. Invoke scsi_eh_flush_done_q(). + + <<scsi_eh_flush_done_q>> + + At this point all scmds are recovered (or given up) and + put on eh_done_q by scsi_eh_finish_cmd(). This function + flushes eh_done_q by either retrying or notifying upper + layer of failure of the scmds. + + +[2-2] EH through hostt->eh_strategy_handler() + + hostt->eh_strategy_handler() is invoked in the place of +scsi_unjam_host() and it is responsible for whole recovery process. +On completion, the handler should have made lower layers forget about +all failed scmds and either ready for new commands or offline. Also, +it should perform SCSI EH maintenance choirs to maintain integrity of +SCSI midlayer. IOW, of the steps described in [2-1-2], all steps +except for #1 must be implemented by eh_strategy_handler(). + + +[2-2-1] Pre hostt->eh_strategy_handler() SCSI midlayer conditions + + The following conditions are true on entry to the handler. + + - Each failed scmd's eh_flags field is set appropriately. + + - Each failed scmd is linked on scmd->eh_cmd_q by scmd->eh_entry. + + - SHOST_RECOVERY is set. + + - shost->host_failed == shost->host_busy + + +[2-2-2] Post hostt->eh_strategy_handler() SCSI midlayer conditions + + The following conditions must be true on exit from the handler. + + - shost->host_failed is zero. + + - Each scmd's eh_eflags field is cleared. + + - Each scmd is in such a state that scsi_setup_cmd_retry() on the + scmd doesn't make any difference. + + - shost->eh_cmd_q is cleared. + + - Each scmd->eh_entry is cleared. + + - Either scsi_queue_insert() or scsi_finish_command() is called on + each scmd. Note that the handler is free to use scmd->retries and + ->allowed to limit the number of retries. + + +[2-2-3] Things to consider + + - Know that timed out scmds are still active on lower layers. Make + lower layers forget about them before doing anything else with + those scmds. + + - For consistency, when accessing/modifying shost data structure, + grab shost->host_lock. + + - On completion, each failed sdev must have forgotten about all + active scmds. + + - On completion, each failed sdev must be ready for new commands or + offline. + + +-- +Tejun Heo +htejun@gmail.com +11th September 2005 diff --git a/Documentation/scsi/scsi_mid_low_api.txt b/Documentation/scsi/scsi_mid_low_api.txt index 7536823c0cb1..44df89c9c049 100644 --- a/Documentation/scsi/scsi_mid_low_api.txt +++ b/Documentation/scsi/scsi_mid_low_api.txt @@ -373,13 +373,11 @@ Summary: scsi_activate_tcq - turn on tag command queueing scsi_add_device - creates new scsi device (lu) instance scsi_add_host - perform sysfs registration and SCSI bus scan. - scsi_add_timer - (re-)start timer on a SCSI command. scsi_adjust_queue_depth - change the queue depth on a SCSI device scsi_assign_lock - replace default host_lock with given lock scsi_bios_ptable - return copy of block device's partition table scsi_block_requests - prevent further commands being queued to given host scsi_deactivate_tcq - turn off tag command queueing - scsi_delete_timer - cancel timer on a SCSI command. scsi_host_alloc - return a new scsi_host instance whose refcount==1 scsi_host_get - increments Scsi_Host instance's refcount scsi_host_put - decrements Scsi_Host instance's refcount (free if 0) @@ -458,27 +456,6 @@ int scsi_add_host(struct Scsi_Host *shost, struct device * dev) /** - * scsi_add_timer - (re-)start timer on a SCSI command. - * @scmd: pointer to scsi command instance - * @timeout: duration of timeout in "jiffies" - * @complete: pointer to function to call if timeout expires - * - * Returns nothing - * - * Might block: no - * - * Notes: Each scsi command has its own timer, and as it is added - * to the queue, we set up the timer. When the command completes, - * we cancel the timer. An LLD can use this function to change - * the existing timeout value. - * - * Defined in: drivers/scsi/scsi_error.c - **/ -void scsi_add_timer(struct scsi_cmnd *scmd, int timeout, - void (*complete)(struct scsi_cmnd *)) - - -/** * scsi_adjust_queue_depth - allow LLD to change queue depth on a SCSI device * @sdev: pointer to SCSI device to change queue depth on * @tagged: 0 - no tagged queuing @@ -566,24 +543,6 @@ void scsi_deactivate_tcq(struct scsi_device *sdev, int depth) /** - * scsi_delete_timer - cancel timer on a SCSI command. - * @scmd: pointer to scsi command instance - * - * Returns 1 if able to cancel timer else 0 (i.e. too late or already - * cancelled). - * - * Might block: no [may in the future if it invokes del_timer_sync()] - * - * Notes: All commands issued by upper levels already have a timeout - * associated with them. An LLD can use this function to cancel the - * timer. - * - * Defined in: drivers/scsi/scsi_error.c - **/ -int scsi_delete_timer(struct scsi_cmnd *scmd) - - -/** * scsi_host_alloc - create a scsi host adapter instance and perform basic * initialization. * @sht: pointer to scsi host template diff --git a/Documentation/serial/driver b/Documentation/serial/driver index ac7eabbf662a..42ef9970bc86 100644 --- a/Documentation/serial/driver +++ b/Documentation/serial/driver @@ -111,24 +111,20 @@ hardware. Interrupts: locally disabled. This call must not sleep - stop_tx(port,tty_stop) + stop_tx(port) Stop transmitting characters. This might be due to the CTS line becoming inactive or the tty layer indicating we want - to stop transmission. + to stop transmission due to an XOFF character. - tty_stop: 1 if this call is due to the TTY layer issuing a - TTY stop to the driver (equiv to rs_stop). + The driver should stop transmitting characters as soon as + possible. Locking: port->lock taken. Interrupts: locally disabled. This call must not sleep - start_tx(port,tty_start) - start transmitting characters. (incidentally, nonempty will - always be nonzero, and shouldn't be used - it will be dropped). - - tty_start: 1 if this call was due to the TTY layer issuing - a TTY start to the driver (equiv to rs_start) + start_tx(port) + Start transmitting characters. Locking: port->lock taken. Interrupts: locally disabled. @@ -288,26 +284,31 @@ hardware. Other functions --------------- -uart_update_timeout(port,cflag,quot) +uart_update_timeout(port,cflag,baud) Update the FIFO drain timeout, port->timeout, according to the - number of bits, parity, stop bits and quotient. + number of bits, parity, stop bits and baud rate. Locking: caller is expected to take port->lock Interrupts: n/a -uart_get_baud_rate(port,termios) +uart_get_baud_rate(port,termios,old,min,max) Return the numeric baud rate for the specified termios, taking account of the special 38400 baud "kludge". The B0 baud rate is mapped to 9600 baud. + If the baud rate is not within min..max, then if old is non-NULL, + the original baud rate will be tried. If that exceeds the + min..max constraint, 9600 baud will be returned. termios will + be updated to the baud rate in use. + + Note: min..max must always allow 9600 baud to be selected. + Locking: caller dependent. Interrupts: n/a -uart_get_divisor(port,termios,oldtermios) - Return the divsor (baud_base / baud) for the selected baud rate - specified by termios. If the baud rate is out of range, try - the original baud rate specified by oldtermios (if non-NULL). - If that fails, try 9600 baud. +uart_get_divisor(port,baud) + Return the divsor (baud_base / baud) for the specified baud + rate, appropriately rounded. If 38400 baud and custom divisor is selected, return the custom divisor instead. @@ -315,6 +316,46 @@ uart_get_divisor(port,termios,oldtermios) Locking: caller dependent. Interrupts: n/a +uart_match_port(port1,port2) + This utility function can be used to determine whether two + uart_port structures describe the same port. + + Locking: n/a + Interrupts: n/a + +uart_write_wakeup(port) + A driver is expected to call this function when the number of + characters in the transmit buffer have dropped below a threshold. + + Locking: port->lock should be held. + Interrupts: n/a + +uart_register_driver(drv) + Register a uart driver with the core driver. We in turn register + with the tty layer, and initialise the core driver per-port state. + + drv->port should be NULL, and the per-port structures should be + registered using uart_add_one_port after this call has succeeded. + + Locking: none + Interrupts: enabled + +uart_unregister_driver() + Remove all references to a driver from the core driver. The low + level driver must have removed all its ports via the + uart_remove_one_port() if it registered them with uart_add_one_port(). + + Locking: none + Interrupts: enabled + +uart_suspend_port() + +uart_resume_port() + +uart_add_one_port() + +uart_remove_one_port() + Other notes ----------- diff --git a/Documentation/sonypi.txt b/Documentation/sonypi.txt index 0f3b2405d09e..c1237a925505 100644 --- a/Documentation/sonypi.txt +++ b/Documentation/sonypi.txt @@ -99,6 +99,7 @@ statically linked into the kernel). Those options are: SONYPI_MEYE_MASK 0x0400 SONYPI_MEMORYSTICK_MASK 0x0800 SONYPI_BATTERY_MASK 0x1000 + SONYPI_WIRELESS_MASK 0x2000 useinput: if set (which is the default) two input devices are created, one which interprets the jogdial events as @@ -137,6 +138,15 @@ Bugs: speed handling etc). Use ACPI instead of APM if it works on your laptop. + - sonypi lacks the ability to distinguish between certain key + events on some models. + + - some models with the nvidia card (geforce go 6200 tc) uses a + different way to adjust the backlighting of the screen. There + is a userspace utility to adjust the brightness on those models, + which can be downloaded from + http://www.acc.umu.se/~erikw/program/smartdimmer-0.1.tar.bz2 + - since all development was done by reverse engineering, there is _absolutely no guarantee_ that this driver will not crash your laptop. Permanently. diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index a18ecb92b356..13cba955cb5a 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -75,7 +75,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. adsp_map - PCM device number maps assigned to the 2st OSS device. - Default: 1 nonblock_open - - Don't block opening busy PCM devices. + - Don't block opening busy PCM devices. Default: 1 For example, when dsp_map=2, /dev/dsp will be mapped to PCM #2 of the card #0. Similarly, when adsp_map=0, /dev/adsp will be mapped @@ -132,6 +132,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. mpu_irq - IRQ # for MPU-401 UART (PnP setup) dma1 - first DMA # for AD1816A chip (PnP setup) dma2 - second DMA # for AD1816A chip (PnP setup) + clockfreq - Clock frequency for AD1816A chip (default = 0, 33000Hz) Module supports up to 8 cards, autoprobe and PnP. @@ -147,6 +148,16 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module supports up to 8 cards. This module does not support autoprobe thus main port must be specified!!! Other ports are optional. + Module snd-ad1889 + ----------------- + + Module for Analog Devices AD1889 chips. + + ac97_quirk - AC'97 workaround for strange hardware + See the description of intel8x0 module for details. + + This module supports up to 8 cards. + Module snd-ali5451 ------------------ @@ -188,15 +199,20 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module snd-atiixp ----------------- - Module for ATI IXP 150/200/250 AC97 controllers. + Module for ATI IXP 150/200/250/400 AC97 controllers. - ac97_clock - AC'97 clock (defalut = 48000) + ac97_clock - AC'97 clock (default = 48000) ac97_quirk - AC'97 workaround for strange hardware - See the description of intel8x0 module for details. + See "AC97 Quirk Option" section below. spdif_aclink - S/PDIF transfer over AC-link (default = 1) This module supports up to 8 cards and autoprobe. + ATI IXP has two different methods to control SPDIF output. One is + over AC-link and another is over the "direct" SPDIF output. The + implementation depends on the motherboard, and you'll need to + choose the correct one via spdif_aclink module option. + Module snd-atiixp-modem ----------------------- @@ -229,7 +245,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. The hardware EQ hardware and SPDIF is only present in the Vortex2 and Advantage. - Note: Some ALSA mixer applicactions don't handle the SPDIF samplerate + Note: Some ALSA mixer applications don't handle the SPDIF sample rate control correctly. If you have problems regarding this, try another ALSA compliant mixer (alsamixer works). @@ -301,7 +317,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. mpu_port - 0x300,0x310,0x320,0x330, 0 = disable (default) fm_port - 0x388 (default), 0 = disable (default) - soft_ac3 - Sofware-conversion of raw SPDIF packets (model 033 only) + soft_ac3 - Software-conversion of raw SPDIF packets (model 033 only) (default = 1) joystick_port - Joystick port address (0 = disable, 1 = auto-detect) @@ -383,7 +399,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module for PCI sound cards based on CS4610/CS4612/CS4614/CS4615/CS4622/ CS4624/CS4630/CS4280 PCI chips. - external_amp - Force to enable external amplifer. + external_amp - Force to enable external amplifier. thinkpad - Force to enable Thinkpad's CLKRUN control. mmap_valid - Support OSS mmap mode (default = 0). @@ -619,7 +635,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. VIA VT8251/VT8237A model - force the model name - position_fix - Fix DMA pointer (0 = FIFO size, 1 = none, 2 = POSBUF) + position_fix - Fix DMA pointer (0 = auto, 1 = none, 2 = POSBUF, 3 = FIFO size) Module supports up to 8 cards. @@ -655,6 +671,11 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. allout 5-jack in back, 2-jack in front, SPDIF out auto auto-config reading BIOS (default) + If the default configuration doesn't work and one of the above + matches with your device, report it together with the PCI + subsystem ID (output of "lspci -nv") to ALSA BTS or alsa-devel + ML (see the section "Links and Addresses"). + Note 2: If you get click noises on output, try the module option position_fix=1 or 2. position_fix=1 will use the SD_LPIB register value without FIFO size correction as the current @@ -782,20 +803,13 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. ac97_clock - AC'97 codec clock base (0 = auto-detect) ac97_quirk - AC'97 workaround for strange hardware - The following strings are accepted: - default = don't override the default setting - disable = disable the quirk - hp_only = use headphone control as master - swap_hp = swap headphone and master controls - swap_surround = swap master and surround controls - ad_sharing = for AD1985, turn on OMS bit and use headphone - alc_jack = for ALC65x, turn on the jack sense mode - inv_eapd = inverted EAPD implementation - mute_led = bind EAPD bit for turning on/off mute LED - For backward compatibility, the corresponding integer - value -1, 0, ... are accepted, too. + See "AC97 Quirk Option" section below. buggy_irq - Enable workaround for buggy interrupts on some - motherboards (default off) + motherboards (default yes on nForce chips, + otherwise off) + buggy_semaphore - Enable workaround for hardwares with buggy + semaphores (e.g. on some ASUS laptops) + (default off) Module supports autoprobe and multiple bus-master chips (max 8). @@ -807,13 +821,6 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. motherboard has these devices, use the ns558 or snd-mpu401 modules, respectively. - The ac97_quirk option is used to enable/override the workaround - for specific devices. Some hardware have swapped output pins - between Master and Headphone, or Surround. The driver provides - the auto-detection of known problematic devices, but some might - be unknown or wrongly detected. In such a case, pass the proper - value with this option. - The power-management is supported. Module snd-intel8x0m @@ -965,7 +972,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. with machines with other (most likely CS423x or OPL3SAx) chips, even though the device is detected in lspci. In such a case, try other drivers, e.g. snd-cs4232 or snd-opl3sa2. Some has ISA-PnP - but some doesn't have ISA PnP. You'll need to speicfy isapnp=0 + but some doesn't have ISA PnP. You'll need to specify isapnp=0 and proper hardware parameters in the case without ISA PnP. Note: some laptops need a workaround for AC97 RESET. For the @@ -1301,7 +1308,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. channels [VIA8233/C, 8235, 8237 only] ac97_quirk - AC'97 workaround for strange hardware - See the description of intel8x0 module for details. + See "AC97 Quirk Option" section below. Module supports autoprobe and multiple bus-master chips (max 8). @@ -1326,16 +1333,17 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. "lspci -nv"). If dxs_support=5 does not work, try dxs_support=4; if it doesn't work too, try dxs_support=1. (dxs_support=1 is - usually for old motherboards. The correct implementated + usually for old motherboards. The correct implemented board should work with 4 or 5.) If it still doesn't work and the default setting is ok, dxs_support=3 is the right choice. If the default setting doesn't work at all, try dxs_support=2 to disable the DXS channels. In any cases, please let us know the result and the - subsystem vendor/device ids. + subsystem vendor/device ids. See "Links and Addresses" + below. Note: for the MPU401 on VIA823x, use snd-mpu401 driver - additonally. The mpu_port option is for VIA686 chips only. + additionally. The mpu_port option is for VIA686 chips only. Module snd-via82xx-modem ------------------------ @@ -1397,8 +1405,10 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module supports up to 8 cards. The module is compiled only when PCMCIA is supported on kernel. - To activate the driver via the card manager, you'll need to set - up /etc/pcmcia/vxpocket.conf. See the sound/pcmcia/vx/vxpocket.c. + With the older 2.6.x kernel, to activate the driver via the card + manager, you'll need to set up /etc/pcmcia/vxpocket.conf. See the + sound/pcmcia/vx/vxpocket.c. 2.6.13 or later kernel requires no + longer require a config file. When the driver is compiled as a module and the hotplug firmware is supported, the firmware data is loaded via hotplug automatically. @@ -1410,6 +1420,9 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Note: the driver is build only when CONFIG_ISA is set. + Note2: snd-vxp440 driver is merged to snd-vxpocket driver since + ALSA 1.0.10. + Module snd-ymfpci ----------------- @@ -1435,6 +1448,37 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Note: the driver is build only when CONFIG_ISA is set. +AC97 Quirk Option +================= + +The ac97_quirk option is used to enable/override the workaround for +specific devices on drivers for on-board AC'97 controllers like +snd-intel8x0. Some hardware have swapped output pins between Master +and Headphone, or Surround (thanks to confusion of AC'97 +specifications from version to version :-) + +The driver provides the auto-detection of known problematic devices, +but some might be unknown or wrongly detected. In such a case, pass +the proper value with this option. + +The following strings are accepted: + - default Don't override the default setting + - disable Disable the quirk + - hp_only Bind Master and Headphone controls as a single control + - swap_hp Swap headphone and master controls + - swap_surround Swap master and surround controls + - ad_sharing For AD1985, turn on OMS bit and use headphone + - alc_jack For ALC65x, turn on the jack sense mode + - inv_eapd Inverted EAPD implementation + - mute_led Bind EAPD bit for turning on/off mute LED + +For backward compatibility, the corresponding integer value -1, 0, +... are accepted, too. + +For example, if "Master" volume control has no effect on your device +but only "Headphone" does, pass ac97_quirk=hp_only module option. + + Configuring Non-ISAPNP Cards ============================ @@ -1458,7 +1502,7 @@ devices where %i is sound card number from zero to seven. To auto-load an ALSA driver for OSS services, define the string 'sound-slot-%i' where %i means the slot number for OSS, which corresponds to the card index of ALSA. Usually, define this -as the the same card module. +as the same card module. An example configuration for a single emu10k1 card is like below: ----- /etc/modprobe.conf @@ -1552,6 +1596,8 @@ Proc interfaces (/proc/asound) - whole-frag write only whole fragments (optimization affecting playback only) - no-silence do not fill silence ahead to avoid clicks + - buggy-ptr Returns the whitespace blocks in GETOPTR ioctl + instead of filled blocks Example: echo "x11amp 128 16384" > /proc/asound/card0/pcm0p/oss echo "squake 0 0 disable" > /proc/asound/card0/pcm0c/oss @@ -1588,9 +1634,14 @@ commands to the snd-page-alloc driver: use. -Links -===== +Links and Addresses +=================== ALSA project homepage http://www.alsa-project.org + ALSA Bug Tracking System + https://bugtrack.alsa-project.org/bugs/ + + ALSA Developers ML + mailto:alsa-devel@lists.sourceforge.net diff --git a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl index db0b7d2dc477..24e85520890b 100644 --- a/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl +++ b/Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl @@ -447,7 +447,7 @@ .... /* allocate a chip-specific data with zero filled */ - chip = kcalloc(1, sizeof(*chip), GFP_KERNEL); + chip = kzalloc(sizeof(*chip), GFP_KERNEL); if (chip == NULL) return -ENOMEM; @@ -949,7 +949,7 @@ After allocating a card instance via <function>snd_card_new()</function> (with <constant>NULL</constant> on the 4th arg), call - <function>kcalloc()</function>. + <function>kzalloc()</function>. <informalexample> <programlisting> @@ -958,7 +958,7 @@ mychip_t *chip; card = snd_card_new(index[dev], id[dev], THIS_MODULE, NULL); ..... - chip = kcalloc(1, sizeof(*chip), GFP_KERNEL); + chip = kzalloc(sizeof(*chip), GFP_KERNEL); ]]> </programlisting> </informalexample> @@ -1136,7 +1136,7 @@ return -ENXIO; } - chip = kcalloc(1, sizeof(*chip), GFP_KERNEL); + chip = kzalloc(sizeof(*chip), GFP_KERNEL); if (chip == NULL) { pci_disable_device(pci); return -ENOMEM; @@ -1292,7 +1292,7 @@ need to initialize this number as -1 before actual allocation, since irq 0 is valid. The port address and its resource pointer can be initialized as null by - <function>kcalloc()</function> automatically, so you + <function>kzalloc()</function> automatically, so you don't have to take care of resetting them. </para> @@ -3422,10 +3422,17 @@ struct _snd_pcm_runtime { <para> The <structfield>iface</structfield> field specifies the type of - the control, - <constant>SNDRV_CTL_ELEM_IFACE_XXX</constant>. There are - <constant>MIXER</constant>, <constant>PCM</constant>, - <constant>CARD</constant>, etc. + the control, <constant>SNDRV_CTL_ELEM_IFACE_XXX</constant>, which + is usually <constant>MIXER</constant>. + Use <constant>CARD</constant> for global controls that are not + logically part of the mixer. + If the control is closely associated with some specific device on + the sound card, use <constant>HWDEP</constant>, + <constant>PCM</constant>, <constant>RAWMIDI</constant>, + <constant>TIMER</constant>, or <constant>SEQUENCER</constant>, and + specify the device number with the + <structfield>device</structfield> and + <structfield>subdevice</structfield> fields. </para> <para> diff --git a/Documentation/sparse.txt b/Documentation/sparse.txt index f97841478459..1829009db771 100644 --- a/Documentation/sparse.txt +++ b/Documentation/sparse.txt @@ -51,13 +51,13 @@ or you don't get any checking at all. Where to get sparse ~~~~~~~~~~~~~~~~~~~ -With BK, you can just get it from +With git, you can just get it from - bk://sparse.bkbits.net/sparse + rsync://rsync.kernel.org/pub/scm/devel/sparse/sparse.git and DaveJ has tar-balls at - http://www.codemonkey.org.uk/projects/bitkeeper/sparse/ + http://www.codemonkey.org.uk/projects/git-snapshots/sparse/ Once you have it, just do diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 136d817c01ba..baf17b381588 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt @@ -171,7 +171,7 @@ the header 'include/linux/sysrq.h', this will define everything else you need. Next, you must create a sysrq_key_op struct, and populate it with A) the key handler function you will use, B) a help_msg string, that will print when SysRQ prints help, and C) an action_msg string, that will print right before your -handler is called. Your handler must conform to the protoype in 'sysrq.h'. +handler is called. Your handler must conform to the prototype in 'sysrq.h'. After the sysrq_key_op is created, you can call the macro register_sysrq_key(int key, struct sysrq_key_op *op_p) that is defined in diff --git a/Documentation/uml/UserModeLinux-HOWTO.txt b/Documentation/uml/UserModeLinux-HOWTO.txt index 0c7b654fec99..544430e39980 100644 --- a/Documentation/uml/UserModeLinux-HOWTO.txt +++ b/Documentation/uml/UserModeLinux-HOWTO.txt @@ -2176,7 +2176,7 @@ If you want to access files on the host machine from inside UML, you can treat it as a separate machine and either nfs mount directories from the host or copy files into the virtual machine with scp or rcp. - However, since UML is running on the the host, it can access those + However, since UML is running on the host, it can access those files just like any other process and make them available inside the virtual machine without needing to use the network. diff --git a/Documentation/usb/URB.txt b/Documentation/usb/URB.txt index d59b95cc6f1b..a49e5f2c2b46 100644 --- a/Documentation/usb/URB.txt +++ b/Documentation/usb/URB.txt @@ -1,5 +1,6 @@ Revised: 2000-Dec-05. Again: 2002-Jul-06 +Again: 2005-Sep-19 NOTE: @@ -18,8 +19,8 @@ called USB Request Block, or URB for short. and deliver the data and status back. - Execution of an URB is inherently an asynchronous operation, i.e. the - usb_submit_urb(urb) call returns immediately after it has successfully queued - the requested action. + usb_submit_urb(urb) call returns immediately after it has successfully + queued the requested action. - Transfers for one URB can be canceled with usb_unlink_urb(urb) at any time. @@ -94,8 +95,9 @@ To free an URB, use void usb_free_urb(struct urb *urb) -You may not free an urb that you've submitted, but which hasn't yet been -returned to you in a completion callback. +You may free an urb that you've submitted, but which hasn't yet been +returned to you in a completion callback. It will automatically be +deallocated when it is no longer in use. 1.4. What has to be filled in? @@ -145,30 +147,36 @@ to get seamless ISO streaming. 1.6. How to cancel an already running URB? -For an URB which you've submitted, but which hasn't been returned to -your driver by the host controller, call +There are two ways to cancel an URB you've submitted but which hasn't +been returned to your driver yet. For an asynchronous cancel, call int usb_unlink_urb(struct urb *urb) It removes the urb from the internal list and frees all allocated -HW descriptors. The status is changed to reflect unlinking. After -usb_unlink_urb() returns with that status code, you can free the URB -with usb_free_urb(). +HW descriptors. The status is changed to reflect unlinking. Note +that the URB will not normally have finished when usb_unlink_urb() +returns; you must still wait for the completion handler to be called. -There is also an asynchronous unlink mode. To use this, set the -the URB_ASYNC_UNLINK flag in urb->transfer flags before calling -usb_unlink_urb(). When using async unlinking, the URB will not -normally be unlinked when usb_unlink_urb() returns. Instead, wait -for the completion handler to be called. +To cancel an URB synchronously, call + + void usb_kill_urb(struct urb *urb) + +It does everything usb_unlink_urb does, and in addition it waits +until after the URB has been returned and the completion handler +has finished. It also marks the URB as temporarily unusable, so +that if the completion handler or anyone else tries to resubmit it +they will get a -EPERM error. Thus you can be sure that when +usb_kill_urb() returns, the URB is totally idle. 1.7. What about the completion handler? The handler is of the following type: - typedef void (*usb_complete_t)(struct urb *); + typedef void (*usb_complete_t)(struct urb *, struct pt_regs *) -i.e. it gets just the URB that caused the completion call. +I.e., it gets the URB that caused the completion call, plus the +register values at the time of the corresponding interrupt (if any). In the completion handler, you should have a look at urb->status to detect any USB errors. Since the context parameter is included in the URB, you can pass information to the completion handler. @@ -176,17 +184,11 @@ you can pass information to the completion handler. Note that even when an error (or unlink) is reported, data may have been transferred. That's because USB transfers are packetized; it might take sixteen packets to transfer your 1KByte buffer, and ten of them might -have transferred succesfully before the completion is called. +have transferred succesfully before the completion was called. NOTE: ***** WARNING ***** -Don't use urb->dev field in your completion handler; it's cleared -as part of giving urbs back to drivers. (Addressing an issue with -ownership of periodic URBs, which was otherwise ambiguous.) Instead, -use urb->context to hold all the data your driver needs. - -NOTE: ***** WARNING ***** -Also, NEVER SLEEP IN A COMPLETION HANDLER. These are normally called +NEVER SLEEP IN A COMPLETION HANDLER. These are normally called during hardware interrupt processing. If you can, defer substantial work to a tasklet (bottom half) to keep system latencies low. You'll probably need to use spinlocks to protect data structures you manipulate @@ -229,24 +231,10 @@ ISO data with some other event stream. Interrupt transfers, like isochronous transfers, are periodic, and happen in intervals that are powers of two (1, 2, 4 etc) units. Units are frames for full and low speed devices, and microframes for high speed ones. - -Currently, after you submit one interrupt URB, that urb is owned by the -host controller driver until you cancel it with usb_unlink_urb(). You -may unlink interrupt urbs in their completion handlers, if you need to. - -After a transfer completion is called, the URB is automagically resubmitted. -THIS BEHAVIOR IS EXPECTED TO BE REMOVED!! - -Interrupt transfers may only send (or receive) the "maxpacket" value for -the given interrupt endpoint; if you need more data, you will need to -copy that data out of (or into) another buffer. Similarly, you can't -queue interrupt transfers. -THESE RESTRICTIONS ARE EXPECTED TO BE REMOVED!! - -Note that this automagic resubmission model does make it awkward to use -interrupt OUT transfers. The portable solution involves unlinking those -OUT urbs after the data is transferred, and perhaps submitting a final -URB for a short packet. - The usb_submit_urb() call modifies urb->interval to the implemented interval value that is less than or equal to the requested interval value. + +In Linux 2.6, unlike earlier versions, interrupt URBs are not automagically +restarted when they complete. They end when the completion handler is +called, just like other URBs. If you want an interrupt URB to be restarted, +your completion handler must resubmit it. diff --git a/Documentation/usb/gadget_serial.txt b/Documentation/usb/gadget_serial.txt index a938c3dd13d6..815f5c2301ff 100644 --- a/Documentation/usb/gadget_serial.txt +++ b/Documentation/usb/gadget_serial.txt @@ -20,7 +20,7 @@ License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. -This document and the the gadget serial driver itself are +This document and the gadget serial driver itself are Copyright (C) 2004 by Al Borchers (alborchers@steinerpoint.com). If you have questions, problems, or suggestions for this driver diff --git a/Documentation/usb/proc_usb_info.txt b/Documentation/usb/proc_usb_info.txt index 729c72d34c89..f86550fe38ee 100644 --- a/Documentation/usb/proc_usb_info.txt +++ b/Documentation/usb/proc_usb_info.txt @@ -20,7 +20,7 @@ the /proc/bus/usb/BBB/DDD files. to /etc/fstab. This will mount usbfs at each reboot. You can then issue `cat /proc/bus/usb/devices` to extract - USB device information, and user mode drivers can use usbfs + USB device information, and user mode drivers can use usbfs to interact with USB devices. There are a number of mount options supported by usbfs. @@ -32,7 +32,7 @@ the /proc/bus/usb/BBB/DDD files. still see references to the older "usbdevfs" name. For more information on mounting the usbfs file system, see the -"USB Device Filesystem" section of the USB Guide. The latest copy +"USB Device Filesystem" section of the USB Guide. The latest copy of the USB Guide can be found at http://www.linux-usb.org/ @@ -133,7 +133,7 @@ B: Alloc=ddd/ddd us (xx%), #Int=ddd, #Iso=ddd are the only transfers that reserve bandwidth. Control and bulk transfers use all other bandwidth, including reserved bandwidth that is not used for transfers (such as for short packets). - + The percentage is how much of the "reserved" bandwidth is scheduled by those transfers. For a low or full speed bus (loosely, "USB 1.1"), 90% of the bus bandwidth is reserved. For a high speed bus (loosely, @@ -197,7 +197,7 @@ C:* #Ifs=dd Cfg#=dd Atr=xx MPwr=dddmA | | |__NumberOfInterfaces | |__ "*" indicates the active configuration (others are " ") |__Config info tag - + USB devices may have multiple configurations, each of which act rather differently. For example, a bus-powered configuration might be much less capable than one that is self-powered. Only @@ -228,7 +228,7 @@ I: If#=dd Alt=dd #EPs=dd Cls=xx(sssss) Sub=xx Prot=xx Driver=ssss For example, default settings may not use more than a small amount of periodic bandwidth. To use significant fractions of bus bandwidth, drivers must select a non-default altsetting. - + Only one setting for an interface may be active at a time, and only one driver may bind to an interface at a time. Most devices have only one alternate setting per interface. @@ -297,18 +297,21 @@ S: SerialNumber=dce0 C:* #Ifs= 1 Cfg#= 1 Atr=40 MxPwr= 0mA I: If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 8 Ivl=255ms + T: Bus=00 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#= 2 Spd=12 MxCh= 4 D: Ver= 1.00 Cls=09(hub ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=0451 ProdID=1446 Rev= 1.00 C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=100mA I: If#= 0 Alt= 0 #EPs= 1 Cls=09(hub ) Sub=00 Prot=00 Driver=hub E: Ad=81(I) Atr=03(Int.) MxPS= 1 Ivl=255ms + T: Bus=00 Lev=02 Prnt=02 Port=00 Cnt=01 Dev#= 3 Spd=1.5 MxCh= 0 D: Ver= 1.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=04b4 ProdID=0001 Rev= 0.00 C:* #Ifs= 1 Cfg#= 1 Atr=80 MxPwr=100mA I: If#= 0 Alt= 0 #EPs= 1 Cls=03(HID ) Sub=01 Prot=02 Driver=mouse E: Ad=81(I) Atr=03(Int.) MxPS= 3 Ivl= 10ms + T: Bus=00 Lev=02 Prnt=02 Port=02 Cnt=02 Dev#= 4 Spd=12 MxCh= 0 D: Ver= 1.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS= 8 #Cfgs= 1 P: Vendor=0565 ProdID=0001 Rev= 1.08 diff --git a/Documentation/usb/usbmon.txt b/Documentation/usb/usbmon.txt index f1896ee3bb2a..63cb7edd177e 100644 --- a/Documentation/usb/usbmon.txt +++ b/Documentation/usb/usbmon.txt @@ -102,7 +102,7 @@ Here is the list of words, from left to right: - URB Status. This field makes no sense for submissions, but is present to help scripts with parsing. In error case, it contains the error code. In case of a setup packet, it contains a Setup Tag. If scripts read a number - in this field, the proceed to read Data Length. Otherwise, they read + in this field, they proceed to read Data Length. Otherwise, they read the setup packet before reading the Data Length. - Setup packet, if present, consists of 5 words: one of each for bmRequestType, bRequest, wValue, wIndex, wLength, as specified by the USB Specification 2.0. diff --git a/Documentation/video4linux/CARDLIST.bttv b/Documentation/video4linux/CARDLIST.bttv index 62a12a08e2ac..ec785f9f15a3 100644 --- a/Documentation/video4linux/CARDLIST.bttv +++ b/Documentation/video4linux/CARDLIST.bttv @@ -126,10 +126,12 @@ card=124 - AverMedia AverTV DVB-T 761 card=125 - MATRIX Vision Sigma-SQ card=126 - MATRIX Vision Sigma-SLC card=127 - APAC Viewcomp 878(AMAX) -card=128 - DVICO FusionHDTV DVB-T Lite +card=128 - DViCO FusionHDTV DVB-T Lite card=129 - V-Gear MyVCD card=130 - Super TV Tuner card=131 - Tibet Systems 'Progress DVR' CS16 card=132 - Kodicom 4400R (master) card=133 - Kodicom 4400R (slave) card=134 - Adlink RTV24 +card=135 - DViCO FusionHDTV 5 Lite +card=136 - Acorp Y878F diff --git a/Documentation/video4linux/CARDLIST.saa7134 b/Documentation/video4linux/CARDLIST.saa7134 index 1b5a3a9ffbe2..dc57225f39be 100644 --- a/Documentation/video4linux/CARDLIST.saa7134 +++ b/Documentation/video4linux/CARDLIST.saa7134 @@ -62,3 +62,6 @@ 61 -> Philips TOUGH DVB-T reference design [1131:2004] 62 -> Compro VideoMate TV Gold+II 63 -> Kworld Xpert TV PVR7134 + 64 -> FlyTV mini Asus Digimatrix [1043:0210,1043:0210] + 65 -> V-Stream Studio TV Terminator + 66 -> Yuan TUN-900 (saa7135) diff --git a/Documentation/video4linux/CARDLIST.tuner b/Documentation/video4linux/CARDLIST.tuner index f3302e1b1b9c..f5876be658a6 100644 --- a/Documentation/video4linux/CARDLIST.tuner +++ b/Documentation/video4linux/CARDLIST.tuner @@ -64,3 +64,4 @@ tuner=62 - Philips TEA5767HN FM Radio tuner=63 - Philips FMD1216ME MK3 Hybrid Tuner tuner=64 - LG TDVS-H062F/TUA6034 tuner=65 - Ymec TVF66T5-B/DFF +tuner=66 - LG NTSC (TALN mini series) diff --git a/Documentation/video4linux/Zoran b/Documentation/video4linux/Zoran index 01425c21986b..52c94bd7dca1 100644 --- a/Documentation/video4linux/Zoran +++ b/Documentation/video4linux/Zoran @@ -222,7 +222,7 @@ was introduced in 1991, is used in the DC10 old can generate: PAL , NTSC , SECAM The adv717x, should be able to produce PAL N. But you find nothing PAL N -specific in the the registers. Seem that you have to reuse a other standard +specific in the registers. Seem that you have to reuse a other standard to generate PAL N, maybe it would work if you use the PAL M settings. ========================== diff --git a/Documentation/video4linux/bttv/Insmod-options b/Documentation/video4linux/bttv/Insmod-options index 7bb5a50b0779..fc94ff235ffa 100644 --- a/Documentation/video4linux/bttv/Insmod-options +++ b/Documentation/video4linux/bttv/Insmod-options @@ -44,6 +44,9 @@ bttv.o push used by bttv. bttv will disable overlay by default on this hardware to avoid crashes. With this insmod option you can override this. + no_overlay=1 Disable overlay. It should be used by broken + hardware that doesn't support PCI2PCI direct + transfers. automute=0/1 Automatically mutes the sound if there is no TV signal, on by default. You might try to disable this if you have bad input signal diff --git a/Documentation/vm/locking b/Documentation/vm/locking index c3ef09ae3bb1..f366fa956179 100644 --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -83,19 +83,18 @@ single address space optimization, so that the zap_page_range (from vmtruncate) does not lose sending ipi's to cloned threads that might be spawned underneath it and go to user mode to drag in pte's into tlbs. -swap_list_lock/swap_device_lock -------------------------------- +swap_lock +-------------- The swap devices are chained in priority order from the "swap_list" header. The "swap_list" is used for the round-robin swaphandle allocation strategy. The #free swaphandles is maintained in "nr_swap_pages". These two together -are protected by the swap_list_lock. +are protected by the swap_lock. -The swap_device_lock, which is per swap device, protects the reference -counts on the corresponding swaphandles, maintained in the "swap_map" -array, and the "highest_bit" and "lowest_bit" fields. +The swap_lock also protects all the device reference counts on the +corresponding swaphandles, maintained in the "swap_map" array, and the +"highest_bit" and "lowest_bit" fields. -Both of these are spinlocks, and are never acquired from intr level. The -locking hierarchy is swap_list_lock -> swap_device_lock. +The swap_lock is a spinlock, and is never acquired from intr level. To prevent races between swap space deletion or async readahead swapins deciding whether a swap handle is being used, ie worthy of being read in diff --git a/Documentation/watchdog/watchdog-api.txt b/Documentation/watchdog/watchdog-api.txt index 28388aa700c6..c5beb548cfc4 100644 --- a/Documentation/watchdog/watchdog-api.txt +++ b/Documentation/watchdog/watchdog-api.txt @@ -228,6 +228,26 @@ advantechwdt.c -- Advantech Single Board Computer The GETSTATUS call returns if the device is open or not. [FIXME -- silliness again?] +booke_wdt.c -- PowerPC BookE Watchdog Timer + + Timeout default varies according to frequency, supports + SETTIMEOUT + + Watchdog can not be turned off, CONFIG_WATCHDOG_NOWAYOUT + does not make sense + + GETSUPPORT returns the watchdog_info struct, and + GETSTATUS returns the supported options. GETBOOTSTATUS + returns a 1 if the last reset was caused by the + watchdog and a 0 otherwise. This watchdog can not be + disabled once it has been started. The wdt_period kernel + parameter selects which bit of the time base changing + from 0->1 will trigger the watchdog exception. Changing + the timeout from the ioctl calls will change the + wdt_period as defined above. Finally if you would like to + replace the default Watchdog Handler you can implement the + WatchdogHandler() function in your own code. + eurotechwdt.c -- Eurotech CPU-1220/1410 The timeout can be set using the SETTIMEOUT ioctl and defaults diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt index 476c0c22fbb7..ffe1c062088b 100644 --- a/Documentation/x86_64/boot-options.txt +++ b/Documentation/x86_64/boot-options.txt @@ -6,6 +6,16 @@ only the AMD64 specific ones are listed here. Machine check mce=off disable machine check + mce=bootlog Enable logging of machine checks left over from booting. + Disabled by default because some BIOS leave bogus ones. + If your BIOS doesn't do that it's a good idea to enable though + to make sure you log even machine check events that result + in a reboot. + mce=tolerancelevel (number) + 0: always panic, 1: panic if deadlock possible, + 2: try to avoid panic, 3: never panic or exit (for testing) + default is 1 + Can be also set using sysfs which is preferable. nomce (for compatibility with i386): same as mce=off |