summaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/CodingStyle100
-rw-r--r--Documentation/DocBook/kernel-api.tmpl13
-rw-r--r--Documentation/RCU/whatisRCU.txt1
-rw-r--r--Documentation/SubmitChecklist57
-rw-r--r--Documentation/devices.txt135
-rw-r--r--Documentation/feature-removal-schedule.txt15
-rw-r--r--Documentation/filesystems/Locking9
-rw-r--r--Documentation/filesystems/porting7
-rw-r--r--Documentation/filesystems/vfs.txt6
-rw-r--r--Documentation/ia64/aliasing.txt208
-rw-r--r--Documentation/ioctl-number.txt2
-rw-r--r--Documentation/kernel-parameters.txt3
-rw-r--r--Documentation/networking/tuntap.txt11
-rw-r--r--Documentation/power/swsusp.txt45
-rw-r--r--Documentation/power/video.txt4
-rw-r--r--Documentation/sparse.txt36
-rw-r--r--Documentation/sysctl/vm.txt13
-rw-r--r--Documentation/vm/page_migration114
18 files changed, 624 insertions, 155 deletions
diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle
index ce5d2c038cf5..6d2412ec91ed 100644
--- a/Documentation/CodingStyle
+++ b/Documentation/CodingStyle
@@ -155,7 +155,83 @@ problem, which is called the function-growth-hormone-imbalance syndrome.
See next chapter.
- Chapter 5: Functions
+ Chapter 5: Typedefs
+
+Please don't use things like "vps_t".
+
+It's a _mistake_ to use typedef for structures and pointers. When you see a
+
+ vps_t a;
+
+in the source, what does it mean?
+
+In contrast, if it says
+
+ struct virtual_container *a;
+
+you can actually tell what "a" is.
+
+Lots of people think that typedefs "help readability". Not so. They are
+useful only for:
+
+ (a) totally opaque objects (where the typedef is actively used to _hide_
+ what the object is).
+
+ Example: "pte_t" etc. opaque objects that you can only access using
+ the proper accessor functions.
+
+ NOTE! Opaqueness and "accessor functions" are not good in themselves.
+ The reason we have them for things like pte_t etc. is that there
+ really is absolutely _zero_ portably accessible information there.
+
+ (b) Clear integer types, where the abstraction _helps_ avoid confusion
+ whether it is "int" or "long".
+
+ u8/u16/u32 are perfectly fine typedefs, although they fit into
+ category (d) better than here.
+
+ NOTE! Again - there needs to be a _reason_ for this. If something is
+ "unsigned long", then there's no reason to do
+
+ typedef unsigned long myflags_t;
+
+ but if there is a clear reason for why it under certain circumstances
+ might be an "unsigned int" and under other configurations might be
+ "unsigned long", then by all means go ahead and use a typedef.
+
+ (c) when you use sparse to literally create a _new_ type for
+ type-checking.
+
+ (d) New types which are identical to standard C99 types, in certain
+ exceptional circumstances.
+
+ Although it would only take a short amount of time for the eyes and
+ brain to become accustomed to the standard types like 'uint32_t',
+ some people object to their use anyway.
+
+ Therefore, the Linux-specific 'u8/u16/u32/u64' types and their
+ signed equivalents which are identical to standard types are
+ permitted -- although they are not mandatory in new code of your
+ own.
+
+ When editing existing code which already uses one or the other set
+ of types, you should conform to the existing choices in that code.
+
+ (e) Types safe for use in userspace.
+
+ In certain structures which are visible to userspace, we cannot
+ require C99 types and cannot use the 'u32' form above. Thus, we
+ use __u32 and similar types in all structures which are shared
+ with userspace.
+
+Maybe there are other cases too, but the rule should basically be to NEVER
+EVER use a typedef unless you can clearly match one of those rules.
+
+In general, a pointer, or a struct that has elements that can reasonably
+be directly accessed should _never_ be a typedef.
+
+
+ Chapter 6: Functions
Functions should be short and sweet, and do just one thing. They should
fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
@@ -183,7 +259,7 @@ and it gets confused. You know you're brilliant, but maybe you'd like
to understand what you did 2 weeks from now.
- Chapter 6: Centralized exiting of functions
+ Chapter 7: Centralized exiting of functions
Albeit deprecated by some people, the equivalent of the goto statement is
used frequently by compilers in form of the unconditional jump instruction.
@@ -220,7 +296,7 @@ out:
return result;
}
- Chapter 7: Commenting
+ Chapter 8: Commenting
Comments are good, but there is also a danger of over-commenting. NEVER
try to explain HOW your code works in a comment: it's much better to
@@ -240,7 +316,7 @@ When commenting the kernel API functions, please use the kerneldoc format.
See the files Documentation/kernel-doc-nano-HOWTO.txt and scripts/kernel-doc
for details.
- Chapter 8: You've made a mess of it
+ Chapter 9: You've made a mess of it
That's OK, we all do. You've probably been told by your long-time Unix
user helper that "GNU emacs" automatically formats the C sources for
@@ -288,7 +364,7 @@ re-formatting you may want to take a look at the man page. But
remember: "indent" is not a fix for bad programming.
- Chapter 9: Configuration-files
+ Chapter 10: Configuration-files
For configuration options (arch/xxx/Kconfig, and all the Kconfig files),
somewhat different indentation is used.
@@ -313,7 +389,7 @@ support for file-systems, for instance) should be denoted (DANGEROUS), other
experimental options should be denoted (EXPERIMENTAL).
- Chapter 10: Data structures
+ Chapter 11: Data structures
Data structures that have visibility outside the single-threaded
environment they are created and destroyed in should always have
@@ -344,7 +420,7 @@ Remember: if another thread can find your data structure, and you don't
have a reference count on it, you almost certainly have a bug.
- Chapter 11: Macros, Enums and RTL
+ Chapter 12: Macros, Enums and RTL
Names of macros defining constants and labels in enums are capitalized.
@@ -399,7 +475,7 @@ The cpp manual deals with macros exhaustively. The gcc internals manual also
covers RTL which is used frequently with assembly language in the kernel.
- Chapter 12: Printing kernel messages
+ Chapter 13: Printing kernel messages
Kernel developers like to be seen as literate. Do mind the spelling
of kernel messages to make a good impression. Do not use crippled
@@ -410,7 +486,7 @@ Kernel messages do not have to be terminated with a period.
Printing numbers in parentheses (%d) adds no value and should be avoided.
- Chapter 13: Allocating memory
+ Chapter 14: Allocating memory
The kernel provides the following general purpose memory allocators:
kmalloc(), kzalloc(), kcalloc(), and vmalloc(). Please refer to the API
@@ -429,7 +505,7 @@ from void pointer to any other pointer type is guaranteed by the C programming
language.
- Chapter 14: The inline disease
+ Chapter 15: The inline disease
There appears to be a common misperception that gcc has a magic "make me
faster" speedup option called "inline". While the use of inlines can be
@@ -457,7 +533,7 @@ something it would have done anyway.
- Chapter 15: References
+ Appendix I: References
The C Programming Language, Second Edition
by Brian W. Kernighan and Dennis M. Ritchie.
@@ -481,4 +557,4 @@ Kernel CodingStyle, by greg@kroah.com at OLS 2002:
http://www.kroah.com/linux/talks/ols_2002_kernel_codingstyle_talk/html/
--
-Last updated on 30 December 2005 by a community effort on LKML.
+Last updated on 30 April 2006.
diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index ca02e04a906c..31b727ceb127 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -117,6 +117,7 @@ X!Ilib/string.c
<chapter id="mm">
<title>Memory Management in Linux</title>
<sect1><title>The Slab Cache</title>
+!Iinclude/linux/slab.h
!Emm/slab.c
</sect1>
<sect1><title>User Space Memory Access</title>
@@ -331,6 +332,18 @@ X!Earch/i386/kernel/mca.c
!Esecurity/security.c
</chapter>
+ <chapter id="audit">
+ <title>Audit Interfaces</title>
+!Ekernel/audit.c
+!Ikernel/auditsc.c
+!Ikernel/auditfilter.c
+ </chapter>
+
+ <chapter id="accounting">
+ <title>Accounting Framework</title>
+!Ikernel/acct.c
+ </chapter>
+
<chapter id="pmfuncs">
<title>Power Management</title>
!Ekernel/power/pm.c
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 07cb93b82ba9..6e459420ee9f 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -790,7 +790,6 @@ RCU pointer update:
RCU grace period:
- synchronize_kernel (deprecated)
synchronize_net
synchronize_sched
synchronize_rcu
diff --git a/Documentation/SubmitChecklist b/Documentation/SubmitChecklist
new file mode 100644
index 000000000000..8230098da529
--- /dev/null
+++ b/Documentation/SubmitChecklist
@@ -0,0 +1,57 @@
+Linux Kernel patch sumbittal checklist
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here are some basic things that developers should do if they
+want to see their kernel patch submittals accepted quicker.
+
+These are all above and beyond the documentation that is provided
+in Documentation/SubmittingPatches and elsewhere about submitting
+Linux kernel patches.
+
+
+
+- Builds cleanly with applicable or modified CONFIG options =y, =m, and =n.
+ No gcc warnings/errors, no linker warnings/errors.
+
+- Passes allnoconfig, allmodconfig
+
+- Builds on multiple CPU arch-es by using local cross-compile tools
+ or something like PLM at OSDL.
+
+- ppc64 is a good architecture for cross-compilation checking because it
+ tends to use `unsigned long' for 64-bit quantities.
+
+- Matches kernel coding style(!)
+
+- Any new or modified CONFIG options don't muck up the config menu.
+
+- All new Kconfig options have help text.
+
+- Has been carefully reviewed with respect to relevant Kconfig
+ combinations. This is very hard to get right with testing --
+ brainpower pays off here.
+
+- Check cleanly with sparse.
+
+- Use 'make checkstack' and 'make namespacecheck' and fix any
+ problems that they find. Note: checkstack does not point out
+ problems explicitly, but any one function that uses more than
+ 512 bytes on the stack is a candidate for change.
+
+- Include kernel-doc to document global kernel APIs. (Not required
+ for static functions, but OK there also.) Use 'make htmldocs'
+ or 'make mandocs' to check the kernel-doc and fix any issues.
+
+- Has been tested with CONFIG_PREEMPT, CONFIG_DEBUG_PREEMPT,
+ CONFIG_DEBUG_SLAB, CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_MUTEXES,
+ CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_SPINLOCK_SLEEP all simultaneously
+ enabled.
+
+- Has been build- and runtime tested with and without CONFIG_SMP and
+ CONFIG_PREEMPT.
+
+- If the patch affects IO/Disk, etc: has been tested with and without
+ CONFIG_LBD.
+
+
+2006-APR-27
diff --git a/Documentation/devices.txt b/Documentation/devices.txt
index b369a8c46a73..b2f593fc76ca 100644
--- a/Documentation/devices.txt
+++ b/Documentation/devices.txt
@@ -3,7 +3,7 @@
Maintained by Torben Mathiasen <device@lanana.org>
- Last revised: 25 January 2005
+ Last revised: 01 March 2006
This list is the Linux Device List, the official registry of allocated
device numbers and /dev directory nodes for the Linux operating
@@ -94,7 +94,6 @@ Your cooperation is appreciated.
9 = /dev/urandom Faster, less secure random number gen.
10 = /dev/aio Asyncronous I/O notification interface
11 = /dev/kmsg Writes to this come out as printk's
- 12 = /dev/oldmem Access to crash dump from kexec kernel
1 block RAM disk
0 = /dev/ram0 First RAM disk
1 = /dev/ram1 Second RAM disk
@@ -262,13 +261,13 @@ Your cooperation is appreciated.
NOTE: These devices permit both read and write access.
7 block Loopback devices
- 0 = /dev/loop0 First loopback device
- 1 = /dev/loop1 Second loopback device
+ 0 = /dev/loop0 First loop device
+ 1 = /dev/loop1 Second loop device
...
- The loopback devices are used to mount filesystems not
+ The loop devices are used to mount filesystems not
associated with block devices. The binding to the
- loopback devices is handled by mount(8) or losetup(8).
+ loop devices is handled by mount(8) or losetup(8).
8 block SCSI disk devices (0-15)
0 = /dev/sda First SCSI disk whole disk
@@ -943,7 +942,7 @@ Your cooperation is appreciated.
240 = /dev/ftlp FTL on 16th Memory Technology Device
Partitions are handled in the same way as for IDE
- disks (see major number 3) expect that the partition
+ disks (see major number 3) except that the partition
limit is 15 rather than 63 per disk (same as SCSI.)
45 char isdn4linux ISDN BRI driver
@@ -1168,7 +1167,7 @@ Your cooperation is appreciated.
The filename of the encrypted container and the passwords
are sent via ioctls (using the sdmount tool) to the master
node which then activates them via one of the
- /dev/scramdisk/x nodes for loopback mounting (all handled
+ /dev/scramdisk/x nodes for loop mounting (all handled
through the sdmount tool).
Requested by: andy@scramdisklinux.org
@@ -2538,18 +2537,32 @@ Your cooperation is appreciated.
0 = /dev/usb/lp0 First USB printer
...
15 = /dev/usb/lp15 16th USB printer
- 16 = /dev/usb/mouse0 First USB mouse
- ...
- 31 = /dev/usb/mouse15 16th USB mouse
- 32 = /dev/usb/ez0 First USB firmware loader
- ...
- 47 = /dev/usb/ez15 16th USB firmware loader
48 = /dev/usb/scanner0 First USB scanner
...
63 = /dev/usb/scanner15 16th USB scanner
64 = /dev/usb/rio500 Diamond Rio 500
65 = /dev/usb/usblcd USBLCD Interface (info@usblcd.de)
66 = /dev/usb/cpad0 Synaptics cPad (mouse/LCD)
+ 96 = /dev/usb/hiddev0 1st USB HID device
+ ...
+ 111 = /dev/usb/hiddev15 16th USB HID device
+ 112 = /dev/usb/auer0 1st auerswald ISDN device
+ ...
+ 127 = /dev/usb/auer15 16th auerswald ISDN device
+ 128 = /dev/usb/brlvgr0 First Braille Voyager device
+ ...
+ 131 = /dev/usb/brlvgr3 Fourth Braille Voyager device
+ 132 = /dev/usb/idmouse ID Mouse (fingerprint scanner) device
+ 133 = /dev/usb/sisusbvga1 First SiSUSB VGA device
+ ...
+ 140 = /dev/usb/sisusbvga8 Eigth SISUSB VGA device
+ 144 = /dev/usb/lcd USB LCD device
+ 160 = /dev/usb/legousbtower0 1st USB Legotower device
+ ...
+ 175 = /dev/usb/legousbtower15 16th USB Legotower device
+ 240 = /dev/usb/dabusb0 First daubusb device
+ ...
+ 243 = /dev/usb/dabusb3 Fourth dabusb device
180 block USB block devices
0 = /dev/uba First USB block device
@@ -2710,6 +2723,17 @@ Your cooperation is appreciated.
1 = /dev/cpu/1/msr MSRs on CPU 1
...
+202 block Xen Virtual Block Device
+ 0 = /dev/xvda First Xen VBD whole disk
+ 16 = /dev/xvdb Second Xen VBD whole disk
+ 32 = /dev/xvdc Third Xen VBD whole disk
+ ...
+ 240 = /dev/xvdp Sixteenth Xen VBD whole disk
+
+ Partitions are handled in the same way as for IDE
+ disks (see major number 3) except that the limit on
+ partitions is 15.
+
203 char CPU CPUID information
0 = /dev/cpu/0/cpuid CPUID on CPU 0
1 = /dev/cpu/1/cpuid CPUID on CPU 1
@@ -2747,11 +2771,26 @@ Your cooperation is appreciated.
46 = /dev/ttyCPM0 PPC CPM (SCC or SMC) - port 0
...
47 = /dev/ttyCPM5 PPC CPM (SCC or SMC) - port 5
- 50 = /dev/ttyIOC40 Altix serial card
+ 50 = /dev/ttyIOC0 Altix serial card
+ ...
+ 81 = /dev/ttyIOC31 Altix serial card
+ 82 = /dev/ttyVR0 NEC VR4100 series SIU
+ 83 = /dev/ttyVR1 NEC VR4100 series DSIU
+ 84 = /dev/ttyIOC84 Altix ioc4 serial card
+ ...
+ 115 = /dev/ttyIOC115 Altix ioc4 serial card
+ 116 = /dev/ttySIOC0 Altix ioc3 serial card
+ ...
+ 147 = /dev/ttySIOC31 Altix ioc3 serial card
+ 148 = /dev/ttyPSC0 PPC PSC - port 0
+ ...
+ 153 = /dev/ttyPSC5 PPC PSC - port 5
+ 154 = /dev/ttyAT0 ATMEL serial port 0
...
- 81 = /dev/ttyIOC431 Altix serial card
- 82 = /dev/ttyVR0 NEC VR4100 series SIU
- 83 = /dev/ttyVR1 NEC VR4100 series DSIU
+ 169 = /dev/ttyAT15 ATMEL serial port 15
+ 170 = /dev/ttyNX0 Hilscher netX serial port 0
+ ...
+ 185 = /dev/ttyNX15 Hilscher netX serial port 15
205 char Low-density serial ports (alternate device)
0 = /dev/culu0 Callout device for ttyLU0
@@ -2786,8 +2825,8 @@ Your cooperation is appreciated.
50 = /dev/cuioc40 Callout device for ttyIOC40
...
81 = /dev/cuioc431 Callout device for ttyIOC431
- 82 = /dev/cuvr0 Callout device for ttyVR0
- 83 = /dev/cuvr1 Callout device for ttyVR1
+ 82 = /dev/cuvr0 Callout device for ttyVR0
+ 83 = /dev/cuvr1 Callout device for ttyVR1
206 char OnStream SC-x0 tape devices
@@ -2897,7 +2936,6 @@ Your cooperation is appreciated.
...
196 = /dev/dvb/adapter3/video0 first video decoder of fourth card
-
216 char Bluetooth RFCOMM TTY devices
0 = /dev/rfcomm0 First Bluetooth RFCOMM TTY device
1 = /dev/rfcomm1 Second Bluetooth RFCOMM TTY device
@@ -3002,12 +3040,43 @@ Your cooperation is appreciated.
ioctl()'s can be used to rewind the tape regardless of
the device used to access it.
-231 char InfiniBand MAD
+231 char InfiniBand
0 = /dev/infiniband/umad0
1 = /dev/infiniband/umad1
- ...
+ ...
+ 63 = /dev/infiniband/umad63 63rd InfiniBandMad device
+ 64 = /dev/infiniband/issm0 First InfiniBand IsSM device
+ 65 = /dev/infiniband/issm1 Second InfiniBand IsSM device
+ ...
+ 127 = /dev/infiniband/issm63 63rd InfiniBand IsSM device
+ 128 = /dev/infiniband/uverbs0 First InfiniBand verbs device
+ 129 = /dev/infiniband/uverbs1 Second InfiniBand verbs device
+ ...
+ 159 = /dev/infiniband/uverbs31 31st InfiniBand verbs device
+
+232 char Biometric Devices
+ 0 = /dev/biometric/sensor0/fingerprint first fingerprint sensor on first device
+ 1 = /dev/biometric/sensor0/iris first iris sensor on first device
+ 2 = /dev/biometric/sensor0/retina first retina sensor on first device
+ 3 = /dev/biometric/sensor0/voiceprint first voiceprint sensor on first device
+ 4 = /dev/biometric/sensor0/facial first facial sensor on first device
+ 5 = /dev/biometric/sensor0/hand first hand sensor on first device
+ ...
+ 10 = /dev/biometric/sensor1/fingerprint first fingerprint sensor on second device
+ ...
+ 20 = /dev/biometric/sensor2/fingerprint first fingerprint sensor on third device
+ ...
-232-239 UNASSIGNED
+233 char PathScale InfiniPath interconnect
+ 0 = /dev/ipath Primary device for programs (any unit)
+ 1 = /dev/ipath0 Access specifically to unit 0
+ 2 = /dev/ipath1 Access specifically to unit 1
+ ...
+ 4 = /dev/ipath3 Access specifically to unit 3
+ 129 = /dev/ipath_sma Device used by Subnet Management Agent
+ 130 = /dev/ipath_diag Device used by diagnostics programs
+
+234-239 UNASSIGNED
240-254 char LOCAL/EXPERIMENTAL USE
240-254 block LOCAL/EXPERIMENTAL USE
@@ -3021,6 +3090,24 @@ Your cooperation is appreciated.
This major is reserved to assist the expansion to a
larger number space. No device nodes with this major
should ever be created on the filesystem.
+ (This is probaly not true anymore, but I'll leave it
+ for now /Torben)
+
+---LARGE MAJORS!!!!!---
+
+256 char Equinox SST multi-port serial boards
+ 0 = /dev/ttyEQ0 First serial port on first Equinox SST board
+ 127 = /dev/ttyEQ127 Last serial port on first Equinox SST board
+ 128 = /dev/ttyEQ128 First serial port on second Equinox SST board
+ ...
+ 1027 = /dev/ttyEQ1027 Last serial port on eighth Equinox SST board
+
+256 block Resident Flash Disk Flash Translation Layer
+ 0 = /dev/rfda First RFD FTL layer
+ 16 = /dev/rfdb Second RFD FTL layer
+ ...
+ 240 = /dev/rfdp 16th RFD FTL layer
+
**** ADDITIONAL /dev DIRECTORY ENTRIES
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index f7293297f326..027285d0c26c 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -33,21 +33,6 @@ Who: Adrian Bunk <bunk@stusta.de>
---------------------------
-What: RCU API moves to EXPORT_SYMBOL_GPL
-When: April 2006
-Files: include/linux/rcupdate.h, kernel/rcupdate.c
-Why: Outside of Linux, the only implementations of anything even
- vaguely resembling RCU that I am aware of are in DYNIX/ptx,
- VM/XA, Tornado, and K42. I do not expect anyone to port binary
- drivers or kernel modules from any of these, since the first two
- are owned by IBM and the last two are open-source research OSes.
- So these will move to GPL after a grace period to allow
- people, who might be using implementations that I am not aware
- of, to adjust to this upcoming change.
-Who: Paul E. McKenney <paulmck@us.ibm.com>
-
----------------------------
-
What: raw1394: requests of type RAW1394_REQ_ISO_SEND, RAW1394_REQ_ISO_LISTEN
When: November 2006
Why: Deprecated in favour of the new ioctl-based rawiso interface, which is
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 1045da582b9b..d31efbbdfe50 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -99,7 +99,7 @@ prototypes:
int (*sync_fs)(struct super_block *sb, int wait);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
- int (*statfs) (struct super_block *, struct kstatfs *);
+ int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
@@ -142,15 +142,16 @@ see also dquot_operations section.
--------------------------- file_system_type ---------------------------
prototypes:
- struct super_block *(*get_sb) (struct file_system_type *, int,
- const char *, void *);
+ struct int (*get_sb) (struct file_system_type *, int,
+ const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
locking rules:
may block BKL
get_sb yes yes
kill_sb yes yes
-->get_sb() returns error or a locked superblock (exclusive on ->s_umount).
+->get_sb() returns error or 0 with locked superblock attached to the vfsmount
+(exclusive on ->s_umount).
->kill_sb() takes a write-locked superblock, does all shutdown work on it,
unlocks and drops the reference.
diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 2f388460cbe7..5531694059ab 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -50,10 +50,11 @@ Turn your foo_read_super() into a function that would return 0 in case of
success and negative number in case of error (-EINVAL unless you have more
informative error value to report). Call it foo_fill_super(). Now declare
-struct super_block foo_get_sb(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data)
+int foo_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
- return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super);
+ return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super,
+ mnt);
}
(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3a2e5520c1e3..9d3aed628bc1 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -113,8 +113,8 @@ members are defined:
struct file_system_type {
const char *name;
int fs_flags;
- struct super_block *(*get_sb) (struct file_system_type *, int,
- const char *, void *);
+ struct int (*get_sb) (struct file_system_type *, int,
+ const char *, void *, struct vfsmount *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
@@ -211,7 +211,7 @@ struct super_operations {
int (*sync_fs)(struct super_block *sb, int wait);
void (*write_super_lockfs) (struct super_block *);
void (*unlockfs) (struct super_block *);
- int (*statfs) (struct super_block *, struct kstatfs *);
+ int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
diff --git a/Documentation/ia64/aliasing.txt b/Documentation/ia64/aliasing.txt
new file mode 100644
index 000000000000..38f9a52d1820
--- /dev/null
+++ b/Documentation/ia64/aliasing.txt
@@ -0,0 +1,208 @@
+ MEMORY ATTRIBUTE ALIASING ON IA-64
+
+ Bjorn Helgaas
+ <bjorn.helgaas@hp.com>
+ May 4, 2006
+
+
+MEMORY ATTRIBUTES
+
+ Itanium supports several attributes for virtual memory references.
+ The attribute is part of the virtual translation, i.e., it is
+ contained in the TLB entry. The ones of most interest to the Linux
+ kernel are:
+
+ WB Write-back (cacheable)
+ UC Uncacheable
+ WC Write-coalescing
+
+ System memory typically uses the WB attribute. The UC attribute is
+ used for memory-mapped I/O devices. The WC attribute is uncacheable
+ like UC is, but writes may be delayed and combined to increase
+ performance for things like frame buffers.
+
+ The Itanium architecture requires that we avoid accessing the same
+ page with both a cacheable mapping and an uncacheable mapping[1].
+
+ The design of the chipset determines which attributes are supported
+ on which regions of the address space. For example, some chipsets
+ support either WB or UC access to main memory, while others support
+ only WB access.
+
+MEMORY MAP
+
+ Platform firmware describes the physical memory map and the
+ supported attributes for each region. At boot-time, the kernel uses
+ the EFI GetMemoryMap() interface. ACPI can also describe memory
+ devices and the attributes they support, but Linux/ia64 currently
+ doesn't use this information.
+
+ The kernel uses the efi_memmap table returned from GetMemoryMap() to
+ learn the attributes supported by each region of physical address
+ space. Unfortunately, this table does not completely describe the
+ address space because some machines omit some or all of the MMIO
+ regions from the map.
+
+ The kernel maintains another table, kern_memmap, which describes the
+ memory Linux is actually using and the attribute for each region.
+ This contains only system memory; it does not contain MMIO space.
+
+ The kern_memmap table typically contains only a subset of the system
+ memory described by the efi_memmap. Linux/ia64 can't use all memory
+ in the system because of constraints imposed by the identity mapping
+ scheme.
+
+ The efi_memmap table is preserved unmodified because the original
+ boot-time information is required for kexec.
+
+KERNEL IDENTITY MAPPINGS
+
+ Linux/ia64 identity mappings are done with large pages, currently
+ either 16MB or 64MB, referred to as "granules." Cacheable mappings
+ are speculative[2], so the processor can read any location in the
+ page at any time, independent of the programmer's intentions. This
+ means that to avoid attribute aliasing, Linux can create a cacheable
+ identity mapping only when the entire granule supports cacheable
+ access.
+
+ Therefore, kern_memmap contains only full granule-sized regions that
+ can referenced safely by an identity mapping.
+
+ Uncacheable mappings are not speculative, so the processor will
+ generate UC accesses only to locations explicitly referenced by
+ software. This allows UC identity mappings to cover granules that
+ are only partially populated, or populated with a combination of UC
+ and WB regions.
+
+USER MAPPINGS
+
+ User mappings are typically done with 16K or 64K pages. The smaller
+ page size allows more flexibility because only 16K or 64K has to be
+ homogeneous with respect to memory attributes.
+
+POTENTIAL ATTRIBUTE ALIASING CASES
+
+ There are several ways the kernel creates new mappings:
+
+ mmap of /dev/mem
+
+ This uses remap_pfn_range(), which creates user mappings. These
+ mappings may be either WB or UC. If the region being mapped
+ happens to be in kern_memmap, meaning that it may also be mapped
+ by a kernel identity mapping, the user mapping must use the same
+ attribute as the kernel mapping.
+
+ If the region is not in kern_memmap, the user mapping should use
+ an attribute reported as being supported in the EFI memory map.
+
+ Since the EFI memory map does not describe MMIO on some
+ machines, this should use an uncacheable mapping as a fallback.
+
+ mmap of /sys/class/pci_bus/.../legacy_mem
+
+ This is very similar to mmap of /dev/mem, except that legacy_mem
+ only allows mmap of the one megabyte "legacy MMIO" area for a
+ specific PCI bus. Typically this is the first megabyte of
+ physical address space, but it may be different on machines with
+ several VGA devices.
+
+ "X" uses this to access VGA frame buffers. Using legacy_mem
+ rather than /dev/mem allows multiple instances of X to talk to
+ different VGA cards.
+
+ The /dev/mem mmap constraints apply.
+
+ However, since this is for mapping legacy MMIO space, WB access
+ does not make sense. This matters on machines without legacy
+ VGA support: these machines may have WB memory for the entire
+ first megabyte (or even the entire first granule).
+
+ On these machines, we could mmap legacy_mem as WB, which would
+ be safe in terms of attribute aliasing, but X has no way of
+ knowing that it is accessing regular memory, not a frame buffer,
+ so the kernel should fail the mmap rather than doing it with WB.
+
+ read/write of /dev/mem
+
+ This uses copy_from_user(), which implicitly uses a kernel
+ identity mapping. This is obviously safe for things in
+ kern_memmap.
+
+ There may be corner cases of things that are not in kern_memmap,
+ but could be accessed this way. For example, registers in MMIO
+ space are not in kern_memmap, but could be accessed with a UC
+ mapping. This would not cause attribute aliasing. But
+ registers typically can be accessed only with four-byte or
+ eight-byte accesses, and the copy_from_user() path doesn't allow
+ any control over the access size, so this would be dangerous.
+
+ ioremap()
+
+ This returns a kernel identity mapping for use inside the
+ kernel.
+
+ If the region is in kern_memmap, we should use the attribute
+ specified there. Otherwise, if the EFI memory map reports that
+ the entire granule supports WB, we should use that (granules
+ that are partially reserved or occupied by firmware do not appear
+ in kern_memmap). Otherwise, we should use a UC mapping.
+
+PAST PROBLEM CASES
+
+ mmap of various MMIO regions from /dev/mem by "X" on Intel platforms
+
+ The EFI memory map may not report these MMIO regions.
+
+ These must be allowed so that X will work. This means that
+ when the EFI memory map is incomplete, every /dev/mem mmap must
+ succeed. It may create either WB or UC user mappings, depending
+ on whether the region is in kern_memmap or the EFI memory map.
+
+ mmap of 0x0-0xA0000 /dev/mem by "hwinfo" on HP sx1000 with VGA enabled
+
+ See https://bugzilla.novell.com/show_bug.cgi?id=140858.
+
+ The EFI memory map reports the following attributes:
+ 0x00000-0x9FFFF WB only
+ 0xA0000-0xBFFFF UC only (VGA frame buffer)
+ 0xC0000-0xFFFFF WB only
+
+ This mmap is done with user pages, not kernel identity mappings,
+ so it is safe to use WB mappings.
+
+ The kernel VGA driver may ioremap the VGA frame buffer at 0xA0000,
+ which will use a granule-sized UC mapping covering 0-0xFFFFF. This
+ granule covers some WB-only memory, but since UC is non-speculative,
+ the processor will never generate an uncacheable reference to the
+ WB-only areas unless the driver explicitly touches them.
+
+ mmap of 0x0-0xFFFFF legacy_mem by "X"
+
+ If the EFI memory map reports this entire range as WB, there
+ is no VGA MMIO hole, and the mmap should fail or be done with
+ a WB mapping.
+
+ There's no easy way for X to determine whether the 0xA0000-0xBFFFF
+ region is a frame buffer or just memory, so I think it's best to
+ just fail this mmap request rather than using a WB mapping. As
+ far as I know, there's no need to map legacy_mem with WB
+ mappings.
+
+ Otherwise, a UC mapping of the entire region is probably safe.
+ The VGA hole means the region will not be in kern_memmap. The
+ HP sx1000 chipset doesn't support UC access to the memory surrounding
+ the VGA hole, but X doesn't need that area anyway and should not
+ reference it.
+
+ mmap of 0xA0000-0xBFFFF legacy_mem by "X" on HP sx1000 with VGA disabled
+
+ The EFI memory map reports the following attributes:
+ 0x00000-0xFFFFF WB only (no VGA MMIO hole)
+
+ This is a special case of the previous case, and the mmap should
+ fail for the same reason as above.
+
+NOTES
+
+ [1] SDM rev 2.2, vol 2, sec 4.4.1.
+ [2] SDM rev 2.2, vol 2, sec 4.4.6.
diff --git a/Documentation/ioctl-number.txt b/Documentation/ioctl-number.txt
index 171a44ebd939..1543802ef53e 100644
--- a/Documentation/ioctl-number.txt
+++ b/Documentation/ioctl-number.txt
@@ -85,7 +85,9 @@ Code Seq# Include File Comments
<mailto:maassen@uni-freiburg.de>
'C' all linux/soundcard.h
'D' all asm-s390/dasd.h
+'E' all linux/input.h
'F' all linux/fb.h
+'H' all linux/hiddev.h
'I' all linux/isdn.h
'J' 00-1F drivers/scsi/gdth_ioctl.h
'K' all linux/kd.h
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index a9d3a1794b23..bca6f389da66 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -147,6 +147,9 @@ running once the system is up.
acpi_irq_isa= [HW,ACPI] If irq_balance, mark listed IRQs used by ISA
Format: <irq>,<irq>...
+ acpi_os_name= [HW,ACPI] Tell ACPI BIOS the name of the OS
+ Format: To spoof as Windows 98: ="Microsoft Windows"
+
acpi_osi= [HW,ACPI] empty param disables _OSI
acpi_serialize [HW,ACPI] force serialization of AML methods
diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt
index 76750fb9151a..839cbb71388b 100644
--- a/Documentation/networking/tuntap.txt
+++ b/Documentation/networking/tuntap.txt
@@ -39,10 +39,13 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
mknod /dev/net/tun c 10 200
Set permissions:
- e.g. chmod 0700 /dev/net/tun
- if you want the device only accessible by root. Giving regular users the
- right to assign network devices is NOT a good idea. Users could assign
- bogus network interfaces to trick firewalls or administrators.
+ e.g. chmod 0666 /dev/net/tun
+ There's no harm in allowing the device to be accessible by non-root users,
+ since CAP_NET_ADMIN is required for creating network devices or for
+ connecting to network devices which aren't owned by the user in question.
+ If you want to create persistent devices and give ownership of them to
+ unprivileged users, then you need the /dev/net/tun device to be usable by
+ those users.
Driver module autoloading
diff --git a/Documentation/power/swsusp.txt b/Documentation/power/swsusp.txt
index 516c5019013b..823b2cf6e3dc 100644
--- a/Documentation/power/swsusp.txt
+++ b/Documentation/power/swsusp.txt
@@ -350,9 +350,34 @@ Q: How do I make suspend more verbose?
A: If you want to see any non-error kernel messages on the virtual
terminal the kernel switches to during suspend, you have to set the
-kernel console loglevel to at least 5, for example by doing
-
- echo 5 > /proc/sys/kernel/printk
+kernel console loglevel to at least 4 (KERN_WARNING), for example by
+doing
+
+ # save the old loglevel
+ read LOGLEVEL DUMMY < /proc/sys/kernel/printk
+ # set the loglevel so we see the progress bar.
+ # if the level is higher than needed, we leave it alone.
+ if [ $LOGLEVEL -lt 5 ]; then
+ echo 5 > /proc/sys/kernel/printk
+ fi
+
+ IMG_SZ=0
+ read IMG_SZ < /sys/power/image_size
+ echo -n disk > /sys/power/state
+ RET=$?
+ #
+ # the logic here is:
+ # if image_size > 0 (without kernel support, IMG_SZ will be zero),
+ # then try again with image_size set to zero.
+ if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size
+ echo 0 > /sys/power/image_size
+ echo -n disk > /sys/power/state
+ RET=$?
+ fi
+
+ # restore previous loglevel
+ echo $LOGLEVEL > /proc/sys/kernel/printk
+ exit $RET
Q: Is this true that if I have a mounted filesystem on a USB device and
I suspend to disk, I can lose data unless the filesystem has been mounted
@@ -380,3 +405,17 @@ safest thing is to unmount all filesystems on removable media (such USB,
Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays)
before suspending; then remount them after resuming.
+Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were
+compiled with the similar configuration files. Anyway I found that
+suspend to disk (and resume) is much slower on 2.6.16 compared to
+2.6.15. Any idea for why that might happen or how can I speed it up?
+
+A: This is because the size of the suspend image is now greater than
+for 2.6.15 (by saving more data we can get more responsive system
+after resume).
+
+There's the /sys/power/image_size knob that controls the size of the
+image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as
+root), the 2.6.15 behavior should be restored. If it is still too
+slow, take a look at suspend.sf.net -- userland suspend is faster and
+supports LZF compression to speed it up further.
diff --git a/Documentation/power/video.txt b/Documentation/power/video.txt
index 43a889f8f08d..d859faa3a463 100644
--- a/Documentation/power/video.txt
+++ b/Documentation/power/video.txt
@@ -90,6 +90,7 @@ Table of known working notebooks:
Model hack (or "how to do it")
------------------------------------------------------------------------------
Acer Aspire 1406LC ole's late BIOS init (7), turn off DRI
+Acer TM 230 s3_bios (2)
Acer TM 242FX vbetool (6)
Acer TM C110 video_post (8)
Acer TM C300 vga=normal (only suspend on console, not in X), vbetool (6) or video_post (8)
@@ -115,6 +116,7 @@ Dell D610 vga=normal and X (possibly vbestate (6) too, but not tested)
Dell Inspiron 4000 ??? (*)
Dell Inspiron 500m ??? (*)
Dell Inspiron 510m ???
+Dell Inspiron 5150 vbetool needed (6)
Dell Inspiron 600m ??? (*)
Dell Inspiron 8200 ??? (*)
Dell Inspiron 8500 ??? (*)
@@ -125,6 +127,7 @@ HP NX7000 ??? (*)
HP Pavilion ZD7000 vbetool post needed, need open-source nv driver for X
HP Omnibook XE3 athlon version none (1)
HP Omnibook XE3GC none (1), video is S3 Savage/IX-MV
+HP Omnibook XE3L-GF vbetool (6)
HP Omnibook 5150 none (1), (S1 also works OK)
IBM TP T20, model 2647-44G none (1), video is S3 Inc. 86C270-294 Savage/IX-MV, vesafb gets "interesting" but X work.
IBM TP A31 / Type 2652-M5G s3_mode (3) [works ok with BIOS 1.04 2002-08-23, but not at all with BIOS 1.11 2004-11-05 :-(]
@@ -157,6 +160,7 @@ Sony Vaio vgn-s260 X or boot-radeon can init it (5)
Sony Vaio vgn-S580BH vga=normal, but suspend from X. Console will be blank unless you return to X.
Sony Vaio vgn-FS115B s3_bios (2),s3_mode (4)
Toshiba Libretto L5 none (1)
+Toshiba Libretto 100CT/110CT vbetool (6)
Toshiba Portege 3020CT s3_mode (3)
Toshiba Satellite 4030CDT s3_mode (3) (S1 also works OK)
Toshiba Satellite 4080XCDT s3_mode (3) (S1 also works OK)
diff --git a/Documentation/sparse.txt b/Documentation/sparse.txt
index 3f1c5464b1c9..5a311c38dd1a 100644
--- a/Documentation/sparse.txt
+++ b/Documentation/sparse.txt
@@ -1,5 +1,6 @@
Copyright 2004 Linus Torvalds
Copyright 2004 Pavel Machek <pavel@suse.cz>
+Copyright 2006 Bob Copeland <me@bobcopeland.com>
Using sparse for typechecking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -41,15 +42,8 @@ sure that bitwise types don't get mixed up (little-endian vs big-endian
vs cpu-endian vs whatever), and there the constant "0" really _is_
special.
-Use
-
- make C=[12] CF=-Wbitwise
-
-or you don't get any checking at all.
-
-
-Where to get sparse
-~~~~~~~~~~~~~~~~~~~
+Getting sparse
+~~~~~~~~~~~~~~
With git, you can just get it from
@@ -57,7 +51,7 @@ With git, you can just get it from
and DaveJ has tar-balls at
- http://www.codemonkey.org.uk/projects/git-snapshots/sparse/
+ http://www.codemonkey.org.uk/projects/git-snapshots/sparse/
Once you have it, just do
@@ -65,8 +59,20 @@ Once you have it, just do
make
make install
-as your regular user, and it will install sparse in your ~/bin directory.
-After that, doing a kernel make with "make C=1" will run sparse on all the
-C files that get recompiled, or with "make C=2" will run sparse on the
-files whether they need to be recompiled or not (ie the latter is fast way
-to check the whole tree if you have already built it).
+as a regular user, and it will install sparse in your ~/bin directory.
+
+Using sparse
+~~~~~~~~~~~~
+
+Do a kernel make with "make C=1" to run sparse on all the C files that get
+recompiled, or use "make C=2" to run sparse on the files whether they need to
+be recompiled or not. The latter is a fast way to check the whole tree if you
+have already built it.
+
+The optional make variable CF can be used to pass arguments to sparse. The
+build system passes -Wbitwise to sparse automatically. To perform endianness
+checks, you may define __CHECK_ENDIAN__:
+
+ make C=2 CF="-D__CHECK_ENDIAN__"
+
+These checks are disabled by default as they generate a host of warnings.
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index a46c10fcddfc..2dc246af4885 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/vm:
- drop-caches
- zone_reclaim_mode
- zone_reclaim_interval
+- panic_on_oom
==============================================================
@@ -178,3 +179,15 @@ Time is set in seconds and set by default to 30 seconds.
Reduce the interval if undesired off node allocations occur. However, too
frequent scans will have a negative impact onoff node allocation performance.
+=============================================================
+
+panic_on_oom
+
+This enables or disables panic on out-of-memory feature. If this is set to 1,
+the kernel panics when out-of-memory happens. If this is set to 0, the kernel
+will kill some rogue process, called oom_killer. Usually, oom_killer can kill
+rogue processes and system will survive. If you want to panic the system
+rather than killing rogue processes, set this to 1.
+
+The default value is 0.
+
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index 0dd4ef30c361..99f89aa10169 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -26,8 +26,13 @@ a process are located. See also the numa_maps manpage in the numactl package.
Manual migration is useful if for example the scheduler has relocated
a process to a processor on a distant node. A batch scheduler or an
administrator may detect the situation and move the pages of the process
-nearer to the new processor. At some point in the future we may have
-some mechanism in the scheduler that will automatically move the pages.
+nearer to the new processor. The kernel itself does only provide
+manual page migration support. Automatic page migration may be implemented
+through user space processes that move pages. A special function call
+"move_pages" allows the moving of individual pages within a process.
+A NUMA profiler may f.e. obtain a log showing frequent off node
+accesses and may use the result to move pages to more advantageous
+locations.
Larger installations usually partition the system using cpusets into
sections of nodes. Paul Jackson has equipped cpusets with the ability to
@@ -62,22 +67,14 @@ A. In kernel use of migrate_pages()
It also prevents the swapper or other scans to encounter
the page.
-2. Generate a list of newly allocates page. These pages will contain the
- contents of the pages from the first list after page migration is
- complete.
+2. We need to have a function of type new_page_t that can be
+ passed to migrate_pages(). This function should figure out
+ how to allocate the correct new page given the old page.
3. The migrate_pages() function is called which attempts
- to do the migration. It returns the moved pages in the
- list specified as the third parameter and the failed
- migrations in the fourth parameter. The first parameter
- will contain the pages that could still be retried.
-
-4. The leftover pages of various types are returned
- to the LRU using putback_to_lru_pages() or otherwise
- disposed of. The pages will still have the refcount as
- increased by isolate_lru_pages() if putback_to_lru_pages() is not
- used! The kernel may want to handle the various cases of failures in
- different ways.
+ to do the migration. It will call the function to allocate
+ the new page for each page that is considered for
+ moving.
B. How migrate_pages() works
----------------------------
@@ -93,83 +90,58 @@ Steps:
2. Insure that writeback is complete.
-3. Make sure that the page has assigned swap cache entry if
- it is an anonyous page. The swap cache reference is necessary
- to preserve the information contain in the page table maps while
- page migration occurs.
-
-4. Prep the new page that we want to move to. It is locked
+3. Prep the new page that we want to move to. It is locked
and set to not being uptodate so that all accesses to the new
page immediately lock while the move is in progress.
-5. All the page table references to the page are either dropped (file
- backed pages) or converted to swap references (anonymous pages).
- This should decrease the reference count.
+4. The new page is prepped with some settings from the old page so that
+ accesses to the new page will discover a page with the correct settings.
+
+5. All the page table references to the page are converted
+ to migration entries or dropped (nonlinear vmas).
+ This decrease the mapcount of a page. If the resulting
+ mapcount is not zero then we do not migrate the page.
+ All user space processes that attempt to access the page
+ will now wait on the page lock.
6. The radix tree lock is taken. This will cause all processes trying
- to reestablish a pte to block on the radix tree spinlock.
+ to access the page via the mapping to block on the radix tree spinlock.
7. The refcount of the page is examined and we back out if references remain
otherwise we know that we are the only one referencing this page.
8. The radix tree is checked and if it does not contain the pointer to this
- page then we back out because someone else modified the mapping first.
-
-9. The mapping is checked. If the mapping is gone then a truncate action may
- be in progress and we back out.
-
-10. The new page is prepped with some settings from the old page so that
- accesses to the new page will be discovered to have the correct settings.
+ page then we back out because someone else modified the radix tree.
-11. The radix tree is changed to point to the new page.
+9. The radix tree is changed to point to the new page.
-12. The reference count of the old page is dropped because the radix tree
- reference is gone.
+10. The reference count of the old page is dropped because the radix tree
+ reference is gone. A reference to the new page is established because
+ the new page is referenced to by the radix tree.
-13. The radix tree lock is dropped. With that lookups become possible again
- and other processes will move from spinning on the tree lock to sleeping on
- the locked new page.
+11. The radix tree lock is dropped. With that lookups in the mapping
+ become possible again. Processes will move from spinning on the tree_lock
+ to sleeping on the locked new page.
-14. The page contents are copied to the new page.
+12. The page contents are copied to the new page.
-15. The remaining page flags are copied to the new page.
+13. The remaining page flags are copied to the new page.
-16. The old page flags are cleared to indicate that the page does
- not use any information anymore.
+14. The old page flags are cleared to indicate that the page does
+ not provide any information anymore.
-17. Queued up writeback on the new page is triggered.
+15. Queued up writeback on the new page is triggered.
-18. If swap pte's were generated for the page then replace them with real
- ptes. This will reenable access for processes not blocked by the page lock.
+16. If migration entries were page then replace them with real ptes. Doing
+ so will enable access for user space processes not already waiting for
+ the page lock.
19. The page locks are dropped from the old and new page.
- Processes waiting on the page lock can continue.
+ Processes waiting on the page lock will redo their page faults
+ and will reach the new page.
20. The new page is moved to the LRU and can be scanned by the swapper
etc again.
-TODO list
----------
-
-- Page migration requires the use of swap handles to preserve the
- information of the anonymous page table entries. This means that swap
- space is reserved but never used. The maximum number of swap handles used
- is determined by CHUNK_SIZE (see mm/mempolicy.c) per ongoing migration.
- Reservation of pages could be avoided by having a special type of swap
- handle that does not require swap space and that would only track the page
- references. Something like that was proposed by Marcelo Tosatti in the
- past (search for migration cache on lkml or linux-mm@kvack.org).
-
-- Page migration unmaps ptes for file backed pages and requires page
- faults to reestablish these ptes. This could be optimized by somehow
- recording the references before migration and then reestablish them later.
- However, there are several locking challenges that have to be overcome
- before this is possible.
-
-- Page migration generates read ptes for anonymous pages. Dirty page
- faults are required to make the pages writable again. It may be possible
- to generate a pte marked dirty if it is known that the page is dirty and
- that this process has the only reference to that page.
-
-Christoph Lameter, March 8, 2006.
+Christoph Lameter, May 8, 2006.
OpenPOWER on IntegriCloud