summaryrefslogtreecommitdiffstats
path: root/drivers/md/md.c
Commit message (Collapse)AuthorAgeFilesLines
* md: Add module.h to all files using it implicitlyPaul Gortmaker2011-10-311-0/+1
| | | | | | | | A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* md.c: trivial comment fixChris Dunlop2011-10-191-2/+2
| | | | | | | Trivial comment fix Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: Allow restarting an interrupted incremental recovery.Andrei Warkentin2011-10-181-7/+15
| | | | | | | | | | | | | If an incremental recovery was interrupted, a subsequent re-add will result in a full recovery, even though an incremental should be possible (seen with raid1). Solve this problem by not updating the superblock on the recovering device until array is not degraded any longer. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: clear In_sync bit on devices added to an active array.NeilBrown2011-10-181-0/+1
| | | | | | | | | | | | | | | When we add a device to an active array it can be meaningful to set the 'insync' flag. This indicates that the device is in-sync with the array except for locations recorded in the bitmap. A bitmap-based recovery can then bring it completely in-sync. Internally we move that flag to 'saved_raid_disk' but forgot to clear In_sync like we do in add_new_disk. So clear In_sync after moving its value to saved_raid_disk. Reported-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: rename "mdk_personality" to "md_personality"NeilBrown2011-10-111-8/+8
| | | | | | "mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>
* md: remove typedefs: mdk_thread_t -> struct md_threadNeilBrown2011-10-111-7/+7
| | | | Signed-off-by: NeilBrown <neilb@suse.de>
* md: remove typedefs: mddev_t -> struct mddevNeilBrown2011-10-111-180/+180
| | | | | | Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
* md: removing typedefs: mdk_rdev_t -> struct md_rdevNeilBrown2011-10-111-105/+107
| | | | | | | The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
* md: remove PRINTK and dprintk debugging and use pr_debugNeilBrown2011-10-071-17/+11
| | | | | | Being able to dynamically enable these make them much more useful. Signed-off-by: NeilBrown <neilb@suse.de>
* md: don't delay reboot by 1 second if no MD devices existDaniel P. Berrange2011-09-231-2/+6
| | | | | | | | | | | | | | | | | | The md_notify_reboot() method includes a call to mdelay(1000), to deal with "exotic SCSI devices" which are too volatile on reboot. The delay is unconditional. Even if the machine does not have any block devices, let alone MD devices, the kernel shutdown sequence is slowed down. 1 second does not matter much with physical hardware, but with certain virtualization use cases any wasted time in the bootup & shutdown sequence counts for alot. * drivers/md/md.c: md_notify_reboot() - only impose a delay if there was at least one MD device to be stopped during reboot Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: Avoid waking up a thread after it has been freed.NeilBrown2011-09-211-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org
* md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.NeilBrown2011-09-101-2/+10
| | | | | | | | | | | | | | | | | | | | 0.90 metadata uses an unsigned 32bit number to count the number of kilobytes used from each device. This should allow up to 4TB per device. However we multiply this by 2 (to get sectors) before casting to a larger type, so sizes above 2TB get truncated. Also we allow rdev->sectors to be larger than 4TB, so it is possible for the array to be resized larger than the metadata can handle. So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in used. Also the sanity check at the end of super_90_load should include level 1 as it used ->size too. (RAID0 and Linear don't use ->size at all). Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
* md: fix clearing of 'blocked' flag in the presence of bad blocks.NeilBrown2011-08-301-1/+1
| | | | | | | | | | | When the 'blocked' flag on a device is cleared while there are unacknowledged bad blocks we must fail the device. This is needed for backwards compatability of the interface. The code currently uses the wrong test for "unacknowledged bad blocks exist". Change it to the right test. Signed-off-by: NeilBrown <neilb@suse.de>
* md: use REQ_NOIDLE flag in md_super_write()Namhyung Kim2011-08-251-1/+1
| | | | | | | | | | | | | | Queue idling is used for the anticipation of immediate sequencial I/O's but md_super_write() is a kind of one- shot operation, coupled with md_super_wait(), so the idling in this case will be just a waste of time. Specifying REQ_NOIDLE prevents it. Instead of adding the flag to submit_bio() directly, use pre-defined macro WRITE_FLUSH_FUA. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: ensure changes to 'write-mostly' are reflected in metadata.NeilBrown2011-08-251-0/+5
| | | | | | | | | | The 'write-mostly' flag can be changed through sysfs. With 0.90 metadata, those changes are reflected in the metadata. For 1.x metadata, they aren't. So fix super_1_sync to record 'write-mostly' status. Signed-off-by: NeilBrown <neilb@suse.de>
* md: report failure if a 'set faulty' request doesn't.NeilBrown2011-08-251-1/+6
| | | | | | | | | | | | Sometimes a device will refuse to be set faulty. e.g. RAID1 will never let the last working device become faulty. So check if "md_error()" did manage to set the faulty flag and fail with EBUSY if it didn't. Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198 Reported-by: Mike Hommey <mh+reportbug@glandium.org> Signed-off-by: NeilBrown <neilb@suse.de>
* Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2011-07-281-86/+785
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: (75 commits) md/raid10: handle further errors during fix_read_error better. md/raid10: Handle read errors during recovery better. md/raid10: simplify read error handling during recovery. md/raid10: record bad blocks due to write errors during resync/recovery. md/raid10: attempt to fix read errors during resync/check md/raid10: Handle write errors by updating badblock log. md/raid10: clear bad-block record when write succeeds. md/raid10: avoid writing to known bad blocks on known bad drives. md/raid10 record bad blocks as needed during recovery. md/raid10: avoid reading known bad blocks during resync/recovery. md/raid10 - avoid reading from known bad blocks - part 3 md/raid10: avoid reading from known bad blocks - part 2 md/raid10: avoid reading from known bad blocks - part 1 md/raid10: Split handle_read_error out from raid10d. md/raid10: simplify/reindent some loops. md/raid5: Clear bad blocks on successful write. md/raid5. Don't write to known bad block on doubtful devices. md/raid5: write errors should be recorded as bad blocks if possible. md/raid5: use bad-block log to improve handling of uncorrectable read errors. md/raid5: avoid reading from known bad blocks. ...
| * md/raid10 record bad blocks as needed during recovery.NeilBrown2011-07-281-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | When recovering one or more devices, if all the good devices have bad blocks we should record a bad block on the device being rebuilt. If this fails, we need to abort the recovery. To ensure we don't think that we aborted later than we actually did, we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync, in particular before mddev->curr_resync is updated. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: make it easier to wait for bad blocks to be acknowledged.NeilBrown2011-07-281-25/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty *or* it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: add 'write_error' flag to component devices.NeilBrown2011-07-281-0/+12
| | | | | | | | | | | | | | | | | | If a device has ever seen a write error, we will want to handle known-bad-blocks differently. So create an appropriate state flag and export it via sysfs. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
| * md/raid1: avoid reading from known bad blocks.NeilBrown2011-07-281-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have a bad block list, we should not read from those blocks. There are several main parts to this: 1/ read_balance needs to check for bad blocks, and return not only the chosen device, but also how many good blocks are available there. 2/ fix_read_error needs to avoid trying to read from bad blocks. 3/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' 4/ retrying a read needs to also be ready to submit a smaller read and queue another request for the rest. This does not yet handle bad blocks when reading to perform resync, recovery, or check. 'md_trim_bio' will also be used for RAID10, so put it in md.c and export it. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: Disable bad blocks and v0.90 metadata.NeilBrown2011-07-281-0/+4
| | | | | | | | | | | | | | v0.90 metadata cannot record bad blocks, so when loading metadata for such a device, set shift to -1. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: load/store badblock list from v1.x metadataNeilBrown2011-07-281-6/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Space must have been allocated when array was created. A feature flag is set when the badblock list is non-empty, to ensure old kernels don't load and trust the whole device. We only update the on-disk badblocklist when it has changed. If the badblocklist (or other metadata) is stored on a bad block, we don't cope very well. If metadata has no room for bad block, flag bad-blocks as disabled, and do the same for 0.90 metadata. Signed-off-by: NeilBrown <neilb@suse.de>
| * md/bad-block-log: add sysfs interface for accessing bad-block-log.NeilBrown2011-07-281-0/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | This can show the log (providing it fits in one page) and allows bad blocks to be 'acknowledged' meaning that they have safely been recorded in metadata. Clearing bad blocks is not allowed via sysfs (except for code testing). A bad block can only be cleared when a write to the block succeeds. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
| * md: beginnings of bad block management.NeilBrown2011-07-281-3/+412
| | | | | | | | | | | | | | | | | | | | | | This the first step in allowing md to track bad-blocks per-device so that we can fail individual blocks rather than the whole device. This patch just adds a data structure for recording bad blocks, with routines to add, remove, search the list. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
| * md: remove suspicious size_of()NeilBrown2011-07-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When calling bioset_create we pass the size of the front_pad as sizeof(mddev) which looks suspicious as mddev is a pointer and so it looks like a common mistake where sizeof(*mddev) was intended. The size is actually correct as we want to store a pointer in the front padding of the bios created by the bioset, so make the intent more explicit by using sizeof(mddev_t *) Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * MD: generate an event when array sync is completeJonathan Brassow2011-07-271-0/+2
| | | | | | | | | | | | | | | | This patch causes MD to generate an event (for device-mapper) when the synchronization thread is reaped. This is expected behavior for device-mapper. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: get rid of unnecessary casts on page_address()Namhyung Kim2011-07-271-12/+11
| | | | | | | | | | | | | | page_address() returns void pointer, so the casts can be removed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: change managed of recovery_disabled.NeilBrown2011-07-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we hit a read error while recovering a mirror, we want to abort the recovery without necessarily failing the disk - as having a disk this a read error is better than not having an array at all. Currently this is managed with a per-array flag "recovery_disabled" and is only implemented for RAID1. For RAID10 we will need finer grained control as we might want to disable recovery for individual devices separately. So push more of the decision making into the personality. 'recovery_disabled' is now a 'cookie' which is copied when the personality want to disable recovery and is changed when a device is added to the array as this is used as a trigger to 'try recovery again'. This will allow RAID10 to get the control that it needs. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: remove ro check in md_check_recovery()Namhyung Kim2011-07-271-3/+0
| | | | | | | | | | | | | | | | | | | | Commit c89a8eee6154 ("Allow faulty devices to be removed from a readonly array.") added some work on ro array in the function, but it couldn't be done since we didn't allow the ro array to be handled from the beginning. Fix it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: introduce link/unlink_rdev() helpersNamhyung Kim2011-07-271-33/+14
| | | | | | | | | | | | | | | | | | There are places where sysfs links to rdev are handled in a same way. Add the helper functions to consolidate them. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | fs: seq_file - add event counter to simplify poll() supportKay Sievers2011-07-201-18/+8
|/ | | | | | | | | | | | | | Moving the event counter into the dynamically allocated 'struc seq_file' allows poll() support without the need to allocate its own tracking structure. All current users are switched over to use the new counter. Requested-by: Andrew Morton akpm@linux-foundation.org Acked-by: NeilBrown <neilb@suse.de> Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* md: avoid endless recovery loop when waiting for fail device to complete.NeilBrown2011-06-281-0/+1
| | | | | | | | | | | | | | | | | | | | | If a device fails in a way that causes pending request to take a while to complete, md will not be able to immediately remove it from the array in remove_and_add_spares. It will then incorrectly look like a spare device and md will try to recover it even though it is failed. This leads to a recovery process starting and instantly aborting over and over again. We should check if the device is faulty before considering it to be a spare. This will avoid trying to start a recovery that cannot proceed. This bug was introduced in 2.6.26 so that patch is suitable for any kernel since then. Cc: stable@kernel.org Reported-by: Jim Paradis <james.paradis@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: check ->hot_remove_disk when removing diskNamhyung Kim2011-06-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store() during disk removal. The linear personality only has ->hot_add_disk and no ->hot_remove_disk, so that removing disk in the array resulted to following kernel bug: $ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3] $ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) PGD c9f5d067 PUD 8575a067 PMD 0 Oops: 0010 [#1] SMP CPU 2 Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO RIP: 0010:[<0000000000000000>] [< (null)>] (null) RSP: 0018:ffff880085757df0 EFLAGS: 00010282 RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000 RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000 FS: 00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000) Stack: ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000 ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90 ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98 Call Trace: [<ffffffff8138496a>] ? slot_store+0xaa/0x265 [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8 [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144 [<ffffffff81106b87>] vfs_write+0xb1/0x10d [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135 [<ffffffff81106cac>] sys_write+0x4d/0x77 [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b Code: Bad RIP value. RIP [< (null)>] (null) RSP <ffff880085757df0> CR2: 0000000000000000 ---[ end trace ba5fc64319a826fb ]--- Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
* md: Using poll /proc/mdstat can monitor the events of adding a spare disks马建朋2011-06-091-0/+2
| | | | | Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: add sync_super to mddev_t structJonathan Brassow2011-06-081-2/+13
| | | | | | | | | | | | | | Add the 'sync_super' function pointer to MD array structure (struct mddev_s) If device-mapper (dm-raid.c) is to define its own on-disk superblock and be able to load it, there must still be a way for MD to initiate superblock updates. The simplest way to make this happen is to provide a pointer in the MD array structure that can be set by device-mapper (or other module) with a function to do this. If the function has been set, it will be used; otherwise, the method with be looked up via 'super_types' as usual. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: move thread wakeups into resumeJonathan Brassow2011-06-081-3/+7
| | | | | | | | | | | | | | | Move personality and sync/recovery thread starting outside md_run. Moving the wakeup's of the personality and sync/recovery threads out of md_run and into do_md_run and mddev_resume solves two issues: 1) It allows bitmap_load to be called before the sync_thread is run and 2) when MD personalities are used by device-mapper (dm-raid.c), the start-up of the array is better alligned with device-mapper primatives (CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should not happen until after a resume has taken place. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: possible typoJonathan Brassow2011-06-081-2/+2
| | | | | | | | | | | Make message a bit clearer by s/blocks/k/ I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the message. 'k' may be a bit ambigous, but I think it's better than "blocks" which normally means 512, but means 1024 in MD. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: no sync IO while suspendedJonathan Brassow2011-06-081-1/+3
| | | | | | | | | | Disallow resync I/O while the RAID array is suspended. Recovery, resync, and metadata I/O should not be allowed while a device is suspended. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: no integrity register if no gendiskJonathan Brassow2011-06-081-2/+2
| | | | | | | | | | Don't attempt md_integrity_register if there is no gendisk struct available. When MD arrays are built via device-mapper, the gendisk structure is not available via mddev. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: allow resync_start to be set while an array is active.NeilBrown2011-05-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sysfs attribute 'resync_start' (known internally as recovery_cp), records where a resync is up to. A value of 0 means the array is not known to be in-sync at all. A value of MaxSector means the array is believed to be fully in-sync. When the size of member devices of an array (RAID1,RAID4/5/6) is increased, the array can be increased to match. This process sets resync_start to the old end-of-device offset so that the new part of the array gets resynced. However with RAID1 (and RAID6) a resync is not technically necessary and may be undesirable. So it would be good if the implied resync after the array is resized could be avoided. So: change 'resync_start' so the value can be changed while the array is active, and as a precaution only allow it to be changed while resync/recovery is 'frozen'. Changing it once resync has started is not going to be useful anyway. This allows the array to be resized without a resync by: write 'frozen' to 'sync_action' write new size to 'component_size' (this will set resync_start) write 'none' to 'resync_start' write 'idle' to 'sync_action'. Also slightly improve some tests on recovery_cp when resizing raid1/raid5. Now that an arbitrary value could be set we should be more careful in our tests. Signed-off-by: NeilBrown <neilb@suse.de>
* md: reject a re-add request that cannot be honoured.NeilBrown2011-05-111-0/+10
| | | | | | | | | | | | | | | The 'add_new_disk' ioctl can be used to add a device either as a spare, or as an active disk that just needs to be resynced based on write-intent-bitmap information (re-add) Currently if a re-add is requested but fails we add as a spare instead. This makes it impossible for user-space to check for failure. So change to require that a re-add attempt will either succeed or completely fail. User-space can then decide what to do next. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Fix race when creating a new md device.NeilBrown2011-05-111-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race when creating an md device by opening /dev/mdXX. If two processes do this at much the same time they will follow the call path __blkdev_get -> get_gendisk -> kobj_lookup The first will call -> md_probe -> md_alloc -> add_disk -> blk_register_region and the race happens when the second gets to kobj_lookup after add_disk has called blk_register_region but before it returns to md_alloc. In the case the second will not call md_probe (as the probe is already done) but will get a handle on the gendisk, return to __blkdev_get which will then call md_open (via the ->open) pointer. As mddev->gendisk hasn't been set yet, md_open will think something is wrong an return with ERESTARTSYS. This can loop endlessly while the first thread makes no progress through add_disk. Nothing is blocking it, but due to scheduler behaviour it doesn't get a turn. So this is essentially a live-lock. We fix this by simply moving the assignment to mddev->gendisk before the call the add_disk() so md_open doesn't get confused. Also move blk_queue_flush earlier because add_disk should be as late as possible. To make sure that md_open doesn't complete until md_alloc has done all that is needed, we take mddev->open_mutex during the last part of md_alloc. md_open will wait for this. This can cause a lock-up on boot so Cc:ing for stable. For 2.6.36 and earlier a different patch will be needed as the 'blk_queue_flush' call isn't there. Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Tested-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Cc: stable@kernel.org
* md: Cleanup after raid45->raid0 takeoverKrzysztof Wojcik2011-04-201-0/+1
| | | | | | | | | | | | | | Problem: After raid4->raid0 takeover operation, another takeover operation (e.g raid0->raid10) results "kernel oops". Root cause: Variables 'degraded' in mddev structure is not cleared on raid45->raid0 takeover. This patch reset this variable. Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: provide generic support for handling unplug callbacks.NeilBrown2011-04-181-0/+56
| | | | | | | | | | | | | When an md device adds a request to a queue, it can call mddev_check_plugged. If this succeeds then we know that the md thread will be woken up shortly, and ->plug_cnt will be non-zero until then, so some processing can be delayed. If it fails, then no unplug callback is expected and the make_request function needs to do whatever is required to make the request happen. Signed-off-by: NeilBrown <neilb@suse.de>
* md - remove old plugging code.NeilBrown2011-04-181-51/+0
| | | | | | | | | | | | | md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NeilBrown <neilb@suse.de>
* Fix common misspellingsLucas De Marchi2011-03-311-1/+1
| | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* md: Fix integrity registration error when no devices are capableMartin K. Petersen2011-03-281-6/+2
| | | | | | | | | | | | We incorrectly returned -EINVAL when none of the devices in the array had an integrity profile. This in turn prevented mdadm from starting the metadevice. Fix this so we only return errors on mismatched profiles and memory allocation failures. Reported-by: Giacomo Catenazzi <cate@cateee.net> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2011-03-241-12/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
| * block: Require subsystems to explicitly allocate bio_set integrity mempoolMartin K. Petersen2011-03-171-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
OpenPOWER on IntegriCloud