summaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'dm-3.13-changes' of ↵Linus Torvalds2013-11-1416-455/+1466
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: "A set of device-mapper changes for 3.13. Improve reliability of buffer allocations for dm messages with a small number of arguments, a couple path group initialization fixes for dm multipath, a fix for resizing a dm array, various fixes and optimizations for dm cache, a fix for device mapper's Kconfig menu indentation. Features added include: - dm crypt support for activating legacy CBC TrueCrypt containers (useful for forensics of these old TCRYPT containers) - reduced dm-cache memory requirements for each block in the cache - basic support for shrinking a dm-cache's cache (fast) device - most notably, dm-cache support for managing cache coherency when deploying dm-cache with sophisticated origin volumes (that support hardware snapshots and/or clustering): these changes come in the form of a new passthrough operation mode and a cache block invalidation interface" * tag 'dm-3.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (32 commits) dm cache: resolve small nits and improve Documentation dm cache: add cache block invalidation support dm cache: add remove_cblock method to policy interface dm cache policy mq: reduce memory requirements dm cache metadata: check the metadata version when reading the superblock dm cache: add passthrough mode dm cache: cache shrinking support dm cache: promotion optimisation for writes dm cache: be much more aggressive about promoting writes to discarded blocks dm cache policy mq: implement writeback_work() and mq_{set,clear}_dirty() dm cache: optimize commit_if_needed dm space map disk: optimise sm_disk_dec_block MAINTAINERS: add reference to device-mapper's linux-dm.git tree dm: fix Kconfig menu indentation dm: allow remove to be deferred dm table: print error on preresume failure dm crypt: add TCW IV mode for old CBC TCRYPT containers dm crypt: properly handle extra key string in initialization dm cache: log error message if dm_kcopyd_copy() fails dm cache: use cell_defer() boolean argument consistently ...
| * dm cache: resolve small nits and improve DocumentationMike Snitzer2013-11-122-2/+2
| | | | | | | | | | | | | | | | Document passthrough mode, cache shrinking, and cache invalidation. Also, use strcasecmp() and hlist_unhashed(). Reported-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: add cache block invalidation supportJoe Thornber2013-11-111-3/+222
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Cache block invalidation is removing an entry from the cache without writing it back. Cache blocks can be invalidated via the 'invalidate_cblocks' message, which takes an arbitrary number of cblock ranges: invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* E.g. dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: add remove_cblock method to policy interfaceJoe Thornber2013-11-113-4/+57
| | | | | | | | | | | | | | | | | | | | | | | | Implement policy_remove_cblock() and add remove_cblock method to the mq policy. These methods will be used by the following cache block invalidation patch which adds the 'invalidate_cblocks' message to the cache core. Also, update some comments in dm-cache-policy.h Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache policy mq: reduce memory requirementsJoe Thornber2013-11-111-312/+231
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than storing the cblock in each cache entry, we allocate all entries in an array and infer the cblock from the entry position. Saves 4 bytes of memory per cache block. In addition, this gives us an easy way of looking up cache entries by cblock. We no longer need to keep an explicit bitset to track which cblocks have been allocated. And no searching is needed to find free cblocks. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache metadata: check the metadata version when reading the superblockJoe Thornber2013-11-111-3/+21
| | | | | | | | | | | | | | Need to check the version to verify on-disk metadata is supported. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: add passthrough modeJoe Thornber2013-11-113-35/+184
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "Passthrough" is a dm-cache operating mode (like writethrough or writeback) which is intended to be used when the cache contents are not known to be coherent with the origin device. It behaves as follows: * All reads are served from the origin device (all reads miss the cache) * All writes are forwarded to the origin device; additionally, write hits cause cache block invalidates This mode decouples cache coherency checks from cache device creation, largely to avoid having to perform coherency checks while booting. Boot scripts can create cache devices in passthrough mode and put them into service (mount cached filesystems, for example) without having to worry about coherency. Coherency that exists is maintained, although the cache will gradually cool as writes take place. Later, applications can perform coherency checks, the nature of which will depend on the type of the underlying storage. If coherency can be verified, the cache device can be transitioned to writethrough or writeback mode while still warm; otherwise, the cache contents can be discarded prior to transitioning to the desired operating mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Morgan Mears <Morgan.Mears@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: cache shrinking supportJoe Thornber2013-11-112-9/+120
| | | | | | | | | | | | | | | | Allow a cache to shrink if the blocks being removed from the cache are not dirty. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: promotion optimisation for writesJoe Thornber2013-11-091-6/+87
| | | | | | | | | | | | | | | | | | | | | | | | If a write block triggers promotion and covers a whole block we can avoid a copy. Introduce dm_{hook,unhook}_bio to simplify saving and restoring bio fields (bi_private is now used by overwrite). Switch writethrough support over to using these helpers too. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: be much more aggressive about promoting writes to discarded blocksJoe Thornber2013-11-091-21/+63
| | | | | | | | | | | | | | | | | | | | | | | | Previously these promotions only got priority if there were unused cache blocks. Now we give them priority if there are any clean blocks in the cache. The fio_soak_test in the device-mapper-test-suite now gives uniform performance across subvolumes (~16 seconds). Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache policy mq: implement writeback_work() and mq_{set,clear}_dirty()Joe Thornber2013-11-091-19/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are now two multiqueues for in cache blocks. A clean one and a dirty one. writeback_work comes from the dirty one. Demotions come from the clean one. There are two benefits: - Performance improvement, since demoting a clean block is a noop. - The cache cleans itself when io load is light. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: optimize commit_if_neededHeinz Mauelshagen2013-11-091-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | Check commit_requested flag _before_ calling dm_cache_changed_this_transaction() superfluously. Also, be sure to set last_commit_jiffies _after_ dm_cache_commit() completes. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm space map disk: optimise sm_disk_dec_blockJoe Thornber2013-11-091-17/+1
| | | | | | | | | | | | | | | | | | | | Don't waste time spotting blocks that have been allocated and then freed in the same transaction. The extra lookup is expensive, and I don't think it really gives us much. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: fix Kconfig menu indentationMikulas Patocka2013-11-091-11/+11
| | | | | | | | | | | | | | | | | | The option DM_LOG_USERSPACE is sub-option of DM_MIRROR, so place it right after DM_MIRROR. Doing so fixes various other Device mapper targets/features to be properly nested under "Device mapper support". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm: allow remove to be deferredMikulas Patocka2013-11-093-10/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows the removal of an open device to be deferred until it is closed. (Previously such a removal attempt would fail.) The deferred remove functionality is enabled by setting the flag DM_DEFERRED_REMOVE in the ioctl structure on DM_DEV_REMOVE or DM_REMOVE_ALL ioctl. On return from DM_DEV_REMOVE, the flag DM_DEFERRED_REMOVE indicates if the device was removed immediately or flagged to be removed on close - if the flag is clear, the device was removed. On return from DM_DEV_STATUS and other ioctls, the flag DM_DEFERRED_REMOVE is set if the device is scheduled to be removed on closure. A device that is scheduled to be deleted can be revived using the message "@cancel_deferred_remove". This message clears the DMF_DEFERRED_REMOVE flag so that the device won't be deleted on close. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm table: print error on preresume failureMike Snitzer2013-11-091-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | If preresume fails it is worth logging an error given that a device is left suspended due to the failure. This change was motivated by local preresume error logging that was added to the cache target ("preresume failed"). Elevating this target-agnostic context for the where the target-specific error occurred relative to the DM core's callouts makes sense. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>
| * dm crypt: add TCW IV mode for old CBC TCRYPT containersMilan Broz2013-11-091-2/+183
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dm-crypt can already activate TCRYPT (TrueCrypt compatible) containers in LRW or XTS block encryption mode. TCRYPT containers prior to version 4.1 use CBC mode with some additional tweaks, this patch adds support for these containers. This new mode is implemented using special IV generator named TCW (TrueCrypt IV with whitening). TCW IV only supports containers that are encrypted with one cipher (Tested with AES, Twofish, Serpent, CAST5 and TripleDES). While this mode is legacy and is known to be vulnerable to some watermarking attacks (e.g. revealing of hidden disk existence) it can still be useful to activate old containers without using 3rd party software or for independent forensic analysis of such containers. (Both the userspace and kernel code is an independent implementation based on the format documentation and it completely avoids use of original source code.) The TCW IV generator uses two additional keys: Kw (whitening seed, size is always 16 bytes - TCW_WHITENING_SIZE) and Kiv (IV seed, size is always the IV size of the selected cipher). These keys are concatenated at the end of the main encryption key provided in mapping table. While whitening is completely independent from IV, it is implemented inside IV generator for simplification. The whitening value is always 16 bytes long and is calculated per sector from provided Kw as initial seed, xored with sector number and mixed with CRC32 algorithm. Resulting value is xored with ciphertext sector content. IV is calculated from the provided Kiv as initial IV seed and xored with sector number. Detailed calculation can be found in the Truecrypt documentation for version < 4.1 and will also be described on dm-crypt site, see: http://code.google.com/p/cryptsetup/wiki/DMCrypt The experimental support for activation of these containers is already present in git devel brach of cryptsetup. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm crypt: properly handle extra key string in initializationMilan Broz2013-11-091-11/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some encryption modes use extra keys (e.g. loopAES has IV seed) which are not used in block cipher initialization but are part of key string in table constructor. This patch adds an additional field which describes the length of the extra key(s) and substracts it before real key encryption setting. The key_size always includes the size, in bytes, of the key provided in mapping table. The key_parts describes how many parts (usually keys) are contained in the whole key buffer. And key_extra_size contains size in bytes of additional keys part (this number of bytes must be subtracted because it is processed by the IV generator). | K1 | K2 | .... | K64 | Kiv | |----------- key_size ----------------- | | |-key_extra_size-| | [64 keys] | [1 key] | => key_parts = 65 Example where key string contains main key K, whitening key Kw and IV seed Kiv: | K | Kiv | Kw | |--------------- key_size --------------| | |-----key_extra_size------| | [1 key] | [1 key] | [1 key] | => key_parts = 3 Because key_extra_size is calculated during IV mode setting, key initialization is moved after this step. For now, this change has no effect to supported modes (thanks to ilog2 rounding) but it is required by the following patch. Also, fix a sparse warning in crypt_iv_lmk_one(). Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: log error message if dm_kcopyd_copy() failsHeinz Mauelshagen2013-11-091-1/+3
| | | | | | | | | | | | | | | | A migration failure should be logged (albeit limited). Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: use cell_defer() boolean argument consistentlyHeinz Mauelshagen2013-11-091-4/+4
| | | | | | | | | | | | | | | | Fix a few cell_defer() calls that weren't passing a bool. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: return -EINVAL if the user specifies unknown cache policyMikulas Patocka2013-11-092-8/+9
| | | | | | | | | | | | | | | | Return -EINVAL when the specified cache policy is unknown rather than returning -ENOMEM. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache metadata: return bool from __superblock_all_zeroesJoe Thornber2013-11-091-4/+5
| | | | | | | | | | Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache policy mq: a few small fixesJoe Thornber2013-11-091-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | Rename takeout_queue to concat_queue. Fix a harmless bug in mq policies pop() function. Currently pop() always succeeds, with up coming changes this wont be the case. Fix typo in comment above pre_cache_to_cache prototype. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache policy: remove return from void policy_remove_mappingJoe Thornber2013-11-091-1/+1
| | | | | | | | | | | | | | No need to return from a void function. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: improve efficiency of quiescing flag managementJoe Thornber2013-11-091-22/+5
| | | | | | | | | | | | | | | | Make the quiescing flag an atomic_t and stop protecting it with a spin lock. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache: fix a race condition between queuing new migrations and quiescing ↵Joe Thornber2013-11-091-14/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for a shutdown The code that was trying to do this was inadequate. The postsuspend method (in ioctl context), needs to wait for the worker thread to acknowledge the request to quiesce. Otherwise the migration count may drop to zero temporarily before the worker thread realises we're quiescing. In this case the target will be taken down, but the worker thread may have issued a new migration, which will cause an oops when it completes. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.9+
| * dm cache: io destined for the cache device can now serve as tick biosJoe Thornber2013-11-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Previously only origin bios could trigger ticks, which meant if all the io was destined for the cache no ticks were generated. If no ticks are generated then multiple hits, and movements in general, are attributed to the same tick. Only a stop gap fix, we need a better solution. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm cache policy mq: protect residency method with existing mutexJoe Thornber2013-11-091-2/+6
| | | | | | | | | | | | | | | | | | It is safe to use a mutex in mq_residency() at this point since it is only called from ioctl context. But future-proof mq_residency() by using might_sleep() to catch new contexts that cannot sleep. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm array: fix bug in growing arrayJoe Thornber2013-11-051-1/+4
| | | | | | | | | | | | | | | | Entries would be lost if the old tail block was partially filled. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 3.9+
| * dm mpath: requeue I/O during pg_initHannes Reinecke2013-11-051-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When pg_init is running no I/O can be submitted to the underlying devices, as the path priority etc might change. When using queue_io for this, requests will be piling up within multipath as the block I/O scheduler just sees a _very fast_ device. All of this queued I/O has to be resubmitted from within multipathing once pg_init is done. This approach has the problem that it's virtually impossible to abort I/O when pg_init is running, and we're adding heavy load to the devices after pg_init since all of the queued I/O needs to be resubmitted _before_ any requests can be pulled off of the request queue and normal operation continues. This patch will requeue the I/O that triggers the pg_init call, and return 'busy' when pg_init is in progress. With these changes the block I/O scheduler will stop submitting I/O during pg_init, resulting in a quicker path switch and less I/O pressure (and memory consumption) after pg_init. Signed-off-by: Hannes Reinecke <hare@suse.de> [patch header edited for clarity and typos by Mike Snitzer] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
| * dm mpath: fix race condition between multipath_dtr and pg_init_doneShiva Krishna Merla2013-10-311-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever multipath_dtr() is happening we must prevent queueing any further path activation work. Implement this by adding a new 'pg_init_disabled' flag to the multipath structure that denotes future path activation work should be skipped if it is set. By disabling pg_init and then re-enabling in flush_multipath_work() we also avoid the potential for pg_init to be initiated while suspending an mpath device. Without this patch a race condition exists that may result in a kernel panic: 1) If after pg_init_done() decrements pg_init_in_progress to 0, a call to wait_for_pg_init_completion() assumes there are no more pending path management commands. 2) If pg_init_required is set by pg_init_done(), due to retryable mode_select errors, then process_queued_ios() will again queue the path activation work. 3) If free_multipath() completes before activate_path() work is called a NULL pointer dereference like the following can be seen when accessing members of the recently destructed multipath: BUG: unable to handle kernel NULL pointer dereference at 0000000000000090 RIP: 0010:[<ffffffffa003db1b>] [<ffffffffa003db1b>] activate_path+0x1b/0x30 [dm_multipath] [<ffffffff81090ac0>] worker_thread+0x170/0x2a0 [<ffffffff81096c80>] ? autoremove_wake_function+0x0/0x40 [switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer] Signed-off-by: Shiva Krishna Merla <shivakrishna.merla@netapp.com> Reviewed-by: Krishnasamy Somasundaram <somasundaram.krishnasamy@netapp.com> Tested-by: Speagle Andy <Andy.Speagle@netapp.com> Acked-by: Junichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
| * dm: allocate buffer for messages with small number of arguments using GFP_NOIOMikulas Patocka2013-10-311-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dm-mpath and dm-thin must process messages even if some device is suspended, so we allocate argv buffer with GFP_NOIO. These messages have a small fixed number of arguments. On the other hand, dm-switch needs to process bulk data using messages so excessive use of GFP_NOIO could cause trouble. The patch also lowers the default number of arguments from 64 to 8, so that there is smaller load on GFP_NOIO allocations. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Acked-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* | Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-blockLinus Torvalds2013-11-144-56/+13
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block IO core updates from Jens Axboe: "This is the pull request for the core changes in the block layer for 3.13. It contains: - The new blk-mq request interface. This is a new and more scalable queueing model that marries the best part of the request based interface we currently have (which is fully featured, but scales poorly) and the bio based "interface" which the new drivers for high IOPS devices end up using because it's much faster than the request based one. The bio interface has no block layer support, since it taps into the stack much earlier. This means that drivers end up having to implement a lot of functionality on their own, like tagging, timeout handling, requeue, etc. The blk-mq interface provides all these. Some drivers even provide a switch to select bio or rq and has code to handle both, since things like merging only works in the rq model and hence is faster for some workloads. This is a huge mess. Conversion of these drivers nets us a substantial code reduction. Initial results on converting SCSI to this model even shows an 8x improvement on single queue devices. So while the model was intended to work on the newer multiqueue devices, it has substantial improvements for "classic" hardware as well. This code has gone through extensive testing and development, it's now ready to go. A pull request is coming to convert virtio-blk to this model will be will be coming as well, with more drivers scheduled for 3.14 conversion. - Two blktrace fixes from Jan and Chen Gang. - A plug merge fix from Alireza Haghdoost. - Conversion of __get_cpu_var() from Christoph Lameter. - Fix for sector_div() with 64-bit divider from Geert Uytterhoeven. - A fix for a race between request completion and the timeout handling from Jeff Moyer. This is what caused the merge conflict with blk-mq/core, in case you are looking at that. - A dm stacking fix from Mike Snitzer. - A code consolidation fix and duplicated code removal from Kent Overstreet. - A handful of block bug fixes from Mikulas Patocka, fixing a loop crash and memory corruption on blk cg. - Elevator switch bug fix from Tomoki Sekiyama. A heads-up that I had to rebase this branch. Initially the immutable bio_vecs had been queued up for inclusion, but a week later, it became clear that it wasn't fully cooked yet. So the decision was made to pull this out and postpone it until 3.14. It was a straight forward rebase, just pruning out the immutable series and the later fixes of problems with it. The rest of the patches applied directly and no further changes were made" * 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits) block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO block: Do not call sector_div() with a 64-bit divisor kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup() block: Consolidate duplicated bio_trim() implementations block: Use rw_copy_check_uvector() block: Enable sysfs nomerge control for I/O requests in the plug list block: properly stack underlying max_segment_size to DM device elevator: acquire q->sysfs_lock in elevator_change() elevator: Fix a race in elevator switching and md device initialization block: Replace __get_cpu_var uses bdi: test bdi_init failure block: fix a probe argument to blk_register_region loop: fix crash if blk_alloc_queue fails blk-core: Fix memory corruption if blkcg_init_queue fails block: fix race between request completion and timeout handling blktrace: Send BLK_TN_PROCESS events to all running traces blk-mq: don't disallow request merges for req->special being set blk-mq: mq plug list breakage blk-mq: fix for flush deadlock ...
| * | block: Consolidate duplicated bio_trim() implementationsKent Overstreet2013-11-084-56/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on, we should know better than this. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'driver-core-3.13-rc1' of ↵Linus Torvalds2013-11-073-4/+4
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core / sysfs patches from Greg KH: "Here's the big driver core / sysfs update for 3.13-rc1. There's lots of dev_groups updates for different subsystems, as they all get slowly migrated over to the safe versions of the attribute groups (removing userspace races with the creation of the sysfs files.) Also in here are some kobject updates, devres expansions, and the first round of Tejun's sysfs reworking to enable it to be used by other subsystems as a backend for an in-kernel filesystem. All of these have been in linux-next for a while with no reported issues" * tag 'driver-core-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (83 commits) sysfs: rename sysfs_assoc_lock and explain what it's about sysfs: use generic_file_llseek() for sysfs_file_operations sysfs: return correct error code on unimplemented mmap() mdio_bus: convert bus code to use dev_groups device: Make dev_WARN/dev_WARN_ONCE print device as well as driver name sysfs: separate out dup filename warning into a separate function sysfs: move sysfs_hash_and_remove() to fs/sysfs/dir.c sysfs: remove unused sysfs_get_dentry() prototype sysfs: honor bin_attr.attr.ignore_lockdep sysfs: merge sysfs_elem_bin_attr into sysfs_elem_attr devres: restore zeroing behavior of devres_alloc() sysfs: fix sysfs_write_file for bin file input: gameport: convert bus code to use dev_groups input: serio: remove bus usage of dev_attrs input: serio: use DEVICE_ATTR_RO() i2o: convert bus code to use dev_groups memstick: convert bus code to use dev_groups tifm: convert bus code to use dev_groups virtio: convert bus code to use dev_groups ipack: convert bus code to use dev_groups ...
| * | Merge 3.12-rc6 into driver-core-nextGreg Kroah-Hartman2013-10-192-8/+13
| |\ \ | | | | | | | | | | | | | | | | | | | | We want these fixes here too. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * \ \ Merge 3.12-rc3 into driver-core-nextGreg Kroah-Hartman2013-09-2917-89/+226
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want the driver core and sysfs fixes in here to make merges and development easier. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | sysfs: clean up sysfs_get_dirent()Tejun Heo2013-09-263-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pre-existing sysfs interfaces which take explicit namespace argument are weird in that they place the optional @ns in front of @name which is contrary to the established convention. For example, we end up forcing vast majority of sysfs_get_dirent() users to do sysfs_get_dirent(parent, NULL, name), which is silly and error-prone especially as @ns and @name may be interchanged without causing compilation warning. This renames sysfs_get_dirent() to sysfs_get_dirent_ns() and swap the positions of @name and @ns, and sysfs_get_dirent() is now a wrapper around sysfs_get_dirent_ns(). This makes confusions a lot less likely. There are other interfaces which take @ns before @name. They'll be updated by following patches. This patch doesn't introduce any functional changes. v2: EXPORT_SYMBOL_GPL() wasn't updated leading to undefined symbol error on module builds. Reported by build test robot. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | | raid5: avoid finding "discard" stripeShaohua Li2013-10-241-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SCSI discard will damage discard stripe bio setting, eg, some fields are changed. If the stripe is reused very soon, we have wrong bios setting. We remove discard stripe from hash list, so next time the strip will be fully initialized. Suitable for backport to 3.7+. Cc: <stable@vger.kernel.org> (3.7+) Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | | | raid5: set bio bi_vcnt 0 for discard requestShaohua Li2013-10-241-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SCSI layer will add new payload for discard request. If two bios are merged to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse SCSI and cause oops. Suitable for backport to 3.7+ Cc: stable@vger.kernel.org (v3.7+) Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
* | | | | md: avoid deadlock when md_set_badblocks.Bian Yu2013-10-241-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When operate harddisk and hit errors, md_set_badblocks is called after scsi_restart_operations which already disabled the irq. but md_set_badblocks will call write_sequnlock_irq and enable irq. so softirq can preempt the current thread and that may cause a deadlock. I think this situation should use write_sequnlock_irqsave/irqrestore instead. I met the situation and the call trace is below: [ 638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010 [ 638.921923] lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0 [ 638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37 [ 638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013 [ 638.927816] ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007 [ 638.929829] ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8 [ 638.931848] ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8 [ 638.933884] Call Trace: [ 638.935867] <IRQ> [<ffffffff8172ff85>] dump_stack+0x55/0x76 [ 638.937878] [<ffffffff81730030>] spin_dump+0x8a/0x8f [ 638.939861] [<ffffffff81730056>] spin_bug+0x21/0x26 [ 638.941836] [<ffffffff81336de4>] do_raw_spin_lock+0xa4/0xc0 [ 638.943801] [<ffffffff8173f036>] _raw_spin_lock+0x66/0x80 [ 638.945747] [<ffffffff814a73ed>] ? scsi_device_unbusy+0x9d/0xd0 [ 638.947672] [<ffffffff8173fb1b>] ? _raw_spin_unlock+0x2b/0x50 [ 638.949595] [<ffffffff814a73ed>] scsi_device_unbusy+0x9d/0xd0 [ 638.951504] [<ffffffff8149ec47>] scsi_finish_command+0x37/0xe0 [ 638.953388] [<ffffffff814a75e8>] scsi_softirq_done+0xa8/0x140 [ 638.955248] [<ffffffff8130e32b>] blk_done_softirq+0x7b/0x90 [ 638.957116] [<ffffffff8104fddd>] __do_softirq+0xfd/0x330 [ 638.958987] [<ffffffff810b964f>] ? __lock_release+0x6f/0x100 [ 638.960861] [<ffffffff8174a5cc>] call_softirq+0x1c/0x30 [ 638.962724] [<ffffffff81004c7d>] do_softirq+0x8d/0xc0 [ 638.964565] [<ffffffff8105024e>] irq_exit+0x10e/0x150 [ 638.966390] [<ffffffff8174ad4a>] smp_apic_timer_interrupt+0x4a/0x60 [ 638.968223] [<ffffffff817499af>] apic_timer_interrupt+0x6f/0x80 [ 638.970079] <EOI> [<ffffffff810b964f>] ? __lock_release+0x6f/0x100 [ 638.971899] [<ffffffff8173fa6a>] ? _raw_spin_unlock_irq+0x3a/0x50 [ 638.973691] [<ffffffff8173fa60>] ? _raw_spin_unlock_irq+0x30/0x50 [ 638.975475] [<ffffffff81562393>] md_set_badblocks+0x1f3/0x4a0 [ 638.977243] [<ffffffff81566e07>] rdev_set_badblocks+0x27/0x80 [ 638.978988] [<ffffffffa00d97bb>] raid5_end_read_request+0x36b/0x4e0 [raid456] [ 638.980723] [<ffffffff811b5a1d>] bio_endio+0x1d/0x40 [ 638.982463] [<ffffffff81304ff3>] req_bio_endio.isra.65+0x83/0xa0 [ 638.984214] [<ffffffff81306b9f>] blk_update_request+0x7f/0x350 [ 638.985967] [<ffffffff81306ea1>] blk_update_bidi_request+0x31/0x90 [ 638.987710] [<ffffffff813085e0>] __blk_end_bidi_request+0x20/0x50 [ 638.989439] [<ffffffff8130862f>] __blk_end_request_all+0x1f/0x30 [ 638.991149] [<ffffffff81308746>] blk_peek_request+0x106/0x250 [ 638.992861] [<ffffffff814a62a9>] ? scsi_kill_request.isra.32+0xe9/0x130 [ 638.994561] [<ffffffff814a633a>] scsi_request_fn+0x4a/0x3d0 [ 638.996251] [<ffffffff813040a7>] __blk_run_queue+0x37/0x50 [ 638.997900] [<ffffffff813045af>] blk_run_queue+0x2f/0x50 [ 638.999553] [<ffffffff814a5750>] scsi_run_queue+0xe0/0x1c0 [ 639.001185] [<ffffffff814a7721>] scsi_run_host_queues+0x21/0x40 [ 639.002798] [<ffffffff814a2e87>] scsi_restart_operations+0x177/0x200 [ 639.004391] [<ffffffff814a4fe9>] scsi_error_handler+0xc9/0xe0 [ 639.005996] [<ffffffff814a4f20>] ? scsi_unjam_host+0xd0/0xd0 [ 639.007600] [<ffffffff81072f6b>] kthread+0xdb/0xe0 [ 639.009205] [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170 [ 639.010821] [<ffffffff81748cac>] ret_from_fork+0x7c/0xb0 [ 639.012437] [<ffffffff81072e90>] ? flush_kthread_worker+0x170/0x170 This bug was introduce in commit 2e8ac30312973dd20e68073653 (the first time rdev_set_badblock was call from interrupt context), so this patch is appropriate for 3.5 and subsequent kernels. Cc: <stable@vger.kernel.org> (3.5+) Signed-off-by: Bian Yu <bianyu@kedacom.com> Reviewed-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | | | md: Fix skipping recovery for read-only arrays.Lukasz Dorau2013-10-242-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since: commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86 md: Allow devices to be re-added to a read-only array. spares are activated on a read-only array. In case of raid1 and raid10 personalities it causes that not-in-sync devices are marked in-sync without checking if recovery has been finished. If a read-only array is degraded and one of its devices is not in-sync (because the array has been only partially recovered) recovery will be skipped. This patch adds checking if recovery has been finished before marking a device in-sync for raid1 and raid10 personalities. In case of raid5 personality such condition is already present (at raid5.c:6029). Bug was introduced in 3.10 and causes data corruption. Cc: stable@vger.kernel.org Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | | | | bcache: Fixed incorrect order of arguments to bio_alloc_bioset()Kent Overstreet2013-10-231-1/+1
| |_|/ / |/| | | | | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | dm snapshot: fix data corruptionMikulas Patocka2013-10-161-6/+12
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a particular type of data corruption that has been encountered when loading a snapshot's metadata from disk. When we allocate a new chunk in persistent_prepare, we increment ps->next_free and we make sure that it doesn't point to a metadata area by further incrementing it if necessary. When we load metadata from disk on device activation, ps->next_free is positioned after the last used data chunk. However, if this last used data chunk is followed by a metadata area, ps->next_free is positioned erroneously to the metadata area. A newly-allocated chunk is placed at the same location as the metadata area, resulting in data or metadata corruption. This patch changes the code so that ps->next_free skips the metadata area when metadata are loaded in function read_exceptions. The patch also moves a piece of code from persistent_prepare_exception to a separate function skip_metadata to avoid code duplication. CVE-2013-4299 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* | | bcache: Fix a null ptr deref regressionKent Overstreet2013-10-101-2/+1
| |/ |/| | | | | | | | | | | | | | | | | | | Commit c0f04d88e46d ("bcache: Fix flushes in writeback mode") was fixing a reported data corruption bug, but it seems some last minute refactoring or rebasing introduced a null pointer deref. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: linux-stable <stable@vger.kernel.org> # >= v3.10 Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'dm-3.12-fixes' of ↵Linus Torvalds2013-09-258-25/+118
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device-mapper fixes from Mike Snitzer: "A few fixes for dm-snapshot, a 32 bit fix for dm-stats, a couple error handling fixes for dm-multipath. A fix for the thin provisioning target to not expose non-zero discard limits if discards are disabled. Lastly, add two DM module parameters which allow users to tune the emergency memory reserves that DM mainatins per device -- this helps fix a long-standing issue for dm-multipath. The conservative default reserve for request-based dm-multipath devices (256) has proven problematic for users with many multipathed SCSI devices but relatively little memory. To responsibly select a smaller value users should use the new nr_bios tracepoint info (via commit 75afb352 "block: Add nr_bios to block_rq_remap tracepoint") to determine the peak number of bios their workloads create" * tag 'dm-3.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: add reserved_bio_based_ios module parameter dm: add reserved_rq_based_ios module parameter dm: lower bio-based mempool reservation dm thin: do not expose non-zero discard limits if discards disabled dm mpath: disable WRITE SAME if it fails dm-snapshot: fix performance degradation due to small hash size dm snapshot: workaround for a false positive lockdep warning dm stats: fix possible counter corruption on 32-bit systems dm mpath: do not fail path on -ENOSPC
| * | dm: add reserved_bio_based_ios module parameterMike Snitzer2013-09-233-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow user to change the number of IOs that are reserved by bio-based DM's mempools by writing to this file: /sys/module/dm_mod/parameters/reserved_bio_based_ios The default value is RESERVED_BIO_BASED_IOS (16). The maximum allowed value is RESERVED_MAX_IOS (1024). Export dm_get_reserved_bio_based_ios() for use by DM targets and core code. Switch to sizing dm-io's mempool and bioset using DM core's configurable 'reserved_bio_based_ios'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com>
| * | dm: add reserved_rq_based_ios module parameterMike Snitzer2013-09-233-4/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow user to change the number of IOs that are reserved by request-based DM's mempools by writing to this file: /sys/module/dm_mod/parameters/reserved_rq_based_ios The default value is RESERVED_REQUEST_BASED_IOS (256). The maximum allowed value is RESERVED_MAX_IOS (1024). Export dm_get_reserved_rq_based_ios() for use by DM targets and core code. Switch to sizing dm-mpath's mempool using DM core's configurable 'reserved_rq_based_ios'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>
| * | dm: lower bio-based mempool reservationMike Snitzer2013-09-231-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bio-based device mapper processing doesn't need larger mempools (like request-based DM does), so lower the number of reserved entries for bio-based operation. 16 was already used for bio-based DM's bioset but mistakenly wasn't used for it's _io_cache. Formalize difference between bio-based and request-based defaults by introducing RESERVED_BIO_BASED_IOS and RESERVED_REQUEST_BASED_IOS. (based on older code from Mikulas Patocka) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Frank Mayhar <fmayhar@google.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com>
| * | dm thin: do not expose non-zero discard limits if discards disabledMike Snitzer2013-09-231-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix issue where the block layer would stack the discard limits of the pool's data device even if the "ignore_discard" pool feature was specified. The pool and thin device(s) still had discards disabled because the QUEUE_FLAG_DISCARD request_queue flag wasn't set. But to avoid user confusion when "ignore_discard" is used: both the pool device and the thin device(s) have zeroes for all discard limits. Also, always set discard_zeroes_data_unsupported in targets because they should never advertise the 'discard_zeroes_data' capability (even if the pool's data device supports it). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Joe Thornber <ejt@redhat.com>
OpenPOWER on IntegriCloud