summaryrefslogtreecommitdiffstats
path: root/drivers/block
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2013-09-091-17/+19
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "This includes both the first pile of Ceph patches (which I sent to torvalds@vger, sigh) and a few new patches that add support for fscache for Ceph. That includes a few fscache core fixes that David Howells asked go through the Ceph tree. (Thanks go to Milosz Tanski for putting this feature together) This first batch of patches (included here) had (has) several important RBD bug fixes, hole punch support, several different cleanups in the page cache interactions, improvements in the truncate code (new truncate mutex to avoid shenanigans with i_mutex), and a series of fixes in the synchronous striping read/write code. On top of that is a random collection of small fixes all across the tree (error code checks and error path cleanup, obsolete wq flags, etc)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (43 commits) ceph: use d_invalidate() to invalidate aliases ceph: remove ceph_lookup_inode() ceph: trivial buildbot warnings fix ceph: Do not do invalidate if the filesystem is mounted nofsc ceph: page still marked private_2 ceph: ceph_readpage_to_fscache didn't check if marked ceph: clean PgPrivate2 on returning from readpages ceph: use fscache as a local presisent cache fscache: Netfs function for cleanup post readpages FS-Cache: Fix heading in documentation CacheFiles: Implement interface to check cache consistency FS-Cache: Add interface to check consistency of a cached object rbd: fix null dereference in dout rbd: fix buffer size for writes to images with snapshots libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc rbd: fix I/O error propagation for reads ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem ceph: allow sync_read/write return partial successed size of read/write. ceph: fix bugs about handling short-read for sync read mode. ceph: remove useless variable revoked_rdcache ...
| * rbd: fix null dereference in doutJosh Durgin2013-09-031-3/+5
| | | | | | | | | | | | | | | | | | | | The order parameter is sometimes NULL in _rbd_dev_v2_snap_size(), but the dout() always derefences it. Move this to another dout() protected by a check that order is non-NULL. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org>
| * rbd: fix buffer size for writes to images with snapshotsJosh Durgin2013-09-031-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rbd_osd_req_create() needs to know the snapshot context size to create a buffer large enough to send it with the message front. It gets this from the img_request, which was not set for the obj_request yet. This resulted in trying to write past the end of the front payload, hitting this BUG: libceph: BUG_ON(p > msg->front.iov_base + msg->front.iov_len); Fix this by associating the obj_request with its img_request immediately after it's created, before the osd request is created. Fixes: http://tracker.ceph.com/issues/5760 Suggested-by: Alex Elder <alex.elder@linaro.org> Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org>
| * rbd: fix I/O error propagation for readsJosh Durgin2013-09-031-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a request returns an error, the driver needs to report the entire extent of the request as completed. Writes already did this, since they always set xferred = length, but reads were skipping that step if an error other than -ENOENT occurred. Instead, rbd would end up passing 0 xferred to blk_end_request(), which would always report needing more data. This resulted in an assert failing when more data was required by the block layer, but all the object requests were done: [ 1868.719077] rbd: obj_request read result -108 xferred 0 [ 1868.719077] [ 1868.719518] end_request: I/O error, dev rbd1, sector 0 [ 1868.719739] [ 1868.719739] Assertion failure in rbd_img_obj_callback() at line 1736: [ 1868.719739] [ 1868.719739] rbd_assert(more ^ (which == img_request->obj_request_count)); Without this assert, reads that hit errors would hang forever, since the block layer considered them incomplete. Fixes: http://tracker.ceph.com/issues/5647 CC: stable@vger.kernel.org # v3.10 Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org>
| * Merge remote-tracking branch 'linus/master' into testingSage Weil2013-08-1528-682/+2164
| |\
| * | block: rbd: use NULL instead of 0Jingoo Han2013-08-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The local variables such as 'bio_list', and 'pages' are pointers; thus, use NULL instead of 0 to fix the following sparse warnings. drivers/block/rbd.c:2166:32: warning: Using plain integer as NULL pointer drivers/block/rbd.c:2168:31: warning: Using plain integer as NULL pointer Signed-off-by: Jingoo Han <jg1.han@samsung.com> Reviewed-by: Sage Weil <sage@inktank.com>
* | | Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds2013-09-072-201/+408
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVM Express driver update from Matthew Wilcox. * git://git.infradead.org/users/willy/linux-nvme: NVMe: Merge issue on character device bring-up NVMe: Handle ioremap failure NVMe: Add pci suspend/resume driver callbacks NVMe: Use normal shutdown NVMe: Separate controller init from disk discovery NVMe: Separate queue alloc/free from create/delete NVMe: Group pci related actions in functions NVMe: Disk stats for read/write commands only NVMe: Bring up cdev on set feature failure NVMe: Fix checkpatch issues NVMe: Namespace IDs are unsigned NVMe: Update nvme_id_power_state with latest spec NVMe: Split header file into user-visible and kernel-visible pieces NVMe: Call nvme_process_cq from submission path NVMe: Remove "process_cq did something" message NVMe: Return correct value from interrupt handler NVMe: Disk IO statistics NVMe: Restructure MSI / MSI-X setup NVMe: Use kzalloc instead of kmalloc+memset
| * | | NVMe: Merge issue on character device bring-upKeith Busch2013-09-061-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A recent patch made it possible to bring up the character handle when the device is responsive but not accepting a set-features command. Another recent patch moved the initialization that requires we move where the checks for this condition occur. This patch merges these two ideas so it works much as before. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Handle ioremap failureKeith Busch2013-09-031-8/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Decrement the number of queues required for doorbell remapping until the memory is successfully mapped for that size. Additional checks are done so that we don't call free_irq if it has already been freed. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Add pci suspend/resume driver callbacksKeith Busch2013-09-031-15/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Used for going in and out of low power states. Resuming reuses the IO queues from the previous initialization, freeing any allocated queues that are no longer usable. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Use normal shutdownKeith Busch2013-09-031-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The NVMe spec recommends using the shutdown normal sequence when safely taking the controller offline instead of hitting CC.EN on the next start-up to reset the controller. The spec recommends a minimum of 1 second for the shutdown complete. This patch waits 2 seconds to be on the safe side. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Separate controller init from disk discoveryKeith Busch2013-09-031-30/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This combines the controller initialization into one function, removing IO queue setup from namespace discovery, and creates symetric functions for device removal. The controller start and shutdown functions can now be called from resume/suspend context as well as probe/remove. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Separate queue alloc/free from create/deleteKeith Busch2013-09-031-39/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This separates nvme queue allocation from creation, and queue deletion from freeing. This is so that we may in the future temporarily disable queues and reuse the same memory when bringing them back online, like coming back from suspend state. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Group pci related actions in functionsKeith Busch2013-09-031-46/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This will make it easier to reuse these outside probe/remove. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Disk stats for read/write commands onlyKeith Busch2013-09-031-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Flush and discard requests would previously mess up the accounting. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Bring up cdev on set feature failureKeith Busch2013-09-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch creates the character device as long as a device's admin queues are usable so a user has an opprotunity to perform administration tasks. A device may be in a state that does not allow IO and setting the queue count feature in such a state returns an error. Previously the driver would bail and the controller would be unusable. Signed-off-by: Keith Busch <keith.busch@intel.com>
| * | | NVMe: Fix checkpatch issuesKeith Busch2013-09-031-5/+4
| | | | | | | | | | | | | | | | Signed-off-by: Keith Busch <keith.busch@intel.com>
| * | | NVMe: Namespace IDs are unsignedMatthew Wilcox2013-09-031-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'Number of Namespaces' read from the device was being treated as signed, which would cause us to not scan any namespaces for a device with more than 2 billion namespaces. That led to noticing that the namespace ID was also being treated as signed, which could lead to the result from NVME_IOCTL_ID being treated as an error code. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Call nvme_process_cq from submission pathMatthew Wilcox2013-06-241-19/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we have the queue locked, it makes sense to check if there are any completion queue entries on the queue before we release the lock. If there are, it may save an interrupt and reduce latency for the I/Os that happened to complete. This happens fairly often for some workloads. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Remove "process_cq did something" messageMatthew Wilcox2013-06-241-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I was originally intending to log the fact that the kthread had done some work since it might help us find interrupt handling problems, but that hasn't been done yet, and spamming the logs with this message is just rude. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Return correct value from interrupt handlerMatthew Wilcox2013-06-241-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The interrupt handler currently reports whether it found any new completion queue entries. If the completion queue is primarily being processed by a method other than the interrupt handler, it may return IRQ_NONE so often that Linux thinks that the interrupt is being falsely triggered. To solve this problem, report whether any completion queue entries have been seen since the last interrupt was received for this queue. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Disk IO statisticsKeith Busch2013-06-201-0/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add io stats accounting for bio requests so nvme block devices show useful disk stats. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Restructure MSI / MSI-X setupMatthew Wilcox2013-06-201-21/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current code copies 'nr_io_queues' into 'q_count', modifies 'nr_io_queues' during MSI-X setup, then resets 'nr_io_queues' for MSI setup. Instead, copy 'nr_io_queues' into 'vecs' and modify 'vecs' during both MSI-X and MSI setup. This lets us simplify the for-loops that set up MSI-X and MSI, and opens the possibility of using more I/O queues than we have interrupt vectors, should future benchmarking prove that to be a useful feature. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Use kzalloc instead of kmalloc+memsetTushar Behera2013-06-191-16/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use kzalloc instead of kmalloc and a susbsequent memset. Signed-off-by: Tushar Behera <tushar.behera@linaro.org> Signed-off-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-061-1/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...
| * | | | treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacksJoe Perches2013-08-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't emit OOM warnings when k.alloc calls fail when there there is a v.alloc immediately afterwards. Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | | Merge tag 'driver-core-3.12-rc1' of ↵Linus Torvalds2013-09-031-5/+9
|\ \ \ \ \ | |_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here's the big driver core pull request for 3.12-rc1. Lots of tiny changes here fixing up the way sysfs attributes are created, to try to make drivers simpler, and fix a whole class race conditions with creations of device attributes after the device was announced to userspace. All the various pieces are acked by the different subsystem maintainers" * tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits) firmware loader: fix pending_fw_head list corruption drivers/base/memory.c: introduce help macro to_memory_block dynamic debug: line queries failing due to uninitialized local variable sysfs: sysfs_create_groups returns a value. debugfs: provide debugfs_create_x64() when disabled rbd: convert bus code to use bus_groups firmware: dcdbas: use binary attribute groups sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled driver core: add #include <linux/sysfs.h> to core files. HID: convert bus code to use dev_groups Input: serio: convert bus code to use drv_groups Input: gameport: convert bus code to use drv_groups driver core: firmware: use __ATTR_RW() driver core: core: use DEVICE_ATTR_RO driver core: bus: use DRIVER_ATTR_WO() driver core: create write-only attribute macros for devices and drivers sysfs: create __ATTR_WO() driver-core: platform: convert bus code to use dev_groups workqueue: convert bus code to use dev_groups MEI: convert bus code to use dev_groups ...
| * | | | rbd: convert bus code to use bus_groupsGreg Kroah-Hartman2013-08-271-5/+9
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bus_attrs field of struct bus_type is going away soon, dev_groups should be used instead. This converts the RBD bus code to use the correct field. Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Acked-by: Alex Elder <elder@linaro.org> Cc: <ceph-devel@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | aoe: adjust ref of head for compound page tailsEd Cashin2013-08-131-10/+7
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a BUG which can trigger when direct-IO is used with AOE. As discussed previously, the fact that some users of the block layer provide bios that point to pages with a zero _count means that it is not OK for the network layer to do a put_page on the skb frags during an skb_linearize, so the aoe driver gets a reference to pages in bios and puts the reference before ending the bio. And because it cannot use get_page on a page with a zero _count, it manipulates the value directly. It is not OK to increment the _count of a compound page tail, though, since the VM layer will VM_BUG_ON a non-zero _count. Block users that do direct I/O can result in the aoe driver seeing compound page tails in bios. In that case, the same logic works as long as the head of the compound page is used instead of the tails. This patch handles compound pages and does not BUG. It relies on the block layer user leaving the relationship between the page tail and its head alone for the duration between the submission of the bio and its completion, whether successful or not. Signed-off-by: Ed Cashin <ecashin@coraid.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for-3.11/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds2013-07-2216-624/+1915
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block IO driver bits from Jens Axboe: "As I mentioned in the core block pull request, due to real life circumstances the driver pull request would be late. Now it looks like -rc2 late... On the plus side, apart form the rsxx update, these are all things that I could argue could go in later in the cycle as they are fixes and not features. So even though things are late, it's not ALL bad. The pull request contains: - Updates to bcache, all bug fixes, from Kent. - A pile of drbd bug fixes (no big features this time!). - xen blk front/back fixes. - rsxx driver updates, some of them deferred form 3.10. So should be well cooked by now" * 'for-3.11/drivers' of git://git.kernel.dk/linux-block: (63 commits) bcache: Allocation kthread fixes bcache: Fix GC_SECTORS_USED() calculation bcache: Journal replay fix bcache: Shutdown fix bcache: Fix a sysfs splat on shutdown bcache: Advertise that flushes are supported bcache: check for allocation failures bcache: Fix a dumb race bcache: Use standard utility code bcache: Update email address bcache: Delete fuzz tester bcache: Document shrinker reserve better bcache: FUA fixes drbd: Allow online change of al-stripes and al-stripe-size drbd: Constants should be UPPERCASE drbd: Ignore the exit code of a fence-peer handler if it returns too late drbd: Fix rcu_read_lock balance on error path drbd: fix error return code in drbd_init() drbd: Do not sleep inside rcu bcache: Refresh usage docs ...
| * \ \ Merge tag 'v3.10-rc7' into for-3.11/driversJens Axboe2013-07-028-442/+647
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux 3.10-rc7 Pull this in early to avoid doing it with the bcache merge, since there are a number of changes to bcache between my old base (3.10-rc1) and the new pull request.
| * | | | drbd: Allow online change of al-stripes and al-stripe-sizePhilipp Reisner2013-06-285-52/+172
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow to change the AL layout with an resize operation. For that the reisze command gets two new fields: al_stripes and al_stripe_size. In order to make the operation crash save: 1) Lock out all IO and MD-IO 2) Write the super block with MDF_PRIMARY_IND clear 3) write the bitmap to the new location (all zeros, since we allow only while connected) 4) Initialize the new AL-area 5) Write the super block with the restored MDF_PRIMARY_IND. 6) Unfreeze all IO Since the AL-layout has no influence on the protocol, this operation needs to be beforemed on both sides of a resource (if intended). Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | drbd: Constants should be UPPERCASEPhilipp Reisner2013-06-283-14/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | drbd: Ignore the exit code of a fence-peer handler if it returns too latePhilipp Reisner2013-06-283-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case the connection was established and lost again before the a fence-peer handler returns, ignore the exit code of this instance. (And use the exit code of the later started instance) Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | drbd: Fix rcu_read_lock balance on error pathAndreas Gruenbacher2013-06-281-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | drbd: fix error return code in drbd_init()Wei Yongjun2013-06-281-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | drbd: Do not sleep inside rcuAndreas Gruenbacher2013-06-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | | Merge branch 'stable/for-jens-3.10' of ↵Jens Axboe2013-06-284-422/+1214
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.11/drivers Konrad writes: It has the 'feature-max-indirect-segments' implemented in both backend and frontend. The current problem with the backend and frontend is that the segment size is limited to 11 pages. It means we can at most squeeze in 44kB per request. The ring can hold 32 (next power of two below 36) requests, meaning we can do 1.4M of outstanding requests. Nowadays that is not enough. The problem in the past was addressed in two ways - but neither one went upstream. The first solution to this proposed by Justin from Spectralogic was to negotiate the segment size. This means that the ‘struct blkif_sring_entry’ is now a variable size. It can expand from 112 bytes (cover 11 pages of data - 44kB) to 1580 bytes (256 pages of data - so 1MB). It is a simple extension by just making the array in the request expand from 11 to a variable size negotiated. But it had limits: this extension still limits the number of segments per request to 255 (as the total number must be specified in the request, which only has an 8-bit field for that purpose). The other solution (from Intel - Ronghui) was to create one extra ring that only has the ‘struct blkif_request_segment’ in them. The ‘struct blkif_request’ would be changed to have an index in said ‘segment ring’. There is only one segment ring. This means that the size of the initial ring is still the same. The requests would point to the segment and enumerate out how many of the indexes it wants to use. The limit is of course the size of the segment. If one assumes a one-page segment this means we can in one request cover ~4MB. Those patches were posted as RFC and the author never followed up on the ideas on changing it to be a bit more flexible. There is yet another mechanism that could be employed  (which these patches implement) - and it borrows from VirtIO protocol. And that is the ‘indirect descriptors’. This very similar to what Intel suggests, but with a twist. The twist is to negotiate how many of these 'segment' pages (aka indirect descriptor pages) we want to support (in reality we negotiate how many entries in the segment we want to cover, and we module the number if it is bigger than the segment size). This means that with the existing 36 slots in the ring (single page) we can cover: 32 slots * each blkif_request_indirect covers: 512 * 4096 ~= 64M. Since we ample space in the blkif_request_indirect to span more than one indirect page, that number (64M) can be also multiplied by eight = 512MB. Roger Pau Monne took the idea and implemented them in these patches. They work great and the corner cases (migration between backends with and without this extension) work nicely. The backend has a limit right now off how many indirect entries it can handle: one indirect page, and at maximum 256 entries (out of 512 - so 50% of the page is used). That comes out to 32 slots * 256 entries in a indirect page * 1 indirect page per request * 4096 = 32MB. This is a conservative number that can change in the future. Right now it strikes a good balance between giving excellent performance, memory usage in the backend, and balancing the needs of many guests. In the patchset there is also the split of the blkback structure to be per-VBD. This means that the spinlock contention we had with many guests trying to do I/O and all the blkback threads hitting the same lock has been eliminated. Also there are bug-fixes to deal with oddly sized sectors, insane amounts on th ring, and also a security fix (posted earlier).
| | * | | | xen-blkback: check the number of iovecs before allocating a biosRoger Pau Monne2013-06-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the introduction of indirect segments we can receive requests with a number of segments bigger than the maximum number of allowed iovecs in a bios, so make sure that blkback doesn't try to allocate a bios with more iovecs than BIO_MAX_PAGES Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen-blkfront: set blk_queue_max_hw_sectors correctlyRoger Pau Monne2013-06-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that indirect segments are enabled blk_queue_max_hw_sectors must be set to match the maximum number of sectors we can handle in a request. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Felipe Franciosi <felipe.franciosi@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen-blkback: workaround compiler bug in gcc 4.1Roger Pau Monne2013-06-211-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code generat with gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-54) creates an unbound loop for the second foreach_grant_safe loop in purge_persistent_gnt. The workaround is to avoid having this second loop and instead perform all the work inside the first loop by adding a new variable, clean_used, that will be set when all the desired persistent grants have been removed and we need to iterate over the remaining ones to remove the WAS_ACTIVE flag. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Tom O'Neill <toneill@vmem.com> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen/blkback: Check for insane amounts of request on the ring (v6).Konrad Rzeszutek Wilk2013-06-173-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check that the ring does not have an insane amount of requests (more than there could fit on the ring). If we detect this case we will stop processing the requests and wait until the XenBus disconnects the ring. The existing check RING_REQUEST_CONS_OVERFLOW which checks for how many responses we have created in the past (rsp_prod_pvt) vs requests consumed (req_cons) and whether said difference is greater or equal to the size of the ring, does not catch this case. Wha the condition does check if there is a need to process more as we still have a backlog of responses to finish. Note that both of those values (rsp_prod_pvt and req_cons) are not exposed on the shared ring. To understand this problem a mini crash course in ring protocol response/request updates is in place. There are four entries: req_prod and rsp_prod; req_event and rsp_event to track the ring entries. We are only concerned about the first two - which set the tone of this bug. The req_prod is a value incremented by frontend for each request put on the ring. Conversely the rsp_prod is a value incremented by the backend for each response put on the ring (rsp_prod gets set by rsp_prod_pvt when pushing the responses on the ring). Both values can wrap and are modulo the size of the ring (in block case that is 32). Please see RING_GET_REQUEST and RING_GET_RESPONSE for the more details. The culprit here is that if the difference between the req_prod and req_cons is greater than the ring size we have a problem. Fortunately for us, the '__do_block_io_op' loop: rc = blk_rings->common.req_cons; rp = blk_rings->common.sring->req_prod; while (rc != rp) { .. blk_rings->common.req_cons = ++rc; /* before make_response() */ } will loop up to the point when rc == rp. The macros inside of the loop (RING_GET_REQUEST) is smart and is indexing based on the modulo of the ring size. If the frontend has provided a bogus req_prod value we will loop until the 'rc == rp' - which means we could be processing already processed requests (or responses) often. The reason the RING_REQUEST_CONS_OVERFLOW is not helping here is b/c it only tracks how many responses we have internally produced and whether we would should process more. The astute reader will notice that the macro RING_REQUEST_CONS_OVERFLOW provides two arguments - more on this later. For example, if we were to enter this function with these values: blk_rings->common.sring->req_prod = X+31415 (X is the value from the last time __do_block_io_op was called). blk_rings->common.req_cons = X blk_rings->common.rsp_prod_pvt = X The RING_REQUEST_CONS_OVERFLOW(&blk_rings->common, blk_rings->common.req_cons) is doing: req_cons - rsp_prod_pvt >= 32 Which is, X - X >= 32 or 0 >= 32 And that is false, so we continue on looping (this bug). If we re-use said macro RING_REQUEST_CONS_OVERFLOW and pass in the rp instead (sring->req_prod) of rc, the this macro can do the check: req_prod - rsp_prov_pvt >= 32 Which is, X + 31415 - X >= 32 , or 31415 >= 32 which is true, so we can error out and break out of the function. Unfortunatly the difference between rsp_prov_pvt and req_prod can be at 32 (which would error out in the macro). This condition exists when the backend is lagging behind with the responses and still has not finished responding to all of them (so make_response has not been called), and the rsp_prov_pvt + 32 == req_cons. This ends up with us not being able to use said macro. Hence introducing a new macro called RING_REQUEST_PROD_OVERFLOW which does a simple check of: req_prod - rsp_prod_pvt > RING_SIZE And with the X values from above: X + 31415 - X > 32 Returns true. Also not that if the ring is full (which is where the RING_REQUEST_CONS_OVERFLOW triggered), we would not hit the same condition: X + 32 - X > 32 Which is false. Lets use that macro. Note that in v5 of this patchset the macro was different - we used an earlier version. Cc: stable@vger.kernel.org [v1: Move the check outside the loop] [v2: Add a pr_warn as suggested by David] [v3: Use RING_REQUEST_CONS_OVERFLOW as suggested by Jan] [v4: Move wake_up after kthread_stop as suggested by Jan] [v5: Use RING_REQUEST_PROD_OVERFLOW instead] [v6: Use RING_REQUEST_PROD_OVERFLOW - Jan's version] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> gadsa
| | * | | | xen/blkback: Check device permissions before allowing OP_DISCARDKonrad Rzeszutek Wilk2013-06-071-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to make sure that the device is not RO or that the request is not past the number of sectors we want to issue the DISCARD operation for. This fixes CVE-2013-2140. Cc: stable@vger.kernel.org Acked-by: Jan Beulich <JBeulich@suse.com> Acked-by: Ian Campbell <Ian.Campbell@citrix.com> [v1: Made it pr_warn instead of pr_debug] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen/blkback: Use physical sector size for setupStefan Bader2013-06-072-3/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently xen-blkback passes the logical sector size over xenbus and xen-blkfront sets up the paravirt disk with that logical block size. But newer drives usually have the logical sector size set to 512 for compatibility reasons and would show the actual sector size only in physical sector size. This results in the device being partitioned and accessed in dom0 with the correct sector size, but the guest thinks 512 bytes is the correct block size. And that results in poor performance. To fix this, blkback gets modified to pass also physical-sector-size over xenbus and blkfront to use both values to set up the paravirt disk. I did not just change the passed in sector-size because I am not sure having a bigger logical sector size than the physical one is valid (and that would happen if a newer dom0 kernel hits an older domU kernel). Also this way a domU set up before should still be accessible (just some tools might detect the unaligned setup). [v2: Make xenbus write failure non-fatal] [v3: Use xenbus_scanf instead of xenbus_gather] [v4: Rebased against segment changes] Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen-blkfront: Introduce a 'max' module parameter to alter the amount of ↵Konrad Rzeszutek Wilk2013-06-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | indirect segments. The max module parameter (by default 32) is the maximum number of segments that the frontend will negotiate with the backend for indirect descriptors. Higher value means more potential throughput but more memory usage. The backend picks the minimum of the frontend and its default backend value. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen-blkfront: use a different scatterlist for each requestRoger Pau Monne2013-05-081-25/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In blkif_queue_request blkfront iterates over the scatterlist in order to set the segments of the request, and in blkif_completion blkfront iterates over the raw request, which makes it hard to know the exact position of the source and destination memory positions. This can be solved by allocating a scatterlist for each request, that will be keep until the request is finished, allowing us to copy the data back to the original memory without having to iterate over the raw request. Oracle-Bug: 16660413 - LARGE ASYNCHRONOUS READS APPEAR BROKEN ON 2.6.39-400 CC: stable@vger.kernel.org Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-and-Tested-by: Anne Milicia <anne.milicia@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen-blkback: allocate list of pending reqs in small chunksRoger Pau Monne2013-05-073-78/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allocate pending requests in smaller chunks instead of allocating them all at the same time. This change also removes the global array of pending_reqs, it is no longer necessay. Variables related to the grant mapping have been grouped into a struct called "grant_page", this allows to allocate them in smaller chunks, and also improves memory locality. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen-block: implement indirect descriptorsRoger Pau Monne2013-04-184-125/+604
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Indirect descriptors introduce a new block operation (BLKIF_OP_INDIRECT) that passes grant references instead of segments in the request. This grant references are filled with arrays of blkif_request_segment_aligned, this way we can send more segments in a request. The proposed implementation sets the maximum number of indirect grefs (frames filled with blkif_request_segment_aligned) to 256 in the backend and 32 in the frontend. The value in the frontend has been chosen experimentally, and the backend value has been set to a sane value that allows expanding the maximum number of indirect descriptors in the frontend if needed. The migration code has changed from the previous implementation, in which we simply remapped the segments on the shared ring. Now the maximum number of segments allowed in a request can change depending on the backend, so we have to requeue all the requests in the ring and in the queue and split the bios in them if they are bigger than the new maximum number of segments. [v2: Fixed minor comments by Konrad. [v1: Added padding to make the indirect request 64bit aligned. Added some BUGs, comments; fixed number of indirect pages in blkif_get_x86_{32/64}_req. Added description about the indirect operation in blkif.h] Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> [v3: Fixed spaces and tabs mix ups] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen-blkback: expand map/unmap functionsRoger Pau Monne2013-04-181-55/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preparatory change for implementing indirect descriptors. Change xen_blkbk_{map/unmap} in order to be able to map/unmap a random amount of grants (previously it was limited to BLKIF_MAX_SEGMENTS_PER_REQUEST). Also, remove the usage of pending_req in the map/unmap functions, so we can map/unmap grants without needing to pass a pending_req. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * | | | xen-blkback: make the queue of free requests per backendRoger Pau Monne2013-04-183-105/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the last dependency from blkbk by moving the list of free requests to blkif. This change reduces the contention on the list of available requests. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: xen-devel@lists.xen.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OpenPOWER on IntegriCloud