summaryrefslogtreecommitdiffstats
path: root/drivers/block
Commit message (Collapse)AuthorAgeFilesLines
* zram: replace global tb_lock with fine grain lockWeijie Yang2014-08-062-33/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we use a rwlock tb_lock to protect concurrent access to the whole zram meta table. However, according to the actual access model, there is only a small chance for upper user to access the same table[index], so the current lock granularity is too big. The idea of optimization is to change the lock granularity from whole meta table to per table entry (table -> table[index]), so that we can protect concurrent access to the same table[index], meanwhile allow the maximum concurrency. With this in mind, several kinds of locks which could be used as a per-entry lock were tested and compared: Test environment: x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04, kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO. iozone test: iozone -t 4 -R -r 16K -s 200M -I +Z (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------------- Initial write 1381094 1425435 1422860 1423075 1421521 Rewrite 1529479 1641199 1668762 1672855 1654910 Read 8468009 11324979 11305569 11117273 10997202 Re-read 8467476 11260914 11248059 11145336 10906486 Reverse Read 6821393 8106334 8282174 8279195 8109186 Stride read 7191093 8994306 9153982 8961224 9004434 Random read 7156353 8957932 9167098 8980465 8940476 Mixed workload 4172747 5680814 5927825 5489578 5972253 Random write 1483044 1605588 1594329 1600453 1596010 Pwrite 1276644 1303108 1311612 1314228 1300960 Pread 4324337 4632869 4618386 4457870 4500166 To enhance the possibility of access the same table[index] concurrently, set zram a small disksize(10MB) and let threads run with large loop count. fio test: fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4 --filename=/dev/zram0 --name=seq-write --rw=write --stonewall --name=seq-read --rw=read --stonewall --name=seq-readwrite --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall (10MB zram raw block device, take the average of 10 tests, KB/s) Test base CAS spinlock rwlock bit_spinlock ------------------------------------------------------------- seq-write 933789 999357 1003298 995961 1001958 seq-read 5634130 6577930 6380861 6243912 6230006 seq-rw 1405687 1638117 1640256 1633903 1634459 rand-rw 1386119 1614664 1617211 1609267 1612471 All the optimization methods show a higher performance than the base, however, it is hard to say which method is the most appropriate. On the other hand, zram is mostly used on small embedded system, so we don't want to increase any memory footprint. This patch pick the bit_spinlock method, pack object size and page_flag into an unsigned long table.value, so as to not increase any memory overhead on both 32-bit and 64-bit system. On the third hand, even though different kinds of locks have different performances, we can ignore this difference, because: if zram is used as zram swapfile, the swap subsystem can prevent concurrent access to the same swapslot; if zram is used as zram-blk for set up filesystem on it, the upper filesystem and the page cache also prevent concurrent access of the same block mostly. So we can ignore the different performances among locks. Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zram: use size_t instead of u16Minchan Kim2014-08-061-1/+1
| | | | | | | | | | | | | | Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16 or more. In these cases u16 is not sufficiently large to represent a compressed page's size so use size_t. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Nitin Gupta <ngupta@vflare.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zram: remove unused SECTOR_SIZE defineSergey Senozhatsky2014-08-061-1/+0
| | | | | | | | | | | Drop SECTOR_SIZE define, because it's not used. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zram: rename struct `table' to `zram_table_entry'Sergey Senozhatsky2014-08-061-2/+2
| | | | | | | | | | | | Andrew Morton has recently noted that `struct table' actually represents table entry and, thus, should be renamed. Rename to `zram_table_entry'. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* zram: avoid lockdep splat by revalidate_diskMinchan Kim2014-07-231-4/+18
| | | | | | | | | | | | | | | | | | | | | | | Sasha reported lockdep warning [1] introduced by [2]. It could be fixed by doing disk revalidation out of the init_lock. It's okay because disk capacity change is protected by init_lock so that revalidate_disk always sees up-to-date value so there is no race. [1] https://lkml.org/lkml/2014/7/3/735 [2] zram: revalidate disk after capacity change Fixes 2e32baea46ce ("zram: revalidate disk after capacity change"). Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Alexander E. Patrakov" <patrakov@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> CC: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* drbd: fix regression 'out of mem, failed to invoke fence-peer helper'Lars Ellenberg2014-07-101-0/+6
| | | | | | | | | | | | | | Since linux kernel 3.13, kthread_run() internally uses wait_for_completion_killable(). We sometimes may use kthread_run() while we still have a signal pending, which we used to kick our threads out of potentially blocking network functions, causing kthread_run() to mistake that as a new fatal signal and fail. Fix: flush_signals() before kthread_run(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* zram: revalidate disk after capacity changeMinchan Kim2014-07-031-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexander reported mkswap on /dev/zram0 is failed if other process is opening the block device file. Step is as follows, 0. Reset the unused zram device. 1. Use a program that opens /dev/zram0 with O_RDWR and sleeps until killed. 2. While that program sleeps, echo the correct value to /sys/block/zram0/disksize. 3. Verify (e.g. in /proc/partitions) that the disk size is applied correctly. It is. 4. While that program still sleeps, attempt to mkswap /dev/zram0. This fails: mkswap: error: swap area needs to be at least 40 KiB When I investigated, the size get by ioctl(fd, BLKGETSIZE64, xxx) on mkswap to get a size of blockdev was zero although zram0 has right size by 2. The reason is zram didn't revalidate disk after changing capacity so that size of blockdev's inode is not uptodate until all of file is close. This patch should fix the BUG. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Alexander E. Patrakov <patrakov@gmail.com> Tested-by: Alexander E. Patrakov <patrakov@gmail.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2014-06-262-2/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A small collection of fixes/changes for the current series. This contains: - Removal of dead code from Gu Zheng. - Revert of two bad fixes that went in earlier in this round, marking things as __init that were not purely used from init. - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(), which could place us wrongly. Make it use the non __ variant, which handles cases where we are called from the wrong CPU set. From me. - A fix for drbd, which allocates discard requests without room for the SCSI payload. From Lars Ellenberg. - A fix for user-after-free in the blkcg code from Tejun. - Addition of limiting gaps in SG lists, if the hardware needs it. This is the last pre-req patch for blk-mq to enable the full NVMe conversion. Could wait until 3.17, but it's simple enough so would be nice to have everything we need for the NVMe port in the 3.17 release. From me" * 'for-linus' of git://git.kernel.dk/linux-block: drbd: fix NULL pointer deref in blk_add_request_payload blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue() block: add support for limiting gaps in SG lists bio: remove unused macro bip_vec_idx() Revert "block: add __init to elv_register" Revert "block: add __init to blkcg_policy_register" blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t floppy: format block0 read error message properly
| * drbd: fix NULL pointer deref in blk_add_request_payloadLars Ellenberg2014-06-251-1/+4
| | | | | | | | | | | | | | | | | | | | Discards don't have any payload. But the scsi layer still expects a bio_vec it can use internally, see sd_setup_discard_cmnd() and blk_add_request_payload(). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * Merge branch 'for-jens' of ↵Jens Axboe2014-06-181-1/+1
| |\ | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/linux-block into for-linus
| | * floppy: format block0 read error message properlyJiri Kosina2014-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | In case reading of block 0 fails, line without trailing newline is printed causing dmesg to look horrible. Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | rbd: handle parent_overlap on writes correctlyIlya Dryomov2014-06-231-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following check in rbd_img_obj_request_submit() rbd_dev->parent_overlap <= obj_request->img_offset allows the fall through to the non-layered write case even if both parent_overlap and obj_request->img_offset belong to the same RADOS object. This leads to data corruption, because the area to the left of parent_overlap ends up unconditionally zero-filled instead of being populated with parent data. Suppose we want to write 1M to offset 6M of image bar, which is a clone of foo@snap; object_size is 4M, parent_overlap is 5M: rbd_data.<id>.0000000000000001 ---------------------|----------------------|------------ | should be copyup'ed | should be zeroed out | write ... ---------------------|----------------------|------------ 4M 5M 6M parent_overlap obj_request->img_offset 4..5M should be copyup'ed from foo, yet it is zero-filled, just like 5..6M is. Given that the only striping mode kernel client currently supports is chunking (i.e. stripe_unit == object_size, stripe_count == 1), round parent_overlap up to the next object boundary for the purposes of the overlap check. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
* | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2014-06-191-2/+5
|\ \ \ | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A smaller collection of fixes for the block core that would be nice to have in -rc2. This pull request contains: - Fixes for races in the wait/wakeup logic used in blk-mq from Alexander. No issues have been observed, but it is definitely a bit flakey currently. Alternatively, we may drop the cyclic wakeups going forward, but that needs more testing. - Some cleanups from Christoph. - Fix for an oops in null_blk if queue_mode=1 and softirq completions are used. From me. - A fix for a regression caused by the chunk size setting. It inadvertently used max_hw_sectors instead of max_sectors, which is incorrect, and causes hangs on btrfs multi-disk setups (where hw sectors apparently isn't set). From me. - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a recent addition as well, but it actually breaks blk-mq which relies on strict scheduling. If the workqueue power_efficient mode is turned on, this breaks blk-mq. From Matias. - null_blk module parameter description fix from Mike" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: bitmap tag: fix races in bt_get() function blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt blk-mq: bitmap tag: fix races on shared ::wake_index fields block: blk_max_size_offset() should check ->max_sectors null_blk: fix softirq completions for queue_mode == 1 blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue blk-mq: properly drain stopped queues block: remove WQ_POWER_EFFICIENT from kblockd null_blk: fix name and description of 'queue_mode' module parameter block: remove elv_abort_queue and blk_abort_flushes
| * | null_blk: fix softirq completions for queue_mode == 1Jens Axboe2014-06-161-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | Only blk-mq completions have payload attached to the request, for request_fn mode we have stored it in req->special. This fixes an oops with queue_mode=1 and softirq completions. Signed-off-by: Jens Axboe <axboe@fb.com>
| * | null_blk: fix name and description of 'queue_mode' module parameterMike Snitzer2014-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | 'use_mq' is not the name of the module parameter, 'queue_mode' is. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | Merge git://git.infradead.org/users/willy/linux-nvmeLinus Torvalds2014-06-152-90/+149
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe update from Matthew Wilcox: "Mostly bugfixes again for the NVMe driver. I'd like to call out the exported tracepoint in the block layer; I believe Keith has cleared this with Jens. We've had a few reports from people who're really pounding on NVMe devices at scale, hence the timeout changes (and new module parameters), hotplug cpu deadlock, tracepoints, and minor performance tweaks" [ Jens hadn't seen that tracepoint thing, but is ok with it - it will end up going away when mq conversion happens ] * git://git.infradead.org/users/willy/linux-nvme: (22 commits) NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. NVMe: Use Log Page constants in SCSI emulation NVMe: Define Log Page constants NVMe: Fix hot cpu notification dead lock NVMe: Rename io_timeout to nvme_io_timeout NVMe: Use last bytes of f/w rev SCSI Inquiry NVMe: Adhere to request queue block accounting enable/disable NVMe: Fix nvme get/put queue semantics NVMe: Delete NVME_GET_FEAT_TEMP_THRESH NVMe: Make admin timeout a module parameter NVMe: Make iod bio timeout a parameter NVMe: Prevent possible NULL pointer dereference NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) NVMe: Update data structures for NVMe 1.2 NVMe: Enable BUILD_BUG_ON checks NVMe: Update namespace and controller identify structures to the 1.1a spec NVMe: Flush with data support NVMe: Configure support for block flush NVMe: Add tracepoints NVMe: Protect against badly formatted CQEs ...
| * | | NVMe: Fix START_STOP_UNIT Scsi->NVMe translation.Dan McLeran2014-06-131-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch contains several fixes for Scsi START_STOP_UNIT. The previous code did not account for signed vs. unsigned arithmetic which resulted in an invalid lowest power state caculation when the device only supports 1 power state. The code for Power Condition == 2 (Idle) was not following the spec. The spec calls for setting the device to specific power states, depending upon Power Condition Modifier, without accounting for the number of power states supported by the device. The code for Power Condition == 3 (Standby) was using a hard-coded '0' which is replaced with the macro POWER_STATE_0. Signed-off-by: Dan McLeran <daniel.mcleran@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Use Log Page constants in SCSI emulationMatthew Wilcox2014-06-131-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The nvme-scsi file defined its own Log Page constant. Use the newly-defined one from the header file instead. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Fix hot cpu notification dead lockKeith Busch2014-06-131-10/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a potential dead lock if a cpu event occurs during nvme probe since it registered with hot cpu notification. This fixes the race by having the module register with notification outside of probe rather than have each device register. The actual work is done in a scheduled work queue instead of in the notifier since assigning IO queues has the potential to block if the driver creates additional queues. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Rename io_timeout to nvme_io_timeoutMatthew Wilcox2014-06-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It's positively immoral to have a global variable called 'io_timeout'. Keep the module parameter called io_timeout, though. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Use last bytes of f/w rev SCSI InquiryKeith Busch2014-06-031-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After skipping right-padded spaces, use the last four bytes of the firmware revision when reporting the Inquiry Product Revision. These are generally more indicative to what is running. Signed-off-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Adhere to request queue block accounting enable/disableSam Bradshaw2014-06-031-14/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently, a new sysfs control "iostats" was added to selectively enable or disable io statistics collection for request queues. This patch hooks that control. IO statistics collection is rather expensive on large, multi-node machines with drives pushing millions of iops. Having the ability to disable collection if not needed can improve throughput significantly. As a data point, on a quad E5-4640, I see more than 50% throughput improvement when io statistics accounting is disabled during heavily multi-threaded small block random read benchmarks where device performance is in the million iops+ range. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Fix nvme get/put queue semanticsKeith Busch2014-06-031-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The routines to get and lock nvme queues required the caller to "put" or "unlock" them even if getting one returned NULL. This patch fixes that. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Delete NVME_GET_FEAT_TEMP_THRESHMatthew Wilcox2014-06-031-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This define isn't used, and any code that wanted to use it should use NVME_FEAT_TEMP_THRESH instead. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Make admin timeout a module parameterKeith Busch2014-06-031-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Keith Busch <keith.busch@intel.com> [made admin_timeout static] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Make iod bio timeout a parameterKeith Busch2014-06-031-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was originally set to 4 times the IO timeout, but that was when the IO timeout was 5 seconds instead of 30. 20 seconds for total time to failure seemed more reasonable than 2 minutes for most, but other users have requested to make this a module parameter instead. Signed-off-by: Keith Busch <keith.busch@intel.com> [renamed the module parameter to retry_time] [made retry_time static] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Prevent possible NULL pointer dereferenceSantosh Y2014-06-031-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kmalloc() used by the nvme_alloc_iod() to allocate memory for 'iod' can fail. So check the return value. Signed-off-by: Santosh Y <santosh.sy@samsung.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD)Indraneel Mukherjee2014-06-031-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In GetLogPage the buffer size passed to device is a 0's based value. Signed-off-by: Indraneel M <indraneel.m@samsung.com> Reported-by: Shiro Itou <shiro.itou@outlook.com> Reviewed-by: Vishal Verma <vishal.l.verma@linux.intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Enable BUILD_BUG_ON checksMatthew Wilcox2014-05-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since _nvme_check_size() wasn't being called from anywhere, the compiler was optimising it away ... along with all the link-time build failures that would result if any of the structures were the wrong size. Call it from nvme_exit() for no particular reason. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Flush with data supportKeith Busch2014-05-051-20/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible a filesystem may send a flush flagged bio with write data. There is no such composite NVMe command, so the driver sends flush and write separately. The device is allowed to execute these commands in any order, so it was possible the driver ends the bio after the write completes, but while the flush is still active. We don't want to let a filesystem believe flush succeeded before it really has; this could cause data corruption on a power loss between these events. To fix, this patch splits the flush and write into chained bios. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Configure support for block flushKeith Busch2014-05-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This configures an nvme request_queue as flush capable if the device has a volatile write cache present. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Add tracepointsKeith Busch2014-05-051-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding tracepoints for bio_complete and block_split into nvme to help with gathering IO info using blktrace and blkparse. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Protect against badly formatted CQEsKeith Busch2014-05-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a misbehaving device posts a CQE with a command id < depth but for one that was never allocated, the command info will have a callback function set to NULL and we don't want to try invoking that. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
| * | | NVMe: Improve error messagesMatthew Wilcox2014-05-051-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Help people diagnose what is going wrong at initialisation time by printing out which command has gone wrong and what the device returned. Also fix the error message printed while waiting for reset. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Keith Busch <keith.busch@intel.com>
| * | | NVMe: Update copyright headersMatthew Wilcox2014-05-052-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the copyright dates accurate and remove the final paragraph that includes the address of the FSF. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2014-06-121-45/+197
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "This has a mix of bug fixes and cleanups. Alex's patch fixes a rare race in RBD. Ilya's patches fix an ENOENT check when a second rbd image is mapped and a couple memory leaks. Zheng fixes several issues with fragmented directories and multiple MDSs. Josh fixes a spin/sleep issue, and Josh and Guangliang's patches fix setting and unsetting RBD images read-only. Naturally there are several other cleanups mixed in for good measure" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) rbd: only set disk to read-only once rbd: move calls that may sleep out of spin lock range rbd: add ioctl for rbd ceph: use truncate_pagecache() instead of truncate_inode_pages() ceph: include time stamp in every MDS request rbd: fix ida/idr memory leak rbd: use reference counts for image requests rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync() rbd: make sure we have latest osdmap on 'rbd map' libceph: add ceph_monc_wait_osdmap() libceph: mon_get_version request infrastructure libceph: recognize poolop requests in debugfs ceph: refactor readpage_nounlock() to make the logic clearer mds: check cap ID when handling cap export message ceph: remember subtree root dirfrag's auth MDS ceph: introduce ceph_fill_fragtree() ceph: handle cap import atomically ceph: pre-allocate ceph_cap struct for ceph_add_cap() ceph: update inode fields according to issued caps rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ...
| * | | rbd: only set disk to read-only onceJosh Durgin2014-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rbd_open(), called every time the device is opened, calls set_device_ro(). There's no reason to set the device read-only or read-write every time it is opened. Just do this once during device setup, using set_disk_ro() instead because the struct block_device isn't available to us there. Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * | | rbd: move calls that may sleep out of spin lock rangeJosh Durgin2014-06-101-11/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_user() and set_disk_ro() may allocate memory, leading to a potential deadlock if theye are called while a spin lock is held. Move the acquisition and release of rbd_dev->lock from rbd_ioctl() into rbd_ioctl_set_ro(), so it can occur between get_user() and set_disk_ro(). Signed-off-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * | | rbd: add ioctl for rbdGuangliang Zhao2014-06-101-2/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running the following commands: [root@ceph0 mnt]# blockdev --setro /dev/rbd1 [root@ceph0 mnt]# blockdev --getro /dev/rbd1 0 The block setro didn't take effect, it is because the rbd doesn't support ioctl of block driver. This resolves: http://tracker.ceph.com/issues/6265 Signed-off-by: Guangliang Zhao <guangliang@unitedstack.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
| * | | rbd: fix ida/idr memory leakIlya Dryomov2014-06-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ida_destroy() needs to be called on module exit to release ida caches. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * | | rbd: use reference counts for image requestsAlex Elder2014-06-061-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each image request contains a reference count, but to date it has not actually been used. (I think this was just an oversight.) A recent report involving rbd failing an assertion shed light on why and where we need to use these reference counts. Every OSD request associated with an object request uses rbd_osd_req_callback() as its callback function. That function will call a helper function (dependent on the type of OSD request) that will set the object request's "done" flag if the object request if appropriate. If that "done" flag is set, the object request is passed to rbd_obj_request_complete(). In rbd_obj_request_complete(), requests are processed in sequential order. So if an object request completes before one of its predecessors in the image request, the completion is deferred. Otherwise, if it's a completing object's "turn" to be completed, it is passed to rbd_img_obj_end_request(), which records the result of the operation, accumulates transferred bytes, and so on. Next, the successor to this request is checked and if it is marked "done", (deferred) completion processing is performed on that request, and so on. If the last object request in an image request is completed, rbd_img_request_complete() is called, which (typically) destroys the image request. There is a race here, however. The instant an object request is marked "done" it can be provided (by a thread handling completion of one of its predecessor operations) to rbd_img_obj_end_request(), which (for the last request) can then lead to the image request getting torn down. And this can happen *before* that object has itself entered rbd_img_obj_end_request(). As a result, once it *does* enter that function, the image request (and even the object request itself) may have been freed and become invalid. All that's necessary to avoid this is to properly count references to the image requests. We tear down an image request's object requests all at once--only when the entire image request has completed. So there's no need for an image request to count references for its object requests. However, we don't want an image request to go away until the last of its object requests has passed through rbd_img_obj_callback(). In other words, we don't want rbd_img_request_complete() to necessarily result in the image request being destroyed, because it may get called before we've finished processing on all of its object requests. So the fix is to add a reference to an image request for each of its object requests. The reference can be viewed as representing an object request that has not yet finished its call to rbd_img_obj_callback(). That is emphasized by getting the reference right after assigning that as the image object's callback function. The corresponding release of that reference is done at the end of rbd_img_obj_callback(), which every image object request passes through exactly once. Cc: stable@vger.kernel.org Signed-off-by: Alex Elder <elder@linaro.org> Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
| * | | rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()Ilya Dryomov2014-06-061-38/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | osd_request, along with r_request and r_reply messages attached to it are leaked in __rbd_dev_header_watch_sync() if the requested image doesn't exist. This is because lingering requests are special and get an extra ref in the reply path. Fix it by unregistering linger request on the error path and split __rbd_dev_header_watch_sync() into two functions to make it maintainable. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
| * | | rbd: make sure we have latest osdmap on 'rbd map'Ilya Dryomov2014-06-061-3/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given an existing idle mapping (img1), mapping an image (img2) in a newly created pool (pool2) fails: $ ceph osd pool create pool1 8 8 $ rbd create --size 1000 pool1/img1 $ sudo rbd map pool1/img1 $ ceph osd pool create pool2 8 8 $ rbd create --size 1000 pool2/img2 $ sudo rbd map pool2/img2 rbd: sysfs write failed rbd: map failed: (2) No such file or directory This is because client instances are shared by default and we don't request an osdmap update when bumping a ref on an existing client. The fix is to use the mon_get_version request to see if the osdmap we have is the latest, and block until the requested update is received if it's not. Fixes: http://tracker.ceph.com/issues/8184 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | | rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERODuan Jiong2014-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
* | | | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2014-06-113-11/+14
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer fixes from Jens Axboe: "Final small batch of fixes to be included before -rc1. Some general cleanups in here as well, but some of the blk-mq fixes we need for the NVMe conversion and/or scsi-mq. The pull request contains: - Support for not merging across a specified "chunk size", if set by the driver. Some NVMe devices perform poorly for IO that crosses such a chunk, so we need to support it generically as part of request merging avoid having to do complicated split logic. From me. - Bump max tag depth to 10Ki tags. Some scsi devices have a huge shared tag space. Before we failed with EINVAL if a too large tag depth was specified, now we truncate it and pass back the actual value. From me. - Various blk-mq rq init fixes from me and others. - A fix for enter on a dying queue for blk-mq from Keith. This is needed to prevent oopsing on hot device removal. - Fixup for blk-mq timer addition from Ming Lei. - Small round of performance fixes for mtip32xx from Sam Bradshaw. - Minor stack leak fix from Rickard Strandqvist. - Two __init annotations from Fabian Frederick" * 'for-linus' of git://git.kernel.dk/linux-block: block: add __init to blkcg_policy_register block: add __init to elv_register block: ensure that bio_add_page() always accepts a page for an empty bio blk-mq: add timer in blk_mq_start_request blk-mq: always initialize request->start_time block: blk-exec.c: Cleaning up local variable address returnd mtip32xx: minor performance enhancements blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init() blk-mq: don't allow queue entering for a dying queue blk-mq: bump max tag depth to 10K tags block: add blk_rq_set_block_pc() block: add notion of a chunk size for request merging
| * | | | mtip32xx: minor performance enhancementsSam Bradshaw2014-06-062-10/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the following: 1) Compiler hinting in the fast path. 2) A prefetch of port->flags to eliminate moderate cpu stalling later in mtip_hw_submit_io(). 3) Eliminate a redundant rq_data_dir(). 4) Reorder members of driver_data to eliminate false cacheline sharing between irq_workers_active and unal_qdepth. With some workload and topology configurations, I'm seeing ~1.5% throughput improvement in small block random read benchmarks as well as improved latency std. dev. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Add include of <linux/prefetch.h> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | | block: add blk_rq_set_block_pc()Jens Axboe2014-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the optimizations around not clearing the full request at alloc time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC up to the user allocating the request. Add a blk_rq_set_block_pc() that sets the command type to REQ_TYPE_BLOCK_PC, and properly initializes the members associated with this type of request. Update callers to use this function instead of manipulating rq->cmd_type directly. Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed attempt. Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | nbd: zero from and len fields in NBD_CMD_DISCONNECT.Hani Benhabiles2014-06-061-5/+2
|/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Len field is already set to zero, but not the from field which is sent as 0xfffffffffffffe00. This makes no sense, and may cause confuse server implementations doing sanity checks (qemu-nbd is an example.) Signed-off-by: Hani Benhabiles <hani@linux.com> Cc: Paul Clements <paul.clements@us.sios.com> Cc: Paul Clements <Paul.Clements@steeleye.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'akpm' (patchbomb from Andrew) into nextLinus Torvalds2014-06-042-5/+15
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc updates from Andrew Morton: - a few fixes for 3.16. Cc'ed to stable so they'll get there somehow. - various misc fixes and cleanups - most of the ocfs2 queue. Review is slow... - most of MM. The MM queue is pretty huge this time, but not much in the way of feature work. - some tweaks under kernel/ - printk maintenance work - updates to lib/ - checkpatch updates - tweaks to init/ * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (276 commits) fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul init/main.c: remove an ifdef kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND init/main.c: add initcall_blacklist kernel parameter init/main.c: don't use pr_debug() fs/binfmt_flat.c: make old_reloc() static fs/binfmt_elf.c: fix bool assignements fs/efs: convert printk(KERN_DEBUG to pr_debug fs/efs: add pr_fmt / use __func__ fs/efs: convert printk to pr_foo() scripts/checkpatch.pl: device_initcall is not the only __initcall substitute checkpatch: check stable email address checkpatch: warn on unnecessary void function return statements checkpatch: prefer kstrto<foo> to sscanf(buf, "%<lhuidx>", &bar); checkpatch: add warning for kmalloc/kzalloc with multiply checkpatch: warn on #defines ending in semicolon checkpatch: make --strict a default for files in drivers/net and net/ checkpatch: always warn on missing blank line after variable declaration block checkpatch: fix wildcard DT compatible string checking ...
| * | | | zram: correct offset usage in zram_bio_discardWeijie Yang2014-06-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to skip the physical block(PAGE_SIZE) which is partially covered by the discard bio, so we check the remaining size and subtract it if there is a need to goto the next physical block. The current offset usage in zram_bio_discard is incorrect, it will cause its upper filesystem breakdown. Consider the following scenario: On some architecture or config, PAGE_SIZE is 64K for example, filesystem is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a offset = 4K and size=72K, normally, it should not really discard any physical block as it partially cover two physical blocks. However, with the current offset usage, it will discard the second physical block and free its memory, which will cause filesystem breakdown. This patch corrects the offset usage in zram_bio_discard. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud