summaryrefslogtreecommitdiffstats
path: root/drivers/md/bcache
Commit message (Collapse)AuthorAgeFilesLines
* bcache: remove driver private bio splitting codeKent Overstreet2015-08-137-162/+18
| | | | | | | | | | | | | The bcache driver has always accepted arbitrarily large bios and split them internally. Now that every driver must accept arbitrarily large bios this code isn't nessecary anymore. Cc: linux-bcache@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: add a bi_error field to struct bioChristoph Hellwig2015-07-298-43/+44
| | | | | | | | | | | | | | | | | | | | | | | | Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: have drivers use blk_queue_max_discard_sectors()Jens Axboe2015-07-171-1/+1
| | | | | | | | | | Some drivers use it now, others just set the limits field manually. But in preparation for splitting this into a hard and soft limit, ensure that they all call the proper function for setting the hw limit for discards. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* bcache: don't embed 'return' statements in closure macrosJens Axboe2015-07-114-6/+14
| | | | | | | | This is horribly confusing, it breaks the flow of the code without it being apparent in the caller. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de>
* MAINTAINERS: BCACHE: Kent Overstreet has changed email addressJoe Perches2015-06-301-1/+1
| | | | | | | | | | | | Kent's email address in MAINTAINERS seems to be invalid. This was his last sign-off address, so use that if appropriate. Fix the S: status entry while there. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* bcache: use kvfree() in various placesPekka Enberg2015-06-302-16/+4
| | | | | | | | | Use kvfree() instead of open-coding it. Signed-off-by: Pekka Enberg <penberg@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* writeback: separate out include/linux/backing-dev-defs.hTejun Heo2015-06-021-0/+1
| | | | | | | | | | | | | | | | | | | | | With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: remove management of bi_remaining when restoring original bi_end_ioMike Snitzer2015-05-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") regressed all existing callers that followed this pattern: 1) saving a bio's original bi_end_io 2) wiring up an intermediate bi_end_io 3) restoring the original bi_end_io from intermediate bi_end_io 4) calling bio_endio() to execute the restored original bi_end_io The regression was due to BIO_CHAIN only ever getting set if bio_inc_remaining() is called. For the above pattern it isn't set until step 3 above (step 2 would've needed to establish BIO_CHAIN). As such the first bio_endio(), in step 2 above, never decremented __bi_remaining before calling the intermediate bi_end_io -- leaving __bi_remaining with the value 1 instead of 0. When bio_inc_remaining() occurred during step 3 it brought it to a value of 2. When the second bio_endio() was called, in step 4 above, it should've called the original bi_end_io but it didn't because there was an extra reference that wasn't dropped (due to atomic operations being optimized away since BIO_CHAIN wasn't set upfront). Fix this issue by removing the __bi_remaining management complexity for all callers that use the above pattern -- bio_chain() is the only interface that _needs_ to be concerned with __bi_remaining. For the above pattern callers just expect the bi_end_io they set to get called! Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls that aren't associated with the bio_chain() interface. Also, the bio_inc_remaining() interface has been moved local to bio.c. Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* bio: skip atomic inc/dec of ->bi_cnt for most use casesJens Axboe2015-05-051-1/+1
| | | | | | | | | | | | | | | Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* md/bcache: use generic io stats accounting functions to simplify io stat ↵Gu Zheng2014-11-241-17/+6
| | | | | | | | | | | accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Kent Overstreet <kmo@datera.io> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: disable entropy contributions for nonrot devicesMike Snitzer2014-10-041-0/+1
| | | | | | | | | | | | | | | | | | | | | Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set QUEUE_FLAG_NONROT. Historically, all block devices have automatically made entropy contributions. But as previously stated in commit e2e1a148 ("block: add sysfs knob for turning off disk entropy contributions"): - On SSD disks, the completion times aren't as random as they are for rotational drives. So it's questionable whether they should contribute to the random pool in the first place. - Calling add_disk_randomness() has a lot of overhead. There are more reliable sources for randomness than non-rotational block devices. From a security perspective it is better to err on the side of caution than to allow entropy contributions from unreliable "random" sources. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* bcache: Drop unneeded blk_sync_queue() callsKent Overstreet2014-08-041-8/+2
| | | | | | | | this is needed for the queue/block device we created (it's done by blk_cleanup_queue() which we do call) - but calling it for the block devices we only opened is pointless. Change-Id: I53dfded14ed15b9581d10ca8399d5e1b3abbf9f2
* bcache: add mutex lock for bch_is_openJianjian Huo2014-08-041-0/+2
| | | | | | | Since bch_is_open will iterate linked list bch_cache_sets and uncached_devices, it needs bch_register_lock. Signed-off-by: Jianjian Huo <samuel.huo@gmail.com>
* bcache: Correct printing of btree_gc_max_duration_msSurbhi Palande2014-08-041-2/+2
| | | | | | | | | time_stats::btree_gc_max_duration_mc is not bit shifted by 8 Fixes BUG #138 Change-Id: I44fc6e1d0579674016acc533f1a546b080e5371a Signed-off-by: Surbhi Palande <sap@daterainc.com>
* bcache: try to set b->parent properlySlava Pestov2014-08-043-20/+25
| | | | | | | bcache_flash_dev.ktest would reliably crash with 8k and 16k bucket size before; now it passes. Change-Id: Ib542232235e39298c3a7548fe52b645cabb823d1
* bcache: fix memory corruption in init error pathSlava Pestov2014-08-041-3/+8
| | | | | | | | If register_cache_set() failed, we would touch ca->set after it had already been freed. Also, fix an assertion to catch this. Change-Id: I748e5f5b223e2d9b2602075dec2f997cced2394d
* bcache: fix crash with incomplete cache setSlava Pestov2014-08-042-0/+8
| | | | Change-Id: I6abde52afe917633480caaf4e2518f42a816d886
* bcache: Fix more early shutdown bugsKent Overstreet2014-08-041-3/+4
| | | | Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: fix use-after-free in btree_gc_coalesce()Slava Pestov2014-08-041-0/+1
| | | | | | | | | | | If we goto out_nocoalesce after we free new_nodes[0], we end up freeing new_nodes[0] again. This was generating a lockdep warning. The fix is to set new_nodes[0] to NULL, since the out_nocoalesce path safely ignores NULL entries in the new_nodes array. This regression was introduced in 2d7f9531. Change-Id: I76564d7257800583214376b4bacf236cda90c89c
* bcache: Fix an infinite loop in journal replayKent Overstreet2014-08-041-1/+4
| | | | | | | | | When running with multiple cache devices, if one of the devices has a completely empty journal but we'd already found some journal entries on a previosu device we'd go into an infinite loop. Change-Id: I1dcdc0d738192746de28f40e8b08825b0dea5e2b Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: fix crash in bcache_btree_node_alloc_fail tracepointSlava Pestov2014-08-041-1/+1
| | | | | | 'b' was NULL. Change-Id: Icac0fd04afa2d23f213d96d51afd53374e6dd0c0
* bcache: bcache_write tracepoint was crashingSlava Pestov2014-08-041-1/+2
| | | | Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: fix typo in bch_bkey_equal_headerSlava Pestov2014-08-041-1/+1
| | | | Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Allocate bounce buffers with GFP_NOWAITKent Overstreet2014-08-042-2/+2
| | | | | | | There's no point in blocking on these allocations, since our fallback paths will probably go faster than blocking. Change-Id: I733ca202c25cb36bde02607a0a60552229a4241c
* bcache: Make sure to pass GFP_WAIT to mempool_alloc()Kent Overstreet2014-08-041-1/+1
| | | | | | | | this was very wrong - mempool_alloc() only guarantees success with GFP_WAIT. bcache uses GFP_NOWAIT in various other places where we have a fallback, circuits must've gotten crossed when writing this code or something. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: fix uninterruptible sleep in writeback threadSlava Pestov2014-08-043-5/+15
| | | | | | | | | | | | | There were two issues here: - writeback thread did not start until the device first became dirty - writeback thread used uninterruptible sleep once running Without this patch I see kernel warnings printed and a load average of 1.52 after booting my test VM. With this patch the warnings are gone and the load average is near 0.00 as expected. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: wait for buckets when allocating new btree rootSlava Pestov2014-08-043-5/+12
| | | | | | | | Tested: - sometimes bcache_tier test would hang on startup with a failure to allocate the btree root -- no longer seeing this Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: fix crash on shutdown in passthrough modeSlava Pestov2014-08-041-1/+2
| | | | We never started the writeback thread in this case, so don't stop it.
* bcache: fix lockdep warnings on shutdownSlava Pestov2014-08-041-0/+4
|
* bcache allocator: send discards with correct sizeSlava Pestov2014-08-041-1/+1
|
* bcache: Fix to remove the rcu_sched stalls.Surbhi Palande2014-08-041-1/+2
| | | | | | | | while loop was executing infinitely. This fix ends the while loop gracefully. Signed-off-by: Surbhi Palande <sap@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix a journal replay bugKent Overstreet2014-08-043-11/+19
| | | | | | | journal replay wansn't validating pointers with bch_extent_invalid() before derefing, fixed Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix a bug when detachingKent Overstreet2014-08-041-3/+6
| | | | | | | After detaching a backing device from a cache set, a bit wasn't getting reset meaning the second detach wouldn't work correctly. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* arch: Mass conversion of smp_mb__*()Peter Zijlstra2014-04-182-2/+2
| | | | | | | | | | | Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* bcache: remove nested function usageJohn Sheu2014-03-182-72/+76
| | | | | | | | | | | Uninlined nested functions can cause crashes when using ftrace, as they don't follow the normal calling convention and confuse the ftrace function graph tracer as it examines the stack. Also, nested functions are supported as a gcc extension, but may fail on other compilers (e.g. llvm). Signed-off-by: John Sheu <john.sheu@gmail.com>
* bcache: Kill bucket->gc_genKent Overstreet2014-03-184-11/+9
| | | | | | | | gc_gen was a temporary used to recalculate last_gc, but since we only need bucket->last_gc when gc isn't running (gc_mark_valid = 1), we can just update last_gc directly. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Kill unused freelistKent Overstreet2014-03-185-125/+110
| | | | | | | | This was originally added as at optimization that for various reasons isn't needed anymore, but it does add a lot of nasty corner cases (and it was responsible for some recently fixed bugs). Just get rid of it now. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Rework btree cache reserve handlingKent Overstreet2014-03-186-139/+145
| | | | | | | | | | | | | | | | | | | This changes the bucket allocation reserves to use _real_ reserves - separate freelists - instead of watermarks, which if nothing else makes the current code saner to reason about and is going to be important in the future when we add support for multiple btrees. It also adds btree_check_reserve(), which checks (and locks) the reserves for both bucket allocation and memory allocation for btree nodes; the old code just kinda sorta assumed that since (e.g. for btree node splits) it had the root locked and that meant no other threads could try to make use of the same reserve; this technically should have been ok for memory allocation (we should always have a reserve for memory allocation (the btree node cache is used as a reserve and we preallocate it)), but multiple btrees will mean that locking the root won't be sufficient anymore, and for the bucket allocation reserve it was technically possible for the old code to deadlock. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Kill btree_io_wqKent Overstreet2014-03-183-24/+2
| | | | | | | | With the locking rework in the last patch, this shouldn't be needed anymore - btree_node_write_work() only takes b->write_lock which is never held for very long. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: btree locking reworkKent Overstreet2014-03-184-52/+133
| | | | | | | | | | | | | | | | | | Add a new lock, b->write_lock, which is required to actually modify - or write - a btree node; this lock is only held for short durations. This means we can write out a btree node without taking b->lock, which _is_ held for long durations - solving a deadlock when btree_flush_write() (from the journalling code) is called with a btree node locked. Right now just occurs in bch_btree_set_root(), but with an upcoming journalling rework is going to happen a lot more. This also turns b->lock is now more of a read/intent lock instead of a read/write lock - but not completely, since it still blocks readers. May turn it into a real intent lock at some point in the future. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix a race when freeing btree nodesKent Overstreet2014-03-181-33/+20
| | | | | | | | | | | | This isn't a bulletproof fix; btree_node_free() -> bch_bucket_free() puts the bucket on the unused freelist, where it can be reused right away without any ordering requirements. It would be better to wait on at least a journal write to go down before reusing the bucket. bch_btree_set_root() does this, and inserting into non leaf nodes is completely synchronous so we should be ok, but future patches are just going to get rid of the unused freelist - it was needed in the past for various reasons but shouldn't be anymore. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add a real GC_MARK_RECLAIMABLEKent Overstreet2014-03-184-14/+21
| | | | | | | This means the garbage collection code can better check for data and metadata pointers to the same buckets. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Add bch_keylist_init_single()Kent Overstreet2014-03-182-4/+7
| | | | | | | This will potentially save us an allocation when we've got inode/dirent bkeys that don't fit in the keylist's inline keys. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Improve priority_statsKent Overstreet2014-03-181-6/+20
| | | | | | Break down data into clean data/dirty data/metadata. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Better alloc tracepointsKent Overstreet2014-03-182-5/+12
| | | | | | | Change the invalidate tracepoint to indicate how much data we're invalidating, and change the alloc tracepoints to indicate what offset they're for. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Kill dead cgroup codeKent Overstreet2014-03-185-202/+0
| | | | | | This hasn't been used or even enabled in ages. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: stop moving_gc marking buckets that can't be moved.Nicholas Swenson2014-03-181-1/+4
| | | | Signed-off-by: Nicholas Swenson <nks@daterainc.com>
* bcache: Fix moving_pred()Kent Overstreet2014-03-181-5/+3
| | | | | | Avoid a potential null pointer deref (e.g. from check keys for cache misses) Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix moving_gc deadlocking with a foreground writeNicholas Swenson2014-03-185-8/+16
| | | | | | | | | | | | | | | Deadlock happened because a foreground write slept, waiting for a bucket to be allocated. Normally the gc would mark buckets available for invalidation. But the moving_gc was stuck waiting for outstanding writes to complete. These writes used the bcache_wq, the same queue foreground writes used. This fix gives moving_gc its own work queue, so it was still finish moving even if foreground writes are stuck waiting for allocation. It also makes work queue a parameter to the data_insert path, so moving_gc can use its workqueue for writes. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kmo@daterainc.com>
* bcache: Fix discard granularityKent Overstreet2014-03-181-0/+1
| | | | | | blk_stack_limits() doesn't like a discard granularity of 0. Signed-off-by: Kent Overstreet <kmo@daterainc.com>
OpenPOWER on IntegriCloud