summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* block: blk_flush_integrity() for bio-based driversDan Williams2015-10-213-0/+16
| | | | | | | | | | | | | | | | | Since they lack requests to pin the request_queue active, synchronous bio-based drivers may have in-flight integrity work from bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush that work to prevent races to free the queue and the final usage of the blk_integrity profile. This is temporary unless/until bio-based drivers start to generically take a q_usage_counter reference while a bio is in-flight. Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: move blk_integrity to request_queueDan Williams2015-10-214-10/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A trace like the following proceeds a crash in bio_integrity_process() when it goes to use an already freed blk_integrity profile. BUG: unable to handle kernel paging request at ffff8800d31b10d8 IP: [<ffff8800d31b10d8>] 0xffff8800d31b10d8 PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3 Oops: 0011 [#1] SMP Dumping ftrace buffer: --------------------------------- ndctl-2222 2.... 44526245us : disk_release: pmem1s systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s <...>-409 4.... 44574005us : bio_integrity_process: pmem1s --------------------------------- [..] Call Trace: [<ffffffff8144e0f9>] ? bio_integrity_process+0x159/0x2d0 [<ffffffff8144e4f6>] bio_integrity_verify_fn+0x36/0x60 [<ffffffff810bd2dc>] process_one_work+0x1cc/0x4e0 Given that a request_queue is pinned while i/o is in flight and that a gendisk is allowed to have a shorter lifetime, move blk_integrity to request_queue to satisfy requests arriving after the gendisk has been torn down. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: generic request_queue reference countingDan Williams2015-10-217-75/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow pmem, and other synchronous/bio-based block drivers, to fallback on a per-cpu reference count managed by the core for tracking queue live/dead state. The existing per-cpu reference count for the blk_mq case is promoted to be used in all block i/o scenarios. This involves initializing it by default, waiting for it to drop to zero at exit, and holding a live reference over the invocation of q->make_request_fn() in generic_make_request(). The blk_mq code continues to take its own reference per blk_mq request and retains the ability to freeze the queue, but the check that the queue is frozen is moved to generic_make_request(). This fixes crash signatures like the following: BUG: unable to handle kernel paging request at ffff880140000000 [..] Call Trace: [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70 [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem] [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem] [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0 [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110 [<ffffffff8141f966>] submit_bio+0x76/0x170 [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160 [<ffffffff81286e62>] submit_bh+0x12/0x20 [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170 [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90 [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270 [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30 [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250 [<ffffffff81303494>] ext4_put_super+0x64/0x360 [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0 Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <keith.busch@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: suspend i/o during runtime blk_integrity_unregisterDan Williams2015-10-211-0/+2
| | | | | | | | | | | Synchronize pending i/o against a change in the integrity profile to avoid the possibility of spurious integrity errors. Cc: Matthew Wilcox <willy@linux.intel.com> Acked-by: Keith Busch <keith.busch@intel.com> [keith: also protect dynamic integrity registration] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* md: suspend i/o during runtime blk_integrity_unregisterDan Williams2015-10-214-0/+7
| | | | | | | | | | | Synchronize pending i/o against a change in the integrity profile to avoid the possibility of spurious integrity errors. Given linear_add() is suspending the mddev before manipulating the mddev, do the same for the other personalities. Acked-by: NeilBrown <neilb@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdownDan Williams2015-10-215-9/+1
| | | | | | | | | | | | | | | Now that the integrity profile is statically allocated there is no work to do when shutting down an integrity enabled block device. Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <JBottomley@Odin.com> Acked-by: NeilBrown <neilb@suse.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Inline blk_integrity in struct gendiskMartin K. Petersen2015-10-2110-181/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Export integrity data interval size in sysfsMartin K. Petersen2015-10-212-0/+21
| | | | | | | | | | | The size of the data interval was not exported in the sysfs integrity directory. Export it so that userland apps can tell whether the interval is different from the device's logical block size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Reduce the size of struct blk_integrityMartin K. Petersen2015-10-213-9/+9
| | | | | | | | | | | | | | | The per-device properties in the blk_integrity structure were previously unsigned short. However, most of the values fit inside a char. The only exception is the data interval size and we can work around that by storing it as a power of two. This cuts the size of the dynamic portion of blk_integrity in half. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Consolidate static integrity profile propertiesMartin K. Petersen2015-10-219-64/+65
| | | | | | | | | | | | | | | | We previously made a complete copy of a device's data integrity profile even though several of the fields inside the blk_integrity struct are pointers to fixed template entries in t10-pi.c. Split the static and per-device portions so that we can reference the template directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* block: Move integrity kobject to struct gendiskMartin K. Petersen2015-10-213-13/+12
| | | | | | | | | | | | | The integrity kobject purely exists to support the integrity subdirectory in sysfs and doesn't really have anything to do with the blk_integrity data structure. Move the kobject to struct gendisk where it belongs. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* NVMe: initialize error to '0'Jens Axboe2015-10-151-1/+1
| | | | | | Reported-by: Keith Busch <keith.busch@intel.com> Fixes: 1951feae88c5 ("nvme: use an integer value to Linux errno values") Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: use an integer value to Linux errno valuesChristoph Hellwig2015-10-151-5/+7
| | | | | | | | | | | | Use a separate integer variable to hold the signed Linux errno values we pass back to the block layer. Note that for pass through commands those might still be NVMe values, but those fit into the int as well. Fixes: f4829a9b7a61: ("blk-mq: fix racy updates of rq->errors") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: fix 32-bit build warningArnd Bergmann2015-10-121-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | Compiling the nvme driver on 32-bit warns about a cast from a __u64 variable to a pointer: drivers/block/nvme-core.c: In function 'nvme_submit_io': drivers/block/nvme-core.c:1847:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] (void __user *)io.addr, length, NULL, 0); The cast here is intentional and safe, so we can shut up the gcc warning by adding an intermediate cast to 'uintptr_t'. I had previously submitted a patch to fix this problem in the nvme driver, but it was accepted on the same day that two new warnings got added. For clarification, I also change the third instance of this cast to use uintptr_t instead of unsigned long now. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: d29ec8241c10e ("nvme: submit internal commands through the block layer") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* NVMe: Add explicit block config dependencyKeith Busch2015-10-121-1/+1
| | | | | | | | | | The nvme driver was moved from drivers/block, losing our implicit dependency on CONFIG_BLOCK. This makes it an explicit driver dependency. Reported-by: Jim Davis <jim.epost@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: include <linux/types.ĥ> in <linux/nvme.h>Christoph Hellwig2015-10-091-0/+2
| | | | | | | | | The buildbot complains about this even if it doesn't generate a a build warning. But it's an easy fix, so here we go: Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: move to a new drivers/nvme/host directoryJay Sternberg2015-10-0912-14/+21
| | | | | | | | | | | | | | This patch moves the NVMe driver from drivers/block/ to its own new drivers/nvme/host/ directory. This is in preparation of splitting the current monolithic driver up and add support for the upcoming NVMe over Fabrics standard. The drivers/nvme/host/ is chose to leave space for a NVMe target implementation in addition to this host side driver. Signed-off-by: Jay Sternberg <jay.e.sternberg@intel.com> [hch: rebased, renamed core.c to pci.c, slight tweaks] Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme.h: add missing nvme_id_ctrl endianess annotationsChristoph Hellwig2015-10-091-2/+2
| | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: move hardware structures out of the uapi version of nvme.hChristoph Hellwig2015-10-094-591/+590
| | | | | | | | | | | | | | Currently all NVMe command and completion structures are exposed to userspace through the uapi version of nvme.h. They are not an ABI between the kernel and userspace, and will change in C-incompatible way for future versions of the spec. Move them to the kernel version of the file and rename the uapi header to nvme_ioctl.h so that userspace can easily detect the presence of the new clean header. Nvme-cli already carries a local copy of the header, so it won't be affected by this move. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: add a local nvme.h headerChristoph Hellwig2015-10-094-116/+136
| | | | | | | | | Add a new drivers/block/nvme.h which contains all the driver internal interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: properly handle partially initialized queues in nvme_create_io_queuesChristoph Hellwig2015-10-091-2/+14
| | | | | | | | This avoids having to clean up later in a seemingly unrelated place. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: merge nvme_dev_start, nvme_dev_resume and nvme_async_probeChristoph Hellwig2015-10-091-33/+20
| | | | | | | | | | And give the resulting function a sensible name. This keeps all the error handling in a single place and will allow for further improvements to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: factor reset code into a common helperChristoph Hellwig2015-10-091-24/+24
| | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: merge nvme_dev_reset into nvme_reset_failed_devChristoph Hellwig2015-10-091-9/+3
| | | | | | | | And give the resulting function a more descriptive name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: delete dev from dev_list in nvme_resetChristoph Hellwig2015-10-091-0/+1
| | | | | | | | | | | Device resets need to delete the device from the device list before kicking of the reset an re-probe, otherwise we get the device added to the list twice. nvme_reset is the only side missing this deletion at the moment, and this patch adds it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* NVMe: Simplify device resume on io queue failureKeith Busch2015-10-092-29/+6
| | | | | | | | | | | | | | | | Releasing IO queues and disks was done in a work queue outside the controller resume context to delete namespaces if the controller failed after a resume from suspend. This is unnecessary since we can resume a device asynchronously. This patch makes resume use probe_work so it can directly remove namespaces if the device is manageable but not IO capable. Since the deleting disks was the only reason we had the convoluted "reset_workfn", this patch removes that unnecessary indirection. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* NVMe: Namespace removal simplificationsKeith Busch2015-10-091-32/+10
| | | | | | | | | | This liberates namespace removal from the device, allowing gendisk references to be closed independent of the nvme controller reference count. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* NVMe: Reference count open namespacesKeith Busch2015-10-092-9/+21
| | | | | | | | | | | Dynamic namespace attachment means the namespace may be removed at any time, so the namespace reference count can not be tied to the device reference count. This fixes a NULL dereference if an opened namespace is detached from a controller. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.4/core' into for-4.4/driversJens Axboe2015-10-09690-3959/+7245
|\ | | | | | | Signed-off-by: Jens Axboe <axboe@fb.com>
| * Merge tag 'v4.3-rc4' into for-4.4/coreJens Axboe2015-10-09690-3957/+7244
| |\ | | | | | | | | | | | | | | | | | | Linux 4.3-rc4 Pulling in v4.3-rc4 to avoid conflicts with NVMe fixes that have gone in since for-4.4/core was based.
| | * Linux 4.3-rc4v4.3-rc4Linus Torvalds2015-10-041-1/+1
| | |
| | * Merge branch 'strscpy' of ↵Linus Torvalds2015-10-0425-37/+188
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile Pull strscpy string copy function implementation from Chris Metcalf. Chris sent this during the merge window, but I waffled back and forth on the pull request, which is why it's going in only now. The new "strscpy()" function is definitely easier to use and more secure than either strncpy() or strlcpy(), both of which are horrible nasty interfaces that have serious and irredeemable problems. strncpy() has a useless return value, and doesn't NUL-terminate an overlong result. To make matters worse, it pads a short result with zeroes, which is a performance disaster if you have big buffers. strlcpy(), by contrast, is a mis-designed "fix" for strlcpy(), lacking the insane NUL padding, but having a differently broken return value which returns the original length of the source string. Which means that it will read characters past the count from the source buffer, and you have to trust the source to be properly terminated. It also makes error handling fragile, since the test for overflow is unnecessarily subtle. strscpy() avoids both these problems, guaranteeing the NUL termination (but not excessive padding) if the destination size wasn't zero, and making the overflow condition very obvious by returning -E2BIG. It also doesn't read past the size of the source, and can thus be used for untrusted source data too. So why did I waffle about this for so long? Every time we introduce a new-and-improved interface, people start doing these interminable series of trivial conversion patches. And every time that happens, somebody does some silly mistake, and the conversion patch to the improved interface actually makes things worse. Because the patch is mindnumbing and trivial, nobody has the attention span to look at it carefully, and it's usually done over large swatches of source code which means that not every conversion gets tested. So I'm pulling the strscpy() support because it *is* a better interface. But I will refuse to pull mindless conversion patches. Use this in places where it makes sense, but don't do trivial patches to fix things that aren't actually known to be broken. * 'strscpy' of git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: tile: use global strscpy() rather than private copy string: provide strscpy() Make asm/word-at-a-time.h available on all architectures
| | | * tile: use global strscpy() rather than private copyChris Metcalf2015-09-101-29/+4
| | | | | | | | | | | | | | | | | | | | | | | | Now that strscpy() is a standard API, remove the local copy. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
| | | * string: provide strscpy()Chris Metcalf2015-09-102-0/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The strscpy() API is intended to be used instead of strlcpy(), and instead of most uses of strncpy(). - Unlike strlcpy(), it doesn't read from memory beyond (src + size). - Unlike strlcpy() or strncpy(), the API provides an easy way to check for destination buffer overflow: an -E2BIG error return value. - The provided implementation is robust in the face of the source buffer being asynchronously changed during the copy, unlike the current implementation of strlcpy(). - Unlike strncpy(), the destination buffer will be NUL-terminated if the string in the source buffer is too long. - Also unlike strncpy(), the destination buffer will not be updated beyond the NUL termination, avoiding strncpy's behavior of zeroing the entire tail end of the destination buffer. (A memset() after the strscpy() can be used if this behavior is desired.) - The implementation should be reasonably performant on all platforms since it uses the asm/word-at-a-time.h API rather than simple byte copy. Kernel-to-kernel string copy is not considered to be performance critical in any case. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
| | | * Make asm/word-at-a-time.h available on all architecturesChris Metcalf2015-07-0822-8/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added the x86 implementation of word-at-a-time to the generic version, which previously only supported big-endian. Omitted the x86-specific load_unaligned_zeropad(), which in any case is also not present for the existing BE-only implementation of a word-at-a-time, and is only used under CONFIG_DCACHE_WORD_ACCESS. Added as a "generic-y" to the Kbuilds of all architectures that didn't previously have it. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com>
| | * | Merge tag 'md/4.3-fixes' of git://neil.brown.name/mdLinus Torvalds2015-10-047-26/+28
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull md fixes from Neil Brown: "Assorted fixes for md in 4.3-rc. Two tagged for -stable, and one is really a cleanup to match and improve kmemcache interface. * tag 'md/4.3-fixes' of git://neil.brown.name/md: md/bitmap: don't pass -1 to bitmap_storage_alloc. md/raid1: Avoid raid1 resync getting stuck md: drop null test before destroy functions md: clear CHANGE_PENDING in readonly array md/raid0: apply base queue limits *before* disk_stack_limits md/raid5: don't index beyond end of array in need_this_block(). raid5: update analysis state for failed stripe md: wait for pending superblock updates before switching to read-only
| | | * | md/bitmap: don't pass -1 to bitmap_storage_alloc.NeilBrown2015-10-021-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Passing -1 to bitmap_storage_alloc() causes page->index to be set to -1, which is quite problematic. So only pass ->cluster_slot if mddev_is_clustered(). Fixes: b97e92574c0b ("Use separate bitmaps for each nodes in the cluster") Cc: stable@vger.kernel.org (v4.1+) Signed-off-by: NeilBrown <neilb@suse.com>
| | | * | md/raid1: Avoid raid1 resync getting stuckJes Sorensen2015-10-021-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | close_sync() needs to set conf->next_resync to a large, but safe value below MaxSector and use it to determine whether or not to set start_next_window in wait_barrier() Solution suggested by Neil Brown. Reported-by: Nate Dailey <nate.dailey@stratus.com> Tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
| | | * | md: drop null test before destroy functionsJulia Lawall2015-10-024-14/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: NeilBrown <neilb@suse.com>
| | | * | md: clear CHANGE_PENDING in readonly arrayShaohua Li2015-10-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If faulty disks of an array are more than allowed degraded number, the array enters error handling. It will be marked as read-only with MD_CHANGE_PENDING/RECOVERY_NEEDED set. But currently recovery doesn't clear CHANGE_PENDING bit for read-only array. If MD_CHANGE_PENDING is set for a raid5 array, all returned IO will be hold on a list till the bit is clear. But recovery nevery clears this bit, the IO is always in pending state and nevery finish. This has bad effects like upper layer can't get an IO error and the array can't be stopped. Fixes: c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
| | | * | md/raid0: apply base queue limits *before* disk_stack_limitsNeilBrown2015-10-021-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling e.g. blk_queue_max_hw_sectors() after calls to disk_stack_limits() discards the settings determined by disk_stack_limits(). So we need to make those calls first. Fixes: 199dc6ed5179 ("md/raid0: update queue parameter in a safer location.") Cc: stable@vger.kernel.org (v2.6.35+ - please apply with 199dc6ed5179). Reported-by: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
| | | * | md/raid5: don't index beyond end of array in need_this_block().NeilBrown2015-10-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When need_this_block probably shouldn't be called when there are more than 2 failed devices, we really don't want it to try indexing beyond the end of the failed_num[] of fdev[] arrays. So limit the loops to at most 2 iterations. Reported-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.de>
| | | * | raid5: update analysis state for failed stripeShaohua Li2015-10-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | handle_failed_stripe() makes the stripe fail, eg, all IO will return with a failure, but it doesn't update stripe_head_state. Later handle_stripe() has special handling for raid6 for handle_stripe_fill(). That check before handle_stripe_fill() doesn't skip the failed stripe and we get a kernel crash in need_this_block. This patch clear the analysis state to make sure no functions wrongly called after handle_failed_stripe() Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
| | | * | md: wait for pending superblock updates before switching to read-onlyNeilBrown2015-10-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a superblock update is pending, wait for it to complete before letting md_set_readonly() switch to readonly. Otherwise we might lose important information about a device having failed. For external arrays, waiting for superblock updates can wait on user-space, so in that case, just return an error. Reported-and-tested-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
| | * | | Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linusLinus Torvalds2015-10-0413-144/+82
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull MIPS updates from Ralf Baechle: "This week's round of MIPS fixes: - Fix JZ4740 build - Fix fallback to GFP_DMA - FP seccomp in case of ENOSYS - Fix bootmem panic - A number of FP and CPS fixes - Wire up new syscalls - Make sure BPF assembler objects can properly be disassembled - Fix BPF assembler code for MIPS I" * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: MIPS: scall: Always run the seccomp syscall filters MIPS: Octeon: Fix kernel panic on startup from memory corruption MIPS: Fix R2300 FP context switch handling MIPS: Fix octeon FP context switch handling MIPS: BPF: Fix load delay slots. MIPS: BPF: Do all exports of symbols with FEXPORT(). MIPS: Fix the build on jz4740 after removing the custom gpio.h MIPS: CPS: #ifdef on CONFIG_MIPS_MT_SMP rather than CONFIG_MIPS_MT MIPS: CPS: Don't include MT code in non-MT kernels. MIPS: CPS: Stop dangling delay slot from has_mt. MIPS: dma-default: Fix 32-bit fall back to GFP_DMA MIPS: Wire up userfaultfd and membarrier syscalls.
| | | * | | MIPS: scall: Always run the seccomp syscall filtersMarkos Chandras2015-10-044-73/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MIPS syscall handler code used to return -ENOSYS on invalid syscalls. Whilst this is expected, it caused problems for seccomp filters because the said filters never had the change to run since the code returned -ENOSYS before triggering them. This caused problems on the chromium testsuite for filters looking for invalid syscalls. This has now changed and the seccomp filters are always run even if the syscall is invalid. We return -ENOSYS once we return from the seccomp filters. Moreover, similar codepaths have been merged in the process which simplifies somewhat the overall syscall code. Signed-off-by: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/11236/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| | | * | | MIPS: Octeon: Fix kernel panic on startup from memory corruptionMatt Bennett2015-10-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During development it was found that a number of builds would panic during the kernel init process, more specifically in 'delayed_fput()'. The panic showed the kernel trying to access a memory address of '0xb7fdc00' while traversing the 'delayed_fput_list' structure. Comparing this memory address to the value of the pointer used on builds that did not panic confirmed that the pointer on crashing builds must have been corrupted at some stage earlier in the init process. By traversing the list earlier and earlier in the code it was found that 'plat_mem_setup()' was responsible for corrupting the list. Specifically the line: memory = cvmx_bootmem_phy_alloc(mem_alloc_size, __pa_symbol(&__init_end), -1, 0x100000, CVMX_BOOTMEM_FLAG_NO_LOCKING); Which would eventually call: cvmx_bootmem_phy_set_size(new_ent_addr, cvmx_bootmem_phy_get_size (ent_addr) - (desired_min_addr - ent_addr)); Where 'new_ent_addr'=0x4800000 (the address of 'delayed_fput_list') and the second argument (size)=0xb7fdc00 (the address causing the kernel panic). The job of this part of 'plat_mem_setup()' is to allocate chunks of memory for the kernel to use. At the start of each chunk of memory the size of the chunk is written, hence the value 0xb7fdc00 is written onto memory at 0x4800000, therefore the kernel panics when it goes back to access 'delayed_fput_list' later on in the initialisation process. On builds that were not crashing it was found that the compiler had placed 'delayed_fput_list' at 0x4800008, meaning it wasn't corrupted (but something else in memory was overwritten). As can be seen in the first function call above the code begins to allocate chunks of memory beginning from the symbol '__init_end'. The MIPS linker script (vmlinux.lds.S) however defines the .bss section to begin after '__init_end'. Therefore memory within the .bss section is allocated to the kernel to use (System.map shows 'delayed_fput_list' and other kernel structures to be in .bss). To stop the kernel panic (and the .bss section being corrupted) memory should begin being allocated from the symbol '_end'. Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz> Acked-by: David Daney <david.daney@cavium.com> Cc: linux-mips@linux-mips.org Cc: aleksey.makarov@auriga.com Patchwork: https://patchwork.linux-mips.org/patch/11251/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| | | * | | MIPS: Fix R2300 FP context switch handlingPaul Burton2015-10-021-27/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 1a3d59579b9f ("MIPS: Tidy up FPU context switching") removed FP context saving from the asm-written resume function in favour of reusing existing code to perform the same task. However it only removed the FP context saving code from the r4k_switch.S implementation of resume. Remove it from the r2300_switch.S implementation too in order to prevent attempting to save the FP context twice, which would likely lead to an exception from the second save because the FPU had already been disabled by the first save. This patch has only been build tested, using rbtx49xx_defconfig. Fixes: 1a3d59579b9f ("MIPS: Tidy up FPU context switching") Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: Maciej W. Rozycki <macro@linux-mips.org> Cc: linux-kernel@vger.kernel.org Cc: Manuel Lauss <manuel.lauss@gmail.com> Patchwork: https://patchwork.linux-mips.org/patch/11167/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| | | * | | MIPS: Fix octeon FP context switch handlingPaul Burton2015-10-021-25/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 1a3d59579b9f ("MIPS: Tidy up FPU context switching") removed FP context saving from the asm-written resume function in favour of reusing existing code to perform the same task. However it only removed the FP context saving code from the r4k_switch.S implementation of resume. Octeon uses its own implementation in octeon_switch.S, so remove FP context saving there too in order to prevent attempting to save context twice. That formerly led to an exception from the second save as follows because the FPU had already been disabled by the first save: do_cpu invoked from kernel context![#1]: CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.3.0-rc2-dirty #2 task: 800000041f84a008 ti: 800000041f864000 task.ti: 800000041f864000 $ 0 : 0000000000000000 0000000010008ce1 0000000000100000 ffffffffbfffffff $ 4 : 800000041f84a008 800000041f84ac08 800000041f84c000 0000000000000004 $ 8 : 0000000000000001 0000000000000000 0000000000000000 0000000000000001 $12 : 0000000010008ce3 0000000000119c60 0000000000000036 800000041f864000 $16 : 800000041f84ac08 800000000792ce80 800000041f84a008 ffffffff81758b00 $20 : 0000000000000000 ffffffff8175ae50 0000000000000000 ffffffff8176c740 $24 : 0000000000000006 ffffffff81170300 $28 : 800000041f864000 800000041f867d90 0000000000000000 ffffffff815f3fa0 Hi : 0000000000fa8257 Lo : ffffffffe15cfc00 epc : ffffffff8112821c resume+0x9c/0x200 ra : ffffffff815f3fa0 __schedule+0x3f0/0x7d8 Status: 10008ce2 KX SX UX KERNEL EXL Cause : 1080002c (ExcCode 0b) PrId : 000d0601 (Cavium Octeon+) Modules linked in: Process kthreadd (pid: 2, threadinfo=800000041f864000, task=800000041f84a008, tls=0000000000000000) Stack : ffffffff81604218 ffffffff815f7e08 800000041f84a008 ffffffff811681b0 800000041f84a008 ffffffff817e9878 0000000000000000 ffffffff81770000 ffffffff81768340 ffffffff81161398 0000000000000001 0000000000000000 0000000000000000 ffffffff815f4424 0000000000000000 ffffffff81161d68 ffffffff81161be8 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffffffff8111e16c 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ... Call Trace: [<ffffffff8112821c>] resume+0x9c/0x200 [<ffffffff815f3fa0>] __schedule+0x3f0/0x7d8 [<ffffffff815f4424>] schedule+0x34/0x98 [<ffffffff81161d68>] kthreadd+0x180/0x198 [<ffffffff8111e16c>] ret_from_kernel_thread+0x14/0x1c Tested using cavium_octeon_defconfig on an EdgeRouter Lite. Fixes: 1a3d59579b9f ("MIPS: Tidy up FPU context switching") Reported-by: Aaro Koskinen <aaro.koskinen@nokia.com> Signed-off-by: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: Aleksey Makarov <aleksey.makarov@auriga.com> Cc: linux-kernel@vger.kernel.org Cc: Chandrakala Chavva <cchavva@caviumnetworks.com> Cc: David Daney <david.daney@cavium.com> Cc: Leonid Rosenboim <lrosenboim@caviumnetworks.com> Patchwork: https://patchwork.linux-mips.org/patch/11166/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
| | | * | | MIPS: BPF: Fix load delay slots.Ralf Baechle2015-10-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The entire bpf_jit_asm.S is written in noreorder mode because "we know better" according to a comment. This also prevented the assembler from throwing in the required NOPs for MIPS I processors which have no load-use interlock, thus the load's consumer might end up using the old value of the register from prior to the load. Fixed by putting the assembler in reorder mode for just the affected load instructions. This is not enough for gas to actually try to be clever by looking at the next instruction and inserting a nop only when needed but as the comment said "we know better", so getting gas to unconditionally emit a NOP is just right in this case and prevents adding further ifdefery. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
OpenPOWER on IntegriCloud