summaryrefslogtreecommitdiffstats
path: root/drivers/nvme/host/pci.c
Commit message (Collapse)AuthorAgeFilesLines
...
* nvme-pci: avoid hmb desc array idx out-of-bound when hmmaxd set.Minwoo Im2017-11-201-1/+1
| | | | | | | | | | | | | | | | | | | | | hmb descriptor idx out-of-bound occurs in case of below conditions. preferred = 128MiB chunk_size = 4MiB hmmaxd = 1 Current code will not allow rmmod which will free hmb descriptors to be done successfully in above case. "descs[i]" will be set in for-loop without seeing any conditions related to "max_entries" after a single "descs" was allocated by (max_entries = 1) in this case. Added a condition into for-loop to check index of descriptors. Fixes: 044a9df1("nvme-pci: implement the HMB entry number and size limitations") Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-pci: disable APST on Samsung SSD 960 EVO + ASUS PRIME B350M-AKai-Heng Feng2017-11-201-2/+10
| | | | | | | | | | | | | | | | | | | | The NVMe device in question drops off the PCIe bus after system suspend. I've tried several approaches to workaround this issue, but none of them works: - NVME_QUIRK_DELAY_BEFORE_CHK_RDY - NVME_QUIRK_NO_DEEPEST_PS - Disable APST before controller shutdown - Delay between controller shutdown and system suspend - Explicitly set power state to 0 before controller shutdown Fortunately it's a desktop, so disable APST won't hurt the battery. Also, change the quirk function name to reflect it's for vendor combination quirks. BugLink: https://bugs.launchpad.net/bugs/1705748 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-blockLinus Torvalds2017-11-141-48/+195
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block layer updates from Jens Axboe: "This is the main pull request for block storage for 4.15-rc1. Nothing out of the ordinary in here, and no API changes or anything like that. Just various new features for drivers, core changes, etc. In particular, this pull request contains: - A patch series from Bart, closing the whole on blk/scsi-mq queue quescing. - A series from Christoph, building towards hidden gendisks (for multipath) and ability to move bio chains around. - NVMe - Support for native multipath for NVMe (Christoph). - Userspace notifications for AENs (Keith). - Command side-effects support (Keith). - SGL support (Chaitanya Kulkarni) - FC fixes and improvements (James Smart) - Lots of fixes and tweaks (Various) - bcache - New maintainer (Michael Lyle) - Writeback control improvements (Michael) - Various fixes (Coly, Elena, Eric, Liang, et al) - lightnvm updates, mostly centered around the pblk interface (Javier, Hans, and Rakesh). - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph) - Writeback series that fix the much discussed hundreds of millions of sync-all units. This goes all the way, as discussed previously (me). - Fix for missing wakeup on writeback timer adjustments (Yafang Shao). - Fix laptop mode on blk-mq (me). - {mq,name} tupple lookup for IO schedulers, allowing us to have alias names. This means you can use 'deadline' on both !mq and on mq (where it's called mq-deadline). (me). - blktrace race fix, oopsing on sg load (me). - blk-mq optimizations (me). - Obscure waitqueue race fix for kyber (Omar). - NBD fixes (Josef). - Disable writeback throttling by default on bfq, like we do on cfq (Luca Miccio). - Series from Ming that enable us to treat flush requests on blk-mq like any other request. This is a really nice cleanup. - Series from Ming that improves merging on blk-mq with schedulers, getting us closer to flipping the switch on scsi-mq again. - BFQ updates (Paolo). - blk-mq atomic flags memory ordering fixes (Peter Z). - Loop cgroup support (Shaohua). - Lots of minor fixes from lots of different folks, both for core and driver code" * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits) nvme: fix visibility of "uuid" ns attribute blk-mq: fixup some comment typos and lengths ide: ide-atapi: fix compile error with defining macro DEBUG blk-mq: improve tag waiting setup for non-shared tags brd: remove unused brd_mutex blk-mq: only run the hardware queue if IO is pending block: avoid null pointer dereference on null disk fs: guard_bio_eod() needs to consider partitions xtensa/simdisk: fix compile error nvme: expose subsys attribute to sysfs nvme: create 'slaves' and 'holders' entries for hidden controllers block: create 'slaves' and 'holders' entries for hidden gendisks nvme: also expose the namespace identification sysfs files for mpath nodes nvme: implement multipath access to nvme subsystems nvme: track shared namespaces nvme: introduce a nvme_ns_ids structure nvme: track subsystems block, nvme: Introduce blk_mq_req_flags_t block, scsi: Make SCSI quiesce and resume work reliably block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag ...
| * nvme-pci: avoid dereference of symbol from unloaded moduleMing Lei2017-11-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The 'remove_work' may be scheduled to run after nvme_remove() returns since we can't simply cancel it in nvme_remove() for avoiding deadlock. Once nvme_remove() returns, this module(nvme) can be unloaded. On the other hand, nvme_put_ctrl() calls ctr->ops->free_ctrl which may point to nvme_pci_free_ctrl() in unloaded module. This patch avoids this issue by queuing 'remove_work' via 'nvme_wq', and flush this worqueue in nvme_exit() as suggested by Sagi. Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme: remove handling of multiple AEN requestsKeith Busch2017-11-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | The driver can handle tracking only one AEN request, so this patch removes handling for multiple ones. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme: centralize AEN definesKeith Busch2017-11-101-13/+3
| | | | | | | | | | | | | | | | | | | | All the transports were unnecessarilly duplicating the AEN request accounting. This patch defines everything in one place. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nvme: Remove unused headersKeith Busch2017-11-011-4/+0
| | | | | | | | | | | | Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: switch controller refcounting to use struct deviceChristoph Hellwig2017-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of allocating a separate struct device for the character device handle embedd it into struct nvme_ctrl and use it for the main controller refcounting. This removes double refcounting and gets us an automatic reference for the character device operations. We keep ctrl->device as a pointer for now to avoid chaning printks all over, but in the future we could look into message printing helpers that take a controller structure similar to what other subsystems do. Note the delete_ctrl operation always already has a reference (either through sysfs due this change, or because every open file on the /dev/nvme-fabrics node has a refernece) when it is entered now, so we don't need to do the unless_zero variant there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com>
| * nvme-pci: add SGL supportChaitanya Kulkarni2017-10-201-27/+187
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds SGL support for NVMe PCIe driver, based on an earlier patch from Rajiv Shanmugam Madeswaran <smrajiv15 at gmail.com>. This patch refactors the original code and adds new module parameter sgl_threshold to determine whether to use SGL or PRP for IOs. The usage of SGLs is controlled by the sgl_threshold module parameter, which allows to conditionally use SGLs if average request segment size (avg_seg_size) is greater than sgl_threshold. In the original patch, the decision of using SGLs was dependent only on the IO size, with the new approach we consider not only IO size but also the number of physical segments present in the IO. We calculate avg_seg_size based on request payload bytes and number of physical segments present in the request. For e.g.:- 1. blk_rq_nr_phys_segments = 2 blk_rq_payload_bytes = 8k avg_seg_size = 4K use sgl if avg_seg_size >= sgl_threshold. 2. blk_rq_nr_phys_segments = 2 blk_rq_payload_bytes = 64k avg_seg_size = 32K use sgl if avg_seg_size >= sgl_threshold. 3. blk_rq_nr_phys_segments = 16 blk_rq_payload_bytes = 64k avg_seg_size = 4K use sgl if avg_seg_size >= sgl_threshold. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme-pci: fix typos in commentsMinwoo Im2017-10-191-2/+2
| | | | | | | | | | | | | | | | fixed comment typos in adapter_alloc_cq() and adapter_alloc_sq(). 'the the' duplications are replaced with 'that the'. Signed-off-by: Minwoo Im <dn3108@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme-pci: Use PCI bus address for data/queues in CMBChristoph Hellwig2017-10-041-7/+7
|/ | | | | | | | | | | | | | | | | | | | | Currently, NVMe PCI host driver is programming CMB dma address as I/O SQs addresses. This results in failures on systems where 1:1 outbound mapping is not used (example Broadcom iProc SOCs) because CMB BAR will be progammed with PCI bus address but NVMe PCI EP will try to access CMB using dma address. To have CMB working on systems without 1:1 outbound mapping, we program PCI bus address for I/O SQs instead of dma address. This approach will work on systems with/without 1:1 outbound mapping. Based on a report and previous patch from Abhishek Shah. Fixes: 8ffaadf7 ("NVMe: Use CMB for the IO SQes if available") Cc: stable@vger.kernel.org Reported-by: Abhishek Shah <abhishek.shah@broadcom.com> Tested-by: Abhishek Shah <abhishek.shah@broadcom.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-pci: Print invalid SGL only onceKeith Busch2017-09-251-12/+18
| | | | | | | | | | | | The WARN_ONCE macro returns true if the condition is true, not if the warn was raised, so we're printing the scatter list every time it's invalid. This is excessive and makes debugging harder, so this patch prints it just once. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-pci: initialize queue memory before interruptsKeith Busch2017-09-251-2/+2
| | | | | | | | | | | | | | | A spurious interrupt before the nvme driver has initialized the completion queue may inadvertently cause the driver to believe it has a completion to process. This may result in a NULL dereference since the nvmeq's tags are not set at this point. The patch initializes the host's CQ memory so that a spurious interrupt isn't mistaken for a real completion. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-pci: implement the HMB entry number and size limitationsChristoph Hellwig2017-09-111-1/+5
| | | | | | | | | | | Adds support for the new Host Memory Buffer Minimum Descriptor Entry Size and Host Memory Maximum Descriptors Entries field that were added in TP 4002 HMB Enhancements. These allow the controller to advertise limits for the usual number of segments in the host memory buffer, as well as a minimum usable per-segment size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com>
* nvme-pci: propagate (some) errors from host memory buffer setupChristoph Hellwig2017-09-111-6/+12
| | | | | | | | | | | We want to catch command execution errors when resetting the device, so propagate errors from the Set Features when setting up the host memory buffer. We keep ignoring memory allocation failures, as the spec clearly says that the controller must work without a host memory buffer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Cc: stable@vger.kernel.org
* nvme-pci: use appropriate initial chunk size for HMB allocationAkinobu Mita2017-09-111-1/+1
| | | | | | | | | | | | | | The initial chunk size for host memory buffer allocation is currently PAGE_SIZE << MAX_ORDER. MAX_ORDER order allocation is usually failed without CONFIG_DMA_CMA. So the HMB allocation is retried with chunk size PAGE_SIZE << (MAX_ORDER - 1) in general, but there is no problem if the retry allocation works correctly. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> [hch: rebased] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Cc: stable@vger.kernel.org
* nvme-pci: fix host memory buffer allocation fallbackChristoph Hellwig2017-09-111-18/+30
| | | | | | | | | | | | | | nvme_alloc_host_mem currently contains two loops that are interwinded, and the outer retry loop turns out to be broken. Fix this by untangling the two. Based on a report an initial patch from Akinobu Mita. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Tested-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Cc: stable@vger.kernel.org
* nvme: fix lightnvm checkChristoph Hellwig2017-09-111-0/+4
| | | | | | | | | | | | | nvme_nvm_ns_supported assumes every device is a pci_dev, which leads to reading an incorrect field, or possible even a dereference of unallocated memory for fabrics controllers. Fix this by introducing a quirk for lighnvm capable devices instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
* Merge branch 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-blockLinus Torvalds2017-09-091-3/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull followup block layer updates from Jens Axboe: "I ended up splitting the main pull request for this series into two, mainly because of clashes between NVMe fixes that went into 4.13 after the for-4.14 branches were split off. This pull request is mostly NVMe, but not exclusively. In detail, it contains: - Two pull request for NVMe changes from Christoph. Nothing new on the feature front, basically just fixes all over the map for the core bits, transport, rdma, etc. - Series from Bart, cleaning up various bits in the BFQ scheduler. - Series of bcache fixes, which has been lingering for a release or two. Coly sent this in, but patches from various people in this area. - Set of patches for BFQ from Paolo himself, updating both documentation and fixing some corner cases in performance. - Series from Omar, attempting to now get the 4k loop support correct. Our confidence level is higher this time. - Series from Shaohua for loop as well, improving O_DIRECT performance and fixing a use-after-free" * 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits) bcache: initialize dirty stripes in flash_dev_run() loop: set physical block size to logical block size bcache: fix bch_hprint crash and improve output bcache: Update continue_at() documentation bcache: silence static checker warning bcache: fix for gc and write-back race bcache: increase the number of open buckets bcache: Correct return value for sysfs attach errors bcache: correct cache_dirty_target in __update_writeback_rate() bcache: gc does not work when triggering by manual command bcache: Don't reinvent the wheel but use existing llist API bcache: do not subtract sectors_to_gc for bypassed IO bcache: fix sequential large write IO bypass bcache: Fix leak of bdev reference block/loop: remove unused field block/loop: fix use after free bfq: Use icq_to_bic() consistently bfq: Suppress compiler warnings about comparisons bfq: Check kstrtoul() return value bfq: Declare local functions static ...
| * nvme/pci: Use req_op to determine DIF remappingKeith Busch2017-08-301-2/+2
| | | | | | | | | | | | | | | | | | | | Only read and write commands need DIF remapping. Everything else uses a passthrough integrity payload. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: fix uninitialized prp2 value on small transfersJan H. Schönherr2017-08-281-1/+3
| | | | | | | | | | | | | | | | | | | | | | The value of iod->first_dma ends up as prp2 in NVMe commands. In case there is not enough data to cross a page boundary, iod->first_dma is never initialized and contains random data. Comply with the NVMe specification and fill in 0 in that case. Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme: Add admin_tagset pointer to nvme_ctrlSagi Grimberg2017-08-281-0/+1
| | | | | | | | | | | | | | Will be used when we centralize control flows. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme-pci: use dma memory for the host memory buffer descriptorsChristoph Hellwig2017-08-301-11/+11
|/ | | | | | | | | | | | | | | | | | | | | | The NVMe 1.3 specification says in section 5.21.1.13: "After a successful completion of a Set Features enabling the host memory buffer, the host shall not write to the associated host memory region, buffer size, or descriptor list until the host memory buffer has been disabled." While this doesn't state that the descriptor list must remain accessible to the device it certainly implies it must remaing readable by the device. So switch to a dma coherent allocation for the descriptor list just to be safe - it's not like the cost for it matters compared to the actual memory buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Fixes: 87ad72a59a38 ("nvme-pci: implement host memory buffer support")
* nvme-pci: set cqe_seen on polled completionsKeith Busch2017-08-181-3/+2
| | | | | | | Fixes: 920d13a884 ("nvme-pci: factor out the cqe reading mechanics from __nvme_process_cq") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-pci: fix CMB sysfs file removal in reset pathMax Gurtovoy2017-08-101-11/+7
| | | | | | | | | | | | Currently we create the sysfs entry even if we fail mapping it. In that case, the unmapping will not remove the sysfs created file. There is no good reason to create a sysfs entry for a non working CMB and show his characteristics. Fixes: f63572dff ("nvme: unmap CMB and remove sysfs file in reset path") Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Stephen Bates <sbates@raithlin.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* nvme-pci: fix HMB size calculationChristoph Hellwig2017-07-251-3/+3
| | | | | | | | | | | | | | It's possible the preferred HMB size may not be a multiple of the chunk_size. This patch moves len to function scope and uses that in the for loop increment so the last iteration doesn't cause the total size to exceed the allocated HMB size. Based on an earlier patch from Keith Busch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Fixes: 87ad72a59a38 ("nvme-pci: implement host memory buffer support")
* nvme-pci: Fix an error handling path in 'nvme_probe()'Christophe JAILLET2017-07-201-3/+4
| | | | | | | | | | | Release resources in the correct order in order not to miss a 'put_device()' if 'nvme_dev_map()' fails. Fixes: b00a726a9fd8 ("NVMe: Don't unmap controller registers on reset") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-pci: Remove nvme_setup_prps BUG_ONKeith Busch2017-07-201-8/+25
| | | | | | | | | | | | | This patch replaces the invalid nvme SGL kernel panic with a warning, and returns an appropriate error. The warning will occur only on the first occurance, and sgl details will be printed to help debug how the request was allowed to form. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-pci: add another device ID with stripe quirkDavid Wayne Fugate2017-07-201-0/+3
| | | | | | | | | | Adds a fourth Intel controller which has the "stripe" quirk. Signed-off-by: David Wayne Fugate <david.fugate@intel.com> Acked-by: Keith Busch <keith.busch@intel.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2017-07-111-38/+58
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block updates from Jens Axboe: "This is a followup for block changes, that didn't make the initial pull request. It's a bit of a mixed bag, this contains: - A followup pull request from Sagi for NVMe. Outside of fixups for NVMe, it also includes a series for ensuring that we properly quiesce hardware queues when browsing live tags. - Set of integrity fixes from Dmitry (mostly), fixing various issues for folks using DIF/DIX. - Fix for a bug introduced in cciss, with the req init changes. From Christoph. - Fix for a bug in BFQ, from Paolo. - Two followup fixes for lightnvm/pblk from Javier. - Depth fix from Ming for blk-mq-sched. - Also from Ming, performance fix for mtip32xx that was introduced with the dynamic initialization of commands" * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits) block: call bio_uninit in bio_endio nvmet: avoid unneeded assignment of submit_bio return value nvme-pci: add module parameter for io queue depth nvme-pci: compile warnings in nvme_alloc_host_mem() nvmet_fc: Accept variable pad lengths on Create Association LS nvme_fc/nvmet_fc: revise Create Association descriptor length lightnvm: pblk: remove unnecessary checks lightnvm: pblk: control I/O flow also on tear down cciss: initialize struct scsi_req null_blk: fix error flow for shared tags during module_init block: Fix __blkdev_issue_zeroout loop nvme-rdma: unconditionally recycle the request mr nvme: split nvme_uninit_ctrl into stop and uninit virtio_blk: quiesce/unquiesce live IO when entering PM states mtip32xx: quiesce request queues to make sure no submissions are inflight nbd: quiesce request queues to make sure no submissions are inflight nvme: kick requeue list when requeueing a request instead of when starting the queues nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues ...
| * Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-linusJens Axboe2017-07-101-38/+58
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull followup NVMe (mostly) changes from Sagi: I added the quiesce/unquiesce patches in here as it's easy for me easily apply changes on top. It has accumulated reviews and includes mostly nvme anyway, please tell me if you don't want to take them with this. This includes: - quiesce/unquiesce fixes in nvme and others from me - nvme-fc add create association padding spec updates from James - some more quirking from MKP - nvmet nit cleanup from Max - Fix nvme-rdma racy RDMA completion signalling from Marta - some centralization patches from me - add tagset nr_hw_queues updates on controller resets in nvme drivers from me - nvme-rdma fix resources recycling when doing error recovery from me - minor cleanups in nvme-fc from me
| | * nvme-pci: add module parameter for io queue depthweiping zhang2017-07-101-2/+22
| | | | | | | | | | | | | | | | | | | | | Adjust io queue depth more easily, and make sure io queue depth >= 2. Signed-off-by: weiping zhang <zhangweiping@didichuxing.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme-pci: compile warnings in nvme_alloc_host_mem()Dan Carpenter2017-07-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "i" should be signed or it could cause a forever loop on the cleanup path. "size" can be used uninitialized. Fixes: 87ad72a59a38 ("nvme-pci: implement host memory buffer support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme: split nvme_uninit_ctrl into stop and uninitSagi Grimberg2017-07-061-12/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Usually before we teardown the controller we want to: 1. complete/cancel any ctrl inflight works 2. remove ctrl namespaces (only for removal though, resets shouldn't remove any namespaces). but we do not want to destroy the controller device as we might use it for logging during the teardown stage. This patch adds nvme_start_ctrl() which queues inflight controller works (aen, ns scan, queue start and keep-alive if kato is set) and nvme_stop_ctrl() which cancels the works namespace removal is left to the callers to handle. Move nvme_uninit_ctrl after we are done with the controller device. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queuesSagi Grimberg2017-07-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | unlike blk_mq_stop_hw_queues and blk_mq_start_stopped_hw_queues quiescing/unquiescing respects the submission path rcu grace. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme-pci: rename to nvme_pci_configure_admin_queueSagi Grimberg2017-07-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | we are going to need the name for the core routine... Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme: move ctrl cap to struct nvme_ctrlSagi Grimberg2017-07-021-11/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All transports use either a private cache of controller cap or an on-stack copy, move it to the generic struct nvme_ctrl. In the future it will also be maintained by the core. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme: move queue_count to the nvme_ctrlSagi Grimberg2017-07-021-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All all transports use the queue_count in exactly the same, so move it to the generic struct nvme_ctrl. In the future it will also be maintained by the core. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-By: James Smart <james.smart@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
| | * nvme: Quirks for PM1725 controllersMartin K. Petersen2017-07-021-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PM1725 controllers have a couple of quirks that need to be handled in the driver: - I/O queue depth must be limited to 64 entries on controllers that do not report MQES. - The host interface registers go offline briefly while resetting the chip. Thus a delay is needed before checking whether the controller is ready. Note that the admin queue depth is also limited to 64 on older versions of this board. Since our NVME_AQ_DEPTH is now 32 that is no longer an issue. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
* | | Merge tag 'pci-v4.13-changes' of ↵Linus Torvalds2017-07-081-6/+9
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - add sysfs max_link_speed/width, current_link_speed/width (Wong Vee Khee) - make host bridge IRQ mapping much more generic (Matthew Minter, Lorenzo Pieralisi) - convert most drivers to pci_scan_root_bus_bridge() (Lorenzo Pieralisi) - mutex sriov_configure() (Jakub Kicinski) - mutex pci_error_handlers callbacks (Christoph Hellwig) - split ->reset_notify() into ->reset_prepare()/reset_done() (Christoph Hellwig) - support multiple PCIe portdrv interrupts for MSI as well as MSI-X (Gabriele Paoloni) - allocate MSI/MSI-X vector for Downstream Port Containment (Gabriele Paoloni) - fix MSI IRQ affinity pre/post/min_vecs issue (Michael Hernandez) - test INTx masking during enumeration, not at run-time (Piotr Gregor) - avoid using device_may_wakeup() for runtime PM (Rafael J. Wysocki) - restore the status of PCI devices across hibernation (Chen Yu) - keep parent resources that start at 0x0 (Ard Biesheuvel) - enable ECRC only if device supports it (Bjorn Helgaas) - restore PRI and PASID state after Function-Level Reset (CQ Tang) - skip DPC event if device is not present (Keith Busch) - check domain when matching SMBIOS info (Sujith Pandel) - mark Intel XXV710 NIC INTx masking as broken (Alex Williamson) - avoid AMD SB7xx EHCI USB wakeup defect (Kai-Heng Feng) - work around long-standing Macbook Pro poweroff issue (Bjorn Helgaas) - add Switchtec "running" status flag (Logan Gunthorpe) - fix dra7xx incorrect RW1C IRQ register usage (Arvind Yadav) - modify xilinx-nwl IRQ chip for legacy interrupts (Bharat Kumar Gogada) - move VMD SRCU cleanup after bus, child device removal (Jon Derrick) - add Faraday clock handling (Linus Walleij) - configure Rockchip MPS and reorganize (Shawn Lin) - limit Qualcomm TLP size to 2K (hardware issue) (Srinivas Kandagatla) - support Tegra MSI 64-bit addressing (Thierry Reding) - use Rockchip normal (not privileged) register bank (Shawn Lin) - add HiSilicon Kirin SoC PCIe controller driver (Xiaowei Song) - add Sigma Designs Tango SMP8759 PCIe controller driver (Marc Gonzalez) - add MediaTek PCIe host controller support (Ryder Lee) - add Qualcomm IPQ4019 support (John Crispin) - add HyperV vPCI protocol v1.2 support (Jork Loeser) - add i.MX6 regulator support (Quentin Schulz) * tag 'pci-v4.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (113 commits) PCI: tango: Add Sigma Designs Tango SMP8759 PCIe host bridge support PCI: Add DT binding for Sigma Designs Tango PCIe controller PCI: rockchip: Use normal register bank for config accessors dt-bindings: PCI: Add documentation for MediaTek PCIe PCI: Remove __pci_dev_reset() and pci_dev_reset() PCI: Split ->reset_notify() method into ->reset_prepare() and ->reset_done() PCI: xilinx: Make of_device_ids const PCI: xilinx-nwl: Modify IRQ chip for legacy interrupts PCI: vmd: Move SRCU cleanup after bus, child device removal PCI: vmd: Correct comment: VMD domains start at 0x10000, not 0x1000 PCI: versatile: Add local struct device pointers PCI: tegra: Do not allocate MSI target memory PCI: tegra: Support MSI 64-bit addressing PCI: rockchip: Use local struct device pointer consistently PCI: rockchip: Check for clk_prepare_enable() errors during resume MAINTAINERS: Remove Wenrui Li as Rockchip PCIe driver maintainer PCI: rockchip: Configure RC's MPS setting PCI: rockchip: Reconfigure configuration space header type PCI: rockchip: Split out rockchip_pcie_cfg_configuration_accesses() PCI: rockchip: Move configuration accesses into rockchip_pcie_cfg_atu() ...
| * | | PCI: Split ->reset_notify() method into ->reset_prepare() and ->reset_done()Christoph Hellwig2017-07-031-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pci_error_handlers->reset_notify() method had a flag to indicate whether to prepare for or clean up after a reset. The prepare and done cases have no shared functionality whatsoever, so split them into separate methods. [bhelgaas: changelog, update locking comments] Link: http://lkml.kernel.org/r/20170601111039.8913-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
* | | | Merge branch 'irq-core-for-linus' of ↵Linus Torvalds2017-07-031-1/+1
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The irq department delivers: - Expand the generic infrastructure handling the irq migration on CPU hotplug and convert X86 over to it. (Thomas Gleixner) Aside of consolidating code this is a preparatory change for: - Finalizing the affinity management for multi-queue devices. The main change here is to shut down interrupts which are affine to a outgoing CPU and reenabling them when the CPU comes online again. That avoids moving interrupts pointlessly around and breaking and reestablishing affinities for no value. (Christoph Hellwig) Note: This contains also the BLOCK-MQ and NVME changes which depend on the rework of the irq core infrastructure. Jens acked them and agreed that they should go with the irq changes. - Consolidation of irq domain code (Marc Zyngier) - State tracking consolidation in the core code (Jeffy Chen) - Add debug infrastructure for hierarchical irq domains (Thomas Gleixner) - Infrastructure enhancement for managing generic interrupt chips via devmem (Bartosz Golaszewski) - Constification work all over the place (Tobias Klauser) - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni) - The usual set of fixes, updates and enhancements all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits) irqchip/or1k-pic: Fix interrupt acknowledgement irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity nvme: Allocate queues for all possible CPUs blk-mq: Create hctx for each present CPU blk-mq: Include all present CPUs in the default queue mapping genirq: Avoid unnecessary low level irq function calls genirq: Set irq masked state when initializing irq_desc genirq/timings: Add infrastructure for estimating the next interrupt arrival time genirq/timings: Add infrastructure to track the interrupt timings genirq/debugfs: Remove pointless NULL pointer check irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID irqchip/gic-v3-its: Add ACPI NUMA node mapping irqchip/gic-v3-its-platform-msi: Make of_device_ids const irqchip/gic-v3-its: Make of_device_ids const irqchip/irq-mvebu-icu: Add new driver for Marvell ICU irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU genirq/irqdomain: Remove auto-recursive hierarchy support irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access ...
| * | | nvme: Allocate queues for all possible CPUsChristoph Hellwig2017-06-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlike most drіvers that simply pass the maximum possible vectors to pci_alloc_irq_vectors NVMe needs to configure the device before allocting the vectors, so it needs a manual update for the new scheme of using all present CPUs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <keith.busch@intel.com> Cc: linux-block@vger.kernel.org Cc: linux-nvme@lists.infradead.org Link: http://lkml.kernel.org/r/20170626102058.10200-4-hch@lst.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | | Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-blockLinus Torvalds2017-07-031-252/+395
|\ \ \ \ | | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block/IO updates from Jens Axboe: "This is the main pull request for the block layer for 4.13. Not a huge round in terms of features, but there's a lot of churn related to some core cleanups. Note this depends on the UUID tree pull request, that Christoph already sent out. This pull request contains: - A series from Christoph, unifying the error/stats codes in the block layer. We now use blk_status_t everywhere, instead of using different schemes for different places. - Also from Christoph, some cleanups around request allocation and IO scheduler interactions in blk-mq. - And yet another series from Christoph, cleaning up how we handle and do bounce buffering in the block layer. - A blk-mq debugfs series from Bart, further improving on the support we have for exporting internal information to aid debugging IO hangs or stalls. - Also from Bart, a series that cleans up the request initialization differences across types of devices. - A series from Goldwyn Rodrigues, allowing the block layer to return failure if we will block and the user asked for non-blocking. - Patch from Hannes for supporting setting loop devices block size to that of the underlying device. - Two series of patches from Javier, fixing various issues with lightnvm, particular around pblk. - A series from me, adding support for write hints. This comes with NVMe support as well, so applications can help guide data placement on flash to improve performance, latencies, and write amplification. - A series from Ming, improving and hardening blk-mq support for stopping/starting and quiescing hardware queues. - Two pull requests for NVMe updates. Nothing major on the feature side, but lots of cleanups and bug fixes. From the usual crew. - A series from Neil Brown, greatly improving the bio rescue set support. Most notably, this kills the bio rescue work queues, if we don't really need them. - Lots of other little bug fixes that are all over the place" * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits) lightnvm: pblk: set line bitmap check under debug lightnvm: pblk: verify that cache read is still valid lightnvm: pblk: add initialization check lightnvm: pblk: remove target using async. I/Os lightnvm: pblk: use vmalloc for GC data buffer lightnvm: pblk: use right metadata buffer for recovery lightnvm: pblk: schedule if data is not ready lightnvm: pblk: remove unused return variable lightnvm: pblk: fix double-free on pblk init lightnvm: pblk: fix bad le64 assignations nvme: Makefile: remove dead build rule blk-mq: map all HWQ also in hyperthreaded system nvmet-rdma: register ib_client to not deadlock in device removal nvme_fc: fix error recovery on link down. nvmet_fc: fix crashes on bad opcodes nvme_fc: Fix crash when nvme controller connection fails. nvme_fc: replace ioabort msleep loop with completion nvme_fc: fix double calls to nvme_cleanup_cmd() nvme-fabrics: verify that a controller returns the correct NQN nvme: simplify nvme_dev_attrs_are_visible ...
| * | | nvme: use a single NVME_AQ_DEPTH and relax it to 32Sagi Grimberg2017-06-281-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No need to differentiate fabrics from pci/loop, also lower it to 32 as we don't really need 256 inflight admin commands. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | nvme-pci: open-code polling logic in nvme_pollSagi Grimberg2017-06-281-19/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Given that the code is simple enough it seems better then passing a tag by reference for each call site, also we can now get rid of __nvme_process_cq. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | nvme-pci: factor out the cqe reading mechanics from __nvme_process_cqSagi Grimberg2017-06-281-22/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also, maintain a consumed counter to rely on for doorbell and cqe_seen update instead of directly relying on the cq head and phase. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | nvme-pci: factor out cqe handling into a dedicated routineSagi Grimberg2017-06-281-23/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Makes the code slightly more readable. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | nvme-pci: Introduce nvme_ring_cq_doorbellSagi Grimberg2017-06-281-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nice abstraction of the actual mechanics of how to do it. Note the change that we call it after we assign nvmeq->cq_head to avoid passing it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | nvme: move reset workqueue handling to common codeChristoph Hellwig2017-06-151-34/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This moves the nvme_reset function from the PCIe driver to common code, renaming it to nvme_reset_ctrl in the process. Additionally a new helper nvme_reset_ctrl_sync is added for the case where we want to wait for the reset. To facilitate that the reset_work work structure is move to the common nvme_ctrl structure and the ->reset_ctrl method is removed. For now the drivers initialize the reset_work with their own callback, but longer term we should move to callouts for specific parts of the reset process and move even more code to the core. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
OpenPOWER on IntegriCloud