summaryrefslogtreecommitdiffstats
path: root/drivers/nvme/host/pci.c
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | nvme-pci: merge init_request methodsChristoph Hellwig2017-06-151-15/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we get the tagset passed we can have a single implementation for the I/O and admin queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | nvme: move protection information check into nvme_setup_rwChristoph Hellwig2017-06-151-12/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It only applies to read/write commands, and this way non-PCIe drivers get the check as well instead of having to duplicate it when adding metadata support. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | nvmet: use NVME_IDENTIFY_DATA_SIZEJohannes Thumshirn2017-06-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use NVME_IDENTIFY_DATA_SIZE define instead of hard coding the magic 4096 value. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> [hch: converted three more users] Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | nvme-pci: remove redundant includesSagi Grimberg2017-06-151-13/+0
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
| * | | nvme-pci: Remove watchdog timerKeith Busch2017-06-151-67/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The controller status polling was added to preemptively reset a failed controller. This early detection would allow commands that would normally timeout a chance for a retry, or find broken links when the platform didn't support hotplug. This once-per-second MMIO read, however, created more problems than it solves. This often races with PCIe Hotplug events that required complicated syncing between work queues, frequently triggered PCIe Completion Timeout errors that also lead to fatal machine checks, and unnecessarily disrupts low power modes by running on idle controllers. This patch removes the watchdog timer, and instead checks controller health only on an IO timeout when we have a reason to believe something is wrong. If the controller is failed, the driver will disable immediately and request scheduling a reset. Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | nvme-pci: remap BAR0 to cover admin CQ doorbell for large strideXu Yu2017-06-151-22/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing driver initially maps 8192 bytes of BAR0 which is intended to cover doorbells of admin SQ and CQ. However, if a large stride, e.g. 10, is used, the doorbell of admin CQ will be out of 8192 bytes. Consequently, a page fault will be raised when the admin CQ doorbell is accessed in nvme_configure_admin_queue(). This patch fixes this issue by remapping BAR0 before accessing admin CQ doorbell if the initial mapping is not enough. Signed-off-by: Xu Yu <yu.a.xu@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | nvme: Move transports to use nvme-core workqueueSagi Grimberg2017-06-151-15/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of each transport using it's own workqueue, export a single nvme-core workqueue and use that instead. In the future, this will help us moving towards some unification if controller setup/teardown flows. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | nvme-pci: implement host memory buffer supportChristoph Hellwig2017-06-151-2/+187
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a controller supports the host memory buffer we try to provide it with the requested size up to an upper cap set as a module parameter. We try to give as few as possible descriptors, eventually working our way down. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
| * | | Merge tag 'v4.12-rc5' into for-4.13/blockJens Axboe2017-06-121-7/+6
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've already got a few conflicts and upcoming work depends on some of the changes that have gone into mainline as regression fixes for this series. Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream trees to continue working on 4.13 changes. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | blk-mq: switch ->queue_rq return value to blk_status_tChristoph Hellwig2017-06-091-22/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the same values for use for request completion errors as the return value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause a requeue, and all the others are completed as-is. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | block: introduce new block status code typeChristoph Hellwig2017-06-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we use nornal Linux errno values in the block layer, and while we accept any error a few have overloaded magic meanings. This patch instead introduces a new blk_status_t value that holds block layer specific status codes and explicitly explains their meaning. Helpers to convert from and to the previous special meanings are provided for now, but I suspect we want to get rid of them in the long run - those drivers that have a errno input (e.g. networking) usually get errnos that don't know about the special block layer overloads, and similarly returning them to userspace will usually return somethings that strictly speaking isn't correct for file system operations, but that's left as an exercise for later. For now the set of errors is a very limited set that closely corresponds to the previous overloaded errno values, but there is some low hanging fruite to improve it. blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse typechecking, so that we can easily catch places passing the wrong values. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | nvme/pci: Fix stuck nvme resetKeith Busch2017-06-271-1/+2
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The controller state is set to resetting prior to disabling the controller, so this patch accounts for that state when deciding if it needs to freeze the queues. Without this, an 'nvme reset /dev/nvme0' blocks forever because the queues were never frozen. Fixes: 82b057caefaf ("nvme-pci: fix multiple ctrl removal scheduling") Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | nvme-pci: fix multiple ctrl removal schedulingRakesh Pandit2017-06-071-7/+6
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c5f6ce97c1210 tries to address multiple resets but fails as work_busy doesn't involve any synchronization and can fail. This is reproducible easily as can be seen by WARNING below which is triggered with line: WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING) Allowing multiple resets can result in multiple controller removal as well if different conditions inside nvme_reset_work fail and which might deadlock on device_release_driver. [ 480.327007] WARNING: CPU: 3 PID: 150 at drivers/nvme/host/pci.c:1900 nvme_reset_work+0x36c/0xec0 [ 480.327008] Modules linked in: rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast... [ 480.327044] btusb videobuf2_core ghash_clmulni_intel snd_hwdep cfg80211 acer_wmi hci_uart.. [ 480.327065] CPU: 3 PID: 150 Comm: kworker/u16:2 Not tainted 4.12.0-rc1+ #13 [ 480.327065] Hardware name: Acer Predator G9-591/Mustang_SLS, BIOS V1.10 03/03/2016 [ 480.327066] Workqueue: nvme nvme_reset_work [ 480.327067] task: ffff880498ad8000 task.stack: ffffc90002218000 [ 480.327068] RIP: 0010:nvme_reset_work+0x36c/0xec0 [ 480.327069] RSP: 0018:ffffc9000221bdb8 EFLAGS: 00010246 [ 480.327070] RAX: 0000000000460000 RBX: ffff880498a98128 RCX: dead000000000200 [ 480.327070] RDX: 0000000000000001 RSI: ffff8804b1028020 RDI: ffff880498a98128 [ 480.327071] RBP: ffffc9000221be50 R08: 0000000000000000 R09: 0000000000000000 [ 480.327071] R10: ffffc90001963ce8 R11: 000000000000020d R12: ffff880498a98000 [ 480.327072] R13: ffff880498a53500 R14: ffff880498a98130 R15: ffff880498a98128 [ 480.327072] FS: 0000000000000000(0000) GS:ffff8804c1cc0000(0000) knlGS:0000000000000000 [ 480.327073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 480.327074] CR2: 00007ffcf3c37f78 CR3: 0000000001e09000 CR4: 00000000003406e0 [ 480.327074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 480.327075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 480.327075] Call Trace: [ 480.327079] ? __switch_to+0x227/0x400 [ 480.327081] process_one_work+0x18c/0x3a0 [ 480.327082] worker_thread+0x4e/0x3b0 [ 480.327084] kthread+0x109/0x140 [ 480.327085] ? process_one_work+0x3a0/0x3a0 [ 480.327087] ? kthread_park+0x60/0x60 [ 480.327102] ret_from_fork+0x2c/0x40 [ 480.327103] Code: e8 5a dc ff ff 85 c0 41 89 c1 0f..... This patch addresses the problem by using state of controller to decide whether reset should be queued or not as state change is synchronizated using controller spinlock. Also cancel_work_sync is used to make sure remove cancels the reset_work and waits for it to finish. This patch also changes return value from -ENODEV to more appropriate -EBUSY if nvme_reset fails to change state. Fixes: c5f6ce97c1210 ("nvme: don't schedule multiple resets") Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme: Quirk APST on Intel 600P/P3100 devicesAndy Lutomirski2017-05-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | They have known firmware bugs. A fix is apparently in the works -- once fixed firmware is available, someone from Intel (Hi, Keith!) can adjust the quirk accordingly. Cc: stable@vger.kernel.org # v4.11 Cc: Kai-Heng Feng <kai.heng.feng@canonical.com> Cc: Mario Limonciello <mario_limonciello@dell.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | nvme: only setup block integrity if supported by the driverChristoph Hellwig2017-05-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Currently only the PCIe driver supports metadata, so we should not claim integrity support for the other drivers. This prevents nasty crashes with targets that advertise metadata support on fabrics. Also use the opportunity to factor out some code into a separate helper that isn't even compiled if CONFIG_BLK_DEV_INTEGRITY is disabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com>
* | nvme-pci: consistencly use ctrl->device for loggingChristoph Hellwig2017-05-261-6/+6
| | | | | | | | | | | | | | | | This is what most of the code already does and gives much more useful prefixes than the device embedded in the pci_dev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com>
* | Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds2017-05-201-1/+6
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "A small collection of fixes that should go into this cycle. - a pull request from Christoph for NVMe, which ended up being manually applied to avoid pulling in newer bits in master. Mostly fibre channel fixes from James, but also a few fixes from Jon and Vijay - a pull request from Konrad, with just a single fix for xen-blkback from Gustavo. - a fuseblk bdi fix from Jan, fixing a regression in this series with the dynamic backing devices. - a blktrace fix from Shaohua, replacing sscanf() with kstrtoull(). - a request leak fix for drbd from Lars, fixing a regression in the last series with the kref changes. This will go to stable as well" * 'for-linus' of git://git.kernel.dk/linux-block: nvmet: release the sq ref on rdma read errors nvmet-fc: remove target cpu scheduling flag nvme-fc: stop queues on error detection nvme-fc: require target or discovery role for fc-nvme targets nvme-fc: correct port role bits nvme: unmap CMB and remove sysfs file in reset path blktrace: fix integer parse fuseblk: Fix warning in super_setup_bdi_name() block: xen-blkback: add null check to avoid null pointer dereference drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()
| * nvme: unmap CMB and remove sysfs file in reset pathJon Derrick2017-05-201-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CMB doesn't get unmapped until removal while getting remapped on every reset. Add the unmapping and sysfs file removal to the reset path in nvme_pci_disable to match the mapping path in nvme_pci_enable. Fixes: 202021c1a ("nvme : Add sysfs entry for NVMe CMBs when appropriate") Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Acked-by: Keith Busch <keith.busch@intel.com> Reviewed-By: Stephen Bates <sbates@raithlin.com> Cc: <stable@vger.kernel.org> # 4.9+ Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge tag 'pci-v4.12-changes' of ↵Linus Torvalds2017-05-081-17/+13
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - add framework for supporting PCIe devices in Endpoint mode (Kishon Vijay Abraham I) - use non-postable PCI config space mappings when possible (Lorenzo Pieralisi) - clean up and unify mmap of PCI BARs (David Woodhouse) - export and unify Function Level Reset support (Christoph Hellwig) - avoid FLR for Intel 82579 NICs (Sasha Neftin) - add pci_request_irq() and pci_free_irq() helpers (Christoph Hellwig) - short-circuit config access failures for disconnected devices (Keith Busch) - remove D3 sleep delay when possible (Adrian Hunter) - freeze PME scan before suspending devices (Lukas Wunner) - stop disabling MSI/MSI-X in pci_device_shutdown() (Prarit Bhargava) - disable boot interrupt quirk for ASUS M2N-LR (Stefan Assmann) - add arch-specific alignment control to improve device passthrough by avoiding multiple BARs in a page (Yongji Xie) - add sysfs sriov_drivers_autoprobe to control VF driver binding (Bodong Wang) - allow slots below PCI-to-PCIe "reverse bridges" (Bjorn Helgaas) - fix crashes when unbinding host controllers that don't support removal (Brian Norris) - add driver for MicroSemi Switchtec management interface (Logan Gunthorpe) - add driver for Faraday Technology FTPCI100 host bridge (Linus Walleij) - add i.MX7D support (Andrey Smirnov) - use generic MSI support for Aardvark (Thomas Petazzoni) - make Rockchip driver modular (Brian Norris) - advertise 128-byte Read Completion Boundary support for Rockchip (Shawn Lin) - advertise PCI_EXP_LNKSTA_SLC for Rockchip root port (Shawn Lin) - convert atomic_t to refcount_t in HV driver (Elena Reshetova) - add CPU IRQ affinity in HV driver (K. Y. Srinivasan) - fix PCI bus removal in HV driver (Long Li) - add support for ThunderX2 DMA alias topology (Jayachandran C) - add ThunderX pass2.x 2nd node MCFG quirk (Tomasz Nowicki) - add ITE 8893 bridge DMA alias quirk (Jarod Wilson) - restrict Cavium ACS quirk only to CN81xx/CN83xx/CN88xx devices (Manish Jaggi) * tag 'pci-v4.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (146 commits) PCI: Don't allow unbinding host controllers that aren't prepared ARM: DRA7: clockdomain: Change the CLKTRCTRL of CM_PCIE_CLKSTCTRL to SW_WKUP MAINTAINERS: Add PCI Endpoint maintainer Documentation: PCI: Add userguide for PCI endpoint test function tools: PCI: Add sample test script to invoke pcitest tools: PCI: Add a userspace tool to test PCI endpoint Documentation: misc-devices: Add Documentation for pci-endpoint-test driver misc: Add host side PCI driver for PCI test function device PCI: Add device IDs for DRA74x and DRA72x dt-bindings: PCI: dra7xx: Add DT bindings to enable unaligned access PCI: dwc: dra7xx: Workaround for errata id i870 dt-bindings: PCI: dra7xx: Add DT bindings for PCI dra7xx EP mode PCI: dwc: dra7xx: Add EP mode support PCI: dwc: dra7xx: Facilitate wrapper and MSI interrupts to be enabled independently dt-bindings: PCI: Add DT bindings for PCI designware EP mode PCI: dwc: designware: Add EP mode support Documentation: PCI: Add binding documentation for pci-test endpoint function ixgbe: Use pcie_flr() instead of duplicating it IB/hfi1: Use pcie_flr() instead of duplicating it PCI: imx6: Fix spelling mistake: "contol" -> "control" ...
| * nvme/pci: Switch to pci_request_irq()Christoph Hellwig2017-04-181-17/+13
| | | | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Keith Busch <keith.busch@intel.com>
* | blk-mq: update ->init_request and ->exit_request prototypesChristoph Hellwig2017-05-021-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | Remove the request_idx parameter, which can't be used safely now that we support I/O schedulers with blk-mq. Except for a superflous check in mtip32xx it was unused anyway. Also pass the tag_set instead of just the driver data - this allows drivers to avoid some code duplication in a follow on cleanup. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* | Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-blockLinus Torvalds2017-05-011-48/+175
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer updates from Jens Axboe: - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ was initially a fork of CFQ, but subsequently changed to implement fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant to be used on desktop type single drives, providing good fairness. From Paolo. - Add Kyber IO scheduler. This is a full multiqueue aware scheduler, using a scalable token based algorithm that throttles IO based on live completion IO stats, similary to blk-wbt. From Omar. - A series from Jan, moving users to separately allocated backing devices. This continues the work of separating backing device life times, solving various problems with hot removal. - A series of updates for lightnvm, mostly from Javier. Includes a 'pblk' target that exposes an open channel SSD as a physical block device. - A series of fixes and improvements for nbd from Josef. - A series from Omar, removing queue sharing between devices on mostly legacy drivers. This helps us clean up other bits, if we know that a queue only has a single device backing. This has been overdue for more than a decade. - Fixes for the blk-stats, and improvements to unify the stats and user windows. This both improves blk-wbt, and enables other users to register a need to receive IO stats for a device. From Omar. - blk-throttle improvements from Shaohua. This provides a scalable framework for implementing scalable priotization - particularly for blk-mq, but applicable to any type of block device. The interface is marked experimental for now. - Bucketized IO stats for IO polling from Stephen Bates. This improves efficiency of polled workloads in the presence of mixed block size IO. - A few fixes for opal, from Scott. - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics. From a variety of folks, mostly Sagi and James Smart. - A series from Bart, improving our exposed info and capabilities from the blk-mq debugfs support. - A series from Christoph, cleaning up how handle WRITE_ZEROES. - A series from Christoph, cleaning up the block layer handling of how we track errors in a request. On top of being a nice cleanup, it also shrinks the size of struct request a bit. - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was never used by platforms, and the latter has outlived it's usefulness. - Various little bug fixes and cleanups from a wide variety of folks. * 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits) block: hide badblocks attribute by default blk-mq: unify hctx delay_work and run_work block: add kblock_mod_delayed_work_on() blk-mq: unify hctx delayed_run_work and run_work nbd: fix use after free on module unload MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler blk-mq-sched: alloate reserved tags out of normal pool mtip32xx: use runtime tag to initialize command header scsi: Implement blk_mq_ops.show_rq() blk-mq: Add blk_mq_ops.show_rq() blk-mq: Show operation, cmd_flags and rq_flags names blk-mq: Make blk_flags_show() callers append a newline character blk-mq: Move the "state" debugfs attribute one level down blk-mq: Unregister debugfs attributes earlier blk-mq: Only unregister hctxs for which registration succeeded blk-mq-debugfs: Rename functions for registering and unregistering the mq directory blk-mq: Let blk_mq_debugfs_register() look up the queue name blk-mq: Register <dev>/queue/mq after having registered <dev>/queue ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset ide-pm: always pass 0 error to __blk_end_request_all ..
| * | nvme/pci: Poll CQ on timeoutKeith Busch2017-04-211-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an IO timeout occurs, it's helpful to know if the controller did not post a completion or the driver missed an interrupt. While we never expect the latter, this patch will make it possible to tell the difference so we don't have to guess. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
| * | nvme: improve performance for virtual NVMe devicesHelen Koike2017-04-211-2/+141
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change provides a mechanism to reduce the number of MMIO doorbell writes for the NVMe driver. When running in a virtualized environment like QEMU, the cost of an MMIO is quite hefy here. The main idea for the patch is provide the device two memory location locations: 1) to store the doorbell values so they can be lookup without the doorbell MMIO write 2) to store an event index. I believe the doorbell value is obvious, the event index not so much. Similar to the virtio specification, the virtual device can tell the driver (guest OS) not to write MMIO unless you are writing past this value. FYI: doorbell values are written by the nvme driver (guest OS) and the event index is written by the virtual device (host OS). The patch implements a new admin command that will communicate where these two memory locations reside. If the command fails, the nvme driver will work as before without any optimizations. Contributions: Eric Northup <digitaleric@google.com> Frank Swiderski <fes@google.com> Ted Tso <tytso@mit.edu> Keith Busch <keith.busch@intel.com> Just to give an idea on the performance boost with the vendor extension: Running fio [1], a stock NVMe driver I get about 200K read IOPs with my vendor patch I get about 1000K read IOPs. This was running with a null device i.e. the backing device simply returned success on every read IO request. [1] Running on a 4 core machine: fio --time_based --name=benchmark --runtime=30 --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randread --blocksize=4k --randrepeat=false Signed-off-by: Rob Nelson <rlnelson@google.com> [mlin: port for upstream] Signed-off-by: Ming Lin <mlin@kernel.org> [koike: updated for upstream] Signed-off-by: Helen Koike <helen.koike@collabora.co.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
| * | nvme/pci: Don't set reserved SQ create flagsKeith Busch2017-04-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The QPRIO field is only valid if weighted round robin arbitration is used, and this driver doesn't enable that controller configuration option. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
| * | nvme: split nvme status from block req->errorsChristoph Hellwig2017-04-201-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want our own clearly defined error field for NVMe passthrough commands, and the request errors field is going away in its current form. Just store the status and result field in the nvme_request field from hardirq completion context (using a new helper) and then generate a Linux errno for the block layer only when we actually need it. Because we can't overload the status value with a negative error code for cancelled command we now have a flags filed in struct nvme_request that contains a bit for this condition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nvme: implement REQ_OP_WRITE_ZEROESChristoph Hellwig2017-04-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | But now for the real NVMe Write Zeroes yet, just to get rid of the discard abuse for zeroing. Also rename the quirk flag to be a bit more self-explanatory. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nvme: move ->retries setup to nvme_setup_cmdChristoph Hellwig2017-04-051-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->retries is counting the number of times a command is resubmitted, and be cleared on the first time we see the command. We currently don't do that for non-PCIe command, which is easily fixed by moving the setup to common code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nvme: factor request completion code into a common helperChristoph Hellwig2017-04-041-27/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This avoids duplicating the logic four times, and it also allows to keep some helpers static in core.c or just opencode them. Note that this loses printing the aborted status on completions in the PCI driver as that uses a data structure not available any more. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | blk-mq: constify struct blk_mq_opsEric Biggers2017-03-311-2/+2
| |/ | | | | | | | | | | | | Constify all instances of blk_mq_ops, as they are never modified. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* / nvme: Adjust the Samsung APST quirkAndy Lutomirski2017-04-201-1/+25
|/ | | | | | | | | | | | | | | | | | | | | | | | | | I got a couple more reports: the Samsung APST issues appears to affect multiple 950-series devices in Dell XPS 15 9550 and Precision 5510 laptops. Change the quirk: rather than blacklisting the firmware on the first problematic SSD that was reported, disable APST on all 144d:a802 devices if they're installed in the two affected Dell models. While we're at it, disable only the deepest sleep state instead of all of them -- the reporters say that this is sufficient to fix the problem. (I have a device that appears to be entirely identical to one of the affected devices, but I have a different Dell laptop, so it's not the case that all Samsung devices with firmware BXW75D0Q are broken under all circumstances.) Samsung engineers have an affected system, and hopefully they'll give us a better workaround some time soon. In the mean time, this should minimize regressions. See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184 Cc: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: Complete all stuck requestsKeith Busch2017-03-021-5/+28
| | | | | | | | | | | | | | | | | | | | | If the nvme driver is shutting down its controller, the drievr will not start the queues up again, preventing blk-mq's hot CPU notifier from making forward progress. To fix that, this patch starts a request_queue freeze when the driver resets a controller so no new requests may enter. The driver will wait for frozen after IO queues are restarted to ensure the queue reference can be reinitialized when nvme requests to unfreeze the queues. If the driver is doing a safe shutdown, the driver will wait for the controller to successfully complete all inflight requests so that we don't unnecessarily fail them. Once the controller has been disabled, the queues will be restarted to force remaining entered requests to end in failure so that blk-mq's hot cpu notifier may progress. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: allocate nvme_queue in correct nodeShaohua Li2017-03-021-4/+8
| | | | | | | | | nvme_queue is per-cpu queue (mostly). Allocating it in node where blk-mq will use it. Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme/pci: re-check security protocol support after resetScott Bauer2017-02-231-7/+10
| | | | | | | | | | | A device may change capabilities after each reset, e.g. due to a firmware upgrade. We should thus check for Security Send/Receive and OPAL support after each reset. Based on patches from Christoph and Keith. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme: detect NVMe controller in recent MacBooksDaniel Roschka2017-02-221-0/+1
| | | | | | | | | | | | | Adds support for detection of the NVMe controller found in the following recent MacBooks: - Retina MacBook 2016 (MacBook9,1) - 13" MacBook Pro 2016 without Touch Bar (MacBook13,1) - 13" MacBook Pro 2016 with Touch Bar (MacBook13,2) Signed-off-by: Daniel Roschka <danielroschka@phoenitydawn.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme/pci: No special case for queue busy on IOKeith Busch2017-02-221-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | This driver previously required we have a special check for IO submitted to nvme IO queues that are temporarily suspended. That is no longer necessary since blk-mq provides a quiesce, so any IO that actually gets submitted to such a queue must be ended since the queue isn't going to start back up. This is fixing a condition where we have fewer IO queues after a controller reset. This may happen if the number of CPU's has changed, or controller firmware update changed the queue count, for example. While it may be possible to complete the IO on a different queue, the block layer does not provide a way to resubmit a request on a different hardware context once the request has entered the queue. We don't want these requests to be stuck indefinitely either, so ending them in error is our only option at the moment. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* nvme/pci: Disable on removal when disconnectedKeith Busch2017-02-221-1/+3
| | | | | | | | | | | | | If the device is not present, the driver should disable the queues immediately. Prior to this, the driver was relying on the watchdog timer to kill the queues if requests were outstanding to the device, and that just delays removal up to one second. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'for-4.11/next' into for-4.11/linus-mergeJens Axboe2017-02-171-2/+2
|\ | | | | | | Signed-off-by: Jens Axboe <axboe@fb.com>
| * block: introduce blk_rq_is_passthroughChristoph Hellwig2017-01-311-2/+2
| | | | | | | | | | | | | | | | | | This can be used to check for fs vs non-fs requests and basically removes all knowledge of BLOCK_PC specific from the block layer, as well as preparing for removing the cmd_type field in struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * Merge branch 'for-4.11/block' into for-4.11/rq-refactorJens Axboe2017-01-271-0/+1
| |\ | | | | | | | | | Signed-off-by: Jens Axboe <axboe@fb.com>
* | \ Merge branch 'for-4.11/block' into for-4.11/linus-mergeJens Axboe2017-02-171-2/+13
|\ \ \ | |/ / |/| | | | | Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nvme: Check for Security send/recv support before issuing commands.Scott Bauer2017-02-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to verify that the controller supports the security commands before actually trying to issue them. Signed-off-by: Scott Bauer <scott.bauer@intel.com> [hch: moved the check so that we don't call into the OPAL code if not supported] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | block/sed-opal: allocate struct opal_dev dynamicallyChristoph Hellwig2017-02-171-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Insted of bloating the containing structure with it all the time this allocates struct opal_dev dynamically. Additionally this allows moving the definition of struct opal_dev into sed-opal.c. For this a new private data field is added to it that is passed to the send/receive callback. After that a lot of internals can be made private as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Scott Bauer <scott.bauer@intel.com> Reviewed-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nvme: Add Support for Opal: Unlock from S3 & Opal Allocation/IoctlsScott Bauer2017-02-061-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements the necessary logic to unlock an Opal enabled device coming back from an S3. The patch also implements the SED/Opal allocation necessary to support the opal ioctls. Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | nvme/pci: Don't mark IOD as aborted if abort wasn't sentKeith Busch2017-01-311-2/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | This patch sets the aborted flag only if an abort was sent, reducing excessive kernel message spamming for completed IO that wasn't actually aborted. Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
| * blk-mq-sched: allow setting of default IO schedulerJens Axboe2017-01-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | Add Kconfig entries to manage what devices get assigned an MQ scheduler, and add a blk-mq flag for drivers to opt out of scheduling. The latter is useful for admin type queues that still allocate a blk-mq queue and tag set, but aren't use for normal IO. Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com>
* | nvme: use blk_rq_payload_bytesChristoph Hellwig2017-01-131-11/+8
|/ | | | | | | | | | The new blk_rq_payload_bytes generalizes the payload length hacks that nvme_map_len did before. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
* Merge branch 'nvme-4.10' of git://git.infradead.org/nvme into for-linusJens Axboe2016-12-221-10/+3
|\ | | | | | | | | | | | | | | | | | | Christoph writes: The most significant one is that we've agreed on shared maintaince and a common repository for the PCIe NVMe driver and NVMe over Fabrics. The target code still only has a subset of the maintainers but goes through the same tree as well. Keith, Sagi and me will take turns at collecting patches and sending you pull requests.
| * nvme/pci: Delete misleading queue-wrap commentKeith Busch2016-12-211-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | It is not theoretically possible for this driver to wrap twice while processing completions. The driver allocates only 'queue_depth - 1' tags, so there can never be more than that to reap when processing a completion queue. Removing this misleading comment makes it a little less likely people with broken controllers will blame the driver for their spurious interrupts. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * nvme/pci: Fix whitespace problemMax Gurtovoy2016-12-211-4/+4
| | | | | | | | | | | | | | Convert to tabs and remove unneeded whitespaces. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
OpenPOWER on IntegriCloud