| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The block layer's timeout handling currently prevents drivers from
completing commands outside the timeout callback once blk-mq decides
they've expired. If a device breaks, this could potentially create many
thousands of timed out commands. There's nothing of value to be gleaned
from observing each of those messages, so this patch adds a rate limit
on them.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull NVMe changes from Christoph:
"Here is the current batch of nvme updates for 4.18, we have a few more
patches in the queue, but I'd like to get this pile into your tree
and linux-next ASAP.
The biggest item is support for file-backed namespaces in the NVMe
target from Chaitanya, in addition to that we mostly small fixes from
all the usual suspects."
* 'nvme-4.18-2' of git://git.infradead.org/nvme:
nvme: fixup memory leak in nvme_init_identify()
nvme: fix KASAN warning when parsing host nqn
nvmet-loop: use nr_phys_segments when map rq to sgl
nvmet-fc: increase LS buffer count per fc port
nvmet: add simple file backed ns support
nvmet: remove duplicate NULL initialization for req->ns
nvmet: make a few error messages more generic
nvme-fabrics: allow duplicate connections to the discovery controller
nvme-fabrics: centralize discovery controller defaults
nvme-fabrics: remove unnecessary controller subnqn validation
nvme-fc: remove setting DNR on exception conditions
nvme-rdma: stop admin queue before freeing it
nvme-pci: Fix AER reset handling
nvme-pci: set nvmeq->cq_vector after alloc cq/sq
nvme: host: core: fix precedence of ternary operator
nvme: fix lockdep warning in nvme_mpath_clear_current_path
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The nvme timeout handling doesn't do anything if the pci channel is
offline, which is the case when recovering from PCI error event, so it
was a bad idea to sync the controller reset in this state. This patch
flushes the reset work in the error_resume callback instead when the
channel is back to online. This keeps AER handling serialized and
can recover from timeouts.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199757
Fixes: cc1d5e749a2e ("nvme/pci: Sync controller reset for AER slot_reset")
Reported-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Set cq_vector after alloc cq/sq, otherwise nvme_suspend_queue will invoke
free_irq for it and cause a 'Trying to free already-free IRQ xxx'
warning if the create CQ/SQ command times out.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
[hch: fixed to pass a s16 and clean up the comment]
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |/ /
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
NVMe always completes the request before returning from ->timeout, either
by polling for it, or by disabling the controller. Return BLK_EH_DONE so
that the block layer doesn't even try to complete it again.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If polling completions are racing with the IRQ triggered by a
completion, the IRQ handler will find no work and return IRQ_NONE.
This can trigger complaints about spurious interrupts:
[ 560.169153] irq 630: nobody cared (try booting with the "irqpoll" option)
[ 560.175988] CPU: 40 PID: 0 Comm: swapper/40 Not tainted 4.17.0-rc2+ #65
[ 560.175990] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.0010.010920180151 01/09/2018
[ 560.175991] Call Trace:
[ 560.175994] <IRQ>
[ 560.176005] dump_stack+0x5c/0x7b
[ 560.176010] __report_bad_irq+0x30/0xc0
[ 560.176013] note_interrupt+0x235/0x280
[ 560.176020] handle_irq_event_percpu+0x51/0x70
[ 560.176023] handle_irq_event+0x27/0x50
[ 560.176026] handle_edge_irq+0x6d/0x180
[ 560.176031] handle_irq+0xa5/0x110
[ 560.176036] do_IRQ+0x41/0xc0
[ 560.176042] common_interrupt+0xf/0xf
[ 560.176043] </IRQ>
[ 560.176050] RIP: 0010:cpuidle_enter_state+0x9b/0x2b0
[ 560.176052] RSP: 0018:ffffa0ed4659fe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
[ 560.176055] RAX: ffff9527beb20a80 RBX: 000000826caee491 RCX: 000000000000001f
[ 560.176056] RDX: 000000826caee491 RSI: 00000000335206ee RDI: 0000000000000000
[ 560.176057] RBP: 0000000000000001 R08: 00000000ffffffff R09: 0000000000000008
[ 560.176059] R10: ffffa0ed4659fe78 R11: 0000000000000001 R12: ffff9527beb29358
[ 560.176060] R13: ffffffffa235d4b8 R14: 0000000000000000 R15: 000000826caed593
[ 560.176065] ? cpuidle_enter_state+0x8b/0x2b0
[ 560.176071] do_idle+0x1f4/0x260
[ 560.176075] cpu_startup_entry+0x6f/0x80
[ 560.176080] start_secondary+0x184/0x1d0
[ 560.176085] secondary_startup_64+0xa5/0xb0
[ 560.176088] handlers:
[ 560.178387] [<00000000efb612be>] nvme_irq [nvme]
[ 560.183019] Disabling IRQ #630
A previous commit removed ->cqe_seen that was handling this case,
but we need to handle this a bit differently due to completions
now running outside the queue lock. Return IRQ_HANDLED from the
IRQ handler, if the completion ring head was moved since we last
saw it.
Fixes: 5cb525c8315f ("nvme-pci: handle completions outside of the queue lock")
Reported-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Tested-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Since we aren't sharing the lock for completions now, we don't
have to make it IRQ safe.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This is now feasible. We protect the submission queue ring with
->sq_lock, and the completion side with ->cq_lock.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Split the completion of events into a two part process:
1) Reap the events inside the queue lock
2) Complete the events outside the queue lock
Since we never wrap the queue, we can access it locklessly after we've
updated the completion queue head. This patch started off with batching
events on the stack, but with this trick we don't have to. Keith Busch
<keith.busch@intel.com> came up with that idea.
Note that this kills the ->cqe_seen as well. I haven't been able to
trigger any ill effects of this. If we do race with polling every so
often, it should be rare enough NOT to trigger any issues.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: refactored, restored poll early exit optimization]
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We only clear it dynamically in nvme_suspend_queue(). When we do, ensure
to do a full flush so that any nvme_queue_rq() invocation will see it.
Ideally we'd kill this check completely, but we're using it to flush
requests on a dying queue.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We always check the completion queue after submitting, but in my testing
this isn't a win even on DRAM/xpoint devices. In some cases it's
actually worse. Kill it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We always look at the current CQ head and phase, so don't pass these
as separate arguments, and rename the function to nvme_cqe_pending.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
AER handling expects a successful return from slot_reset means the
driver made the device functional again. The nvme driver had been using
an asynchronous reset to recover the device, so the device
may still be initializing after control is returned to the
AER handler. This creates problems for subsequent event handling,
causing the initializion to fail.
This patch fixes that by syncing the controller reset before returning
to the AER driver, and reporting the true state of the reset.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199657
Reported-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Cc: Sinan Kaya <okaya@codeaurora.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Tested-by: Alex Gagniuc <mr.nuke.me@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
It is possible the driver's remove may have freed the controller if
the remove callback is invoked prior to the async_schedule starting
the reset_work. This patch fixes that by holding a reference on the
controller.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch schedules the initial controller reset in an async_domain
so that it can be synchronized from wait_for_device_probe(). This way
the kernel waits for the initial nvme controller scan to complete for
all devices before proceeding with the boot sequence, which may have
nvme dependencies.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add a new lightnvm quirk to identify CNEX’s Granby controller.
Signed-off-by: Wei Xu <wxu@cnexlabs.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add Seagate Nytro Flash Storage nvme drive to quirk list for
NVME_QUIRK_DELAY_BEFORE_CHK_RDY, which solves a bug where the drive is
probed on hot-add before the firmare is ready, I/O errors are generated
while reading sector 0, and linux is "unable to read partition table".
Signed-off-by: Micah Parrish <micah.parrish@hpe.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some P3100 drives have a bug where they think WRRU (weighted round robin)
is always enabled, even though the host doesn't set it. Since they think
it's enabled, they also look at the submission queue creation priority. We
used to set that to MEDIUM by default, but that was removed in commit
81c1cd98351b. This causes various issues on that drive. Add a quirk to
still set MEDIUM priority for that controller.
Fixes: 81c1cd98351b ("nvme/pci: Don't set reserved SQ create flags")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The admin and first IO queues shared the first irq vector, which has an
affinity mask including cpu0. If a system allows cpu0 to be offlined,
the admin queue may not be usable if no other CPUs in the affinity mask
are online. This is a problem since unlike IO queues, there is only
one admin queue that always needs to be usable.
To fix, this patch allocates one pre_vector for the admin queue that
is assigned all CPUs, so will always be accessible. The IO queues are
assigned the remaining managed vectors.
In case a controller has only one interrupt vector available, the admin
and IO queues will share the pre_vector with all CPUs assigned.
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
| |
All the queue memory is allocated up front. We don't take the node
into consideration when creating queues anymore, so removing the unused
parameter.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
|
|
|
|
|
|
|
|
|
| |
User reported controller always retains CSTS.RDY to 1, which fails
controller disabling when resetting the controller. This is also before
the admin queue is allocated, and trying to disable an unallocated queue
results in a NULL dereference.
Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull block layer updates from Jens Axboe:
"It's a pretty quiet round this time, which is nice. This contains:
- series from Bart, cleaning up the way we set/test/clear atomic
queue flags.
- series from Bart, fixing races between gendisk and queue
registration and removal.
- set of bcache fixes and improvements from various folks, by way of
Michael Lyle.
- set of lightnvm updates from Matias, most of it being the 1.2 to
2.0 transition.
- removal of unused DIO flags from Nikolay.
- blk-mq/sbitmap memory ordering fixes from Omar.
- divide-by-zero fix for BFQ from Paolo.
- minor documentation patches from Randy.
- timeout fix from Tejun.
- Alpha "can't write a char atomically" fix from Mikulas.
- set of NVMe fixes by way of Keith.
- bsg and bsg-lib improvements from Christoph.
- a few sed-opal fixes from Jonas.
- cdrom check-disk-change deadlock fix from Maurizio.
- various little fixes, comment fixes, etc from various folks"
* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
blk-mq: Directly schedule q->timeout_work when aborting a request
blktrace: fix comment in blktrace_api.h
lightnvm: remove function name in strings
lightnvm: pblk: remove some unnecessary NULL checks
lightnvm: pblk: don't recover unwritten lines
lightnvm: pblk: implement 2.0 support
lightnvm: pblk: implement get log report chunk
lightnvm: pblk: rename ppaf* to addrf*
lightnvm: pblk: check for supported version
lightnvm: implement get log report chunk helpers
lightnvm: make address conversions depend on generic device
lightnvm: add support for 2.0 address format
lightnvm: normalize geometry nomenclature
lightnvm: complete geo structure with maxoc*
lightnvm: add shorten OCSSD version in geo
lightnvm: add minor version to generic geometry
lightnvm: simplify geometry structure
lightnvm: pblk: refactor init/exit sequences
lightnvm: Avoid validation of default op value
lightnvm: centralize permission check for lightnvm ioctl
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The PCI interrupt vectors intended to be associated with a queue may
not start at 0; a driver may allocate pre_vectors for special use. This
patch adds an offset parameter so blk-mq may find the intended affinity
mask and updates all drivers using this API accordingly.
Cc: Don Brace <don.brace@microsemi.com>
Cc: <qla2xxx-upstream@qlogic.com>
Cc: <linux-scsi@vger.kernel.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Yet another "incompatible" Samsung NVMe SSD 960 EVO and Asus motherboard
combination. 960 EVO device disappears from PCIe bus within few minutes
after boot-up when APST is in use and never gets back. Forcing
NVME_QUIRK_NO_APST is the only way to make this drive work with this
particular motherboard. NVME_QUIRK_NO_DEEPEST_PS doesn't work, upgrading
motherboard's BIOS didn't help either.
Since this is a desktop motherboard, the only drawback of not using APST
is increased device temperature.
Signed-off-by: Jarosław Janik <jaroslaw.janik@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The nvme-fabrics exports the controller address to sysfs, and we'd
like to have parity with this feature for PCIe. This patch provides
the appropiate callback and returns the controller address as the pci
domain:bus:device.function.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Quiesce IO queues prior to disabling device HMB accesses. A controller
using HMB may relay on it to efficiently complete IO commands.
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
has switched to do irq vectors spread among all possible CPUs, so
pass num_possible_cpus() as max vecotrs to be assigned.
For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
see 'lscpu':
[ming@box]$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 2
NUMA node(s): 2
...
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s):
...
1) before this patch, follows the allocated vectors and their affinity:
irq 47, cpu list 0,4
irq 48, cpu list 1,6
irq 49, cpu list 2,5
irq 50, cpu list 3,7
2) after this patch, follows the allocated vectors and their affinity:
irq 43, cpu list 0
irq 44, cpu list 1
irq 45, cpu list 2
irq 46, cpu list 3
irq 47, cpu list 4
irq 48, cpu list 6
irq 49, cpu list 5
irq 50, cpu list 7
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Triggering PPC EEH detection and handling requires a memory mapped read
failure. The NVMe driver removed the periodic health check MMIO, so
there's no early detection mechanism to trigger the recovery. Instead,
the detection now happens when the nvme driver handles an IO timeout
event. This takes the pci channel offline, so we do not want the driver
to proceed with escalating its own recovery efforts that may conflict
with the EEH handler.
This patch ensures the driver will observe the channel was set to offline
after a failed MMIO read and resets the IO timer so the EEH handler has
a chance to recover the device.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
[updated change log]
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull NVMe fixes from Keith for 4.16-rc.
* 'for-jens' of git://git.infradead.org/nvme:
nvmet: fix PSDT field check in command format
nvme-multipath: fix sysfs dangerously created links
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
nvmet-loop: use blk_rq_payload_bytes for sgl selection
nvme-rdma: use blk_rq_payload_bytes instead of blk_rq_bytes
nvme-fabrics: don't check for non-NULL module in nvmf_register_transport
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch fixes nvme queue cleanup if requesting an IRQ handler for
the queue's vector fails. It does this by resetting the cq_vector to
the uninitialized value of -1 so it is ignored for a controller reset.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
[changelog updates, removed misc whitespace changes]
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We need to halt the controller immediately if we haven't completed
initialization as indicated by the new "connecting" state.
Fixes: ad70062cdb ("nvme-pci: introduce RECONNECTING state to mark initializing procedure")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The controller memory buffer is remapped into a kernel address on each
reset, but the driver was setting the submission queue base address
only on the very first queue creation. The remapped address is likely to
change after a reset, so accessing the old address will hit a kernel bug.
This patch fixes that by setting the queue's CMB base address each time
the queue is created.
Fixes: f63572dff1421 ("nvme: unmap CMB and remove sysfs file in reset path")
Reported-by: Christian Black <christian.d.black@intel.com>
Cc: Jon Derrick <jonathan.derrick@intel.com>
Cc: <stable@vger.kernel.org> # 4.9+
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|/
|
|
|
|
|
|
|
| |
In pci transport, this state is used to mark the initialization
process. This should be also used in other transports as well.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Pull block updates from Jens Axboe:
"This is the main pull request for block IO related changes for the
4.16 kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This contains:
- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
Paolo.
- Support for SMR zones for deadline and mq-deadline from Damien and
Christoph.
- Set of fixes for bcache by way of Michael Lyle, including fixes
from himself, Kent, Rui, Tang, and Coly.
- Series from Matias for lightnvm with fixes from Hans Holmberg,
Javier, and Matias. Mostly centered around pblk, and the removing
rrpc 1.2 in preparation for supporting 2.0.
- A couple of NVMe pull requests from Christoph. Nothing major in
here, just fixes and cleanups, and support for command tracing from
Johannes.
- Support for blk-throttle for tracking reads and writes separately.
From Joseph Qi. A few cleanups/fixes also for blk-throttle from
Weiping.
- Series from Mike Snitzer that enables dm to register its queue more
logically, something that's alwways been problematic on dm since
it's a stacked device.
- Series from Ming cleaning up some of the bio accessor use, in
preparation for supporting multipage bvecs.
- Various fixes from Ming closing up holes around queue mapping and
quiescing.
- BSD partition fix from Richard Narron, fixing a problem where we
can't mount newer (10/11) FreeBSD partitions.
- Series from Tejun reworking blk-mq timeout handling. The previous
scheme relied on atomic bits, but it had races where we would think
a request had timed out if it to reused at the wrong time.
- null_blk now supports faking timeouts, to enable us to better
exercise and test that functionality separately. From me.
- Kill the separate atomic poll bit in the request struct. After
this, we don't use the atomic bits on blk-mq anymore at all. From
me.
- sgl_alloc/free helpers from Bart.
- Heavily contended tag case scalability improvement from me.
- Various little fixes and cleanups from Arnd, Bart, Corentin,
Douglas, Eryu, Goldwyn, and myself"
* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
block: remove smart1,2.h
nvme: add tracepoint for nvme_complete_rq
nvme: add tracepoint for nvme_setup_cmd
nvme-pci: introduce RECONNECTING state to mark initializing procedure
nvme-rdma: remove redundant boolean for inline_data
nvme: don't free uuid pointer before printing it
nvme-pci: Suspend queues after deleting them
bsg: use pr_debug instead of hand crafted macros
blk-mq-debugfs: don't allow write on attributes with seq_operations set
nvme-pci: Fix queue double allocations
block: Set BIO_TRACE_COMPLETION on new bio during split
blk-throttle: use queue_is_rq_based
block: Remove kblockd_schedule_delayed_work{,_on}()
blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
lib/scatterlist: Fix chaining support in sgl_alloc_order()
blk-throttle: track read and write request individually
block: add bdev_read_only() checks to common helpers
block: fail op_is_write() requests to read-only partitions
blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
both nvme-fc/rdma have following pattern:
RESETTING - quiesce blk-mq queues, teardown and delete queues/
connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
initializing things.
Introduce RECONNECTING to nvme-pci transport to do the same mark.
Then we get a coherent state definition among nvme pci/rdma/fc
transports.
Suggested-by: James Smart <james.smart@broadcom.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The driver had been abusing the cq_vector state to know if new submissions
were safe, but that was before we could quiesce blk-mq. If the controller
happens to get an interrupt through while we're suspending those queues,
'no irq handler' warnings may occur.
This patch will disable the interrupts only after the queues are deleted.
Reported-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Tested-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The queue count says the highest queue that's been allocated, so don't
reallocate a queue lower than that.
Fixes: 147b27e4bd0 ("nvme-pci: allocate device queues storage space at probe")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Define the bit positions instead of macros using the magic values,
and move the expanded helpers to calculate the size and size unit into
the implementation C file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Refactor the call to nvme_map_cmb, and change the conditions for probing
for the CMB. First remove the version check as NVMe TPs always apply
to earlier versions of the spec as well. Second check for the whole CMBSZ
register for support of the CMB feature instead of just the size field
inside of it to simplify the code a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
fix comment typos in nvme_create_io_queues() like below.
_aount_ to _amount_
_an_ to _can_
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.
This patch changes the nvmeq allocation to occur at probe time so
there is no way we can dereference it at init_request.
[ 93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[ 93.274146] invalid opcode: 0000 [#1] SMP
[ 93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[ 93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[ 93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[ 93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[ 93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[ 93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[ 93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[ 93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[ 93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[ 93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[ 93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[ 93.422969] FS: 00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[ 93.431996] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[ 93.446368] Call Trace:
[ 93.449103] blk_mq_alloc_rqs+0x231/0x2a0
[ 93.453579] blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[ 93.459214] blk_mq_init_sched+0x7e/0x140
[ 93.463687] elevator_switch+0x5a/0x1f0
[ 93.467966] ? elevator_get.isra.17+0x52/0xc0
[ 93.472826] elv_iosched_store+0xde/0x150
[ 93.477299] queue_attr_store+0x4e/0x90
[ 93.481580] kernfs_fop_write+0xfa/0x180
[ 93.485958] __vfs_write+0x33/0x170
[ 93.489851] ? __inode_security_revalidate+0x4c/0x60
[ 93.495390] ? selinux_file_permission+0xda/0x130
[ 93.500641] ? _cond_resched+0x15/0x30
[ 93.504815] vfs_write+0xad/0x1a0
[ 93.508512] SyS_write+0x52/0xc0
[ 93.512113] do_syscall_64+0x61/0x1a0
[ 93.516199] entry_SYSCALL64_slow_path+0x25/0x25
[ 93.521351] RIP: 0033:0x7f33ce96aab0
[ 93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[ 93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[ 93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[ 93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[ 93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[ 93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[ 93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[ 93.602273] ---[ end trace 810dde3993e5f14e ]---
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| | |
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When the io queues setup or tagset allocation failed, ctrl.tagset is
NULL. But the scan work will still be queued and executed, then panic
comes up due to NULL pointer reference of ctrl.tagset.
To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only
admin queue is live. When non io queues or tagset allocation failed, ctrl
enters into this state, scan work will not be started. But async event
work and nvme dev ioctl will be still available. This will be helpful to
do further investigation and recovery.
Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| | |
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| | |
The local variable __size__ will be set a bit later in a for-loop.
Remove the explicit initialization at the beginning of this function.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Some iommu implementations can merge physically and/or virtually
contiguous segments inside sg_map_dma. The NVMe SGL support does not take
this into account and will warn because of falling off a loop. Pass the
number of mapped segments to nvme_pci_setup_sgls so that the SGL setup
can take the number of mapped segments into account.
Reported-by: Fangjian (Turing) <f.fangjian@huawei.com>
Fixes: a7a7cbe3 ("nvme-pci: add SGL support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@rimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The driver needs to verify there is a payload with a command before
seeing if it should use SGLs to map it.
Fixes: 955b1b5a00ba ("nvme-pci: move use_sgl initialization to nvme_init_iod()")
Reported-by: Paul Menzel <pmenzel+linux-nvme@molgen.mpg.de>
Reviewed-by: Paul Menzel <pmenzel+linux-nvme@molgen.mpg.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A flag "use_sgl" of "struct nvme_iod" has been used in nvme_init_iod()
without being set to any value. It seems like "use_sgl" has been set
in either nvme_pci_setup_prps() or nvme_pci_setup_sgls() which occur
later than nvme_init_iod().
Make "iod->use_sgl" being set in a proper place, nvme_init_iod().
Also move nvme_pci_use_sgls() up above nvme_init_iod() to make it
possible to be called by nvme_init_iod().
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Following condition which will cause NULL pointer dereference will
occur in nvme_free_host_mem() when it tries to remove pci device via
nvme_remove() especially after a failure of host memory allocation for HMB.
"(host_mem_descs == NULL) && (nr_host_mem_descs != 0)"
It's because __nr_host_mem_descs__ is not cleared to 0 unlike
__host_mem_descs__ is so.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
|
|
|
|
|
|
| |
And increase the existing delay to cover this device as well.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Lien <jeff.lien@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|