summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* sched/numa: Update NUMA hinting faults once per scanMel Gorman2013-10-093-3/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | NUMA hinting fault counts and placement decisions are both recorded in the same array which distorts the samples in an unpredictable fashion. The values linearly accumulate during the scan and then decay creating a sawtooth-like pattern in the per-node counts. It also means that placement decisions are time sensitive. At best it means that it is very difficult to state that the buffer holds a decaying average of past faulting behaviour. At worst, it can confuse the load balancer if it sees one node with an artifically high count due to very recent faulting activity and may create a bouncing effect. This patch adds a second array. numa_faults stores the historical data which is used for placement decisions. numa_faults_buffer holds the fault activity during the current scan window. When the scan completes, numa_faults decays and the values from numa_faults_buffer are copied across. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-22-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Select a preferred node with the most numa hinting faultsMel Gorman2013-10-093-2/+17
| | | | | | | | | | | | | | | This patch selects a preferred node for a task to run on based on the NUMA hinting faults. This information is later used to migrate tasks towards the node during balancing. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-21-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Track NUMA hinting faults on per-node basisMel Gorman2013-10-094-1/+27
| | | | | | | | | | | | | | | This patch tracks what nodes numa hinting faults were incurred on. This information is later used to schedule a task on the node storing the pages most frequently faulted by the task. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-20-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Slow scan rate if no NUMA hinting faults are being recordedMel Gorman2013-10-091-0/+12
| | | | | | | | | | | | | | | | NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page was migrated. For long-lived but idle processes there may be no faults but the scan rate will be high and just waste CPU. This patch will slow the scan rate for processes that are not trapping faults. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-19-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Set the scan rate proportional to the memory usage of the task ↵Mel Gorman2013-10-093-17/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | being scanned The NUMA PTE scan rate is controlled with a combination of the numa_balancing_scan_period_min, numa_balancing_scan_period_max and numa_balancing_scan_size. This scan rate is independent of the size of the task and as an aside it is further complicated by the fact that numa_balancing_scan_size controls how many pages are marked pte_numa and not how much virtual memory is scanned. In combination, it is almost impossible to meaningfully tune the min and max scan periods and reasoning about performance is complex when the time to complete a full scan is is partially a function of the tasks memory size. This patch alters the semantic of the min and max tunables to be about tuning the length time it takes to complete a scan of a tasks occupied virtual address space. Conceptually this is a lot easier to understand. There is a "sanity" check to ensure the scan rate is never extremely fast based on the amount of virtual memory that should be scanned in a second. The default of 2.5G seems arbitrary but it is to have the maximum scan rate after the patch roughly match the maximum scan rate before the patch was applied. On a similar note, numa_scan_period is in milliseconds and not jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies to numa_scan_period means that the rate scanning slows depends on HZ which is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Initialise numa_next_scan properlyMel Gorman2013-10-092-2/+9
| | | | | | | | | | | | | | | Scan delay logic and resets are currently initialised to start scanning immediately instead of delaying properly. Initialise them properly at fork time and catch when a new mm has been allocated. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a ↵Mel Gorman2013-10-094-34/+1
| | | | | | | | | | | | | | | | | | | | | | new node" PTE scanning and NUMA hinting fault handling is expensive so commit 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node") deferred the PTE scan until a task had been scheduled on another node. The problem is that in the purely shared memory case that this may never happen and no NUMA hinting fault information will be captured. We are not ruling out the possibility that something better can be done here but for now, this patch needs to be reverted and depend entirely on the scan_delay to avoid punishing short-lived processes. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Continue PTE scanning even if migrate rate limitedPeter Zijlstra2013-10-091-8/+0
| | | | | | | | | | | | | | | | | | Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate limited sees like a bad idea. Even if this node can't migrate anymore other nodes might and we want up-to-date information to do balance decisions. We already rate limit the actual migrations, this should leave enough bandwidth to allow the non-migrating scanning. I think its important we keep up-to-date information if we're going to do placement based on it. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Mitigate chance that same task always updates PTEsPeter Zijlstra2013-10-091-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With a trace_printk("working\n"); right after the cmpxchg in task_numa_work() we can see that of a 4 thread process, its always the same task winning the race and doing the protection change. This is a problem since the task doing the protection change has a penalty for taking faults -- it is busy when marking the PTEs. If its always the same task the ->numa_faults[] get severely skewed. Avoid this by delaying the task doing the protection change such that it is unlikely to win the privilege again. Before: root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15 thread 0/0-3232 [022] .... 212.787402: task_numa_work: working thread 0/0-3232 [022] .... 212.888473: task_numa_work: working thread 0/0-3232 [022] .... 212.989538: task_numa_work: working thread 0/0-3232 [022] .... 213.090602: task_numa_work: working thread 0/0-3232 [022] .... 213.191667: task_numa_work: working thread 0/0-3232 [022] .... 213.292734: task_numa_work: working thread 0/0-3232 [022] .... 213.393804: task_numa_work: working thread 0/0-3232 [022] .... 213.494869: task_numa_work: working thread 0/0-3232 [022] .... 213.596937: task_numa_work: working thread 0/0-3232 [022] .... 213.699000: task_numa_work: working thread 0/0-3232 [022] .... 213.801067: task_numa_work: working thread 0/0-3232 [022] .... 213.903155: task_numa_work: working thread 0/0-3232 [022] .... 214.005201: task_numa_work: working thread 0/0-3232 [022] .... 214.107266: task_numa_work: working thread 0/0-3232 [022] .... 214.209342: task_numa_work: working After: root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15 thread 0/0-3253 [005] .... 136.865051: task_numa_work: working thread 0/2-3255 [026] .... 136.965134: task_numa_work: working thread 0/3-3256 [024] .... 137.065217: task_numa_work: working thread 0/3-3256 [024] .... 137.165302: task_numa_work: working thread 0/3-3256 [024] .... 137.265382: task_numa_work: working thread 0/0-3253 [004] .... 137.366465: task_numa_work: working thread 0/2-3255 [026] .... 137.466549: task_numa_work: working thread 0/0-3253 [004] .... 137.566629: task_numa_work: working thread 0/0-3253 [004] .... 137.666711: task_numa_work: working thread 0/1-3254 [028] .... 137.766799: task_numa_work: working thread 0/0-3253 [004] .... 137.866876: task_numa_work: working thread 0/2-3255 [026] .... 137.966960: task_numa_work: working thread 0/1-3254 [028] .... 138.067041: task_numa_work: working thread 0/2-3255 [026] .... 138.167123: task_numa_work: working thread 0/3-3256 [024] .... 138.267207: task_numa_work: working Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: numa: Do not migrate or account for hinting faults on the zero pageMel Gorman2013-10-092-1/+10
| | | | | | | | | | | | | | | | | | | The zero page is not replicated between nodes and is often shared between processes. The data is read-only and likely to be cached in local CPUs if heavily accessed meaning that the remote memory access cost is less of a concern. This patch prevents trapping faults on the zero pages. For tasks using the zero page this will reduce the number of PTE updates, TLB flushes and hinting faults. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> [ Correct use of is_huge_zero_page] Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-13-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanningMel Gorman2013-10-092-7/+26
| | | | | | | | | | | | | | | | | NUMA PTE scanning is expensive both in terms of the scanning itself and the TLB flush if there are any updates. The TLB flush is avoided if no PTEs are updated but there is a bug where transhuge PMDs are considered to be updated even if they were already pmd_numa. This patch addresses the problem and TLB flushes should be reduced. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-12-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: Do not flush TLB during protection change if !pte_present && ↵Mel Gorman2013-10-091-1/+2
| | | | | | | | | | | | | | | | | | | !migration_entry NUMA PTE scanning is expensive both in terms of the scanning itself and the TLB flush if there are any updates. Currently non-present PTEs are accounted for as an update and incurring a TLB flush where it is only necessary for anonymous migration entries. This patch addresses the problem and should reduce TLB flushes. Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-11-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: Account for a THP NUMA hinting update as one PTE updateMel Gorman2013-10-091-1/+1
| | | | | | | | | | | | | | | | A THP PMD update is accounted for as 512 pages updated in vmstat. This is large difference when estimating the cost of automatic NUMA balancing and can be misleading when comparing results that had collapsed versus split THP. This patch addresses the accounting issue. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: Close races between THP migration and PMD numa clearingMel Gorman2013-10-092-26/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | THP migration uses the page lock to guard against parallel allocations but there are cases like this still open Task A Task B --------------------- --------------------- do_huge_pmd_numa_page do_huge_pmd_numa_page lock_page mpol_misplaced == -1 unlock_page goto clear_pmdnuma lock_page mpol_misplaced == 2 migrate_misplaced_transhuge pmd = pmd_mknonnuma set_pmd_at During hours of testing, one crashed with weird errors and while I have no direct evidence, I suspect something like the race above happened. This patch extends the page lock to being held until the pmd_numa is cleared to prevent migration starting in parallel while the pmd_numa is being cleared. It also flushes the old pmd entry and orders pagetable insertion before rmap insertion. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: numa: Sanitize task_numa_fault() callsitesMel Gorman2013-10-092-44/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are three callers of task_numa_fault(): - do_huge_pmd_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_pmd_numa_page(): Accounts not at all when the page isn't migrated, otherwise accounts against the node we migrated towards. This seems wrong to me; all three sites should have the same sementaics, furthermore we should accounts against where the page really is, we already know where the task is. So modify all three sites to always account; we did after all receive the fault; and always account to where the page is after migration, regardless of success. They all still differ on when they clear the PTE/PMD; ideally that would get sorted too. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: Prevent parallel splits during THP migrationMel Gorman2013-10-091-14/+30
| | | | | | | | | | | | | | | | | THP migrations are serialised by the page lock but on its own that does not prevent THP splits. If the page is split during THP migration then the pmd_same checks will prevent page table corruption but the unlock page and other fix-ups potentially will cause corruption. This patch takes the anon_vma lock to prevent parallel splits during migration. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: Wait for THP migrations to complete during NUMA hinting faultsMel Gorman2013-10-091-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | The locking for migrating THP is unusual. While normal page migration prevents parallel accesses using a migration PTE, THP migration relies on a combination of the page_table_lock, the page lock and the existance of the NUMA hinting PTE to guarantee safety but there is a bug in the scheme. If a THP page is currently being migrated and another thread traps a fault on the same page it checks if the page is misplaced. If it is not, then pmd_numa is cleared. The problem is that it checks if the page is misplaced without holding the page lock meaning that the racing thread can be migrating the THP when the second thread clears the NUMA bit and faults a stale page. This patch checks if the page is potentially being migrated and stalls using the lock_page if it is potentially being migrated before checking if the page is misplaced or not. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: numa: Do not account for a hinting fault if we racedMel Gorman2013-10-091-1/+4
| | | | | | | | | | | | | | If another task handled a hinting fault in parallel then do not double account for it. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* sched/numa: Fix commentsPeter Zijlstra2013-10-092-5/+5
| | | | | | | | | | | | | Fix a 80 column violation and a PTE vs PMD reference. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* mm: numa: Document automatic NUMA balancing sysctlsMel Gorman2013-10-091-0/+66
| | | | | | | | | | | Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-3-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge tag 'v3.12-rc4' into sched/coreIngo Molnar2013-10-09893-4829/+8754
|\ | | | | | | | | | | | | | | | | | | Merge Linux v3.12-rc4 to fix a conflict and also to refresh the tree before applying more scheduler patches. Conflicts: arch/avr32/include/asm/Kbuild Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * Linux 3.12-rc4v3.12-rc4Linus Torvalds2013-10-061-1/+1
| |
| * net: Update the sysctl permissions handler to test effective uid/gidEric W. Biederman2013-10-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | Modify the code to use current_euid(), and in_egroup_p, as in done in fs/proc/proc_sysctl.c:test_perm() Cc: stable@vger.kernel.org Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pendingLinus Torvalds2013-10-068-25/+67
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull SCSI target fixes from Nicholas Bellinger: "Here are the outstanding target fixes queued up for v3.12-rc4 code. The highlights include: - Make vhost/scsi tag percpu_ida_alloc() use GFP_ATOMIC - Allow sess_cmd_map allocation failure fallback to use vzalloc - Fix COMPARE_AND_WRITE se_cmd->data_length bug with FILEIO backends - Fixes for COMPARE_AND_WRITE callback recursive failure OOPs + non zero scsi_status bug - Make iscsi-target do acknowledgement tag release from RX context - Setup iscsi-target with extra (cmdsn_depth / 2) percpu_ida tags Also included is a iscsi-target patch CC'ed for v3.10+ that avoids legacy wait_for_task=true release during fast-past StatSN acknowledgement, and two other SRP target related patches that address long-standing issues that are CC'ed for v3.3+. Extra thanks to Thomas Glanzmann for his testing feedback with COMPARE_AND_WRITE + EXTENDED_COPY VAAI logic" * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: iscsi-target; Allow an extra tag_num / 2 number of percpu_ida tags iscsi-target: Perform release of acknowledged tags from RX context iscsi-target: Only perform wait_for_tasks when performing shutdown target: Fail on non zero scsi_status in compare_and_write_callback target: Fix recursive COMPARE_AND_WRITE callback failure target: Reset data_length for COMPARE_AND_WRITE to NoLB * block_size ib_srpt: always set response for task management target: Fall back to vzalloc upon ->sess_cmd_map kzalloc failure vhost/scsi: Use GFP_ATOMIC with percpu_ida_alloc for obtaining tag ib_srpt: Destroy cm_id before destroying QP. target: Fix xop->dbl assignment in target_xcopy_parse_segdesc_02
| | * iscsi-target; Allow an extra tag_num / 2 number of percpu_ida tagsNicholas Bellinger2013-10-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch bumps the default number of tags allocated per session by iscsi-target via transport_alloc_session_tags() -> percpu_ida_init() by another (tag_num / 2). This is done to take into account the tags waiting to be acknowledged and released in iscsit_ack_from_expstatsn(), but who's number are not directly limited by the CmdSN Window queue_depth being enforced by the target. Using a larger value here is also useful to prevent percpu_ida_alloc() from having to steal tags from other CPUs when no tags are available on the local CPU, while waiting for unacknowledged tags to be released. Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * iscsi-target: Perform release of acknowledged tags from RX contextNicholas Bellinger2013-10-031-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch converts iscsit_ack_from_expstatsn() to populate a local ack_list of commands, and call iscsit_free_cmd() directly from RX thread context, instead of using iscsit_add_cmd_to_immediate_queue() to queue the acknowledged commands to be released from TX thread context. It is helpful to release the acknowledge commands as quickly as possible, along with the associated percpu_ida tags, in order to prevent percpu_ida_alloc() from having to steal tags from other CPUs while waiting for iscsit_free_cmd() to happen from TX thread context. Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * iscsi-target: Only perform wait_for_tasks when performing shutdownNicholas Bellinger2013-10-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes transport_generic_free_cmd() to only wait_for_tasks when shutdown=true is passed to iscsit_free_cmd(). With the advent of >= v3.10 iscsi-target code using se_cmd->cmd_kref, the extra wait_for_tasks with shutdown=false is unnecessary, and may end up causing an extra context switch when releasing WRITEs. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * target: Fail on non zero scsi_status in compare_and_write_callbackNicholas Bellinger2013-10-031-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch addresses a bug for backends such as IBLOCK that perform asynchronous completion via transport_complete_cmd(), that will call target_complete_failure_work() -> transport_generic_request_failure(), upon exception status and invoke cmd->transport_complete_callback() -> compare_and_write_callback() incorrectly during the failure case. It adds a check for a non zero se_cmd->scsi_status within the first invocation of compare_and_write_callback(), and will jump to out plus up se_device->caw_sem before exiting the callback. Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Tested-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * target: Fix recursive COMPARE_AND_WRITE callback failureNicholas Bellinger2013-10-031-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch addresses a bug when compare_and_write_callback() invoked from target_complete_ok_work() hits an failure from __target_execute_cmd() -> cmd->execute_cmd(), that ends up calling transport_generic_request_failure() -> compare_and_write_post(), thus causing SCF_COMPARE_AND_WRITE_POST to incorrectly be set. The result of this bug is that target_complete_ok_work() no longer hits the if (!rc && !(cmd->se_cmd_flags & SCF_COMPARE_AND_WRITE_POST) check that forces an immediate return, and instead double completes the se_cmd in question, triggering an OOPs in the process. This patch changes compare_and_write_post() to only set this bit when a failure has not already occured to ensure the immediate return from within target_complete_ok_work(), and thus allow transport_generic_request_failure() to handle the sending of the CHECK_CONDITION exception status. Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Tested-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * target: Reset data_length for COMPARE_AND_WRITE to NoLB * block_sizeNicholas Bellinger2013-10-031-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch resets se_cmd->data_length for COMPARE_AND_WRITE emulation within sbc_compare_and_write() to NoLB * block_size in order to address a bug with FILEIO backends where a I/O failure will occur when data_length does not match the I/O size being actually dispatched for the individual per block READs + WRITEs. This is done late enough in sbc_compare_and_write() after the memory allocations have occured in transport_generic_new_cmd() to not cause any unwanted side-effects. Reported-by: Thomas Glanzmann <thomas@glanzmann.de> Tested-by: Thomas Glanzmann <thomas@glanzmann.de> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * ib_srpt: always set response for task managementJack Wang2013-10-031-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SRP specification requires: "Response data shall be provided in any SRP_RSP response that is sent in response to an SRP_TSK_MGMT request (see 6.7). The information in the RSP_CODE field (see table 24) shall indicate the completion status of the task management function." So fix this to avoid the SRP initiator interprets task management functions that succeeded as failed. Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Cc: stable@vger.kernel.org # 3.3+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * target: Fall back to vzalloc upon ->sess_cmd_map kzalloc failureNicholas Bellinger2013-10-011-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes transport_alloc_session_tags() to fall back to use vzalloc when kzalloc fails for big tag_num that end up generating larger order allocations. Also use is_vmalloc_addr() in transport_alloc_session_tags() failure path, and normal transport_free_session() path to determine when vfree() needs to be called instead of kfree(). v2 changes: - Use __GFP_NOWARN | __GFP_REPEAT for sess_cmd_map kzalloc (mst) Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Asias He <asias@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * vhost/scsi: Use GFP_ATOMIC with percpu_ida_alloc for obtaining tagNicholas Bellinger2013-10-011-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix GFP_KERNEL -> GFP_ATOMIC usage of percpu_ida_alloc() within vhost_scsi_get_tag(), as this code is expected to be called directly from interrupt context. v2 changes: - Handle possible tag < 0 failure with GFP_ATOMIC Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Asias He <asias@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * ib_srpt: Destroy cm_id before destroying QP.Nicholas Bellinger2013-10-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a bug where ib_destroy_cm_id() was incorrectly being called after srpt_destroy_ch_ib() had destroyed the active QP. This would result in the following failed SRP_LOGIN_REQ messages: Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff1762bd, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2c903009f8f41) Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff1758f9, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 2 (guid=0xfe80000000000000:0x2c903009f8f42) Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff175941, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 2 (guid=0xfe80000000000000:0x2c90300a3cfb2) Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff176299, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2c90300a3cfb1) mlx4_core 0000:84:00.0: command 0x19 failed: fw status = 0x9 rejected SRP_LOGIN_REQ because creating a new RDMA channel failed. Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff176299, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2c90300a3cfb1) mlx4_core 0000:84:00.0: command 0x19 failed: fw status = 0x9 rejected SRP_LOGIN_REQ because creating a new RDMA channel failed. Received SRP_LOGIN_REQ with i_port_id 0x0:0x2590ffff176299, t_port_id 0x2c903009f8f40:0x2c903009f8f40 and it_iu_len 260 on port 1 (guid=0xfe80000000000000:0x2c90300a3cfb1) Reported-by: Navin Ahuja <navin.ahuja@saratoga-speed.com> Cc: stable@vger.kernel.org # 3.3+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| | * target: Fix xop->dbl assignment in target_xcopy_parse_segdesc_02Nicholas Bellinger2013-10-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes up an incorrect assignment for xop->dbl within target_xcopy_parse_segdesc_02() code, as reported by Coverity here: http://marc.info/?l=linux-kernel&m=137936416618490&w=2 Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
| * | Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds2013-10-063-17/+17
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull slave-dmaengine fixes from Vinod Koul: "Here is the slave dmanegine fixes. We have the fix for deadlock issue on imx-dma by Michael and Josh's edma config fix along with author change" * 'fixes' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: imx-dma: fix callback path in tasklet dmaengine: imx-dma: fix lockdep issue between irqhandler and tasklet dmaengine: imx-dma: fix slow path issue in prep_dma_cyclic dma/Kconfig: Make TI_EDMA select TI_PRIV_EDMA edma: Update author email address
| | * | dmaengine: imx-dma: fix callback path in taskletMichael Grzeschik2013-10-041-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to free the ld_active list head before jumping into the callback routine. Otherwise the callback could run into issue_pending and change our ld_active list head we just going to free. This will run the channel list into an currupted and undefined state. Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
| | * | dmaengine: imx-dma: fix lockdep issue between irqhandler and taskletMichael Grzeschik2013-10-041-11/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The tasklet and irqhandler are using spin_lock while other routines are using spin_lock_irqsave/restore. This leads to lockdep issues as described bellow. This patch is changing the code to use spinlock_irq_save/restore in both code pathes. As imxdma_xfer_desc always gets called with spin_lock_irqsave lock held, this patch also removes the spare call inside the routine to avoid double locking. [ 403.358162] ================================= [ 403.362549] [ INFO: inconsistent lock state ] [ 403.366945] 3.10.0-20130823+ #904 Not tainted [ 403.371331] --------------------------------- [ 403.375721] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. [ 403.381769] swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 403.386762] (&(&imxdma->lock)->rlock){?.-...}, at: [<c019d77c>] imxdma_tasklet+0x20/0x134 [ 403.395201] {IN-HARDIRQ-W} state was registered at: [ 403.400108] [<c004b264>] mark_lock+0x2a0/0x6b4 [ 403.404798] [<c004d7c8>] __lock_acquire+0x650/0x1a64 [ 403.410004] [<c004f15c>] lock_acquire+0x94/0xa8 [ 403.414773] [<c02f74e4>] _raw_spin_lock+0x54/0x8c [ 403.419720] [<c019d094>] dma_irq_handler+0x78/0x254 [ 403.424845] [<c0061124>] handle_irq_event_percpu+0x38/0x1b4 [ 403.430670] [<c00612e4>] handle_irq_event+0x44/0x64 [ 403.435789] [<c0063a70>] handle_level_irq+0xd8/0xf0 [ 403.440903] [<c0060a20>] generic_handle_irq+0x28/0x38 [ 403.446194] [<c0009cc4>] handle_IRQ+0x68/0x8c [ 403.450789] [<c0008714>] avic_handle_irq+0x3c/0x48 [ 403.455811] [<c0008f84>] __irq_svc+0x44/0x74 [ 403.460314] [<c0040b04>] cpu_startup_entry+0x88/0xf4 [ 403.465525] [<c02f00d0>] rest_init+0xb8/0xe0 [ 403.470045] [<c03e07dc>] start_kernel+0x28c/0x2d4 [ 403.474986] [<a0008040>] 0xa0008040 [ 403.478709] irq event stamp: 50854 [ 403.482140] hardirqs last enabled at (50854): [<c001c6b8>] tasklet_action+0x38/0xdc [ 403.489954] hardirqs last disabled at (50853): [<c001c6a0>] tasklet_action+0x20/0xdc [ 403.497761] softirqs last enabled at (50850): [<c001bc64>] _local_bh_enable+0x14/0x18 [ 403.505741] softirqs last disabled at (50851): [<c001c268>] irq_exit+0x88/0xdc [ 403.513026] [ 403.513026] other info that might help us debug this: [ 403.519593] Possible unsafe locking scenario: [ 403.519593] [ 403.525548] CPU0 [ 403.528020] ---- [ 403.530491] lock(&(&imxdma->lock)->rlock); [ 403.534828] <Interrupt> [ 403.537474] lock(&(&imxdma->lock)->rlock); [ 403.541983] [ 403.541983] *** DEADLOCK *** [ 403.541983] [ 403.547951] no locks held by swapper/0. [ 403.551813] [ 403.551813] stack backtrace: [ 403.556222] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0-20130823+ #904 [ 403.563039] Backtrace: [ 403.565581] [<c000b98c>] (dump_backtrace+0x0/0x10c) from [<c000bb28>] (show_stack+0x18/0x1c) [ 403.574054] r6:00000000 r5:c05c51d8 r4:c040bd58 r3:00200000 [ 403.579872] [<c000bb10>] (show_stack+0x0/0x1c) from [<c02f398c>] (dump_stack+0x20/0x28) [ 403.587955] [<c02f396c>] (dump_stack+0x0/0x28) from [<c02f29c8>] (print_usage_bug.part.28+0x224/0x28c) [ 403.597340] [<c02f27a4>] (print_usage_bug.part.28+0x0/0x28c) from [<c004b404>] (mark_lock+0x440/0x6b4) [ 403.606682] r8:c004a41c r7:00000000 r6:c040bd58 r5:c040c040 r4:00000002 [ 403.613566] [<c004afc4>] (mark_lock+0x0/0x6b4) from [<c004d844>] (__lock_acquire+0x6cc/0x1a64) [ 403.622244] [<c004d178>] (__lock_acquire+0x0/0x1a64) from [<c004f15c>] (lock_acquire+0x94/0xa8) [ 403.631010] [<c004f0c8>] (lock_acquire+0x0/0xa8) from [<c02f74e4>] (_raw_spin_lock+0x54/0x8c) [ 403.639614] [<c02f7490>] (_raw_spin_lock+0x0/0x8c) from [<c019d77c>] (imxdma_tasklet+0x20/0x134) [ 403.648434] r6:c3847010 r5:c040e890 r4:c38470d4 [ 403.653194] [<c019d75c>] (imxdma_tasklet+0x0/0x134) from [<c001c70c>] (tasklet_action+0x8c/0xdc) [ 403.662013] r8:c0599160 r7:00000000 r6:00000000 r5:c040e890 r4:c3847114 r3:c019d75c [ 403.670042] [<c001c680>] (tasklet_action+0x0/0xdc) from [<c001bd4c>] (__do_softirq+0xe4/0x1f0) [ 403.678687] r7:00000101 r6:c0402000 r5:c059919c r4:00000001 [ 403.684498] [<c001bc68>] (__do_softirq+0x0/0x1f0) from [<c001c268>] (irq_exit+0x88/0xdc) [ 403.692652] [<c001c1e0>] (irq_exit+0x0/0xdc) from [<c0009cc8>] (handle_IRQ+0x6c/0x8c) [ 403.700514] r4:00000030 r3:00000110 [ 403.704192] [<c0009c5c>] (handle_IRQ+0x0/0x8c) from [<c0008714>] (avic_handle_irq+0x3c/0x48) [ 403.712664] r5:c0403f28 r4:c0593ebc [ 403.716343] [<c00086d8>] (avic_handle_irq+0x0/0x48) from [<c0008f84>] (__irq_svc+0x44/0x74) [ 403.724733] Exception stack(0xc0403f28 to 0xc0403f70) [ 403.729841] 3f20: 00000001 00000004 00000000 20000013 c0402000 c04104a8 [ 403.738078] 3f40: 00000002 c0b69620 a0004000 41069264 a03fb5f4 c0403f7c c0403f40 c0403f70 [ 403.746301] 3f60: c004b92c c0009e74 20000013 ffffffff [ 403.751383] r6:ffffffff r5:20000013 r4:c0009e74 r3:c004b92c [ 403.757210] [<c0009e30>] (arch_cpu_idle+0x0/0x4c) from [<c0040b04>] (cpu_startup_entry+0x88/0xf4) [ 403.766161] [<c0040a7c>] (cpu_startup_entry+0x0/0xf4) from [<c02f00d0>] (rest_init+0xb8/0xe0) [ 403.774753] [<c02f0018>] (rest_init+0x0/0xe0) from [<c03e07dc>] (start_kernel+0x28c/0x2d4) [ 403.783051] r6:c03fc484 r5:ffffffff r4:c040a0e0 [ 403.787797] [<c03e0550>] (start_kernel+0x0/0x2d4) from [<a0008040>] (0xa0008040) Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
| | * | dmaengine: imx-dma: fix slow path issue in prep_dma_cyclicMichael Grzeschik2013-10-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When perparing cyclic_dma buffers by the sound layer, it will dump the following lockdep trace. The leading snd_pcm_action_single get called with read_lock_irq called. To fix this, we change the kcalloc call from GFP_KERNEL to GFP_ATOMIC. WARNING: at kernel/lockdep.c:2740 lockdep_trace_alloc+0xcc/0x114() DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)) Modules linked in: CPU: 0 PID: 832 Comm: aplay Not tainted 3.11.0-20130823+ #903 Backtrace: [<c000b98c>] (dump_backtrace+0x0/0x10c) from [<c000bb28>] (show_stack+0x18/0x1c) r6:c004c090 r5:00000009 r4:c2e0bd18 r3:00404000 [<c000bb10>] (show_stack+0x0/0x1c) from [<c02f397c>] (dump_stack+0x20/0x28) [<c02f395c>] (dump_stack+0x0/0x28) from [<c001531c>] (warn_slowpath_common+0x54/0x70) [<c00152c8>] (warn_slowpath_common+0x0/0x70) from [<c00153dc>] (warn_slowpath_fmt+0x38/0x40) r8:00004000 r7:a3b90000 r6:000080d0 r5:60000093 r4:c2e0a000 r3:00000009 [<c00153a4>] (warn_slowpath_fmt+0x0/0x40) from [<c004c090>] (lockdep_trace_alloc+0xcc/0x114) r3:c03955d8 r2:c03907db [<c004bfc4>] (lockdep_trace_alloc+0x0/0x114) from [<c008f16c>] (__kmalloc+0x34/0x118) r6:000080d0 r5:c3800120 r4:000080d0 r3:c040a0f8 [<c008f138>] (__kmalloc+0x0/0x118) from [<c019c95c>] (imxdma_prep_dma_cyclic+0x64/0x168) r7:a3b90000 r6:00000004 r5:c39d8420 r4:c3847150 [<c019c8f8>] (imxdma_prep_dma_cyclic+0x0/0x168) from [<c024618c>] (snd_dmaengine_pcm_trigger+0xa8/0x160) [<c02460e4>] (snd_dmaengine_pcm_trigger+0x0/0x160) from [<c0241fa8>] (soc_pcm_trigger+0x90/0xb4) r8:c058c7b0 r7:c3b8140c r6:c39da560 r5:00000001 r4:c3b81000 [<c0241f18>] (soc_pcm_trigger+0x0/0xb4) from [<c022ece4>] (snd_pcm_do_start+0x2c/0x38) r7:00000000 r6:00000003 r5:c058c7b0 r4:c3b81000 [<c022ecb8>] (snd_pcm_do_start+0x0/0x38) from [<c022e958>] (snd_pcm_action_single+0x40/0x6c) [<c022e918>] (snd_pcm_action_single+0x0/0x6c) from [<c022ea64>] (snd_pcm_action_lock_irq+0x7c/0x9c) r7:00000003 r6:c3b810f0 r5:c3b810f0 r4:c3b81000 [<c022e9e8>] (snd_pcm_action_lock_irq+0x0/0x9c) from [<c023009c>] (snd_pcm_common_ioctl1+0x7f8/0xfd0) r8:c3b7f888 r7:005407b8 r6:c2c991c0 r5:c3b81000 r4:c3b81000 r3:00004142 [<c022f8a4>] (snd_pcm_common_ioctl1+0x0/0xfd0) from [<c023117c>] (snd_pcm_playback_ioctl1+0x464/0x488) [<c0230d18>] (snd_pcm_playback_ioctl1+0x0/0x488) from [<c02311d4>] (snd_pcm_playback_ioctl+0x34/0x40) r8:c3b7f888 r7:00004142 r6:00000004 r5:c2c991c0 r4:005407b8 [<c02311a0>] (snd_pcm_playback_ioctl+0x0/0x40) from [<c00a14a4>] (vfs_ioctl+0x30/0x44) [<c00a1474>] (vfs_ioctl+0x0/0x44) from [<c00a1fe8>] (do_vfs_ioctl+0x55c/0x5c0) [<c00a1a8c>] (do_vfs_ioctl+0x0/0x5c0) from [<c00a208c>] (SyS_ioctl+0x40/0x68) [<c00a204c>] (SyS_ioctl+0x0/0x68) from [<c0009380>] (ret_fast_syscall+0x0/0x44) r8:c0009544 r7:00000036 r6:bedeaa58 r5:00000000 r4:000000c0 Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
| | * | dma/Kconfig: Make TI_EDMA select TI_PRIV_EDMAJosh Boyer2013-09-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TI_EDMA driver unconditionally calls functions provided by the TI_PRIV_EDMA code, but it doesn't force that to be built-in. If that isn't otherwise enabled somewhere, you can get build errors like: linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:593: undefined reference to `edma_free_slot' drivers/built-in.o: In function `edma_terminate_all': linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:169: undefined reference to `edma_stop' drivers/built-in.o: In function `edma_execute': linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:122: undefined reference to `edma_write_slot' linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:149: undefined reference to `edma_link' linux-3.12.0-0.rc0.git1.1.fc21.armv7hl/drivers/dma/edma.c:152: undefined reference to `edma_start' Make TI_EDMA select TI_PRIV_EDMA to avoid this. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> Acked-by: Matt Porter <matt.porter@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
| | * | edma: Update author email addressJosh Boyer2013-09-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Matt's @ti.com address bounces. Update the MODULE_AUTHOR information in edma.c to his Linaro address. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> Acked-by: Matt Porter <matt.porter@linaro.org> Signed-off-by: Vinod Koul <vinod.koul@intel.com>
| * | | Merge branch 'for-linus' of ↵Linus Torvalds2013-10-056-17/+39
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small collection of fixes, including a regression fix from Liu Bo that solves rare crashes with compression on. I've merged my for-linus up to 3.12-rc3 because the top commit is only meant for 3.12. The rest of the fixes are also available in my master branch on top of my last 3.11 based pull" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: Fix crash due to not allocating integrity data for a bioset Btrfs: fix a use-after-free bug in btrfs_dev_replace_finishing Btrfs: eliminate races in worker stopping code Btrfs: fix crash of compressed writes Btrfs: fix transid verify errors when recovering log tree
| | * | | btrfs: Fix crash due to not allocating integrity data for a biosetDarrick J. Wong2013-10-051-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When btrfs creates a bioset, we must also allocate the integrity data pool. Otherwise btrfs will crash when it tries to submit a bio to a checksumming disk: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150 PGD 2305e4067 PUD 23063d067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: btrfs scsi_debug xfs ext4 jbd2 ext3 jbd mbcache sch_fq_codel eeprom lpc_ich mfd_core nfsd exportfs auth_rpcgss af_packet raid6_pq xor zlib_deflate libcrc32c [last unloaded: scsi_debug] CPU: 1 PID: 4486 Comm: mount Not tainted 3.12.0-rc1-mcsum #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8802451c9720 ti: ffff880230698000 task.ti: ffff880230698000 RIP: 0010:[<ffffffff8111e28a>] [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150 RSP: 0018:ffff880230699688 EFLAGS: 00010286 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000005f8445 RDX: 0000000000000001 RSI: 0000000000000010 RDI: 0000000000000000 RBP: ffff8802306996f8 R08: 0000000000011200 R09: 0000000000000008 R10: 0000000000000020 R11: ffff88009d6e8000 R12: 0000000000011210 R13: 0000000000000030 R14: ffff8802306996b8 R15: ffff8802451c9720 FS: 00007f25b8a16800(0000) GS:ffff88024fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000018 CR3: 0000000230576000 CR4: 00000000000007e0 Stack: ffff8802451c9720 0000000000000002 ffffffff81a97100 0000000000281250 ffffffff81a96480 ffff88024fc99150 ffff880228d18200 0000000000000000 0000000000000000 0000000000000040 ffff880230e8c2e8 ffff8802459dc900 Call Trace: [<ffffffff811b2208>] bio_integrity_alloc+0x48/0x1b0 [<ffffffff811b26fc>] bio_integrity_prep+0xac/0x360 [<ffffffff8111e298>] ? mempool_alloc+0x58/0x150 [<ffffffffa03e8041>] ? alloc_extent_state+0x31/0x110 [btrfs] [<ffffffff81241579>] blk_queue_bio+0x1c9/0x460 [<ffffffff8123e58a>] generic_make_request+0xca/0x100 [<ffffffff8123e639>] submit_bio+0x79/0x160 [<ffffffffa03f865e>] btrfs_map_bio+0x48e/0x5b0 [btrfs] [<ffffffffa03c821a>] btree_submit_bio_hook+0xda/0x110 [btrfs] [<ffffffffa03e7eba>] submit_one_bio+0x6a/0xa0 [btrfs] [<ffffffffa03ef450>] read_extent_buffer_pages+0x250/0x310 [btrfs] [<ffffffff8125eef6>] ? __radix_tree_preload+0x66/0xf0 [<ffffffff8125f1c5>] ? radix_tree_insert+0x95/0x260 [<ffffffffa03c66f6>] btree_read_extent_buffer_pages.constprop.128+0xb6/0x120 [btrfs] [<ffffffffa03c8c1a>] read_tree_block+0x3a/0x60 [btrfs] [<ffffffffa03caefd>] open_ctree+0x139d/0x2030 [btrfs] [<ffffffffa03a282a>] btrfs_mount+0x53a/0x7d0 [btrfs] [<ffffffff8113ab0b>] ? pcpu_alloc+0x8eb/0x9f0 [<ffffffff81167305>] ? __kmalloc_track_caller+0x35/0x1e0 [<ffffffff81176ba0>] mount_fs+0x20/0xd0 [<ffffffff81191096>] vfs_kern_mount+0x76/0x120 [<ffffffff81193320>] do_mount+0x200/0xa40 [<ffffffff81135cdb>] ? strndup_user+0x5b/0x80 [<ffffffff81193bf0>] SyS_mount+0x90/0xe0 [<ffffffff8156d31d>] system_call_fastpath+0x1a/0x1f Code: 4c 8d 75 a8 4c 89 6d e8 45 89 e0 4c 8d 6f 30 48 89 5d d8 41 83 e0 af 48 89 fb 49 83 c6 18 4c 89 7d f8 65 4c 8b 3c 25 c0 b8 00 00 <48> 8b 73 18 44 89 c7 44 89 45 98 ff 53 20 48 85 c0 48 89 c2 74 RIP [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150 RSP <ffff880230699688> CR2: 0000000000000018 ---[ end trace 7a96042017ed21e2 ]--- Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
| | * | | Merge branch 'for-linus' into for-linus-3.12Chris Mason2013-10-056-17/+31
| | |\ \ \ | | | |_|/ | | |/| |
| | | * | Btrfs: fix a use-after-free bug in btrfs_dev_replace_finishingIlya Dryomov2013-10-042-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | free_device rcu callback, scheduled from btrfs_rm_dev_replace_srcdev, can be processed before btrfs_scratch_superblock is called, which would result in a use-after-free on btrfs_device contents. Fix this by zeroing the superblock before the rcu callback is registered. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| | | * | Btrfs: eliminate races in worker stopping codeIlya Dryomov2013-10-042-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of worker threads in Btrfs has races in worker stopping code, which cause all kinds of panics and lockups when running btrfs/011 xfstest in a loop. The problem is that btrfs_stop_workers is unsynchronized with respect to check_idle_worker, check_busy_worker and __btrfs_start_workers. E.g., check_idle_worker race flow: btrfs_stop_workers(): check_idle_worker(aworker): - grabs the lock - splices the idle list into the working list - removes the first worker from the working list - releases the lock to wait for its kthread's completion - grabs the lock - if aworker is on the working list, moves aworker from the working list to the idle list - releases the lock - grabs the lock - puts the worker - removes the second worker from the working list ...... btrfs_stop_workers returns, aworker is on the idle list FS is umounted, memory is freed ...... aworker is waken up, fireworks ensue With this applied, I wasn't able to trigger the problem in 48 hours, whereas previously I could reliably reproduce at least one of these races within an hour. Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| | | * | Btrfs: fix crash of compressed writesLiu Bo2013-10-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The crash[1] is found by xfstests/generic/208 with "-o compress", it's not reproduced everytime, but it does panic. The bug is quite interesting, it's actually introduced by a recent commit (573aecafca1cf7a974231b759197a1aebcf39c2a, Btrfs: actually limit the size of delalloc range). Btrfs implements delay allocation, so during writeback, we (1) get a page A and lock it (2) search the state tree for delalloc bytes and lock all pages within the range (3) process the delalloc range, including find disk space and create ordered extent and so on. (4) submit the page A. It runs well in normal cases, but if we're in a racy case, eg. buffered compressed writes and aio-dio writes, sometimes we may fail to lock all pages in the 'delalloc' range, in which case, we need to fall back to search the state tree again with a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset). The mentioned commit has a side effect, that is, in the fallback case, we can find delalloc bytes before the index of the page we already have locked, so we're in the case of (delalloc_end <= *start) and return with (found > 0). This ends with not locking delalloc pages but making ->writepage still process them, and the crash happens. This fixes it by just thinking that we find nothing and returning to caller as the caller knows how to deal with it properly. [1]: ------------[ cut here ]------------ kernel BUG at mm/page-writeback.c:2170! [...] CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8 [...] RIP: 0010:[<ffffffff810f5093>] [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83 [...] [ 4934.248731] Stack: [ 4934.248731] ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a [ 4934.248731] 0000000000000000 0000000000000000 0000000000000fff 0000000000000620 [ 4934.248731] ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0 [ 4934.248731] Call Trace: [ 4934.248731] [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs] [ 4934.248731] [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs] [ 4934.248731] [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b [ 4934.248731] [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs] [ 4934.248731] [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs] [ 4934.248731] [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs] [ 4934.248731] [<ffffffff810608f5>] kthread+0x8d/0x95 [ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43 [ 4934.248731] [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0 [ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43 [ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44 [ 4934.248731] RIP [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83 [ 4934.248731] RSP <ffff8801869f9c48> [ 4934.280307] ---[ end trace 36f06d3f8750236a ]--- Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| | | * | Btrfs: fix transid verify errors when recovering log treeJosef Bacik2013-10-041-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we crash with a log, remount and recover that log, and then crash before we can commit another transaction we will get transid verify errors on the next mount. This is because we were not zero'ing out the log when we committed the transaction after recovery. This is ok as long as we commit another transaction at some point in the future, but if you abort or something else goes wrong you can end up in this weird state because the recovery stuff says that the tree log should have a generation+1 of the super generation, which won't be the case of the transaction that was started for recovery. Fix this by removing the check and _always_ zero out the log portion of the super when we commit a transaction. This fixes the transid verify issues I was seeing with my force errors tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
| * | | | Merge tag 'gpio-v3.12-2' of ↵Linus Torvalds2013-10-051-57/+101
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO fixes from Linus Walleij: "Two patches for the OMAP driver, dealing with setting up IRQs properly on the device tree boot path" * tag 'gpio-v3.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpio/omap: auto-setup a GPIO when used as an IRQ gpio/omap: maintain GPIO and IRQ usage separately
| | * | | | gpio/omap: auto-setup a GPIO when used as an IRQJavier Martinez Canillas2013-10-011-46/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The OMAP GPIO controller HW requires a pin to be configured in GPIO input mode in order to operate as an interrupt input. Since drivers should not be aware of whether an interrupt pin is also a GPIO or not, the HW should be fully configured/enabled as an IRQ if a driver solely uses IRQ APIs such as request_irq(), and never calls any GPIO-related APIs. As such, add the missing HW setup to the OMAP GPIO controller's irq_chip driver. Since this bypasses the GPIO subsystem we have to ensure that another driver won't be able to request the same GPIO pin that is used as an IRQ and set its direction as output. Requesting the GPIO and setting its direction as input is allowed though. This fixes smsc911x ethernet support for tobi and igep OMAP3 boards and OMAP4 SDP SPI based ethernet that use a GPIO as an interrupt line. Cc: stable@vger.kernel.org Acked-by: Stephen Warren <swarren@nvidia.com> Tested-by: George Cherian <george.cherian@ti.com> Tested-by: Aaro Koskinen <aaro.koskinen@iki.fi> Tested-by: Lars Poeschel <poeschel@lemonage.de> Reviewed-by: Kevin Hilman <khilman@linaro.org> Tested-by: Kevin Hilman <khilman@linaro.org> Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
OpenPOWER on IntegriCloud