summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'locking-core-for-linus' of ↵Linus Torvalds2015-09-0310-558/+306
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and atomic updates from Ingo Molnar: "Main changes in this cycle are: - Extend atomic primitives with coherent logic op primitives (atomic_{or,and,xor}()) and deprecate the old partial APIs (atomic_{set,clear}_mask()) The old ops were incoherent with incompatible signatures across architectures and with incomplete support. Now every architecture supports the primitives consistently (by Peter Zijlstra) - Generic support for 'relaxed atomics': - _acquire/release/relaxed() flavours of xchg(), cmpxchg() and {add,sub}_return() - atomic_read_acquire() - atomic_set_release() This came out of porting qwrlock code to arm64 (by Will Deacon) - Clean up the fragile static_key APIs that were causing repeat bugs, by introducing a new one: DEFINE_STATIC_KEY_TRUE(name); DEFINE_STATIC_KEY_FALSE(name); which define a key of different types with an initial true/false value. Then allow: static_branch_likely() static_branch_unlikely() to take a key of either type and emit the right instruction for the case. To be able to know the 'type' of the static key we encode it in the jump entry (by Peter Zijlstra) - Static key self-tests (by Jason Baron) - qrwlock optimizations (by Waiman Long) - small futex enhancements (by Davidlohr Bueso) - ... and misc other changes" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits) jump_label/x86: Work around asm build bug on older/backported GCCs locking, ARM, atomics: Define our SMP atomics in terms of _relaxed() operations locking, include/llist: Use linux/atomic.h instead of asm/cmpxchg.h locking/qrwlock: Make use of _{acquire|release|relaxed}() atomics locking/qrwlock: Implement queue_write_unlock() using smp_store_release() locking/lockref: Remove homebrew cmpxchg64_relaxed() macro definition locking, asm-generic: Add _{relaxed|acquire|release}() variants for 'atomic_long_t' locking, asm-generic: Rework atomic-long.h to avoid bulk code duplication locking/atomics: Add _{acquire|release|relaxed}() variants of some atomic operations locking, compiler.h: Cast away attributes in the WRITE_ONCE() magic locking/static_keys: Make verify_keys() static jump label, locking/static_keys: Update docs locking/static_keys: Provide a selftest jump_label: Provide a self-test s390/uaccess, locking/static_keys: employ static_branch_likely() x86, tsc, locking/static_keys: Employ static_branch_likely() locking/static_keys: Add selftest locking/static_keys: Add a new static_key interface locking/static_keys: Rework update logic locking/static_keys: Add static_key_{en,dis}able() helpers ...
| * locking/qrwlock: Make use of _{acquire|release|relaxed}() atomicsWill Deacon2015-08-121-12/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The qrwlock implementation is slightly heavy in its use of memory barriers, mainly through the use of _cmpxchg() and _return() atomics, which imply full barrier semantics. This patch modifies the qrwlock code to use the more relaxed atomic routines so that we can reduce the unnecessary barrier overhead on weakly-ordered architectures. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman.Long@hp.com Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1438880084-18856-7-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * locking/static_keys: Add selftestPeter Zijlstra2015-08-031-1/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | Add a little selftest that validates all combinations. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * locking/static_keys: Add a new static_key interfacePeter Zijlstra2015-08-031-7/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are various problems and short-comings with the current static_key interface: - static_key_{true,false}() read like a branch depending on the key value, instead of the actual likely/unlikely branch depending on init value. - static_key_{true,false}() are, as stated above, tied to the static_key init values STATIC_KEY_INIT_{TRUE,FALSE}. - we're limited to the 2 (out of 4) possible options that compile to a default NOP because that's what our arch_static_branch() assembly emits. So provide a new static_key interface: DEFINE_STATIC_KEY_TRUE(name); DEFINE_STATIC_KEY_FALSE(name); Which define a key of different types with an initial true/false value. Then allow: static_branch_likely() static_branch_unlikely() to take a key of either type and emit the right instruction for the case. This means adding a second arch_static_branch_jump() assembly helper which emits a JMP per default. In order to determine the right instruction for the right state, encode the branch type in the LSB of jump_entry::key. This is the final step in removing the naming confusion that has led to a stream of avoidable bugs such as: a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()") ... but it also allows new static key combinations that will give us performance enhancements in the subsequent patches. Tested-by: Rabin Vincent <rabin@rab.in> # arm Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> # ppc Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # s390 Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * locking/static_keys: Rework update logicPeter Zijlstra2015-08-031-50/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of spreading the branch_default logic all over the place, concentrate it into the one jump_label_type() function. This does mean we need to actually increment/decrement the enabled count _before_ calling the update path, otherwise jump_label_type() will not see the right state. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * locking/static_keys: Add static_key_{en,dis}able() helpersPeter Zijlstra2015-08-031-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add two helpers to make it easier to treat the refcount as boolean. Suggested-by: Jason Baron <jasonbaron0@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * jump_label: Add jump_entry_key() helperPeter Zijlstra2015-08-031-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid some casting with a helper, also prepares the way for overloading the LSB of jump_entry::key. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * jump_label, locking/static_keys: Rename JUMP_LABEL_TYPE_* and related ↵Peter Zijlstra2015-08-031-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | helpers to the static_key* pattern Rename the JUMP_LABEL_TYPE_* macros to be JUMP_TYPE_* and move the inline helpers into kernel/jump_label.c, since that's the only place they're ever used. Also rename the helpers where it's all about static keys. This is the second step in removing the naming confusion that has led to a stream of avoidable bugs such as: a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * jump_label: Rename JUMP_LABEL_{EN,DIS}ABLE to JUMP_LABEL_{JMP,NOP}Peter Zijlstra2015-08-031-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we've already stepped away from ENABLE is a JMP and DISABLE is a NOP with the branch_default bits, and are going to make it even worse, rename it to make it all clearer. This way we don't mix multiple levels of logic attributes, but have a plain 'physical' name for what the current instruction patching status of a jump label is. This is a first step in removing the naming confusion that has led to a stream of avoidable bugs such as: a833581e372a ("x86, perf: Fix static_key bug in load_mm_cr4()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org [ Beefed up the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * Merge branch 'x86/asm' into locking/coreIngo Molnar2015-08-032-0/+3
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Upcoming changes to static keys is interacting/conflicting with the following pending TSC commits in tip:x86/asm: 4ea1636b04db x86/asm/tsc: Rename native_read_tsc() to rdtsc() ... So merge it into the locking tree to have a smoother resolution. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | locking/pvqspinlock: Only kick CPU at unlock timeWaiman Long2015-08-032-21/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For an over-committed guest with more vCPUs than physical CPUs available, it is possible that a vCPU may be kicked twice before getting the lock - once before it becomes queue head and once again before it gets the lock. All these CPU kicking and halting (VMEXIT) can be expensive and slow down system performance. This patch adds a new vCPU state (vcpu_hashed) which enables the code to delay CPU kicking until at unlock time. Once this state is set, the new lock holder will set _Q_SLOW_VAL and fill in the hash table on behalf of the halted queue head vCPU. The original vcpu_halted state will be used by pv_wait_node() only to differentiate other queue nodes from the qeue head. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | locking/qrwlock: Reduce reader/writer to reader lock transfer latencyWaiman Long2015-08-031-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, a reader will check first to make sure that the writer mode byte is cleared before incrementing the reader count. That waiting is not really necessary. It increases the latency in the reader/writer to reader transition and reduces readers performance. This patch eliminates that waiting. It also has the side effect of reducing the chance of writer lock stealing and improving the fairness of the lock. Using a locking microbenchmark, a 10-threads 5M locking loop of mostly readers (RW ratio = 10,000:1) has the following performance numbers in a Haswell-EX box: Kernel Locking Rate (Kops/s) ------ --------------------- 4.1.1 15,063,081 4.1.1+patch 17,241,552 (+14.4%) Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1436459543-29126-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | locking/pvqspinlock: Order pv_unhash() after cmpxchg() on unlock slowpathWill Deacon2015-08-031-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we unlock in __pv_queued_spin_unlock(), a failed cmpxchg() on the lock value indicates that we need to take the slow-path and unhash the corresponding node blocked on the lock. Since a failed cmpxchg() does not provide any memory-ordering guarantees, it is possible that the node data could be read before the cmpxchg() on weakly-ordered architectures and therefore return a stale value, leading to hash corruption and/or a BUG(). This patch adds an smb_rmb() following the failed cmpxchg operation, so that the unhashing is ordered after the lock has been checked. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Will Deacon <will.deacon@arm.com> [ Added more comments] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Steve Capper <Steve.Capper@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150713155830.GL2632@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | locking: Clean up pvqspinlock warningPeter Zijlstra2015-08-031-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Rename the on-stack variable to match the datastructure variable, - place the cmpxchg back under the comment that explains it, - clean up the WARN() statement to avoid superfluous conditionals and line-breaks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <Waiman.Long@hp.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | |
| | \
| *-. \ Merge branch 'locking/urgent', tag 'v4.2-rc5' into locking/core, to pick up ↵Ingo Molnar2015-08-0319-124/+276
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | fixes before applying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | futex: Make should_fail_futex() statickbuild test robot2015-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: kbuild-all@01.org Cc: tipbuild@zytor.com Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Brian Silverman <bsilver16384@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | rtmutex: Delete scriptable testerDavidlohr Bueso2015-07-204-444/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No one uses this anymore, and this is not the first time the idea of replacing it with a (now possible) userspace side. Lock stealing logic was removed long ago in when the lock was granted to the highest prio. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Darren Hart <dvhart@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1435782588-4177-2-git-send-email-dave@stgolabs.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | futex: Fault/error injection capabilitiesDavidlohr Bueso2015-07-201-2/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although futexes are well known for being a royal pita, we really have very little debugging capabilities - except for relying on tglx's eye half the time. By simply making use of the existing fault-injection machinery, we can improve this situation, allowing generating artificial uaddress faults and deadlock scenarios. Of course, when this is disabled in production systems, the overhead for failure checks is practically zero -- so this is very cheap at the same time. Future work would be nice to now enhance trinity to make use of this. There is a special tunable 'ignore-private', which can filter out private futexes. Given the tsk->make_it_fail filter and this option, pi futexes can be narrowed down pretty closely. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Link: http://lkml.kernel.org/r/1435645562-975-3-git-send-email-dave@stgolabs.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | futex: Enhance comments in futex_lock_pi() for blocking pathsDavidlohr Bueso2015-07-201-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... serves a bit better to clarify between blocking and non-blocking code paths. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Link: http://lkml.kernel.org/r/1435645562-975-2-git-send-email-dave@stgolabs.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | locking/qrwlock: Better optimization for interrupt context readersWaiman Long2015-07-061-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The qrwlock is fair in the process context, but becoming unfair when in the interrupt context to support use cases like the tasklist_lock. The current code isn't that well-documented on what happens when in the interrupt context. The rspin_until_writer_unlock() will only spin if the writer has gotten the lock. If the writer is still in the waiting state, the increment in the reader count will cause the writer to remain in the waiting state and the new interrupt context reader will get the lock and return immediately. The current code, however, does an additional read of the lock value which is not necessary as the information has already been there in the fast path. This may sometime cause an additional cacheline transfer when the lock is highly contended. This patch passes the lock value information gotten in the fast path to the slow path to eliminate the additional read. It also documents the action for the interrupt context readers more clearly. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1434729002-57724-3-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | | locking/qrwlock: Rename functions to queued_*()Waiman Long2015-07-061-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To sync up with the naming convention used in qspinlock, all the qrwlock functions were renamed to started with "queued" instead of "queue". Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Douglas Hatch <doug.hatch@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Scott J Norton <scott.norton@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1434729002-57724-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2015-09-037-68/+295
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Another merge window, another set of networking changes. I've heard rumblings that the lightweight tunnels infrastructure has been voted networking change of the year. But what do I know? 1) Add conntrack support to openvswitch, from Joe Stringer. 2) Initial support for VRF (Virtual Routing and Forwarding), which allows the segmentation of routing paths without using multiple devices. There are some semantic kinks to work out still, but this is a reasonably strong foundation. From David Ahern. 3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov. 4) Ignore route nexthops with a link down state in ipv6, just like ipv4. From Andy Gospodarek. 5) Remove spinlock from fast path of act_gact and act_mirred, from Eric Dumazet. 6) Document the DSA layer, from Florian Fainelli. 7) Add netconsole support to bcmgenet, systemport, and DSA. Also from Florian Fainelli. 8) Add Mellanox Switch Driver and core infrastructure, from Jiri Pirko. 9) Add support for "light weight tunnels", which allow for encapsulation and decapsulation without bearing the overhead of a full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of others. 10) Add Identifier Locator Addressing support for ipv6, from Tom Herbert. 11) Support fragmented SKBs in iwlwifi, from Johannes Berg. 12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia. 13) Add BQL support to 3c59x driver, from Loganaden Velvindron. 14) Stop using a zero TX queue length to mean that a device shouldn't have a qdisc attached, use an explicit flag instead. From Phil Sutter. 15) Use generic geneve netdevice infrastructure in openvswitch, from Pravin B Shelar. 16) Add infrastructure to avoid re-forwarding a packet in software that was already forwarded by a hardware switch. From Scott Feldman. 17) Allow AF_PACKET fanout function to be implemented in a bpf program, from Willem de Bruijn" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits) netfilter: nf_conntrack: make nf_ct_zone_dflt built-in netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled net: fec: clear receive interrupts before processing a packet ipv6: fix exthdrs offload registration in out_rt path xen-netback: add support for multicast control bgmac: Update fixed_phy_register() sock, diag: fix panic in sock_diag_put_filterinfo flow_dissector: Use 'const' where possible. flow_dissector: Fix function argument ordering dependency ixgbe: Resolve "initialized field overwritten" warnings ixgbe: Remove bimodal SR-IOV disabling ixgbe: Add support for reporting 2.5G link speed ixgbe: fix bounds checking in ixgbe_setup_tc for 82598 ixgbe: support for ethtool set_rxfh ixgbe: Avoid needless PHY access on copper phys ixgbe: cleanup to use cached mask value ixgbe: Remove second instance of lan_id variable ixgbe: use kzalloc for allocating one thing flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c ixgbe: Remove unused PCI bus types ...
| * | | | | bpf: add support for %s specifier to bpf_trace_printk()Alexei Starovoitov2015-08-281-2/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | %s specifier makes bpf program and kernel debugging easier. To make sure that trace_printk won't crash the unsafe string is copied into stack and unsafe pointer is substituted. The following C program: #include <linux/fs.h> int foo(struct pt_regs *ctx, struct filename *filename) { void *name = 0; bpf_probe_read(&name, sizeof(name), &filename->name); bpf_trace_printk("executed %s\n", name); return 0; } when attached to kprobe do_execve() will produce output in /sys/kernel/debug/tracing/trace_pipe : make-13492 [002] d..1 3250.997277: : executed /bin/sh sh-13493 [004] d..1 3250.998716: : executed /usr/bin/gcc gcc-13494 [002] d..1 3250.999822: : executed /usr/lib/gcc/x86_64-linux-gnu/4.7/cc1 gcc-13495 [002] d..1 3251.006731: : executed /usr/bin/as gcc-13496 [002] d..1 3251.011831: : executed /usr/lib/gcc/x86_64-linux-gnu/4.7/collect2 collect2-13497 [000] d..1 3251.012941: : executed /usr/bin/ld Suggested-by: Brendan Gregg <brendan.d.gregg@gmail.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | lib: introduce strncpy_from_unsafe()Alexei Starovoitov2015-08-281-16/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | generalize FETCH_FUNC_NAME(memory, string) into strncpy_from_unsafe() and fix sparse warnings that were present in original implementation. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-272-3/+20
| |\ \ \ \ \
| * \ \ \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-214-30/+84
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/usb/qmi_wwan.c Overlapping additions of new device IDs to qmi_wwan.c Signed-off-by: David S. Miller <davem@davemloft.net>
| * \ \ \ \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-08-133-7/+18
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/cavium/Kconfig The cavium conflict was overlapping dependency changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | bpf: fix bpf_perf_event_read() loop upper boundWei-Chun Chao2015-08-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Verifier rejects programs incorrectly. Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read()") Cc: Kaixu Xia <xiakaixu@huawei.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Wei-Chun Chao <weichunc@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | bpf: Implement function bpf_perf_event_read() that get the selected hardware ↵Kaixu Xia2015-08-092-15/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PMU conuter According to the perf_event_map_fd and index, the function bpf_perf_event_read() can convert the corresponding map value to the pointer to struct perf_event and return the Hardware PMU counter value. Signed-off-by: Kaixu Xia <xiakaixu@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | bpf: Add new bpf map type to store the pointer to struct perf_eventKaixu Xia2015-08-091-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a new bpf map type 'BPF_MAP_TYPE_PERF_EVENT_ARRAY'. This map only stores the pointer to struct perf_event. The user space event FDs from perf_event_open() syscall are converted to the pointer to struct perf_event and stored in map. Signed-off-by: Kaixu Xia <xiakaixu@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | bpf: Make the bpf_prog_array_map more genericWang Nan2015-08-093-33/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All the map backends are of generic nature. In order to avoid adding much special code into the eBPF core, rewrite part of the bpf_prog_array map code and make it more generic. So the new perf_event_array map type can reuse most of code with bpf_prog_array map and add fewer lines of special code. Signed-off-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: Kaixu Xia <xiakaixu@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | perf: add the necessary core perf APIs when accessing events counters in ↵Kaixu Xia2015-08-091-0/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | eBPF programs This patch add three core perf APIs: - perf_event_attrs(): export the struct perf_event_attr from struct perf_event; - perf_event_get(): get the struct perf_event from the given fd; - perf_event_read_local(): read the events counters active on the current CPU; These APIs are needed when accessing events counters in eBPF programs. The API perf_event_read_local() comes from Peter and I add the corresponding SOB. Signed-off-by: Kaixu Xia <xiakaixu@huawei.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-07-312-21/+37
| |\ \ \ \ \ \ \ \ | | | |_|_|_|_|/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/s390/net/bpf_jit_comp.c drivers/net/ethernet/ti/netcp_ethss.c net/bridge/br_multicast.c net/ipv4/ip_fragment.c All four conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | ebpf: Allow dereferences of PTR_TO_STACK registersAlex Gartrell2015-07-271-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mov %rsp, %r1 ; r1 = rsp add $-8, %r1 ; r1 = rsp - 8 store_q $123, -8(%rsp) ; *(u64*)r1 = 123 <- valid store_q $123, (%r1) ; *(u64*)r1 = 123 <- previously invalid mov $0, %r0 exit ; Always need to exit And we'd get the following error: 0: (bf) r1 = r10 1: (07) r1 += -8 2: (7a) *(u64 *)(r10 -8) = 999 3: (7a) *(u64 *)(r1 +0) = 999 R1 invalid mem access 'fp' Unable to load program We already know that a register is a stack address and the appropriate offset, so we should be able to validate those references as well. Signed-off-by: Alex Gartrell <agartrell@fb.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-07-238-24/+32
| |\ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: net/bridge/br_mdb.c br_mdb.c conflict was a function call being removed to fix a bug in 'net' but whose signature was changed in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | test_bpf: add bpf_skb_vlan_push/pop() testsAlexei Starovoitov2015-07-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | improve accuracy of timing in test_bpf and add two stress tests: - {skb->data[0], get_smp_processor_id} repeated 2k times - {skb->data[0], vlan_push} x 68 followed by {skb->data[0], vlan_pop} x 68 1st test is useful to test performance of JIT implementation of BPF_LD_ABS together with BPF_CALL instructions. 2nd test is stressing skb_vlan_push/pop logic together with skb->data access via BPF_LD_ABS insn which checks that re-caching of skb->data is done correctly. In order to call bpf_skb_vlan_push() from test_bpf.ko have to add three export_symbol_gpl. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-07-1319-130/+283
| |\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: net/bridge/br_mdb.c Minor conflict in br_mdb.c, in 'net' we added a memset of the on-stack 'ip' variable whereas in 'net-next' we assign a new member 'vid'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | | ebpf: remove self-assignment in interpreter's tail callDaniel Borkmann2015-07-131-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ARG1 = BPF_R1 as it stands, evaluates to regs[BPF_REG_1] = regs[BPF_REG_1] and thus has no effect. Add a comment instead, explaining what happens and why it's okay to just remove it. Since from user space side, a tail call is invoked as a pseudo helper function via bpf_tail_call_proto, the verifier checks the arguments just like with any other helper function and makes sure that the first argument (regs[BPF_REG_1])'s type is ARG_PTR_TO_CTX. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | | | | | Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds2015-09-022-15/+7
|\ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
| * | | | | | | | | | | block: add a bi_error field to struct bioChristoph Hellwig2015-07-292-15/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | | | | | | | | | | Merge branch 'for-4.3' of ↵Linus Torvalds2015-09-026-25/+474
|\ \ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - a new PIDs controller is added. It turns out that PIDs are actually an independent resource from kmem due to the limited PID space. - more core preparations for the v2 interface. Once cpu side interface is settled, it should be ready for lifting the devel mask. for-4.3-unified-base was temporarily branched so that other trees (block) can pull cgroup core changes that blkcg changes depend on. - a non-critical idr_preload usage bug fix. * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: pids: fix invalid get/put usage cgroup: introduce cgroup_subsys->legacy_name cgroup: don't print subsystems for the default hierarchy cgroup: make cftype->private a unsigned long cgroup: export cgrp_dfl_root cgroup: define controller file conventions cgroup: fix idr_preload usage cgroup: add documentation for the PIDs controller cgroup: implement the PIDs subsystem cgroup: allow a cgroup subsystem to reject a fork
| * \ \ \ \ \ \ \ \ \ \ \ Merge branch 'for-4.3-unified-base' into for-4.3Tejun Heo2015-08-251-15/+26
| |\ \ \ \ \ \ \ \ \ \ \ \
| | * | | | | | | | | | | | cgroup: introduce cgroup_subsys->legacy_nameTejun Heo2015-08-181-11/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows cgroup subsystems to use a different name on the unified hierarchy. cgroup_subsys->name is used on the unified hierarchy, ->legacy_name elsewhere. If ->legacy_name is not explicitly set, it's automatically set to ->name and the userland visible behavior remains unchanged. v2: Make parse_cgroupfs_options() only consider ->legacy_name as mount options are used only on legacy hierarchies. Suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org
| | * | | | | | | | | | | | cgroup: don't print subsystems for the default hierarchyTejun Heo2015-08-181-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It doesn't make sense to print subsystems on mount option or /proc/PID/cgroup for the default hierarchy. * cgroup.controllers file at the root of the default hierarchy lists the currently attached controllers. * The default hierarchy is catch-all for unmounted subsystems. * The default hierarchy doesn't accept any mount options. Suppress subsystem printing on mount options and /proc/PID/cgroup for the default hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org
| | * | | | | | | | | | | | cgroup: export cgrp_dfl_rootTejun Heo2015-08-051-0/+1
| | | |_|_|_|_|_|_|_|/ / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While cgroup subsystems can't be modules, blkcg supports dynamically loadable policies which interact with cgroup core. Export cgrp_dfl_root so that cgroup_on_dfl() can be used in those modules. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
| * | | | | | | | | | | | cgroup: pids: fix invalid get/put usageAleksa Sarai2015-08-251-15/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix incorrect usage of css_get and css_put to put a different css in pids_{cancel_,}attach() than the one grabbed in pids_can_attach(). This could lead to quite serious memory leakage (and unsafe operations on the putted css). tj: minor comment update Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | | | | | | | | cgroup: fix idr_preload usageVladimir Davydov2015-08-031-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It does not make much sense to call idr_preload with the same gfp mask as the following idr_alloc, but this is what we do in cgroup_idr_alloc. This patch fixes the idr_preload usage by making cgroup_idr_alloc call idr_alloc w/o __GFP_WAIT. Since it is now safe to call cgroup_idr_alloc with GFP_KERNEL, the patch also fixes all its callers appropriately. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | | | | | | | | cgroup: implement the PIDs subsystemAleksa Sarai2015-07-142-0/+367
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a new single-purpose PIDs subsystem to limit the number of tasks that can be forked inside a cgroup. Essentially this is an implementation of RLIMIT_NPROC that applies to a cgroup rather than a process tree. However, it should be noted that organisational operations (adding and removing tasks from a PIDs hierarchy) will *not* be prevented. Rather, the number of tasks in the hierarchy cannot exceed the limit through forking. This is due to the fact that, in the unified hierarchy, attach cannot fail (and it is not possible for a task to overcome its PIDs cgroup policy limit by attaching to a child cgroup -- even if migrating mid-fork it must be able to fork in the parent first). PIDs are fundamentally a global resource, and it is possible to reach PID exhaustion inside a cgroup without hitting any reasonable kmemcg policy. Once you've hit PID exhaustion, you're only in a marginally better state than OOM. This subsystem allows PID exhaustion inside a cgroup to be prevented. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | | | | | | | | | | | cgroup: allow a cgroup subsystem to reject a forkAleksa Sarai2015-07-144-6/+88
| |/ / / / / / / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new cgroup subsystem callback can_fork that conditionally states whether or not the fork is accepted or rejected by a cgroup policy. In addition, add a cancel_fork callback so that if an error occurs later in the forking process, any state modified by can_fork can be reverted. Allow for a private opaque pointer to be passed from cgroup_can_fork to cgroup_post_fork, allowing for the fork state to be stored by each subsystem separately. Also add a tagging system for cgroup_subsys.h to allow for CGROUP_<TAG> enumerations to be be defined and used. In addition, explicitly add a CGROUP_CANFORK_COUNT macro to make arrays easier to define. This is in preparation for implementing the pids cgroup subsystem. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | | | | | | | | | | Merge branch 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2015-09-021-1/+1
|\ \ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull workqueue updates from Tejun Heo: "Only three trivial changes for workqueue this time - doc, MAINTAINERS and EXPORT_SYMBOL updates" * 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix some docbook warnings workqueue: Make flush_workqueue() available again to non GPL modules workqueue: add myself as a dedicated reviwer
OpenPOWER on IntegriCloud