summaryrefslogtreecommitdiffstats
path: root/mm
Commit message (Collapse)AuthorAgeFilesLines
* mm: change invalidatepage prototype to accept lengthLukas Czerner2013-05-212-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>
* shm: fix null pointer deref when userspace specifies invalid hugepage sizeLi Zefan2013-05-091-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | Dave reported an oops triggered by trinity: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: newseg+0x10d/0x390 PGD cf8c1067 PUD cf8c2067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 2 PID: 7636 Comm: trinity-child2 Not tainted 3.9.0+#67 ... Call Trace: ipcget+0x182/0x380 SyS_shmget+0x5a/0x60 tracesys+0xdd/0xe2 This bug was introduced by commit af73e4d9506d ("hugetlbfs: fix mmap failure in unaligned size request"). Reported-by: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Li Zefan <lizfan@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/slab: Fix crash during slab initChris Mason2013-05-081-10/+10
| | | | | | | | | | | | | | | | | | | | | | Commit 8a965b3baa89 ("mm, slab_common: Fix bootstrap creation of kmalloc caches") introduced a regression that caused us to crash early during boot. The commit was introducing ordering of slab creation, making sure two odd-sized slabs were created after specific powers of two sizes. But, if any of the power of two slabs were created earlier during boot, slabs at index 1 or 2 might not get created at all. This patch makes sure none of the slabs get skipped. Tony Lindgren bisected this down to the offending commit, which really helped because bisect kept bringing me to almost but not quite this one. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Tony Lindgren <tony@atomide.com> Acked-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-blockLinus Torvalds2013-05-083-286/+49
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...
| * Merge branch 'writeback-workqueue' of ↵Jens Axboe2013-04-026-250/+50
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. e3620a3ad5 ("MD RAID5: Avoid accessing gendisk or queue structs when not available") 2f6db2a707 ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>
| | * writeback: expose the bdi_wq workqueueTejun Heo2013-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are cases where userland wants to tweak the priority and affinity of writeback flushers. Expose bdi_wq to userland by setting WQ_SYSFS. It appears under /sys/bus/workqueue/devices/writeback/ and allows adjusting maximum concurrency level, cpumask and nice level. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| | * writeback: replace custom worker pool implementation with unbound workqueueTejun Heo2013-04-011-227/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Writeback implements its own worker pool - each bdi can be associated with a worker thread which is created and destroyed dynamically. The worker thread for the default bdi is always present and serves as the "forker" thread which forks off worker threads for other bdis. there's no reason for writeback to implement its own worker pool when using unbound workqueue instead is much simpler and more efficient. This patch replaces custom worker pool implementation in writeback with an unbound workqueue. The conversion isn't too complicated but the followings are worth mentioning. * bdi_writeback->last_active, task and wakeup_timer are removed. delayed_work ->dwork is added instead. Explicit timer handling is no longer necessary. Everything works by either queueing / modding / flushing / canceling the delayed_work item. * bdi_writeback_thread() becomes bdi_writeback_workfn() which runs off bdi_writeback->dwork. On each execution, it processes bdi->work_list and reschedules itself if there are more things to do. The function also handles low-mem condition, which used to be handled by the forker thread. If the function is running off a rescuer thread, it only writes out limited number of pages so that the rescuer can serve other bdis too. This preserves the flusher creation failure behavior of the forker thread. * INIT_LIST_HEAD(&bdi->bdi_list) is used to tell bdi_writeback_workfn() about on-going bdi unregistration so that it always drains work_list even if it's running off the rescuer. Note that the original code was broken in this regard. Under memory pressure, a bdi could finish unregistration with non-empty work_list. * The default bdi is no longer special. It now is treated the same as any other bdi and bdi_cap_flush_forker() is removed. * BDI_pending is no longer used. Removed. * Some tracepoints become non-applicable. The following TPs are removed - writeback_nothread, writeback_wake_thread, writeback_wake_forker_thread, writeback_thread_start, writeback_thread_stop. Everything, including devices coming and going away and rescuer operation under simulated memory pressure, seems to work fine in my test setup. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com>
| | * writeback: remove unused bdi_pending_listTejun Heo2013-04-011-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | There's no user left. Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Fengguang Wu <fengguang.wu@intel.com>
| * | block: Convert some code to bio_for_each_segment_all()Kent Overstreet2013-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | More prep work for immutable bvecs: A few places in the code were either open coding or using the wrong version - fix. After we introduce the bvec iter, it'll no longer be possible to modify the biovec through bio_for_each_segment_all() - it doesn't increment a pointer to the current bvec, you pass in a struct bio_vec (not a pointer) which is updated with what the current biovec would be (taking into account bi_bvec_done and bi_size). So because of that it's more worthwhile to be consistent about bio_for_each_segment()/bio_for_each_segment_all() usage. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Alexander Viro <viro@zeniv.linux.org.uk>
| * | block: Add bio_for_each_segment_all()Kent Overstreet2013-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __bio_for_each_segment() iterates bvecs from the specified index instead of bio->bv_idx. Currently, the only usage is to walk all the bvecs after the bio has been advanced by specifying 0 index. For immutable bvecs, we need to split these apart; bio_for_each_segment() is going to have a different implementation. This will also help document the intent of code that's using it - bio_for_each_segment_all() is only legal to use for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Neil Brown <neilb@suse.de> CC: Boaz Harrosh <bharrosh@panasas.com>
| * | bounce: Refactor __blk_queue_bounce to not use bi_io_vecKent Overstreet2013-03-231-54/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bunch of what __blk_queue_bounce() was doing was problematic for the immutable bvec work; this cleans that up and the code is quite a bit smaller, too. The __bio_for_each_segment() in copy_to_high_bio_irq() was changed because that one's looping over the original bio, not the bounce bio - a later patch renames __bio_for_each_segment() -> bio_for_each_segment_all(), and documents that bio_for_each_segment_all() is only for code that owns the bio. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk>
| * | block: Remove bi_idx referencesKent Overstreet2013-03-231-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For immutable bvecs, all bi_idx usage needs to be audited - so here we're removing all the unnecessary uses. Most of these are places where it was being initialized on a bio that was just allocated, a few others are conversions to standard macros. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk>
* | | aio: don't include aio.h in sched.hKent Overstreet2013-05-073-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm: remove old aio use_mm() commentZach Brown2013-05-071-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bunch of performance improvements and cleanups Zach Brown and I have been working on. The code should be pretty solid at this point, though it could of course use more review and testing. The results in my testing are pretty impressive, particularly when an ioctx is being shared between multiple threads. In my crappy synthetic benchmark, with 4 threads submitting and one thread reaping completions, I saw overhead in the aio code go from ~50% (mostly ioctx lock contention) to low single digits. Performance with ioctx per thread improved too, but I'd have to rerun those benchmarks. The reason I've been focused on performance when the ioctx is shared is that for a fair number of real world completions, userspace needs the completions aggregated somehow - in practice people just end up implementing this aggregation in userspace today, but if it's done right we can do it much more efficiently in the kernel. Performance wise, the end result of this patch series is that submitting a kiocb writes to _no_ shared cachelines - the penalty for sharing an ioctx is gone there. There's still going to be some cacheline contention when we deliver the completions to the aio ringbuffer (at least if you have interrupts being delivered on multiple cores, which for high end stuff you do) but I have a couple more patches not in this series that implement coalescing for that (by taking advantage of interrupt coalescing). With that, there's basically no bottlenecks or performance issues to speak of in the aio code. This patch: use_mm() is used in more places than just aio. There's no need to mention callers when describing the function. Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm/vmalloc.c: add vfree commentAndrew Morton2013-05-071-0/+2
| | | | | | | | | | | | | | | | | | Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | hugetlbfs: fix mmap failure in unaligned size requestNaoya Horiguchi2013-05-071-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current kernel returns -EINVAL unless a given mmap length is "almost" hugepage aligned. This is because in sys_mmap_pgoff() the given length is passed to vm_mmap_pgoff() as it is without being aligned with hugepage boundary. This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix alignment of huge page requests"), where alignment code is pushed into hugetlb_file_setup() and the variable len in caller side is not changed. To fix this, this patch partially reverts that commit, and adds alignment code in caller side. And it also introduces hstate_sizelog() in order to get proper hstate to specified hugepage size. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881 [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: <iceman_dvd@yahoo.com> Cc: Steven Truelove <steven.truelove@utoronto.ca> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | mm, memcg: add rss_huge stat to memory.statDavid Rientjes2013-05-071-10/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This exports the amount of anonymous transparent hugepages for each memcg via the new "rss_huge" stat in memory.stat. The units are in bytes. This is helpful to determine the hugepage utilization for individual jobs on the system in comparison to rss and opportunities where MADV_HUGEPAGE may be helpful. The amount of anonymous transparent hugepages is also included in "rss" for backwards compatibility. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'slab/for-linus' of ↵Linus Torvalds2013-05-074-639/+589
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull slab changes from Pekka Enberg: "The bulk of the changes are more slab unification from Christoph. There's also few fixes from Aaron, Glauber, and Joonsoo thrown into the mix." * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (24 commits) mm, slab_common: Fix bootstrap creation of kmalloc caches slab: Return NULL for oversized allocations mm: slab: Verify the nodeid passed to ____cache_alloc_node slub: tid must be retrieved from the percpu area of the current processor slub: Do not dereference NULL pointer in node_match slub: add 'likely' macro to inc_slabs_node() slub: correct to calculate num of acquired objects in get_partial_node() slub: correctly bootstrap boot caches mm/sl[au]b: correct allocation type check in kmalloc_slab() slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sections slab: Handle ARCH_DMA_MINALIGN correctly slab: Common definition for kmem_cache_node slab: Rename list3/l3 to node slab: Common Kmalloc cache determination stat: Use size_t for sizes instead of unsigned slab: Common function to create the kmalloc array slab: Common definition for the array of kmalloc caches slab: Common constants for kmalloc boundaries slab: Rename nodelists to node slab: Common name for the per node structures ...
| * \ \ Merge branch 'slab/next' into slab/for-linusPekka Enberg2013-05-074-639/+589
| |\ \ \
| | * | | mm, slab_common: Fix bootstrap creation of kmalloc cachesChristoph Lameter2013-05-061-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For SLAB the kmalloc caches must be created in ascending sizes in order for the OFF_SLAB sub-slab cache to work properly. Create the non power of two caches immediately after the prior power of two kmalloc cache. Do not create the non power of two caches before all other caches. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Christoph Lamete <cl@linux.com> Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Return NULL for oversized allocationsChristoph Lameter2013-05-061-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inline path seems to have changed the SLAB behavior for very large kmalloc allocations with commit e3366016 ("slab: Use common kmalloc_index/kmalloc_size functions"). This patch restores the old behavior but also adds diagnostics so that we can figure where in the code these large allocations occur. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Christoph Lameter <cl@linux.com> Link: http://lkml.kernel.org/r/201305040348.CIF81716.OStQOHFJMFLOVF@I-love.SAKURA.ne.jp [ penberg@kernel.org: use WARN_ON_ONCE ] Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | mm: slab: Verify the nodeid passed to ____cache_alloc_nodeAaron Tomlin2013-05-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the nodeid is > num_online_nodes() this can cause an Oops and a panic(). The purpose of this patch is to assert if this condition is true to aid debugging efforts rather than some random NULL pointer dereference or page fault. This patch is in response to BZ#42967 [1]. Using VM_BUG_ON so it's used only when CONFIG_DEBUG_VM is set, given that ____cache_alloc_node() is a hot code path. [1]: https://bugzilla.kernel.org/show_bug.cgi?id=42967 Signed-off-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: tid must be retrieved from the percpu area of the current processorChristoph Lameter2013-04-051-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As Steven Rostedt has pointer out: rescheduling could occur on a different processor after the determination of the per cpu pointer and before the tid is retrieved. This could result in allocation from the wrong node in slab_alloc(). The effect is much more severe in slab_free() where we could free to the freelist of the wrong page. The window for something like that occurring is pretty small but it is possible. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: Do not dereference NULL pointer in node_matchChristoph Lameter2013-04-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The variables accessed in slab_alloc are volatile and therefore the page pointer passed to node_match can be NULL. The processing of data in slab_alloc is tentative until either the cmpxhchg succeeds or the __slab_alloc slowpath is invoked. Both are able to perform the same allocation from the freelist. Check for the NULL pointer in node_match. A false positive will lead to a retry of the loop in __slab_alloc. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: add 'likely' macro to inc_slabs_node()Joonsoo Kim2013-04-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After boot phase, 'n' always exist. So add 'likely' macro for helping compiler. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: correct to calculate num of acquired objects in get_partial_node()Joonsoo Kim2013-04-021-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a subtle bug when calculating a number of acquired objects. Currently, we calculate "available = page->objects - page->inuse", after acquire_slab() is called in get_partial_node(). In acquire_slab() with mode = 1, we always set new.inuse = page->objects. So, acquire_slab(s, n, page, object == NULL); if (!object) { c->page = page; stat(s, ALLOC_FROM_PARTIAL); object = t; available = page->objects - page->inuse; !!! availabe is always 0 !!! ... Therfore, "available > s->cpu_partial / 2" is always false and we always go to second iteration. This patch correct this problem. After that, we don't need return value of put_cpu_partial(). So remove it. Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slub: correctly bootstrap boot cachesGlauber Costa2013-02-281-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After we create a boot cache, we may allocate from it until it is bootstraped. This will move the page from the partial list to the cpu slab list. If this happens, the loop: list_for_each_entry(p, &n->partial, lru) that we use to scan for all partial pages will yield nothing, and the pages will keep pointing to the boot cpu cache, which is of course, invalid. To do that, we should flush the cache to make sure that the cpu slab is back to the partial list. Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Steffen Michalke <StMichalke@web.de> Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | mm/sl[au]b: correct allocation type check in kmalloc_slab()Joonsoo Kim2013-02-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit "slab: Common Kmalloc cache determination" made mistake in kmalloc_slab(). SLAB_CACHE_DMA is for kmem_cache creation, not for allocation. For allocation, we should use GFP_XXX to identify type of allocation. So, change SLAB_CACHE_DMA to GFP_DMA. Acked-by: Christoph Lameter <cl@linux.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sectionsChristoph Lameter2013-02-061-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Variables were not properly converted and the conversion caused a naming conflict. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Common definition for kmem_cache_nodeChristoph Lameter2013-02-012-17/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Put the definitions for the kmem_cache_node structures together so that we have one structure. That will allow us to create more common fields in the future which could yield more opportunities to share code. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Rename list3/l3 to nodeChristoph Lameter2013-02-012-259/+259
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The list3 or l3 pointers are pointing to per node structures. Reflect that in the names of variables used. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Common Kmalloc cache determinationChristoph Lameter2013-02-014-142/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extract the optimized lookup functions from slub and put them into slab_common.c. Then make slab use these functions as well. Joonsoo notes that this fixes some issues with constant folding which also reduces the code size for slub. https://lkml.org/lkml/2012/10/20/82 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Common function to create the kmalloc arrayChristoph Lameter2013-02-014-99/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kmalloc array is created in similar ways in both SLAB and SLUB. Create a common function and have both allocators call that function. V1->V2: Whitespace cleanup Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Common definition for the array of kmalloc cachesChristoph Lameter2013-02-013-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Have a common definition fo the kmalloc cache arrays in SLAB and SLUB Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Common constants for kmalloc boundariesChristoph Lameter2013-02-011-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Standardize the constants that describe the smallest and largest object kept in the kmalloc arrays for SLAB and SLUB. Differentiate between the maximum size for which a slab cache is used (KMALLOC_MAX_CACHE_SIZE) and the maximum allocatable size (KMALLOC_MAX_SIZE, KMALLOC_MAX_ORDER). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Rename nodelists to nodeChristoph Lameter2013-02-011-68/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Have a common naming between both slab caches for future changes. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Common name for the per node structuresChristoph Lameter2013-02-011-44/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename the structure used for the per node structures in slab to have a name that expresses that fact. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Use common kmalloc_index/kmalloc_size functionsChristoph Lameter2013-02-011-93/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make slab use the common functions. We can get rid of a lot of old ugly stuff as a results. Among them the sizes array and the weird include/linux/kmalloc_sizes file and some pretty bad #include statements in slab_def.h. The one thing that is different in slab is that the 32 byte cache will also be created for arches that have page sizes larger than 4K. There are numerous smaller allocations that SLOB and SLUB can handle better because of their support for smaller allocation sizes so lets keep the 32 byte slab also for arches with > 4K pages. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
| | * | | slab: Use proper formatting specs for unsigned size_tChristoph Lameter2013-02-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
* | | | | Merge branch 'for-linus' of ↵Linus Torvalds2013-05-013-10/+40
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...
| * \ \ \ \ Merge branch 'vfree' into for-nextAl Viro2013-05-011-5/+40
| |\ \ \ \ \
| | * | | | | make vfree() safe to call from interrupt contextsAl Viro2013-03-101-5/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bunch of RCU callbacks want to be able to do vfree() and end up with rather kludgy schemes. Just let vfree() do the right thing - put the victim on llist and schedule actual __vunmap() via schedule_work(), so that it runs from non-interrupt context. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | | | lift sb_start_write() out of ->write()Al Viro2013-04-091-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | | | lift sb_start_write/sb_end_write out of ->aio_write()Al Viro2013-04-091-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2013-05-013-26/+4
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull compat cleanup from Al Viro: "Mostly about syscall wrappers this time; there will be another pile with patches in the same general area from various people, but I'd rather push those after both that and vfs.git pile are in." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: syscalls.h: slightly reduce the jungles of macros get rid of union semop in sys_semctl(2) arguments make do_mremap() static sparc: no need to sign-extend in sync_file_range() wrapper ppc compat wrappers for add_key(2) and request_key(2) are pointless x86: trim sys_ia32.h x86: sys32_kill and sys32_mprotect are pointless get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC merge compat sys_ipc instances consolidate compat lookup_dcookie() convert vmsplice to COMPAT_SYSCALL_DEFINE switch getrusage() to COMPAT_SYSCALL_DEFINE switch epoll_pwait to COMPAT_SYSCALL_DEFINE convert sendfile{,64} to COMPAT_SYSCALL_DEFINE switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect make HAVE_SYSCALL_WRAPPERS unconditional consolidate cond_syscall and SYSCALL_ALIAS declarations teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long get rid of duplicate logics in __SC_....[1-6] definitions
| * | | | | | | make do_mremap() staticAl Viro2013-03-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The extern in sys_sparc_64.c was a rudiment of time when do_mremap() used to exist in MMU case (it doesn't anymore). As for !MMU one, nothing uses it outside of mm/nommu.c... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | | | | teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long longAl Viro2013-03-032-24/+3
| | |/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ... and convert a bunch of SYSCALL_DEFINE ones to SYSCALL_DEFINE<n>, killing the boilerplate crap around them. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | | | mm: cleancache: clean up cleancache_enabledBob Liu2013-04-301-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cleancache_ops is used to decide whether backend is registered. So now cleancache_enabled is always true if defined CONFIG_CLEANCACHE. Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | cleancache: Make cleancache_init use a pointer for the opsKonrad Rzeszutek Wilk2013-04-301-30/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of using a backend_registered to determine whether a backend is enabled. This allows us to remove the backend_register check and just do 'if (cleancache_ops)' [v1: Rebase on top of b97c4b430b0a (ramster->zcache move] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | | | mm: cleancache: lazy initialization to allow tmem backends to build/run as ↵Dan Magenheimer2013-04-301-21/+219
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | modules With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to be built/loaded as modules rather than built-in and enabled by a boot parameter, this patch provides "lazy initialization", allowing backends to register to cleancache even after filesystems were mounted. Calls to init_fs and init_shared_fs are remembered as fake poolids but no real tmem_pools created. On backend registration the fake poolids are mapped to real poolids and respective tmem_pools. Signed-off-by: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Florian Schmaus <fschmaus@gmail.com> Signed-off-by: Andor Daam <andor.daam@googlemail.com> Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> [v1: Minor fixes: used #define for some values and bools] [v2: Removed CLEANCACHE_HAS_LAZY_INIT] [v3: Added more comments, added a lock for [shared_|]fs_poolid_map] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud