summaryrefslogtreecommitdiffstats
path: root/include/linux/fs.h
Commit message (Collapse)AuthorAgeFilesLines
...
* | Merge tag 'locks-v3.20-2' of git://git.samba.org/jlayton/linuxLinus Torvalds2015-02-181-3/+0
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | Pull file locking fixes from Jeff Layton: "A small set of patches to fix problems with the recent file locking changes that we discussed earlier this week" " * tag 'locks-v3.20-2' of git://git.samba.org/jlayton/linux: locks: fix list insertion when lock is split in two locks: remove conditional lock release in middle of flock_lock_file locks: only remove leases associated with the file being closed Revert "locks: keep a count of locks on the flctx lists"
| * Revert "locks: keep a count of locks on the flctx lists"Jeff Layton2015-02-161-3/+0
| | | | | | | | | | | | | | | | | | | | | | This reverts commit 9bd0f45b7037fcfa8b575c7e27d0431d6e6dc3bb. Linus rightly pointed out that I failed to initialize the counters when adding them, so they don't work as expected. Just revert this patch for now. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
* | Merge branch 'lazytime' of ↵Linus Torvalds2015-02-171-0/+10
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull lazytime mount option support from Al Viro: "Lazytime stuff from tytso" * 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ext4: add optimization for the lazytime mount option vfs: add find_inode_nowait() function vfs: add support for a lazytime mount option
| * | vfs: add find_inode_nowait() functionTheodore Ts'o2015-02-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new function find_inode_nowait() which is an even more general version of ilookup5_nowait(). It is designed for callers which need very fine grained control over when the function is allowed to block or increment the inode's reference count. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | vfs: add support for a lazytime mount optionTheodore Ts'o2015-02-051-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new mount option which enables a new "lazytime" mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact long tail latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'iov_iter' of ↵Linus Torvalds2015-02-171-0/+3
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "More iov_iter work - missing counterpart of iov_iter_init() for bvec-backed ones and vfs_read_iter()/vfs_write_iter() - wrappers for sync calls of ->read_iter()/->write_iter()" * 'iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: add vfs_iter_{read,write} helpers new helper: iov_iter_bvec()
| * | | fs: add vfs_iter_{read,write} helpersChristoph Hellwig2015-01-291-0/+3
| |/ / | | | | | | | | | | | | | | | | | | Simple helpers that pass an arbitrary iov_iter to filesystems. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | Merge branch 'getname2' of ↵Linus Torvalds2015-02-171-7/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull getname/putname updates from Al Viro: "Rework of getname/getname_kernel/etc., mostly from Paul Moore. Gets rid of quite a pile of kludges between namei and audit..." * 'getname2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: audit: replace getname()/putname() hacks with reference counters audit: fix filename matching in __audit_inode() and __audit_inode_child() audit: enable filename recording via getname_kernel() simpler calling conventions for filename_mountpoint() fs: create proper filename objects using getname_kernel() fs: rework getname_kernel to handle up to PATH_MAX sized filenames cut down the number of do_path_lookup() callers
| * | | audit: replace getname()/putname() hacks with reference countersPaul Moore2015-01-231-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to ensure that filenames are not released before the audit subsystem is done with the strings there are a number of hacks built into the fs and audit subsystems around getname() and putname(). To say these hacks are "ugly" would be kind. This patch removes the filename hackery in favor of a more conventional reference count based approach. The diffstat below tells most of the story; lots of audit/fs specific code is replaced with a traditional reference count based approach that is easily understood, even by those not familiar with the audit and/or fs subsystems. CC: viro@zeniv.linux.org.uk CC: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | dax: add dax_zero_page_rangeMatthew Wilcox2015-02-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This new function allows us to support hole-punch for DAX files by zeroing a partial page, as opposed to the dax_truncate_page() function which can only truncate to the end of the page. Reimplement dax_truncate_page() to call dax_zero_page_range(). [ross.zwisler@linux.intel.com: ported to 3.13-rc2] [akpm@linux-foundation.org: fix typos in comments] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vfs,ext2: remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAXMatthew Wilcox2015-02-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fewer Kconfig options we have the better. Use the generic CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vfs: remove get_xip_memMatthew Wilcox2015-02-161-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers of get_xip_mem() are now gone. Remove checks for it, initialisers of it, documentation of it and the only implementation of it. Also remove mm/filemap_xip.c as it is now empty. Also remove documentation of the long-gone get_xip_page(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | dax,ext2: replace xip_truncate_page with dax_truncate_pageMatthew Wilcox2015-02-161-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It takes a get_block parameter just like nobh_truncate_page() and block_truncate_page() Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | dax,ext2: replace the XIP page fault handler with the DAX page fault handlerMatthew Wilcox2015-02-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of calling aops->get_xip_mem from the fault handler, the filesystem passes a get_block_t that is used to find the appropriate blocks. This requires that all architectures implement copy_user_page(). At the time of writing, mips and arm do not. Patches exist and are in progress. [akpm@linux-foundation.org: remap_file_pages went away] Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | dax,ext2: replace ext2_clear_xip_target with dax_clear_blocksMatthew Wilcox2015-02-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is practically generic code; other filesystems will want to call it from other places, but there's nothing ext2-specific about it. Make it a little more generic by allowing it to take a count of the number of bytes to zero rather than fixing it to a single page. Thanks to Dave Hansen for suggesting that I need to call cond_resched() if zeroing more than one page. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | dax,ext2: replace XIP read and write with DAX I/OMatthew Wilcox2015-02-161-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the generic AIO infrastructure instead of custom read and write methods. In addition to giving us support for AIO, this adds the missing locking between read() and truncate(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vfs,ext2: introduce IS_DAX(inode)Matthew Wilcox2015-02-161-0/+6
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use an inode flag to tag inodes which should avoid using the page cache. Convert ext2 to use it instead of mapping_is_xip(). Prevent I/Os to files tagged with the DAX flag from falling back to buffered I/O. Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2015-02-121-2/+4
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge third set of updates from Andrew Morton: - the rest of MM [ This includes getting rid of the numa hinting bits, in favor of just generic protnone logic. Yay. - Linus ] - core kernel - procfs - some of lib/ (lots of lib/ material this time) * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (104 commits) lib/lcm.c: replace include lib/percpu_ida.c: remove redundant includes lib/strncpy_from_user.c: replace module.h include lib/stmp_device.c: replace module.h include lib/sort.c: move include inside #if 0 lib/show_mem.c: remove redundant include lib/radix-tree.c: change to simpler include lib/plist.c: remove redundant include lib/nlattr.c: remove redundant include lib/kobject_uevent.c: remove redundant include lib/llist.c: remove redundant include lib/md5.c: simplify include lib/list_sort.c: rearrange includes lib/genalloc.c: remove redundant include lib/idr.c: remove redundant include lib/halfmd4.c: simplify includes lib/dynamic_queue_limits.c: simplify includes lib/sort.c: use simpler includes lib/interval_tree.c: simplify includes hexdump: make it return number of bytes placed in buffer ...
| * | | fs: consolidate {nr,free}_cached_objects args in shrink_controlVladimir Davydov2015-02-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are going to make FS shrinkers memcg-aware. To achieve that, we will have to pass the memcg to scan to the nr_cached_objects and free_cached_objects VFS methods, which currently take only the NUMA node to scan. Since the shrink_control structure already holds the node, and the memcg to scan will be added to it when we introduce memcg-aware vmscan, let us consolidate the methods' arguments in this structure to keep things clean. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-blockLinus Torvalds2015-02-121-4/+24
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull backing device changes from Jens Axboe: "This contains a cleanup of how the backing device is handled, in preparation for a rework of the life time rules. In this part, the most important change is to split the unrelated nommu mmap flags from it, but also removing a backing_dev_info pointer from the address_space (and inode), and a cleanup of other various minor bits. Christoph did all the work here, I just fixed an oops with pages that have a swap backing. Arnd fixed a missing export, and Oleg killed the lustre backing_dev_info from staging. Last patch was from Al, unexporting parts that are now no longer needed outside" * 'for-3.20/bdi' of git://git.kernel.dk/linux-block: Make super_blocks and sb_lock static mtd: export new mtd_mmap_capabilities fs: make inode_to_bdi() handle NULL inode staging/lustre/llite: get rid of backing_dev_info fs: remove default_backing_dev_info fs: don't reassign dirty inodes to default_backing_dev_info nfs: don't call bdi_unregister ceph: remove call to bdi_unregister fs: remove mapping->backing_dev_info fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info nilfs2: set up s_bdi like the generic mount_bdev code block_dev: get bdev inode bdi directly from the block device block_dev: only write bdev inode on close fs: introduce f_op->mmap_capabilities for nommu mmap support fs: kill BDI_CAP_SWAP_BACKED fs: deduplicate noop_backing_dev_info
| * | | Make super_blocks and sb_lock staticAl Viro2015-02-021-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only user outside of fs/super.c is gone now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | fs: remove mapping->backing_dev_infoChristoph Hellwig2015-01-201-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we never use the backing_dev_info pointer in struct address_space we can simply remove it and save 4 to 8 bytes in every inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
| * | | fs: introduce f_op->mmap_capabilities for nommu mmap supportChristoph Hellwig2015-01-201-0/+23
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since "BDI: Provide backing device capability information [try #3]" the backing_dev_info structure also provides flags for the kind of mmap operation available in a nommu environment, which is entirely unrelated to it's original purpose. Introduce a new nommu-only file operation to provide this information to the nommu mmap code instead. Splitting this from the backing_dev_info structure allows to remove lots of backing_dev_info instance that aren't otherwise needed, and entirely gets rid of the concept of providing a backing_dev_info for a character device. It also removes the need for the mtd_inodefs filesystem. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* | | Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2015-02-121-0/+16
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "The main change is the pNFS block server support from Christoph, which allows an NFS client connected to shared disk to do block IO to the shared disk in place of NFS reads and writes. This also requires xfs patches, which should arrive soon through the xfs tree, barring unexpected problems. Support for other filesystems is also possible if there's interest. Thanks also to Chuck Lever for continuing work to get NFS/RDMA into shape" * 'for-3.20' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd: default NFSv4.2 to on nfsd: pNFS block layout driver exportfs: add methods for block layout exports nfsd: add trace events nfsd: update documentation for pNFS support nfsd: implement pNFS layout recalls nfsd: implement pNFS operations nfsd: make find_any_file available outside nfs4state.c nfsd: make find/get/put file available outside nfs4state.c nfsd: make lookup/alloc/unhash_stid available outside nfs4state.c nfsd: add fh_fsid_match helper nfsd: move nfsd_fh_match to nfsfh.h fs: add FL_LAYOUT lease type fs: track fl_owner for leases nfs: add LAYOUT_TYPE_MAX enum value nfsd: factor out a helper to decode nfstime4 values sunrpc/lockd: fix references to the BKL nfsd: fix year-2038 nfs4 state problem svcrdma: Handle additional inline content svcrdma: Move read list XDR round-up logic ...
| * | | fs: add FL_LAYOUT lease typeChristoph Hellwig2015-02-021-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This (ab-)uses the file locking code to allow filesystems to recall outstanding pNFS layouts on a file. This new lease type is similar but not quite the same as FL_DELEG. A FL_LAYOUT lease can always be granted, an a per-filesystem lock (XFS iolock for the initial implementation) ensures not FL_LAYOUT leases granted when we would need to recall them. Also included are changes that allow multiple outstanding read leases of different types on the same file as long as they have a differnt owner. This wasn't a problem until now as nfsd never set FL_LEASE leases, and no one else used FL_DELEG leases, but given that nfsd will also issues FL_LAYOUT leases we will have to handle it now. Signed-off-by: Christoph Hellwig <hch@lst.de>
* | | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2015-02-101-5/+1
|\ \ \ \ | |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc updates from Andrew Morton: "Bite-sized chunks this time, to avoid the MTA ratelimiting woes. - fs/notify updates - ocfs2 - some of MM" That laconic "some MM" is mainly the removal of remap_file_pages(), which is a big simplification of the VM, and which gets rid of a *lot* of random cruft and special cases because we no longer support the non-linear mappings that it used. From a user interface perspective, nothing has changed, because the remap_file_pages() syscall still exists, it's just done by emulating the old behavior by creating a lot of individual small mappings instead of one non-linear one. The emulation is slower than the old "native" non-linear mappings, but nobody really uses or cares about remap_file_pages(), and simplifying the VM is a big advantage. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits) memcg: zap memcg_slab_caches and memcg_slab_mutex memcg: zap memcg_name argument of memcg_create_kmem_cache memcg: zap __memcg_{charge,uncharge}_slab mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check mm: hugetlb: fix type of hugetlb_treat_as_movable variable mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? mm: memory: merge shared-writable dirtying branches in do_wp_page() mm: memory: remove ->vm_file check on shared writable vmas xtensa: drop _PAGE_FILE and pte_file()-related helpers x86: drop _PAGE_FILE and pte_file()-related helpers unicore32: drop pte_file()-related helpers um: drop _PAGE_FILE and pte_file()-related helpers tile: drop pte_file()-related helpers sparc: drop pte_file()-related helpers sh: drop _PAGE_FILE and pte_file()-related helpers score: drop _PAGE_FILE and pte_file()-related helpers s390: drop pte_file()-related helpers parisc: drop _PAGE_FILE and pte_file()-related helpers openrisc: drop _PAGE_FILE and pte_file()-related helpers nios2: drop _PAGE_FILE and pte_file()-related helpers ...
| * | | rmap: drop support of non-linear mappingsKirill A. Shutemov2015-02-101-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't create non-linear mappings anymore. Let's drop code which handles them in rmap. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | mm: drop vm_ops->remap_pages and generic_file_remap_pages() stubKirill A. Shutemov2015-02-101-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nobody uses it anymore. [akpm@linux-foundation.org: fix filemap_xip.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | mm: replace remap_file_pages() syscall with emulationKirill A. Shutemov2015-02-101-2/+6
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | remap_file_pages(2) was invented to be able efficiently map parts of huge file into limited 32-bit virtual address space such as in database workloads. Nonlinear mappings are pain to support and it seems there's no legitimate use-cases nowadays since 64-bit systems are widely available. Let's drop it and get rid of all these special-cased code. The patch replaces the syscall with emulation which creates new VMA on each remap_file_pages(), unless they it can be merged with an adjacent one. I didn't find *any* real code that uses remap_file_pages(2) to test emulation impact on. I've checked Debian code search and source of all packages in ALT Linux. No real users: libc wrappers, mentions in strace, gdb, valgrind and this kind of stuff. There are few basic tests in LTP for the syscall. They work just fine with emulation. To test performance impact, I've written small test case which demonstrate pretty much worst case scenario: map 4G shmfs file, write to begin of every page pgoff of the page, remap pages in reverse order, read every page. The test creates 1 million of VMAs if emulation is in use, so I had to set vm.max_map_count to 1100000 to avoid -ENOMEM. Before: 23.3 ( +- 4.31% ) seconds After: 43.9 ( +- 0.85% ) seconds Slowdown: 1.88x I believe we can live with that. Test case: #define _GNU_SOURCE #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #define MB (1024UL * 1024) #define SIZE (4096 * MB) int main(int argc, char **argv) { unsigned long *p; long i, pass; for (pass = 0; pass < 10; pass++) { p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); if (p == MAP_FAILED) { perror("mmap"); return -1; } for (i = 0; i < SIZE / 4096; i++) p[i * 4096 / sizeof(*p)] = i; for (i = 0; i < SIZE / 4096; i++) { if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096, 0, (SIZE - 4096 * (i + 1)) >> 12, 0)) { perror("remap_file_pages"); return -1; } } for (i = SIZE / 4096 - 1; i >= 0; i--) assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1); munmap(p, SIZE); } return 0; } [akpm@linux-foundation.org: fix spello] [sasha.levin@oracle.com: initialize populate before usage] [sasha.levin@oracle.com: grab file ref to prevent race while mmaping] Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Armin Rigo <arigo@tunes.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | locks: update comments that refer to inode->i_flockJeff Layton2015-01-211-9/+10
| | | | | | | | | | | | Signed-off-by: Jeff Layton <jlayton@primarydata.com>
* | | locks: keep a count of locks on the flctx listsJeff Layton2015-01-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | This makes things a bit more efficient in the cifs and ceph lock pushing code. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* | | locks: clean up the lm_change prototypeJeff Layton2015-01-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we use standard list_heads for tracking leases, we can have lm_change take a pointer to the lease to be modified instead of a double pointer. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* | | locks: add a dedicated spinlock to protect i_flctx listsJeff Layton2015-01-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | We can now add a dedicated spinlock without expanding struct inode. Change to using that to protect the various i_flctx lists. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* | | locks: remove i_flock field from struct inodeJeff Layton2015-01-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Nothing uses it anymore. Also add a forward declaration for struct file_lock to silence some compiler warnings that the removal triggers. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* | | locks: convert lease handling to file_lock_contextJeff Layton2015-01-161-2/+3
| | | | | | | | | | | | | | | Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* | | locks: convert posix locks to file_lock_contextJeff Layton2015-01-161-1/+2
| | | | | | | | | | | | | | | Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* | | locks: add a new struct file_locking_context pointer to struct inodeJeff Layton2015-01-161-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current scheme of using the i_flock list is really difficult to manage. There is also a legitimate desire for a per-inode spinlock to manage these lists that isn't the i_lock. Start conversion to a new scheme to eventually replace the old i_flock list with a new "file_lock_context" object. We start by adding a new i_flctx to struct inode. For now, it lives in parallel with i_flock list, but will eventually replace it. The idea is to allocate a structure to sit in that pointer and act as a locus for all things file locking. We allocate a file_lock_context for an inode when the first lock is added to it, and it's only freed when the inode is freed. We use the i_lock to protect the assignment, but afterward it should mostly be accessed locklessly. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* | | locks: add new struct list_head to struct file_lockJeff Layton2015-01-161-0/+1
|/ / | | | | | | | | | | | | | | | | ...that we can use to queue file_locks to per-ctx list_heads. Go ahead and convert locks_delete_lock and locks_dispose_list to use it instead of the fl_block list. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* / vfs: renumber FMODE_NONOTIFY and add to uniqueness checkDavid Drysdale2015-01-081-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc. The clashing O_PATH value was added in commit 5229645bdc35 ("vfs: add nonconflicting values for O_PATH") but this can't be changed as it is user-visible. FMODE_NONOTIFY is only used internally in the kernel, but it is in the same numbering space as the other O_* flags, as indicated by the comment at the top of include/uapi/asm-generic/fcntl.h (and its use in fs/notify/fanotify/fanotify_user.c). So renumber it to avoid the clash. All of this has happened before (commit 12ed2e36c98a: "fanotify: FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will happen again -- so update the uniqueness check in fcntl_init() to include __FMODE_NONOTIFY. Signed-off-by: David Drysdale <drysdale@google.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Jan Kara <jack@suse.cz> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-12-161-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile #2 from Al Viro: "Next pile (and there'll be one or two more). The large piece in this one is getting rid of /proc/*/ns/* weirdness; among other things, it allows to (finally) make nameidata completely opaque outside of fs/namei.c, making for easier further cleanups in there" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: coda_venus_readdir(): use file_inode() fs/namei.c: fold link_path_walk() call into path_init() path_init(): don't bother with LOOKUP_PARENT in argument fs/namei.c: new helper (path_cleanup()) path_init(): store the "base" pointer to file in nameidata itself make default ->i_fop have ->open() fail with ENXIO make nameidata completely opaque outside of fs/namei.c kill proc_ns completely take the targets of /proc/*/ns/* symlinks to separate fs bury struct proc_ns in fs/proc copy address of proc_ns_ops into ns_common new helpers: ns_alloc_inum/ns_free_inum make proc_ns_operations work with struct ns_common * instead of void * switch the rest of proc_ns_operations to working with &...->ns netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns common object embedded into various struct ....ns
| * make default ->i_fop have ->open() fail with ENXIOAl Viro2014-12-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As it is, default ->i_fop has NULL ->open() (along with all other methods). The only case where it matters is reopening (via procfs symlink) a file that didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned to something sane (default would fail on read/write/ioctl/etc.). Unfortunately, such case exists - alloc_file() users, especially anon_get_file() ones. There we have tons of opened files of very different kinds sharing the same inode. As the result, attempt to reopen those via procfs succeeds and you get a descriptor you can't do anything with. Moreover, in case of sockets we set ->i_fop that will only be used on such reopen attempts - and put a failing ->open() into it to make sure those do not succeed. It would be simpler to put such ->open() into default ->i_fop and leave it unchanged both for anon inode (as we do anyway) and for socket ones. Result: * everything going through do_dentry_open() works as it used to * sock_no_open() kludge is gone * attempts to reopen anon-inode files fail as they really ought to * ditto for aio_private_file() * ditto for perfmon - this one actually tried to imitate sock_no_open() trick, but failed to set ->i_fop, so in the current tree reopens succeed and yield completely useless descriptor. Intent clearly had been to fail with -ENXIO on such reopens; now it actually does. * everything else that used alloc_file() keeps working - it has ->i_fop set for its inodes anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2014-12-161-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "A comparatively quieter cycle for nfsd this time, but still with two larger changes: - RPC server scalability improvements from Jeff Layton (using RCU instead of a spinlock to find idle threads). - server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna Schumaker, enabling fallocate on new clients" * 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd4: fix xdr4 count of server in fs_location4 nfsd4: fix xdr4 inclusion of escaped char sunrpc/cache: convert to use string_escape_str() sunrpc: only call test_bit once in svc_xprt_received fs: nfsd: Fix signedness bug in compare_blob sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt sunrpc: convert to lockless lookup of queued server threads sunrpc: fix potential races in pool_stats collection sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it sunrpc: require svc_create callers to pass in meaningful shutdown routine sunrpc: have svc_wake_up only deal with pool 0 sunrpc: convert sp_task_pending flag to use atomic bitops sunrpc: move rq_cachetype field to better optimize space sunrpc: move rq_splice_ok flag into rq_flags sunrpc: move rq_dropme flag into rq_flags sunrpc: move rq_usedeferral flag to rq_flags sunrpc: move rq_local field to rq_flags sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it nfsd: minor off by one checks in __write_versions() sunrpc: release svc_pool_map reference when serv allocation fails ...
| * \ merge nfs bugfixes into nfsd for-3.19 branchJ. Bruce Fields2014-11-191-3/+46
| |\ \ | | | | | | | | | | | | | | | | In addition to nfsd bugfixes, there are some fixes in -rc5 for client bugs that can interfere with my testing.
| * | | VFS: Rename do_fallocate() to vfs_fallocate()Anna Schumaker2014-11-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function needs to be exported so it can be used by the NFSD module when responding to the new ALLOCATE and DEALLOCATE operations in NFS v4.2. Christoph Hellwig suggested renaming the function to stay consistent with how other vfs functions are named. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* | | | Merge branch 'next' of ↵Linus Torvalds2014-12-141-0/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "In terms of changes, there's general maintenance to the Smack, SELinux, and integrity code. The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT, which allows IMA appraisal to require signatures. Support for reading keys from rootfs before init is call is also added" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits) selinux: Remove security_ops extern security: smack: fix out-of-bounds access in smk_parse_smack() VFS: refactor vfs_read() ima: require signature based appraisal integrity: provide a hook to load keys when rootfs is ready ima: load x509 certificate from the kernel integrity: provide a function to load x509 certificate from the kernel integrity: define a new function integrity_read_file() Security: smack: replace kzalloc with kmem_cache for inode_smack Smack: Lock mode for the floor and hat labels ima: added support for new kernel cmdline parameter ima_template_fmt ima: allocate field pointers array on demand in template_desc_init_fields() ima: don't allocate a copy of template_fmt in template_desc_init_fields() ima: display template format in meas. list if template name length is zero ima: added error messages to template-related functions ima: use atomic bit operations to protect policy update interface ima: ignore empty and with whitespaces policy lines ima: no need to allocate entry for comment ima: report policy load status ima: use path names cache ...
| * \ \ \ Merge branch 'next' of ↵James Morris2014-11-191-0/+1
| |\ \ \ \ | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity into next
| | * | | | VFS: refactor vfs_read()Dmitry Kasatkin2014-11-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | integrity_kernel_read() duplicates the file read operations code in vfs_read(). This patch refactors vfs_read() code creating a helper function __vfs_read(). It is used by both vfs_read() and integrity_kernel_read(). Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
* | | | | | Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds2014-12-141-0/+1
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull aio updates from Benjamin LaHaise. * git://git.kvack.org/~bcrl/aio-next: aio: Skip timer for io_getevents if timeout=0 aio: Make it possible to remap aio ring
| * | | | | | aio: Make it possible to remap aio ringPavel Emelyanov2014-12-131-0/+1
| | |_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
* | | | | | syscalls: implement execveat() system callDavid Drysdale2014-12-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/<fd>" (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud