summaryrefslogtreecommitdiffstats
path: root/fs/ext4
Commit message (Collapse)AuthorAgeFilesLines
...
| | * | | ext4: Avoid unnecessary revokes in ext4_alloc_branch()Jan Kara2019-11-051-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Error cleanup path in ext4_alloc_branch() calls ext4_forget() on freshly allocated indirect blocks with 'metadata' set to 1. This results in generating revoke records for these blocks. However this is unnecessary as the freed blocks are only allocated in the current transaction and thus they will never be journalled. Make this cleanup path similar to e.g. cleanup in ext4_splice_branch() and use ext4_free_blocks() to handle block forgetting by passing EXT4_FREE_BLOCKS_FORGET and not EXT4_FREE_BLOCKS_METADATA to ext4_free_blocks(). This also allows allocating transaction not to reserve any credits for revoke records. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| | * | | ext4: Use ext4_journal_extend() instead of jbd2_journal_extend()Jan Kara2019-11-051-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use ext4 helper ext4_journal_extend() instead of opencoding it in ext4_try_to_expand_extra_isize(). Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| | * | | ext4: Fix ext4_should_journal_data() for EA inodesJan Kara2019-11-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly to directories, EA inodes do only journalled modifications to their data. Change ext4_should_journal_data() to return true for them so that we don't have to special-case them during truncate. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-7-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| | * | | ext4: Fix credit estimate for final inode freeingJan Kara2019-11-051-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Estimate for the number of credits needed for final freeing of inode in ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for orphan deletion, bitmap & group descriptor for inode freeing) and not just 3. [ Fixed minor whitespace nit. -- TYT ] Fixes: e50e5129f384 ("ext4: xattr-in-inode support") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| | * | | ext4: Do not iput inode under running transactionJan Kara2019-11-051-6/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ext4_mkdir(), ext4_symlink(), ext4_create(), or ext4_mknod() fail to add entry into directory, it ends up dropping freshly created inode under the running transaction and thus inode truncation happens under that transaction. That breaks assumptions that evict() does not get called from a transaction context and at least in ext4_symlink() case it can result in inode eviction deadlocking in inode_wait_for_writeback() when flush worker finds symlink inode, starts to write it back and blocks on starting a transaction. So change the code in ext4_mkdir() and ext4_add_nondir() to drop inode reference only after the transaction is stopped. We also have to add inode to the orphan list in that case as otherwise the inode would get leaked in case we crash before inode deletion is committed. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| | * | | ext4: Move marking of handle as sync to ext4_add_nondir()Jan Kara2019-11-051-7/+3
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Every caller of ext4_add_nondir() marks handle as sync if directory has DIRSYNC set. Move this marking to ext4_add_nondir() so reduce some duplication. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: Enable blocksize < pagesize for dioread_nolockRitesh Harjani2019-10-221-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All support is now added for blocksize < pagesize for dioread_nolock. This patch removes those checks which disables dioread_nolock feature for blocksize != pagesize. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-6-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: Add support for blocksize < pagesize in dioread_nolockRitesh Harjani2019-10-224-29/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the support for blocksize < pagesize for dioread_nolock feature. Since in case of blocksize < pagesize, we can have multiple small buffers of page as unwritten extents, we need to maintain a vector of these unwritten extents which needs the conversion after the IO is complete. Thus, we maintain a list of tuple <offset, size> pair (io_end_vec) for this & traverse this list to do the unwritten to written conversion. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-5-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: Refactor mpage_map_and_submit_buffers functionRitesh Harjani2019-10-221-42/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch refactors mpage_map_and_submit_buffers to take out the page buffers processing, as a separate function. This will be required to add support for blocksize < pagesize for dioread_nolock feature. No functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-4-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: Add API to bring in support for unwritten io_end_vec conversionRitesh Harjani2019-10-223-19/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch just brings in the API for conversion of unwritten io_end_vec extents which will be required for blocksize < pagesize support for dioread_nolock feature. No functional changes in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-3-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | | ext4: keep uniform naming convention for io & io_end variablesRitesh Harjani2019-10-221-27/+28
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's keep uniform naming convention for ext4_submit_io (io) & ext4_end_io_t (io_end) structures, to avoid any confusion. No functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-2-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | Merge tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2019-11-301-1/+1
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull iomap updates from Darrick Wong: "In this release, we hoisted as much of XFS' writeback code into iomap as was practicable, refactored the unshare file data function, added the ability to perform buffered io copy on write, and tweaked various parts of the directio implementation as needed to port ext4's directio code (that will be a separate pull). Summary: - Make iomap_dio_rw callers explicitly tell us if they want us to wait - Port the xfs writeback code to iomap to complete the buffered io library functions - Refactor the unshare code to share common pieces - Add support for performing copy on write with buffered writes - Other minor fixes - Fix unchecked return in iomap_bmap - Fix a type casting bug in a ternary statement in iomap_dio_bio_actor - Improve tracepoints for easier diagnostic ability - Fix pipe page leakage in directio reads" * tag 'iomap-5.5-merge-11' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (31 commits) iomap: Fix pipe page leakage during splicing iomap: trace iomap_appply results iomap: fix return value of iomap_dio_bio_actor on 32bit systems iomap: iomap_bmap should check iomap_apply return value iomap: Fix overflow in iomap_page_mkwrite fs/iomap: remove redundant check in iomap_dio_rw() iomap: use a srcmap for a read-modify-write I/O iomap: renumber IOMAP_HOLE to 0 iomap: use write_begin to read pages to unshare iomap: move the zeroing case out of iomap_read_page_sync iomap: ignore non-shared or non-data blocks in xfs_file_dirty iomap: always use AOP_FLAG_NOFS in iomap_write_begin iomap: remove the unused iomap argument to __iomap_write_end iomap: better document the IOMAP_F_* flags iomap: enhance writeback error message iomap: pass a struct page to iomap_finish_page_writeback iomap: cleanup iomap_ioend_compare iomap: move struct iomap_page out of iomap.h iomap: warn on inline maps in iomap_writepage_map iomap: lift the xfs writeback code to iomap ...
| * | iomap: use a srcmap for a read-modify-write I/OGoldwyn Rodrigues2019-10-211-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | The srcmap is used to identify where the read is to be performed from. It is passed to ->iomap_begin, which can fill it in if we need to read data for partially written blocks from a different location than the write target. The srcmap is only supported for buffered writes so far. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> [hch: merged two patches, removed the IOMAP_F_COW flag, use iomap as srcmap if not set, adjust length down to srcmap end as well] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
* | Merge tag 'linux-kselftest-5.5-rc1-kunit' of ↵Linus Torvalds2019-11-253-0/+290
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest KUnit support gtom Shuah Khan: "This adds KUnit, a lightweight unit testing and mocking framework for the Linux kernel from Brendan Higgins. KUnit is not an end-to-end testing framework. It is currently supported on UML and sub-systems can write unit tests and run them in UML env. KUnit documentation is included in this update. In addition, this Kunit update adds 3 new kunit tests: - proc sysctl test from Iurii Zaikin - the 'list' doubly linked list test from David Gow - ext4 tests for decoding extended timestamps from Iurii Zaikin In the future KUnit will be linked to Kselftest framework to provide a way to trigger KUnit tests from user-space" * tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (23 commits) lib/list-test: add a test for the 'list' doubly linked list ext4: add kunit test for decoding extended timestamps Documentation: kunit: Fix verification command kunit: Fix '--build_dir' option kunit: fix failure to build without printk MAINTAINERS: add proc sysctl KUnit test to PROC SYSCTL section kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec() MAINTAINERS: add entry for KUnit the unit testing framework Documentation: kunit: add documentation for KUnit kunit: defconfig: add defconfigs for building KUnit tests kunit: tool: add Python wrappers for running KUnit tests kunit: test: add tests for KUnit managed resources kunit: test: add the concept of assertions kunit: test: add tests for kunit test abort kunit: test: add support for test abort objtool: add kunit_try_catch_throw to the noreturn list kunit: test: add initial tests lib: enable building KUnit in lib/ kunit: test: add the concept of expectations kunit: test: add assertion printing library ...
| * | ext4: add kunit test for decoding extended timestampsIurii Zaikin2019-10-233-0/+290
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KUnit tests for decoding extended 64 bit timestamps that verify the seconds part of [a/c/m] timestamps in ext4 inode structs are decoded correctly. Test data is derived from the table in the Inode Timestamps section of Documentation/filesystems/ext4/inodes.rst. KUnit tests run during boot and output the results to the debug log in TAP format (http://testanything.org/). Only useful for kernel devs running KUnit test harness and are not for inclusion into a production build. Signed-off-by: Iurii Zaikin <yzaikin@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Tested-by: Brendan Higgins <brendanhiggins@google.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
* | Merge tag 'fsverity-for-linus' of ↵Linus Torvalds2019-11-251-1/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fsverity updates from Eric Biggers: "Expose the fs-verity bit through statx()" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: docs: fs-verity: mention statx() support f2fs: support STATX_ATTR_VERITY ext4: support STATX_ATTR_VERITY statx: define STATX_ATTR_VERITY docs: fs-verity: document first supported kernel version
| * | ext4: support STATX_ATTR_VERITYEric Biggers2019-11-131-1/+4
| |/ | | | | | | | | | | | | | | Set the STATX_ATTR_VERITY bit when the statx() system call is used on a verity file on ext4. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Eric Biggers <ebiggers@google.com>
* / ext4: add support for IV_INO_LBLK_64 encryption policiesEric Biggers2019-11-062-0/+16
|/ | | | | | | | | | | | | | | | | | | | | | | | IV_INO_LBLK_64 encryption policies have special requirements from the filesystem beyond those of the existing encryption policies: - Inode numbers must never change, even if the filesystem is resized. - Inode numbers must be <= 32 bits. - File logical block numbers must be <= 32 bits. ext4 has 32-bit inode and file logical block numbers. However, resize2fs can re-number inodes when shrinking an ext4 filesystem. However, typically the people who would want to use this format don't care about filesystem shrinking. They'd be fine with a solution that just prevents the filesystem from being shrunk. Therefore, add a new feature flag EXT4_FEATURE_COMPAT_STABLE_INODES that will do exactly that. Then wire up the fscrypt_operations to expose this flag to fs/crypto/, so that it allows IV_INO_LBLK_64 policies when this flag is set. Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
* Merge branch 'entropy'Linus Torvalds2019-09-291-0/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge active entropy generation updates. This is admittedly partly "for discussion". We need to have a way forward for the boot time deadlocks where user space ends up waiting for more entropy, but no entropy is forthcoming because the system is entirely idle just waiting for something to happen. While this was triggered by what is arguably a user space bug with GDM/gnome-session asking for secure randomness during early boot, when they didn't even need any such truly secure thing, the issue ends up being that our "getrandom()" interface is prone to that kind of confusion, because people don't think very hard about whether they want to block for sufficient amounts of entropy. The approach here-in is to decide to not just passively wait for entropy to happen, but to start actively collecting it if it is missing. This is not necessarily always possible, but if the architecture has a CPU cycle counter, there is a fair amount of noise in the exact timings of reasonably complex loads. We may end up tweaking the load and the entropy estimates, but this should be at least a reasonable starting point. As part of this, we also revert the revert of the ext4 IO pattern improvement that ended up triggering the reported lack of external entropy. * getrandom() active entropy waiting: Revert "Revert "ext4: make __ext4_get_inode_loc plug"" random: try to actively add entropy rather than passively wait for it
| * Revert "Revert "ext4: make __ext4_get_inode_loc plug""Linus Torvalds2019-09-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 72dbcf72156641fde4d8ea401e977341bfd35a05. Instead of waiting forever for entropy that may just not happen, we now try to actively generate entropy when required, and are thus hopefully avoiding the problem that caused the nice ext4 IO pattern fix to be reverted. So revert the revert. Cc: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'ext4_for_linus' of ↵Linus Torvalds2019-09-2113-275/+830
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Added new ext4 debugging ioctls to allow userspace to get information about the state of the extent status cache. Dropped workaround for pre-1970 dates which were encoded incorrectly in pre-4.4 kernels. Since both the kernel correctly generates, and e2fsck detects and fixes this issue for the past four years, it'e time to drop the workaround. (Also, it's not like files with dates in the distant past were all that common in the first place.) A lot of miscellaneous bug fixes and cleanups, including some ext4 Documentation fixes. Also included are two minor bug fixes in fs/unicode" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) unicode: make array 'token' static const, makes object smaller unicode: Move static keyword to the front of declarations ext4: add missing bigalloc documentation. ext4: fix kernel oops caused by spurious casefold flag ext4: fix integer overflow when calculating commit interval ext4: use percpu_counters for extent_status cache hits/misses ext4: fix potential use after free after remounting with noblock_validity jbd2: add missing tracepoint for reserved handle ext4: fix punch hole for inline_data file systems ext4: rework reserved cluster accounting when invalidating pages ext4: documentation fixes ext4: treat buffers with write errors as containing valid data ext4: fix warning inside ext4_convert_unwritten_extents_endio ext4: set error return correctly when ext4_htree_store_dirent fails ext4: drop legacy pre-1970 encoding workaround ext4: add new ioctl EXT4_IOC_GET_ES_CACHE ext4: add a new ioctl EXT4_IOC_GETSTATE ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHE jbd2: flush_descriptor(): Do not decrease buffer head's ref count ext4: remove unnecessary error check ...
| * | ext4: fix kernel oops caused by spurious casefold flagTheodore Ts'o2019-09-034-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an directory has the a casefold flag set without the casefold feature set, s_encoding will not be initialized, and this will cause the kernel to dereference a NULL pointer. In addition to adding checks to avoid these kernel oops, attempts to load inodes with the casefold flag when the casefold feature is not enable will cause the file system to be declared corrupted. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: fix integer overflow when calculating commit intervalzhangyi (F)2019-08-281-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If user specify a large enough value of "commit=" option, it may trigger signed integer overflow which may lead to sbi->s_commit_interval becomes a large or small value, zero in particular. UBSAN: Undefined behaviour in ../fs/ext4/super.c:1592:31 signed integer overflow: 536870912 * 1000 cannot be represented in type 'int' [...] Call trace: [...] [<ffffff9008a2d120>] ubsan_epilogue+0x34/0x9c lib/ubsan.c:166 [<ffffff9008a2d8b8>] handle_overflow+0x228/0x280 lib/ubsan.c:197 [<ffffff9008a2d95c>] __ubsan_handle_mul_overflow+0x4c/0x68 lib/ubsan.c:218 [<ffffff90086d070c>] handle_mount_opt fs/ext4/super.c:1592 [inline] [<ffffff90086d070c>] parse_options+0x1724/0x1a40 fs/ext4/super.c:1773 [<ffffff90086d51c4>] ext4_remount+0x2ec/0x14a0 fs/ext4/super.c:4834 [...] Although it is not a big deal, still silence the UBSAN by limit the input value. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
| * | ext4: use percpu_counters for extent_status cache hits/missesYang Guo2019-08-282-15/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | @es_stats_cache_hits and @es_stats_cache_misses are accessed frequently in ext4_es_lookup_extent function, it would influence the ext4 read/write performance in NUMA system. Let's optimize it using percpu_counter, it is profitable for the performance. The test command is as below: fio -name=randwrite -numjobs=8 -filename=/mnt/test1 -rw=randwrite -ioengine=libaio -direct=1 -iodepth=64 -sync=0 -norandommap -group_reporting -runtime=120 -time_based -bs=4k -size=5G And the result is better 10% than the initial implement: without the patch,IOPS=197k, BW=770MiB/s (808MB/s)(90.3GiB/120002msec) with the patch, IOPS=218k, BW=852MiB/s (894MB/s)(99.9GiB/120002msec) Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Yang Guo <guoyang2@huawei.com> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
| * | ext4: fix potential use after free after remounting with noblock_validityzhangyi (F)2019-08-282-52/+147
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remount process will release system zone which was allocated before if "noblock_validity" is specified. If we mount an ext4 file system to two mountpoints with default mount options, and then remount one of them with "noblock_validity", it may trigger a use after free problem when someone accessing the other one. # mount /dev/sda foo # mount /dev/sda bar User access mountpoint "foo" | Remount mountpoint "bar" | ext4_map_blocks() | ext4_remount() check_block_validity() | ext4_setup_system_zone() ext4_data_block_valid() | ext4_release_system_zone() | free system_blks rb nodes access system_blks rb nodes | trigger use after free | This problem can also be reproduced by one mountpint, At the same time, add_system_zone() can get called during remount as well so there can be racing ext4_data_block_valid() reading the rbtree at the same time. This patch add RCU to protect system zone from releasing or building when doing a remount which inverse current "noblock_validity" mount option. It assign the rbtree after the whole tree was complete and do actual freeing after rcu grace period, avoid any intermediate state. Reported-by: syzbot+1e470567330b7ad711d5@syzkaller.appspotmail.com Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
| * | ext4: fix punch hole for inline_data file systemsTheodore Ts'o2019-08-231-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a program attempts to punch a hole on an inline data file, we need to convert it to a normal file first. This was detected using ext4/032 using the adv configuration. Simple reproducer: mke2fs -Fq -t ext4 -O inline_data /dev/vdc mount /vdc echo "" > /vdc/testfile xfs_io -c 'truncate 33554432' /vdc/testfile xfs_io -c 'fpunch 0 1048576' /vdc/testfile umount /vdc e2fsck -fy /dev/vdc Cc: stable@vger.kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: rework reserved cluster accounting when invalidating pagesEric Whitney2019-08-224-161/+353
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The goal of this patch is to remove two references to the buffer delay bit in ext4_da_page_release_reservation() as part of a larger effort to remove all such references from ext4. These two references are principally used to reduce the reserved block/cluster count when pages are invalidated as a result of truncating, punching holes, or collapsing a block range in a file. The entire function is removed and replaced with code in ext4_es_remove_extent() that reduces the reserved count as a side effect of removing a block range from delayed and not unwritten extents in the extent status tree as is done when truncating, punching holes, or collapsing ranges. The code is written to minimize the number of searches descending from rb tree roots for scalability. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: treat buffers with write errors as containing valid dataZhangXiaoxu2019-08-222-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I got some errors when I repair an ext4 volume which stacked by an iscsi target: Entry 'test60' in / (2) has deleted/unused inode 73750. Clear? It can be reproduced when the network not good enough. When I debug this I found ext4 will read entry buffer from disk and the buffer is marked with write_io_error. If the buffer is marked with write_io_error, it means it already wroten to journal, and not checked out to disk. IOW, the journal is newer than the data in disk. If this journal record 'delete test60', it means the 'test60' still on the disk metadata. In this case, if we read the buffer from disk successfully and create file continue, the new journal record will overwrite the journal which record 'delete test60', then the entry corruptioned. So, use the buffer rather than read from disk if the buffer is marked with write_io_error. Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: fix warning inside ext4_convert_unwritten_extents_endioRakesh Pandit2019-08-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Really enable warning when CONFIG_EXT4_DEBUG is set and fix missing first argument. This was introduced in commit ff95ec22cd7f ("ext4: add warning to ext4_convert_unwritten_extents_endio") and splitting extents inside endio would trigger it. Fixes: ff95ec22cd7f ("ext4: add warning to ext4_convert_unwritten_extents_endio") Signed-off-by: Rakesh Pandit <rakesh@tuxera.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
| * | ext4: set error return correctly when ext4_htree_store_dirent failsColin Ian King2019-08-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when the call to ext4_htree_store_dirent fails the error return variable 'ret' is is not being set to the error code and variable count is instead, hence the error code is not being returned. Fix this by assigning ret to the error return code. Addresses-Coverity: ("Unused value") Fixes: 8af0f0822797 ("ext4: fix readdir error in the case of inline_data+dir_index") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: drop legacy pre-1970 encoding workaroundTheodore Ts'o2019-08-121-14/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally, support for expanded timestamps had a bug in that pre-1970 times were erroneously encoded as being in the the 24th century. This was fixed in commit a4dad1ae24f8 ("ext4: Fix handling of extended tv_sec") which landed in 4.4. Starting with 4.4, pre-1970 timestamps were correctly encoded, but for backwards compatibility those incorrectly encoded timestamps were mapped back to the pre-1970 dates. Given that backwards compatibility workaround has been around for 4 years, and given that running e2fsck from e2fsprogs 1.43.2 and later will offer to fix these timestamps (which has been released for 3 years), it's past time to drop the legacy workaround from the kernel. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: add new ioctl EXT4_IOC_GET_ES_CACHETheodore Ts'o2019-08-116-11/+182
| | | | | | | | | | | | | | | | | | | | | | | | For debugging reasons, it's useful to know the contents of the extent cache. Since the extent cache contains much of what is in the fiemap ioctl, use an fiemap-style interface to return this information. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: add a new ioctl EXT4_IOC_GETSTATETheodore Ts'o2019-08-112-0/+28
| | | | | | | | | | | | | | | | | | | | | The new ioctl EXT4_IOC_GETSTATE returns some of the dynamic state of an ext4 inode for debugging purposes. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHETheodore Ts'o2019-08-114-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | The new ioctl EXT4_IOC_CLEAR_ES_CACHE will force an inode's extent status cache to be cleared out. This is intended for use for debugging. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: remove unnecessary error checkShi Siyuan2019-08-111-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove unnecessary error check in ext4_file_write_iter(), because this check will be done in upcoming later function -- ext4_write_checks() -> generic_write_checks() Change-Id: I7b0ab27f693a50765c15b5eaa3f4e7c38f42e01e Signed-off-by: shisiyuan <shisiyuan@xiaomi.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
| * | ext4: fix warning when turn on dioread_nolock and inline_datayangerkun2019-08-111-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mkfs.ext4 -O inline_data /dev/vdb mount -o dioread_nolock /dev/vdb /mnt echo "some inline data..." >> /mnt/test-file echo "some inline data..." >> /mnt/test-file sync The above script will trigger "WARN_ON(!io_end->handle && sbi->s_journal)" because ext4_should_dioread_nolock() returns false for a file with inline data. Move the check to a place after we have already removed the inline data and prepared inode to write normal pages. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | Merge tag 'y2038-vfs' of ↵Linus Torvalds2019-09-192-3/+22
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground Pull y2038 vfs updates from Arnd Bergmann: "Add inode timestamp clamping. This series from Deepa Dinamani adds a per-superblock minimum/maximum timestamp limit for a file system, and clamps timestamps as they are written, to avoid random behavior from integer overflow as well as having different time stamps on disk vs in memory. At mount time, a warning is now printed for any file system that can represent current timestamps but not future timestamps more than 30 years into the future, similar to the arbitrary 30 year limit that was added to settimeofday(). This was picked as a compromise to warn users to migrate to other file systems (e.g. ext4 instead of ext3) when they need the file system to survive beyond 2038 (or similar limits in other file systems), but not get in the way of normal usage" * tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground: ext4: Reduce ext4 timestamp warnings isofs: Initialize filesystem timestamp ranges pstore: fs superblock limits fs: omfs: Initialize filesystem timestamp ranges fs: hpfs: Initialize filesystem timestamp ranges fs: ceph: Initialize filesystem timestamp ranges fs: sysv: Initialize filesystem timestamp ranges fs: affs: Initialize filesystem timestamp ranges fs: fat: Initialize filesystem timestamp ranges fs: cifs: Initialize filesystem timestamp ranges fs: nfs: Initialize filesystem timestamp ranges ext4: Initialize timestamps limits 9p: Fill min and max timestamps in sb fs: Fill in max and min timestamps in superblock utimes: Clamp the timestamps before update mount: Add mount warning for impending timestamp expiry timestamp_truncate: Replace users of timespec64_trunc vfs: Add timestamp_truncate() api vfs: Add file timestamp range support
| * | | ext4: Reduce ext4 timestamp warningsDeepa Dinamani2019-09-041-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ext4 file systems were created intentionally with 128 byte inodes, the rate-limited warning of eventual possible timestamp overflow are still emitted rather frequently. Remove the warning for now. Discussion for whether any warning is needed, and where it should be emitted, can be found at https://lore.kernel.org/lkml/1567523922.5576.57.camel@lca.pw/. I can post a separate follow-up patch after the conclusion. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
| * | | ext4: Initialize timestamps limitsDeepa Dinamani2019-08-302-3/+24
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext4 has different overflow limits for max filesystem timestamps based on the extra bytes available. The timestamp limits are calculated according to the encoding table in a4dad1ae24f85i(ext4: Fix handling of extended tv_sec): * extra msb of adjust for signed * epoch 32-bit 32-bit tv_sec to * bits time decoded 64-bit tv_sec 64-bit tv_sec valid time range * 0 0 1 -0x80000000..-0x00000001 0x000000000 1901-12-13..1969-12-31 * 0 0 0 0x000000000..0x07fffffff 0x000000000 1970-01-01..2038-01-19 * 0 1 1 0x080000000..0x0ffffffff 0x100000000 2038-01-19..2106-02-07 * 0 1 0 0x100000000..0x17fffffff 0x100000000 2106-02-07..2174-02-25 * 1 0 1 0x180000000..0x1ffffffff 0x200000000 2174-02-25..2242-03-16 * 1 0 0 0x200000000..0x27fffffff 0x200000000 2242-03-16..2310-04-04 * 1 1 1 0x280000000..0x2ffffffff 0x300000000 2310-04-04..2378-04-22 * 1 1 0 0x300000000..0x37fffffff 0x300000000 2378-04-22..2446-05-10 Note that the time limits are not correct for deletion times. Added a warn when an inode cannot be extended to incorporate an extended timestamp. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: tytso@mit.edu Cc: adilger.kernel@dilger.ca Cc: linux-ext4@vger.kernel.org
* | | Merge tag 'fsverity-for-linus' of ↵Linus Torvalds2019-09-189-53/+645
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fs-verity support from Eric Biggers: "fs-verity is a filesystem feature that provides Merkle tree based hashing (similar to dm-verity) for individual readonly files, mainly for the purpose of efficient authenticity verification. This pull request includes: (a) The fs/verity/ support layer and documentation. (b) fs-verity support for ext4 and f2fs. Compared to the original fs-verity patchset from last year, the UAPI to enable fs-verity on a file has been greatly simplified. Lots of other things were cleaned up too. fs-verity is planned to be used by two different projects on Android; most of the userspace code is in place already. Another userspace tool ("fsverity-utils"), and xfstests, are also available. e2fsprogs and f2fs-tools already have fs-verity support. Other people have shown interest in using fs-verity too. I've tested this on ext4 and f2fs with xfstests, both the existing tests and the new fs-verity tests. This has also been in linux-next since July 30 with no reported issues except a couple minor ones I found myself and folded in fixes for. Ted and I will be co-maintaining fs-verity" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: f2fs: add fs-verity support ext4: update on-disk format documentation for fs-verity ext4: add fs-verity read support ext4: add basic fs-verity support fs-verity: support builtin file signatures fs-verity: add SHA-512 support fs-verity: implement FS_IOC_MEASURE_VERITY ioctl fs-verity: implement FS_IOC_ENABLE_VERITY ioctl fs-verity: add data verification hooks for ->readpages() fs-verity: add the hook for file ->setattr() fs-verity: add the hook for file ->open() fs-verity: add inode and superblock fields fs-verity: add Kconfig and the helper functions for hashing fs: uapi: define verity bit for FS_IOC_GETFLAGS fs-verity: add UAPI header fs-verity: add MAINTAINERS file entry fs-verity: add a documentation file
| * | | ext4: add fs-verity read supportEric Biggers2019-08-124-36/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make ext4_mpage_readpages() verify data as it is read from fs-verity files, using the helper functions from fs/verity/. To support both encryption and verity simultaneously, this required refactoring the decryption workflow into a generic "post-read processing" workflow which can do decryption, verification, or both. The case where the ext4 block size is not equal to the PAGE_SIZE is not supported yet, since in that case ext4_mpage_readpages() sometimes falls back to block_read_full_page(), which does not support fs-verity yet. Co-developed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
| * | | ext4: add basic fs-verity supportEric Biggers2019-08-128-17/+457
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add most of fs-verity support to ext4. fs-verity is a filesystem feature that enables transparent integrity protection and authentication of read-only files. It uses a dm-verity like mechanism at the file level: a Merkle tree is used to verify any block in the file in log(filesize) time. It is implemented mainly by helper functions in fs/verity/. See Documentation/filesystems/fsverity.rst for the full documentation. This commit adds all of ext4 fs-verity support except for the actual data verification, including: - Adding a filesystem feature flag and an inode flag for fs-verity. - Implementing the fsverity_operations to support enabling verity on an inode and reading/writing the verity metadata. - Updating ->write_begin(), ->write_end(), and ->writepages() to support writing verity metadata pages. - Calling the fs-verity hooks for ->open(), ->setattr(), and ->ioctl(). ext4 stores the verity metadata (Merkle tree and fsverity_descriptor) past the end of the file, starting at the first 64K boundary beyond i_size. This approach works because (a) verity files are readonly, and (b) pages fully beyond i_size aren't visible to userspace but can be read/written internally by ext4 with only some relatively small changes to ext4. This approach avoids having to depend on the EA_INODE feature and on rearchitecturing ext4's xattr support to support paging multi-gigabyte xattrs into memory, and to support encrypting xattrs. Note that the verity metadata *must* be encrypted when the file is, since it contains hashes of the plaintext data. This patch incorporates work by Theodore Ts'o and Chandan Rajendra. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
* | | Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds2019-09-182-0/+35
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull fscrypt updates from Eric Biggers: "This is a large update to fs/crypto/ which includes: - Add ioctls that add/remove encryption keys to/from a filesystem-level keyring. These fix user-reported issues where e.g. an encrypted home directory can break NetworkManager, sshd, Docker, etc. because they don't get access to the needed keyring. These ioctls also provide a way to lock encrypted directories that doesn't use the vm.drop_caches sysctl, so is faster, more reliable, and doesn't always need root. - Add a new encryption policy version ("v2") which switches to a more standard, secure, and flexible key derivation function, and starts verifying that the correct key was supplied before using it. The key derivation improvement is needed for its own sake as well as for ongoing feature work for which the current way is too inflexible. Work is in progress to update both Android and the 'fscrypt' userspace tool to use both these features. (Working patches are available and just need to be reviewed+merged.) Chrome OS will likely use them too. This has also been tested on ext4, f2fs, and ubifs with xfstests -- both the existing encryption tests, and the new tests for this. This has also been in linux-next since Aug 16 with no reported issues. I'm also using an fscrypt v2-encrypted home directory on my personal desktop" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: (27 commits) ext4 crypto: fix to check feature status before get policy fscrypt: document the new ioctls and policy version ubifs: wire up new fscrypt ioctls f2fs: wire up new fscrypt ioctls ext4: wire up new fscrypt ioctls fscrypt: require that key be added when setting a v2 encryption policy fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl fscrypt: allow unprivileged users to add/remove keys for v2 policies fscrypt: v2 encryption policy support fscrypt: add an HKDF-SHA512 implementation fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl fscrypt: rename keyinfo.c to keysetup.c fscrypt: move v1 policy key setup to keysetup_v1.c fscrypt: refactor key setup code in preparation for v2 policies fscrypt: rename fscrypt_master_key to fscrypt_direct_key fscrypt: add ->ci_inode to fscrypt_info fscrypt: use FSCRYPT_* definitions, not FS_* fscrypt: use FSCRYPT_ prefix for uapi constants ...
| * | ext4 crypto: fix to check feature status before get policyChao Yu2019-08-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When getting fscrypt policy via EXT4_IOC_GET_ENCRYPTION_POLICY, if encryption feature is off, it's better to return EOPNOTSUPP instead of ENODATA, so let's add ext4_has_feature_encrypt() to do the check for that. This makes it so that all fscrypt ioctls consistently check for the encryption feature, and makes ext4 consistent with f2fs in this regard. Signed-off-by: Chao Yu <yuchao0@huawei.com> [EB - removed unneeded braces, updated the documentation, and added more explanation to commit message] Signed-off-by: Eric Biggers <ebiggers@google.com>
| * | ext4: wire up new fscrypt ioctlsEric Biggers2019-08-122-0/+33
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | Wire up the new ioctls for adding and removing fscrypt keys to/from the filesystem, and the new ioctl for retrieving v2 encryption policies. The key removal ioctls also required making ext4_drop_inode() call fscrypt_drop_inode(). For more details see Documentation/filesystems/fscrypt.rst and the fscrypt patches that added the implementation of these ioctls. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
* / Revert "ext4: make __ext4_get_inode_loc plug"Linus Torvalds2019-09-151-3/+0
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit b03755ad6f33b7b8cd7312a3596a2dbf496de6e7. This is sad, and done for all the wrong reasons. Because that commit is good, and does exactly what it says: avoids a lot of small disk requests for the inode table read-ahead. However, it turns out that it causes an entirely unrelated problem: the getrandom() system call was introduced back in 2014 by commit c6e9d6f38894 ("random: introduce getrandom(2) system call"), and people use it as a convenient source of good random numbers. But part of the current semantics for getrandom() is that it waits for the entropy pool to fill at least partially (unlike /dev/urandom). And at least ArchLinux apparently has a systemd that uses getrandom() at boot time, and the improvements in IO patterns means that existing installations suddenly start hanging, waiting for entropy that will never happen. It seems to be an unlucky combination of not _quite_ enough entropy, together with a particular systemd version and configuration. Lennart says that the systemd-random-seed process (which is what does this early access) is supposed to not block any other boot activity, but sadly that doesn't actually seem to be the case (possibly due bogus dependencies on cryptsetup for encrypted swapspace). The correct fix is to fix getrandom() to not block when it's not appropriate, but that fix is going to take a lot more discussion. Do we just make it act like /dev/urandom by default, and add a new flag for "wait for entropy"? Do we add a boot-time option? Or do we just limit the amount of time it will wait for entropy? So in the meantime, we do the revert to give us time to discuss the eventual fix for the fundamental problem, at which point we can re-apply the ext4 inode table access optimization. Reported-by: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'libnvdimm-for-5.3' of ↵Linus Torvalds2019-07-181-4/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "Primarily just the virtio_pmem driver: - virtio_pmem The new virtio_pmem facility introduces a paravirtualized persistent memory device that allows a guest VM to use DAX mechanisms to access a host-file with host-page-cache. It arranges for MAP_SYNC to be disabled and instead triggers a host fsync() when a 'write-cache flush' command is sent to the virtual disk device. - Miscellaneous small fixups" * tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: virtio_pmem: fix sparse warning xfs: disable map_sync for async flush ext4: disable map_sync for async flush dax: check synchronous mapping is supported dm: enable synchronous dax libnvdimm: add dax_dev sync flag virtio-pmem: Add virtio pmem driver libnvdimm: nd_region flush callback support libnvdimm, namespace: Drop uuid_t implementation detail
| * ext4: disable map_sync for async flushPankaj Gupta2019-07-051-4/+6
| | | | | | | | | | | | | | | | | | | | | | Dont support 'MAP_SYNC' with non-DAX files and DAX files with asynchronous dax_device. Virtio pmem provides asynchronous host page cache flush mechanism. We don't support 'MAP_SYNC' with virtio pmem and ext4. Signed-off-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-blockLinus Torvalds2019-07-151-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block updates from Jens Axboe: "A later pull request with some followup items. I had some vacation coming up to the merge window, so certain things items were delayed a bit. This pull request also contains fixes that came in within the last few days of the merge window, which I didn't want to push right before sending you a pull request. This contains: - NVMe pull request, mostly fixes, but also a few minor items on the feature side that were timing constrained (Christoph et al) - Report zones fixes (Damien) - Removal of dead code (Damien) - Turn on cgroup psi memstall (Josef) - block cgroup MAINTAINERS entry (Konstantin) - Flush init fix (Josef) - blk-throttle low iops timing fix (Konstantin) - nbd resize fixes (Mike) - nbd 0 blocksize crash fix (Xiubo) - block integrity error leak fix (Wenwen) - blk-cgroup writeback and priority inheritance fixes (Tejun)" * tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits) MAINTAINERS: add entry for block io cgroup null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED block: Limit zone array allocation size sd_zbc: Fix report zones buffer allocation block: Kill gfp_t argument of blkdev_report_zones() block: Allow mapping of vmalloc-ed buffers block/bio-integrity: fix a memory leak bug nvme: fix NULL deref for fabrics options nbd: add netlink reconfigure resize support nbd: fix crash when the blksize is zero block: Disable write plugging for zoned block devices block: Fix elevator name declaration block: Remove unused definitions nvme: fix regression upon hot device removal and insertion blk-throttle: fix zero wait time for iops throttled group block: Fix potential overflow in blk_report_zones() blkcg: implement REQ_CGROUP_PUNT blkcg, writeback: Implement wbc_blkcg_css() blkcg, writeback: Add wbc->no_cgroup_owner blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner() ...
| * | blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()Tejun Heo2019-07-101-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | wbc_account_io() does a very specific job - try to see which cgroup is actually dirtying an inode and transfer its ownership to the majority dirtier if needed. The name is too generic and confusing. Let's rename it to something more specific. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
OpenPOWER on IntegriCloud