| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A simple test-case from Kirill Shutemov
cat /proc/self/maps >/dev/null
chmod +x /proc/self/net/packet
exec /proc/self/net/packet
makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
the opposite order.
It's a false positive and probably we should not allow "chmod +x" on proc
files. Still I think that we should avoid mm_access() and cred_guard_mutex
in sys_read() paths, security checking should happen at open time. Besides,
this doesn't even look right if the task changes its ->mm between m_stop()
and m_start().
Add the new "mm_struct *mm" member into struct proc_maps_private and change
proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
otherwise.
The only complication is that proc_maps_open() users should additionally do
mmdrop() in fop->release(), add the new proc_map_release() helper for that.
Note: this is the user-visible change, if the task execs after open("maps")
the new ->mm won't be visible via this file. I hope this is fine, and this
matches /proc/pid/mem bahaviour.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Extract the mm_access() code from __mem_open() into the new helper,
proc_mem_open(), the next patch will add another caller.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
do_maps_open() and numa_maps_open() are overcomplicated, they could use
__seq_open_private(). Plus they do the same, just sizeof(*priv)
Change them to use a new simple helper, proc_maps_open(ops, psize). This
simplifies the code and allows us to do the next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be NULL or
it can changed by exec right after mm_access().
And in theory this race is not harmless, the task can exec and then later
exit and free the new mm_struct. In this case get_task_mm(oldmm) can't
help, get_gate_vma(task->mm) can read the freed/unmapped memory.
I think that priv->task should simply die and hold_task_mempolicy() logic
can be simplified. tail_vma logic asks for cleanups too.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For commit ocfs2 journal, ocfs2 journal thread will acquire the mutex
osb->journal->j_trans_barrier and wake up jbd2 commit thread, then it
will wait until jbd2 commit thread done. In order journal mode, jbd2
needs flushing dirty data pages first, and this needs get page lock.
So osb->journal->j_trans_barrier should be got before page lock.
But ocfs2_write_zero_page() and ocfs2_write_begin_inline() obey this
locking order, and this will cause deadlock and hung the whole cluster.
One deadlock catched is the following:
PID: 13449 TASK: ffff8802e2f08180 CPU: 31 COMMAND: "oracle"
#0 [ffff8802ee3f79b0] __schedule at ffffffff8150a524
#1 [ffff8802ee3f7a58] schedule at ffffffff8150acbf
#2 [ffff8802ee3f7a68] rwsem_down_failed_common at ffffffff8150cb85
#3 [ffff8802ee3f7ad8] rwsem_down_read_failed at ffffffff8150cc55
#4 [ffff8802ee3f7ae8] call_rwsem_down_read_failed at ffffffff812617a4
#5 [ffff8802ee3f7b50] ocfs2_start_trans at ffffffffa0498919 [ocfs2]
#6 [ffff8802ee3f7ba0] ocfs2_zero_start_ordered_transaction at ffffffffa048b2b8 [ocfs2]
#7 [ffff8802ee3f7bf0] ocfs2_write_zero_page at ffffffffa048e9bd [ocfs2]
#8 [ffff8802ee3f7c80] ocfs2_zero_extend_range at ffffffffa048ec83 [ocfs2]
#9 [ffff8802ee3f7ce0] ocfs2_zero_extend at ffffffffa048edfd [ocfs2]
#10 [ffff8802ee3f7d50] ocfs2_extend_file at ffffffffa049079e [ocfs2]
#11 [ffff8802ee3f7da0] ocfs2_setattr at ffffffffa04910ed [ocfs2]
#12 [ffff8802ee3f7e70] notify_change at ffffffff81187d29
#13 [ffff8802ee3f7ee0] do_truncate at ffffffff8116bbc1
#14 [ffff8802ee3f7f50] sys_ftruncate at ffffffff8116bcbd
#15 [ffff8802ee3f7f80] system_call_fastpath at ffffffff81515142
RIP: 00007f8de750c6f7 RSP: 00007fffe786e478 RFLAGS: 00000206
RAX: 000000000000004d RBX: ffffffff81515142 RCX: 0000000000000000
RDX: 0000000000000200 RSI: 0000000000028400 RDI: 000000000000000d
RBP: 00007fffe786e040 R8: 0000000000000000 R9: 000000000000000d
R10: 0000000000000000 R11: 0000000000000206 R12: 000000000000000d
R13: 00007fffe786e710 R14: 00007f8de70f8340 R15: 0000000000028400
ORIG_RAX: 000000000000004d CS: 0033 SS: 002b
crash64> bt
PID: 7610 TASK: ffff88100fd56140 CPU: 1 COMMAND: "ocfs2cmt"
#0 [ffff88100f4d1c50] __schedule at ffffffff8150a524
#1 [ffff88100f4d1cf8] schedule at ffffffff8150acbf
#2 [ffff88100f4d1d08] jbd2_log_wait_commit at ffffffffa01274fd [jbd2]
#3 [ffff88100f4d1d98] jbd2_journal_flush at ffffffffa01280b4 [jbd2]
#4 [ffff88100f4d1dd8] ocfs2_commit_cache at ffffffffa0499b14 [ocfs2]
#5 [ffff88100f4d1e38] ocfs2_commit_thread at ffffffffa0499d38 [ocfs2]
#6 [ffff88100f4d1ee8] kthread at ffffffff81090db6
#7 [ffff88100f4d1f48] kernel_thread_helper at ffffffff81516284
crash64> bt
PID: 7609 TASK: ffff88100f2d4480 CPU: 0 COMMAND: "jbd2/dm-20-86"
#0 [ffff88100def3920] __schedule at ffffffff8150a524
#1 [ffff88100def39c8] schedule at ffffffff8150acbf
#2 [ffff88100def39d8] io_schedule at ffffffff8150ad6c
#3 [ffff88100def39f8] sleep_on_page at ffffffff8111069e
#4 [ffff88100def3a08] __wait_on_bit_lock at ffffffff8150b30a
#5 [ffff88100def3a58] __lock_page at ffffffff81110687
#6 [ffff88100def3ab8] write_cache_pages at ffffffff8111b752
#7 [ffff88100def3be8] generic_writepages at ffffffff8111b901
#8 [ffff88100def3c48] journal_submit_data_buffers at ffffffffa0120f67 [jbd2]
#9 [ffff88100def3cf8] jbd2_journal_commit_transaction at ffffffffa0121372[jbd2]
#10 [ffff88100def3e68] kjournald2 at ffffffffa0127a86 [jbd2]
#11 [ffff88100def3ee8] kthread at ffffffff81090db6
#12 [ffff88100def3f48] kernel_thread_helper at ffffffff81516284
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Alex Chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following case may lead to o2net_wq and o2hb thread deadlock on
o2hb_callback_sem.
Currently there are 2 nodes say N1, N2 in the cluster. And N2 down, at
the same time, N3 tries to join the cluster. So N1 will handle node
down (N2) and join (N3) simultaneously.
o2hb o2net_wq
->o2hb_do_disk_heartbeat
->o2hb_check_slot
->o2hb_run_event_list
->o2hb_fire_callbacks
->down_write(&o2hb_callback_sem)
->o2net_hb_node_down_cb
->flush_workqueue(o2net_wq)
->o2net_process_message
->dlm_query_join_handler
->o2hb_check_node_heartbeating
->o2hb_fill_node_map
->down_read(&o2hb_callback_sem)
No need to take o2hb_callback_sem in dlm_query_join_handler,
o2hb_live_lock is enough to protect live node map.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: xMark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: jiangyiwen <jiangyiwen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Firing quorum before connection established can cause unexpected node to
reboot.
Assume there are 3 nodes in the cluster, Node 1, 2, 3. Node 2 and 3 have
wrong ip address of Node 1 in cluster.conf and global heartbeat is enabled
in the cluster. After the heatbeats are started on these three nodes,
Node 1 will reboot due to quorum fencing. It is similar case if Node 1's
networking is not ready when starting the global heartbeat.
The reboot is not friendly as customer is not fully ready for ocfs2 to
work. Fix it by not allowing firing quorum before the connection is
established. In this case, ocfs2 will wait until the wrong configuration
is fixed or networking is up to continue. Also update the log to guide
the user where to check when connection is not built for a long time.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Reduce boilerplate code by using seq_open_private() instead of seq_open()
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reduce boilerplate code by using seq_open_private() instead of seq_open()
Note that the code in and using sc_common_open() has been quite
extensively changed. Not least because there was a latent memory leak in
the code as was: if sc_common_open() failed, the previously allocated
buffer was not freed.
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Reduce boilerplate code by using seq_open_private() instead of seq_open()
Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Remove the branch that free res->lockname.name because the condition
is never satisfied when jump to label error.
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
dlm_lockres_put() should be called without &res->spinlock, otherwise a
deadlock case may happen.
spin_lock(&res->spinlock)
...
dlm_lockres_put
->dlm_lockres_release
->dlm_print_one_lock_resource
->spin_lock(&res->spinlock)
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
In o2net_init, if malloc failed, it directly returns -ENOMEM. Then
o2quo_exit won't be called in init_o2nm.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ocfs2_inode_info->ip_clusters and ocfs2_dinode->id1.bitmap1.i_total are
defined as type u32, so the shift left operations may overflow if volume
size is large, for example, 2TB and cluster size is 1MB.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Refactoring error handling in dlm_alloc_ctxt to simplify code.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
It is supposed to zero pv_minor.
Reported-by: Himangi Saraogi <himangi774@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
fs/ntfs/debug.c:124: WARNING: space prohibited between function name and
open parenthesis '('
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
marked later on
Mel Gorman's commit 2457aec63745 ("mm: non-atomically mark page accessed
during page cache allocation where possible") removed mark_page_accessed()
calls from NTFS without updating the matching find_lock_page() to
find_get_page_flags(GFP_LOCK | FGP_ACCESSED) thus causing the page to
never be marked accessed.
This patch fixes that.
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
According to commit 80af258867648 ("fanotify: groups can specify their
f_flags for new fd"), file descriptors created as part of file access
notification events inherit flags from the event_f_flags argument passed
to syscall fanotify_init(2)[1].
Unfortunately O_CLOEXEC is currently silently ignored.
Indeed, event_f_flags are only given to dentry_open(), which only seems to
care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT in
open_check_o_direct() and O_LARGEFILE in generic_file_open().
It's a pity, since, according to some lookup on various search engines and
http://codesearch.debian.net/, there's already some userspace code which
use O_CLOEXEC:
- in systemd's readahead[2]:
fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME);
- in clsync[3]:
#define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC)
int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS);
- in examples [4] from "Filesystem monitoring in the Linux
kernel" article[5] by Aleksander Morgado:
if ((fanotify_fd = fanotify_init (FAN_CLOEXEC,
O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0)
Additionally, since commit 48149e9d3a7e ("fanotify: check file flags
passed in fanotify_init"). having O_CLOEXEC as part of fanotify_init()
second argument is expressly allowed.
So it seems expected to set close-on-exec flag on the file descriptors if
userspace is allowed to request it with O_CLOEXEC.
But Andrew Morton raised[6] the concern that enabling now close-on-exec
might break existing applications which ask for O_CLOEXEC but expect the
file descriptor to be inherited across exec().
In the other hand, as reported by Mihai Dontu[7] close-on-exec on the file
descriptor returned as part of file access notify can break applications
due to deadlock. So close-on-exec is needed for most applications.
More, applications asking for close-on-exec are likely expecting it to be
enabled, relying on O_CLOEXEC being effective. If not, it might weaken
their security, as noted by Jan Kara[8].
So this patch replaces call to macro get_unused_fd() by a call to function
get_unused_fd_flags() with event_f_flags value as argument. This way
O_CLOEXEC flag in the second argument of fanotify_init(2) syscall is
interpreted and close-on-exec get enabled when requested.
[1] http://man7.org/linux/man-pages/man2/fanotify_init.2.html
[2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294
[3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631
https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38
[4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c
[5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/
[6] http://lkml.kernel.org/r/20141001153621.65e9258e65a6167bf2e4cb50@linux-foundation.org
[7] http://lkml.kernel.org/r/20141002095046.3715eb69@mdontu-l
[8] http://lkml.kernel.org/r/20141002104410.GB19748@quack.suse.cz
Link: http://lkml.kernel.org/r/cover.1411562410.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Mihai Don\u021bu <mihai.dontu@gmail.com>
Cc: Pádraig Brady <P@draigBrady.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Michael Kerrisk-manpages <mtk.manpages@gmail.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Richard Guy Briggs <rgb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some failure paths we may attempt to free user context even if it
wasn't assigned yet. This will cause a NULL ptr deref and a kernel BUG.
The path I was looking at is in inotify_new_group():
oevent = kmalloc(sizeof(struct inotify_event_info), GFP_KERNEL);
if (unlikely(!oevent)) {
fsnotify_destroy_group(group);
return ERR_PTR(-ENOMEM);
}
fsnotify_destroy_group() would get called here, but
group->inotify_data.user is only getting assigned later:
group->inotify_data.user = get_current_user();
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Reviewed-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
| |
No callers outside this file.
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Nothing really exciting this time:
- a few fixlets in the NOHZ code
- a new ARM SoC timer abomination. One should expect that we have
enough of them already, but they insist on inventing new ones.
- the usual bunch of ARM SoC timer updates. That feels like herding
cats"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: arm_arch_timer: Consolidate arch_timer_evtstrm_enable
clocksource: arm_arch_timer: Enable counter access for 32-bit ARM
clocksource: arm_arch_timer: Change clocksource name if CP15 unavailable
clocksource: sirf: Disable counter before re-setting it
clocksource: cadence_ttc: Add support for 32bit mode
clocksource: tcb_clksrc: Sanitize IRQ request
clocksource: arm_arch_timer: Discard unavailable timers correctly
clocksource: vf_pit_timer: Support shutdown mode
ARM: meson6: clocksource: Add Meson6 timer support
ARM: meson: documentation: Add timer documentation
clocksource: sh_tmu: Document r8a7779 binding
clocksource: sh_mtu2: Document r7s72100 binding
clocksource: sh_cmt: Document SoC specific bindings
timerfd: Remove an always true check
nohz: Avoid tick's double reprogramming in highres mode
nohz: Fix spurious periodic tick behaviour in low-res dynticks mode
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We would have returned -EINVAL earlier if ticks wasn't set.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/20140801082848.GF28869@mwanda
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch-set introduces a couple of new features such as large
sector size, FITRIM, and atomic/volatile writes.
Several patches enhance power-off recovery and checkpoint routines.
The fsck.f2fs starts to support fixing corrupted partitions with
recovery hints provided by this patch-set.
Summary:
- retain some recovery information for fsck.f2fs
- enhance checkpoint speed
- enhance flush command management
- bug fix for lseek
- tune in-place-update policies
- enhance roll-forward speed
- revisit all the roll-forward and fsync rules
- support larget sector size
- support FITRIM
- support atomic and volatile writes
And several clean-ups and bug fixes are included"
* tag 'f2fs-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (42 commits)
f2fs: support volatile operations for transient data
f2fs: support atomic writes
f2fs: remove unused return value
f2fs: clean up f2fs_ioctl functions
f2fs: potential shift wrapping buf in f2fs_trim_fs()
f2fs: call f2fs_unlock_op after error was handled
f2fs: check the use of macros on block counts and addresses
f2fs: refactor flush_nat_entries to remove costly reorganizing ops
f2fs: introduce FITRIM in f2fs_ioctl
f2fs: introduce cp_control structure
f2fs: use more free segments until SSR is activated
f2fs: change the ipu_policy option to enable combinations
f2fs: fix to search whole dirty segmap when get_victim
f2fs: fix to clean previous mount option when remount_fs
f2fs: skip punching hole in special condition
f2fs: support large sector size
f2fs: fix to truncate blocks past EOF in ->setattr
f2fs: update i_size when __allocate_data_block
f2fs: use MAX_BIO_BLOCKS(sbi)
f2fs: remove redundant operation during roll-forward recovery
...
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch adds support for volatile writes which keep data pages in memory
until f2fs_evict_inode is called by iput.
For instance, we can use this feature for the sqlite database as follows.
While supporting atomic writes for main database file, we can keep its journal
data temporarily in the page cache by the following sequence.
1. open
-> ioctl(F2FS_IOC_START_VOLATILE_WRITE);
2. writes
: keep all the data in the page cache.
3. flush to the database file with atomic writes
a. ioctl(F2FS_IOC_START_ATOMIC_WRITE);
b. writes
c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
4. close
-> drop the cached data
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch introduces a very limited functionality for atomic write support.
In order to support atomic write, this patch adds two ioctls:
o F2FS_IOC_START_ATOMIC_WRITE
o F2FS_IOC_COMMIT_ATOMIC_WRITE
The database engine should be aware of the following sequence.
1. open
-> ioctl(F2FS_IOC_START_ATOMIC_WRITE);
2. writes
: all the written data will be treated as atomic pages.
3. commit
-> ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
: this flushes all the data blocks to the disk, which will be shown all or
nothing by f2fs recovery procedure.
4. repeat to #2.
The IO pattens should be:
,- START_ATOMIC_WRITE ,- COMMIT_ATOMIC_WRITE
CP | D D D D D D | FSYNC | D D D D | FSYNC ...
`- COMMIT_ATOMIC_WRITE
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | | |
Don't return any value without any usage.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | | |
This patch cleans up f2fs_ioctl functions for better readability.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
My static checker complains that segment is a u64 but only the lower 31
bits can be used before we hit a shift wrapping bug.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch relocates f2fs_unlock_op in every directory operations to be called
after any error was processed.
Otherwise, the checkpoint can be entered with valid node ids without its
dentry when -ENOSPC is occurred.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch cleans up the existing and new macros for readability.
Rule is like this.
,-----------------------------------------> MAX_BLKADDR -,
| ,------------- TOTAL_BLKS ----------------------------,
| | |
| ,- seg0_blkaddr ,----- sit/nat/ssa/main blkaddress |
block | | (SEG0_BLKADDR) | | | | (e.g., MAIN_BLKADDR) |
address 0..x................ a b c d .............................
| |
global seg# 0...................... m .............................
| | |
| `------- MAIN_SEGS -----------'
`-------------- TOTAL_SEGS ---------------------------'
| |
seg# 0..........xx..................
= Note =
o GET_SEGNO_FROM_SEG0 : blk address -> global segno
o GET_SEGNO : blk address -> segno
o START_BLOCK : segno -> starting block address
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Previously, f2fs tries to reorganize the dirty nat entries into multiple sets
according to its nid ranges. This can improve the flushing nat pages, however,
if there are a lot of cached nat entries, it becomes a bottleneck.
This patch introduces a new set management flow by removing dirty nat list and
adding a series of set operations when the nat entry becomes dirty.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch introduces FITRIM in f2fs_ioctl.
In this case, f2fs will issue small discards and prefree discards as many as
possible for the given area.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch add a new data structure to control checkpoint parameters.
Currently, it presents the reason of checkpoint such as is_umount and normal
sync.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Previously, f2fs activates SSR if the # of free segments reaches to the # of
overprovisioned segments.
In this case, SSR starts to use dirty segments only, so that the overprovisoned
space cannot be selected for new data.
This means that we have no chance to utilizae the overprovisioned space at all.
This patch fixes that by allowing LFS allocations until the # of free segments
reaches to the last threshold, reserved space.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch changes the ipu_policy setting to use any combination of orthogonal policies.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In ->get_victim we get max_search value from dirty_i->nr_dirty without
protection of seglist_lock, after that, nr_dirty can be increased/decreased
before we hold seglist_lock lock.
Then in main loop we attempt to traverse all dirty section one time to find
victim section, but it's not accurate to use max_search as the total loop count,
because we might lose checking several sections or check sections redundantly
for the case of nr_dirty are increased or decreased previously.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In manual of mount, we descript remount as below:
"mount -o remount,rw /dev/foo /dir
After this call all old mount options are replaced and arbitrary stuff from
fstab is ignored, except the loop= option which is internally generated and
maintained by the mount command."
Previously f2fs do not clear up old mount options when remount_fs, so we have no
chance of disabling previous option (e.g. flush_merge). Fix it.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Now punching hole in directory is not supported in f2fs, so let's limit file
type in punch_hole().
In addition, in punch_hole if offset is exceed file size, we should skip
punching hole.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Block size in f2fs is 4096 bytes, so theoretically, f2fs can support 4096 bytes
sector device at maximum. But now f2fs only support 512 bytes size sector, so
block device such as zRAM which uses page cache as its block storage space will
not be mounted successfully as mismatch between sector size of zRAM and sector
size of f2fs supported.
In this patch we support large sector size in f2fs, so block device with sector
size of 512/1024/2048/4096 bytes can be supported in f2fs.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
By using FALLOC_FL_KEEP_SIZE in ->fallocate of f2fs, we can fallocate block past
EOF without changing i_size of inode. These blocks past EOF will not be
truncated in ->setattr as we truncate them only when change the file size.
We should give a chance to truncate blocks out of filesize in setattr().
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The f2fs_direct_IO uses __allocate_data_block, but inside the allocation path,
we should update i_size at the changed time to update its inode page.
Otherwise, we can get wrong i_size after roll-forward recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | | |
This patch cleans up a simple macro.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If same data is updated multiple times, we don't need to redo whole the
operations.
Let's just update the lastest one.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In f2fs_sync_file, if there is no written appended writes, it skips
to write its node blocks.
But, if there is up-to-date inode page, we should write it to update
its metadata during the roll-forward recovery.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We can summarize the roll forward recovery scenarios as follows.
[Term] F: fsync_mark, D: dentry_mark
1. inode(x) | CP | inode(x) | dnode(F)
-> Update the latest inode(x).
2. inode(x) | CP | inode(F) | dnode(F)
-> No problem.
3. inode(x) | CP | dnode(F) | inode(x)
-> Recover to the latest dnode(F), and drop the last inode(x)
4. inode(x) | CP | dnode(F) | inode(F)
-> No problem.
5. CP | inode(x) | dnode(F)
-> The inode(DF) was missing. Should drop this dnode(F).
6. CP | inode(DF) | dnode(F)
-> No problem.
7. CP | dnode(F) | inode(DF)
-> If f2fs_iget fails, then goto next to find inode(DF).
8. CP | dnode(F) | inode(x)
-> If f2fs_iget fails, then goto next to find inode(DF).
But it will fail due to no inode(DF).
So, this patch adds some missing points such as #1, #5, #7, and #8.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch revisited whole the recovery information during the f2fs_sync_file.
In this patch, there are three information to make a decision.
a) IS_CHECKPOINTED, /* is it checkpointed before? */
b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */
c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */
And, the scenarios for our rule are based on:
[Term] F: fsync_mark, D: dentry_mark
1. inode(x) | CP | inode(x) | dnode(F)
2. inode(x) | CP | inode(F) | dnode(F)
3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
4. inode(x) | CP | dnode(F) | inode(F)
5. CP | inode(x) | dnode(F) | inode(DF)
6. CP | inode(DF) | dnode(F)
7. CP | dnode(F) | inode(DF)
8. CP | dnode(F) | inode(x) | inode(DF)
For example, #3, the three conditions should be changed as follows.
inode(x) | CP | dnode(F) | inode(x) | inode(F)
a) x o o o o
b) x x x x o
c) x o o x o
If f2fs_sync_file stops ------^,
it should write inode(F) --------------^
So, the need_inode_block_update should return true, since
c) get_nat_flag(e, HAS_LAST_FSYNC), is false.
For example, #8,
CP | alloc | dnode(F) | inode(x) | inode(DF)
a) o x x x x
b) x x x o
c) o o x o
If f2fs_sync_file stops -------^,
it should write inode(DF) --------------^
Note that, the roll-forward policy should follow this rule, which means,
if there are any missing blocks, we doesn't need to recover that inode.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch introduces a flag in the nat entry structure to merge various
information such as checkpointed and fsync_done marks.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Previously, all the dnode pages should be read during the roll-forward recovery.
Even worsely, whole the chain was traversed twice.
This patch removes that redundant and costly read operations by using page cache
of meta_inode and readahead function as well.
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If the inode is same and its data index are needed to truncate, we can fall into
double lock for its inode page via get_dnode_of_data.
Error case is like this.
1. write data 1, 2, 3, 4, 5 in inode #4.
2. write data 100, 102, 103, 104, 105 in dnode #6 of inode #4.
3. sync
4. update data 100->106 in dnode #6.
5. fsync inode #4.
6. power-cut
-> Then,
1. go back to #3's checkpoint
2. in do_recover_data, get_dnode_of_data() gets inode #4.
3. detect 100->106 in dnode #6.
4. check_index_in_prev_nodes tries to truncate 100 in dnode #6.
5. to trigger truncate_hole, get_dnode_of_data should grab inode #4.
6. detect *kernel hang*
This patch should resolve that bug.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|