From 143dfe8611a63030ce0c79419dc362f7838be557 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Fri, 27 Aug 2010 18:45:12 -0600 Subject: writeback: IO-less balance_dirty_pages() As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. RATIONALS ========= - disk seeks on concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner) With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention". * "CPU usage has dropped by ~55%", "it certainly appears that most of the CPU time saving comes from the removal of contention on the inode_wb_list_lock" (IMHO at least 10% comes from the reduction of cacheline bouncing, because the new code is able to call much less frequently into balance_dirty_pages() and hence access the global page states) * the user space "App overhead" is reduced by 20%, by avoiding the cacheline pollution by the complex writeback code path * "for a ~5% throughput reduction", "the number of write IOs have dropped by ~25%", and the elapsed time reduced from 41:42.17 to 40:53.23. * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and improves IO throughput from 38MB/s to 42MB/s. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. Now it's possible to increase writeback chunk size proportional to the disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram, the larger writeback size dramatically reduces the seek count to 1/10 (far beyond my expectation) and improves the write throughput by 24%. - long block time in balance_dirty_pages() hurts desktop responsiveness Many of us may have the experience: it often takes a couple of seconds or even long time to stop a heavy writing dd/cp/tar command with Ctrl-C or "kill -9". - IO pipeline broken by bumpy write() progress There are a broad class of "loop {read(buf); write(buf);}" applications whose read() pipeline will be under-utilized or even come to a stop if the write()s have long latencies _or_ don't progress in a constant rate. The current threshold based throttling inherently transfers the large low level IO completion fluctuations to bumpy application write()s, and further deteriorates with increasing number of dirtiers and/or bdi's. For example, when doing 50 dd's + 1 remote rsync to an XFS partition, the rsync progresses very bumpy in legacy kernel, and throughput is improved by 67% by this patchset. (plus the larger write chunk size, it will be 93% speedup). The new rate based throttling can support 1000+ dd's with excellent smoothness, low latency and low overheads. For the above reasons, it's much better to do IO-less and low latency pauses in balance_dirty_pages(). Jan Kara, Dave Chinner and me explored the scheme to let balance_dirty_pages() wait for enough writeback IO completions to safeguard the dirty limit. However it's found to have two problems: - in large NUMA systems, the per-cpu counters may have big accounting errors, leading to big throttle wait time and jitters. - NFS may kill large amount of unstable pages with one single COMMIT. Because NFS server serves COMMIT with expensive fsync() IOs, it is desirable to delay and reduce the number of COMMITs. So it's not likely to optimize away such kind of bursty IO completions, and the resulted large (and tiny) stall times in IO completion based throttling. So here is a pause time oriented approach, which tries to control the pause time in each balance_dirty_pages() invocations, by controlling the number of pages dirtied before calling balance_dirty_pages(), for smooth and efficient dirty throttling: - avoid useless (eg. zero pause time) balance_dirty_pages() calls - avoid too small pause time (less than 4ms, which burns CPU power) - avoid too large pause time (more than 200ms, which hurts responsiveness) - avoid big fluctuations of pause times It can control pause times at will. The default policy (in a followup patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case. BEHAVIOR CHANGE =============== (1) dirty threshold Users will notice that the applications will get throttled once crossing the global (background + dirty)/2=15% threshold, and then balanced around 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable memory in 1-dd case. Since the task will be soft throttled earlier than before, it may be perceived by end users as performance "slow down" if his application happens to dirty more than 15% dirtyable memory. (2) smoothness/responsiveness Users will notice a more responsive system during heavy writeback. "killall dd" will take effect instantly. Signed-off-by: Wu Fengguang --- include/trace/events/writeback.h | 24 ------------------------ 1 file changed, 24 deletions(-) (limited to 'include/trace') diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 5f172703eb4f..178c23508d3d 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_register); DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister); DEFINE_WRITEBACK_EVENT(writeback_thread_start); DEFINE_WRITEBACK_EVENT(writeback_thread_stop); -DEFINE_WRITEBACK_EVENT(balance_dirty_start); -DEFINE_WRITEBACK_EVENT(balance_dirty_wait); - -TRACE_EVENT(balance_dirty_written, - - TP_PROTO(struct backing_dev_info *bdi, int written), - - TP_ARGS(bdi, written), - - TP_STRUCT__entry( - __array(char, name, 32) - __field(int, written) - ), - - TP_fast_assign( - strncpy(__entry->name, dev_name(bdi->dev), 32); - __entry->written = written; - ), - - TP_printk("bdi %s written %d", - __entry->name, - __entry->written - ) -); DECLARE_EVENT_CLASS(wbc_class, TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), -- cgit v1.2.1 From b48c104d2211b0ac881a71f5f76a3816225f8111 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Wed, 2 Mar 2011 17:22:49 -0600 Subject: writeback: trace event bdi_dirty_ratelimit It helps understand how various throttle bandwidths are updated. Signed-off-by: Wu Fengguang --- include/trace/events/writeback.h | 45 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) (limited to 'include/trace') diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 178c23508d3d..ffb5deb77ca9 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -226,6 +226,51 @@ TRACE_EVENT(global_dirty_state, ) ); +#define KBps(x) ((x) << (PAGE_SHIFT - 10)) + +TRACE_EVENT(bdi_dirty_ratelimit, + + TP_PROTO(struct backing_dev_info *bdi, + unsigned long dirty_rate, + unsigned long task_ratelimit), + + TP_ARGS(bdi, dirty_rate, task_ratelimit), + + TP_STRUCT__entry( + __array(char, bdi, 32) + __field(unsigned long, write_bw) + __field(unsigned long, avg_write_bw) + __field(unsigned long, dirty_rate) + __field(unsigned long, dirty_ratelimit) + __field(unsigned long, task_ratelimit) + __field(unsigned long, balanced_dirty_ratelimit) + ), + + TP_fast_assign( + strlcpy(__entry->bdi, dev_name(bdi->dev), 32); + __entry->write_bw = KBps(bdi->write_bandwidth); + __entry->avg_write_bw = KBps(bdi->avg_write_bandwidth); + __entry->dirty_rate = KBps(dirty_rate); + __entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit); + __entry->task_ratelimit = KBps(task_ratelimit); + __entry->balanced_dirty_ratelimit = + KBps(bdi->balanced_dirty_ratelimit); + ), + + TP_printk("bdi %s: " + "write_bw=%lu awrite_bw=%lu dirty_rate=%lu " + "dirty_ratelimit=%lu task_ratelimit=%lu " + "balanced_dirty_ratelimit=%lu", + __entry->bdi, + __entry->write_bw, /* write bandwidth */ + __entry->avg_write_bw, /* avg write bandwidth */ + __entry->dirty_rate, /* bdi dirty rate */ + __entry->dirty_ratelimit, /* base ratelimit */ + __entry->task_ratelimit, /* ratelimit with position control */ + __entry->balanced_dirty_ratelimit /* the balanced ratelimit */ + ) +); + DECLARE_EVENT_CLASS(writeback_congest_waited_template, TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), -- cgit v1.2.1 From ece13ac31bbe492d940ba0bc4ade2ae1521f46a5 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Sun, 29 Aug 2010 23:33:20 -0600 Subject: writeback: trace event balance_dirty_pages Useful for analyzing the dynamics of the throttling algorithms and debugging user reported problems. Signed-off-by: Wu Fengguang --- include/trace/events/writeback.h | 73 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) (limited to 'include/trace') diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index ffb5deb77ca9..0ce9f06f58c2 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -271,6 +271,79 @@ TRACE_EVENT(bdi_dirty_ratelimit, ) ); +TRACE_EVENT(balance_dirty_pages, + + TP_PROTO(struct backing_dev_info *bdi, + unsigned long thresh, + unsigned long bg_thresh, + unsigned long dirty, + unsigned long bdi_thresh, + unsigned long bdi_dirty, + unsigned long dirty_ratelimit, + unsigned long task_ratelimit, + unsigned long dirtied, + long pause, + unsigned long start_time), + + TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty, + dirty_ratelimit, task_ratelimit, + dirtied, pause, start_time), + + TP_STRUCT__entry( + __array( char, bdi, 32) + __field(unsigned long, limit) + __field(unsigned long, setpoint) + __field(unsigned long, dirty) + __field(unsigned long, bdi_setpoint) + __field(unsigned long, bdi_dirty) + __field(unsigned long, dirty_ratelimit) + __field(unsigned long, task_ratelimit) + __field(unsigned int, dirtied) + __field(unsigned int, dirtied_pause) + __field(unsigned long, paused) + __field( long, pause) + ), + + TP_fast_assign( + unsigned long freerun = (thresh + bg_thresh) / 2; + strlcpy(__entry->bdi, dev_name(bdi->dev), 32); + + __entry->limit = global_dirty_limit; + __entry->setpoint = (global_dirty_limit + freerun) / 2; + __entry->dirty = dirty; + __entry->bdi_setpoint = __entry->setpoint * + bdi_thresh / (thresh + 1); + __entry->bdi_dirty = bdi_dirty; + __entry->dirty_ratelimit = KBps(dirty_ratelimit); + __entry->task_ratelimit = KBps(task_ratelimit); + __entry->dirtied = dirtied; + __entry->dirtied_pause = current->nr_dirtied_pause; + __entry->pause = pause * 1000 / HZ; + __entry->paused = (jiffies - start_time) * 1000 / HZ; + ), + + + TP_printk("bdi %s: " + "limit=%lu setpoint=%lu dirty=%lu " + "bdi_setpoint=%lu bdi_dirty=%lu " + "dirty_ratelimit=%lu task_ratelimit=%lu " + "dirtied=%u dirtied_pause=%u " + "paused=%lu pause=%ld", + __entry->bdi, + __entry->limit, + __entry->setpoint, + __entry->dirty, + __entry->bdi_setpoint, + __entry->bdi_dirty, + __entry->dirty_ratelimit, + __entry->task_ratelimit, + __entry->dirtied, + __entry->dirtied_pause, + __entry->paused, /* ms */ + __entry->pause /* ms */ + ) +); + DECLARE_EVENT_CLASS(writeback_congest_waited_template, TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed), -- cgit v1.2.1 From ad4e38dd6a33bb3a4882c487d7abe621e583b982 Mon Sep 17 00:00:00 2001 From: Curt Wohlgemuth Date: Fri, 7 Oct 2011 21:51:56 -0600 Subject: writeback: send work item to queue_io, move_expired_inodes Instead of sending ->older_than_this to queue_io() and move_expired_inodes(), send the entire wb_writeback_work structure. There are other fields of a work item that are useful in these routines and in tracepoints. Acked-by: Jan Kara Signed-off-by: Curt Wohlgemuth Signed-off-by: Wu Fengguang --- include/trace/events/writeback.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'include/trace') diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 0ce9f06f58c2..1261db3916cc 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -157,9 +157,9 @@ DEFINE_WBC_EVENT(wbc_writepage); TRACE_EVENT(writeback_queue_io, TP_PROTO(struct bdi_writeback *wb, - unsigned long *older_than_this, + struct wb_writeback_work *work, int moved), - TP_ARGS(wb, older_than_this, moved), + TP_ARGS(wb, work, moved), TP_STRUCT__entry( __array(char, name, 32) __field(unsigned long, older) @@ -167,6 +167,7 @@ TRACE_EVENT(writeback_queue_io, __field(int, moved) ), TP_fast_assign( + unsigned long *older_than_this = work->older_than_this; strncpy(__entry->name, dev_name(wb->bdi->dev), 32); __entry->older = older_than_this ? *older_than_this : 0; __entry->age = older_than_this ? -- cgit v1.2.1 From 0e175a1835ffc979e55787774e58ec79e41957d7 Mon Sep 17 00:00:00 2001 From: Curt Wohlgemuth Date: Fri, 7 Oct 2011 21:54:10 -0600 Subject: writeback: Add a 'reason' to wb_writeback_work This creates a new 'reason' field in a wb_writeback_work structure, which unambiguously identifies who initiates writeback activity. A 'wb_reason' enumeration has been added to writeback.h, to enumerate the possible reasons. The 'writeback_work_class' and tracepoint event class and 'writeback_queue_io' tracepoints are updated to include the symbolic 'reason' in all trace events. And the 'writeback_inodes_sbXXX' family of routines has had a wb_stats parameter added to them, so callers can specify why writeback is being started. Acked-by: Jan Kara Signed-off-by: Curt Wohlgemuth Signed-off-by: Wu Fengguang --- include/trace/events/writeback.h | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) (limited to 'include/trace') diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 1261db3916cc..b99caa8b780c 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -34,6 +34,7 @@ DECLARE_EVENT_CLASS(writeback_work_class, __field(int, for_kupdate) __field(int, range_cyclic) __field(int, for_background) + __field(int, reason) ), TP_fast_assign( strncpy(__entry->name, dev_name(bdi->dev), 32); @@ -43,16 +44,18 @@ DECLARE_EVENT_CLASS(writeback_work_class, __entry->for_kupdate = work->for_kupdate; __entry->range_cyclic = work->range_cyclic; __entry->for_background = work->for_background; + __entry->reason = work->reason; ), TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d " - "kupdate=%d range_cyclic=%d background=%d", + "kupdate=%d range_cyclic=%d background=%d reason=%s", __entry->name, MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev), __entry->nr_pages, __entry->sync_mode, __entry->for_kupdate, __entry->range_cyclic, - __entry->for_background + __entry->for_background, + wb_reason_name[__entry->reason] ) ); #define DEFINE_WRITEBACK_WORK_EVENT(name) \ @@ -165,6 +168,7 @@ TRACE_EVENT(writeback_queue_io, __field(unsigned long, older) __field(long, age) __field(int, moved) + __field(int, reason) ), TP_fast_assign( unsigned long *older_than_this = work->older_than_this; @@ -173,12 +177,14 @@ TRACE_EVENT(writeback_queue_io, __entry->age = older_than_this ? (jiffies - *older_than_this) * 1000 / HZ : -1; __entry->moved = moved; + __entry->reason = work->reason; ), - TP_printk("bdi %s: older=%lu age=%ld enqueue=%d", + TP_printk("bdi %s: older=%lu age=%ld enqueue=%d reason=%s", __entry->name, __entry->older, /* older_than_this in jiffies */ __entry->age, /* older_than_this in relative milliseconds */ - __entry->moved) + __entry->moved, + wb_reason_name[__entry->reason]) ); TRACE_EVENT(global_dirty_state, -- cgit v1.2.1