<feed xmlns='http://www.w3.org/2005/Atom'>
<title>blackbird-op-linux/fs/btrfs, branch master</title>
<subtitle>Blackbird™ Linux sources for OpenPOWER</subtitle>
<id>https://git.raptorcs.com/git/blackbird-op-linux/atom?h=master</id>
<link rel='self' href='https://git.raptorcs.com/git/blackbird-op-linux/atom?h=master'/>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/'/>
<updated>2020-02-17T21:26:30+00:00</updated>
<entry>
<title>Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux</title>
<updated>2020-02-17T21:26:30+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-02-17T21:26:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=eaea2947063ac694cddff1787d43e7807490dbc7'/>
<id>urn:sha1:eaea2947063ac694cddff1787d43e7807490dbc7</id>
<content type='text'>
Pull btrfs fix from David Sterba:
 "This is the fix for sleeping in a locked section bug reported by Dave
  Jones, caused by a patch dependence in development and pulled
  branches.

  I picked the existing patch over the fixup that Filipe sent, as it's a
  bit more generic fix. I've verified it with a specific test case, some
  rsync stress and one round of fstests"

* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't set path-&gt;leave_spinning for truncate
</content>
</entry>
<entry>
<title>btrfs: don't set path-&gt;leave_spinning for truncate</title>
<updated>2020-02-17T15:23:06+00:00</updated>
<author>
<name>Josef Bacik</name>
<email>josef@toxicpanda.com</email>
</author>
<published>2020-01-17T14:02:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=52e29e331070cd7d52a64cbf1b0958212a340e28'/>
<id>urn:sha1:52e29e331070cd7d52a64cbf1b0958212a340e28</id>
<content type='text'>
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence.  We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation.  With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.

Note: this patch also fixes a warning after 28553fa992cb ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.

  [70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
  [70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
  [70.794863] 5 locks held by rsync/1141:
  [70.794876]  #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
  [70.795030]  #1: ffff888428de28e8 (&amp;type-&gt;i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
  [70.795051]  #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
  [70.795124]  #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
  [70.795203]  #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
  [70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
  [70.795362] Call Trace:
  [70.795374]  dump_stack+0x71/0xa0
  [70.795445]  ___might_sleep.part.96.cold.106+0xa6/0xb6
  [70.795459]  kmem_cache_alloc+0x1d3/0x290
  [70.795471]  alloc_extent_state+0x22/0x1c0
  [70.795544]  __clear_extent_bit+0x3ba/0x580
  [70.795557]  ? _raw_spin_unlock_irq+0x24/0x30
  [70.795569]  btrfs_truncate_inode_items+0x339/0xe50
  [70.795647]  btrfs_evict_inode+0x269/0x540
  [70.795659]  ? dput.part.38+0x29/0x460
  [70.795671]  evict+0xcd/0x190
  [70.795682]  __dentry_kill+0xd6/0x180
  [70.795754]  dput.part.38+0x2ad/0x460
  [70.795765]  do_renameat2+0x3cb/0x540
  [70.795777]  __x64_sys_rename+0x1c/0x20

Reported-by: Dave Jones &lt;davej@codemonkey.org.uk&gt;
Fixes: 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
[ add note ]
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux</title>
<updated>2020-02-16T19:43:45+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-02-16T19:43:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=713db356041071d16360e82247de3107ec9ed57f'/>
<id>urn:sha1:713db356041071d16360e82247de3107ec9ed57f</id>
<content type='text'>
Pull btrfs fixes from David Sterba:
 "Two races fixed, memory leak fix, sysfs directory fixup and two new
  log messages:

   - two fixed race conditions: extent map merging and truncate vs
     fiemap

   - create the right sysfs directory with device information and move
     the individual device dirs under it

   - print messages when the tree-log is replayed at mount time or
     cannot be replayed on remount"

* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: sysfs, move device id directories to UUID/devinfo
  btrfs: sysfs, add UUID/devinfo kobject
  Btrfs: fix race between shrinking truncate and fiemap
  btrfs: log message when rw remount is attempted with unclean tree-log
  btrfs: print message when tree-log replay starts
  Btrfs: fix race between using extent maps and merging them
  btrfs: ref-verify: fix memory leaks
</content>
</entry>
<entry>
<title>btrfs: sysfs, move device id directories to UUID/devinfo</title>
<updated>2020-02-12T17:28:22+00:00</updated>
<author>
<name>Anand Jain</name>
<email>anand.jain@oracle.com</email>
</author>
<published>2020-02-12T09:28:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=1b9867eb6120db85f8dca8ff42789d9ec9ee16a5'/>
<id>urn:sha1:1b9867eb6120db85f8dca8ff42789d9ec9ee16a5</id>
<content type='text'>
Originally it was planned to create device id directories under
UUID/devinfo, but it got under UUID/devices by mistake. We really want
it under definfo so the bare device node names are not mixed with device
ids and are easy to enumerate.

Fixes: 668e48af7a94 ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
Signed-off-by: Anand Jain &lt;anand.jain@oracle.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: sysfs, add UUID/devinfo kobject</title>
<updated>2020-02-12T17:28:18+00:00</updated>
<author>
<name>Anand Jain</name>
<email>anand.jain@oracle.com</email>
</author>
<published>2020-02-12T09:28:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=a013d141eceee0f7747385e900da2858141aa0f3'/>
<id>urn:sha1:a013d141eceee0f7747385e900da2858141aa0f3</id>
<content type='text'>
Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).

Signed-off-by: Anand Jain &lt;anand.jain@oracle.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>Btrfs: fix race between shrinking truncate and fiemap</title>
<updated>2020-02-12T16:17:10+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2020-02-07T12:23:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=28553fa992cb28be6a65566681aac6cafabb4f2d'/>
<id>urn:sha1:28553fa992cb28be6a65566681aac6cafabb4f2d</id>
<content type='text'>
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).

Consider the following example:

1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
   and a 1MiB prealloc extent at file offset 128KiB;

2) Task A starts doing a shrinking truncate of our inode to reduce it to
   a size of 64KiB. Before it searches the subvolume tree for file
   extent items to delete, it drops all the extent maps in the range
   from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();

3) Task B starts doing a fiemap against our inode. When looking up for
   the inode's extent maps in the range from 128KiB to (u64)-1, it
   doesn't find any in the inode's extent map tree, since they were
   removed by task A.  Because it didn't find any in the extent map
   tree, it scans the inode's subvolume tree for file extent items, and
   it finds the 1MiB prealloc extent at file offset 128KiB, then it
   creates an extent map based on that file extent item and adds it to
   inode's extent map tree (this ends up being done by
   btrfs_get_extent() &lt;- btrfs_get_extent_fiemap() &lt;-
   get_extent_skip_holes());

4) Task A then drops the prealloc extent at file offset 128KiB and
   shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
   truncation operation finishes and we end up with an extent map
   representing a 1MiB prealloc extent at file offset 128KiB, despite we
   don't have any more that extent;

After this the two types of problems we have are:

1) Future calls to fiemap always report that a 1MiB prealloc extent
   exists at file offset 128KiB. This is stale information, no longer
   correct;

2) If the size of the file is increased, by a truncate operation that
   increases the file size or by a write into a file offset &gt; 64KiB for
   example, we end up not inserting file extent items to represent holes
   for any range between 128KiB and 128KiB + 1MiB, since the hole
   expansion function, btrfs_cont_expand() will skip hole insertion for
   any range for which an extent map exists that represents a prealloc
   extent. This causes fsck to complain about missing file extent items
   when not using the NO_HOLES feature.

The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.

Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: log message when rw remount is attempted with unclean tree-log</title>
<updated>2020-02-12T16:17:00+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2020-02-05T16:12:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=10a3a3edc5b89a8cd095bc63495fb1e0f42047d9'/>
<id>urn:sha1:10a3a3edc5b89a8cd095bc63495fb1e0f42047d9</id>
<content type='text'>
A remount to a read-write filesystem is not safe when there's tree-log
to be replayed. Files that could be opened until now might be affected
by the changes in the tree-log.

A regular mount is needed to replay the log so the filesystem presents
the consistent view with the pending changes included.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain &lt;anand.jain@oracle.com&gt;
Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: print message when tree-log replay starts</title>
<updated>2020-02-12T16:16:57+00:00</updated>
<author>
<name>David Sterba</name>
<email>dsterba@suse.com</email>
</author>
<published>2020-02-05T16:12:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=e8294f2f6aa6208ed0923aa6d70cea3be178309a'/>
<id>urn:sha1:e8294f2f6aa6208ed0923aa6d70cea3be178309a</id>
<content type='text'>
There's no logged information about tree-log replay although this is
something that points to previous unclean unmount. Other filesystems
report that as well.

Suggested-by: Chris Murphy &lt;lists@colorremedies.com&gt;
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain &lt;anand.jain@oracle.com&gt;
Reviewed-by: Johannes Thumshirn &lt;johannes.thumshirn@wdc.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>Btrfs: fix race between using extent maps and merging them</title>
<updated>2020-02-12T16:16:46+00:00</updated>
<author>
<name>Filipe Manana</name>
<email>fdmanana@suse.com</email>
</author>
<published>2020-01-31T14:06:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=ac05ca913e9f3871126d61da275bfe8516ff01ca'/>
<id>urn:sha1:ac05ca913e9f3871126d61da275bfe8516ff01ca</id>
<content type='text'>
We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.

With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():

  $ cat -n fs/btrfs/extent_io.c
  3061  static int __do_readpage(struct extent_io_tree *tree,
  3062                           struct page *page,
  (...)
  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
  3128                                        end - cur + 1, get_extent, em_cached);
  3129                  if (IS_ERR_OR_NULL(em)) {
  3130                          SetPageError(page);
  3131                          unlock_extent(tree, cur, end);
  3132                          break;
  3133                  }
  3134                  extent_offset = cur - em-&gt;start;
  3135                  BUG_ON(extent_map_end(em) &lt;= cur);
  (...)

Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().

We have an inode with a size of 8KiB and 2 extent maps:

  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
            a previous transaction

  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
            persisted but writeback started for it already. The extent map
	    is pinned since there's writeback and an ordered extent in
	    progress, so it can not be merged with extent map A yet

The following sequence of steps leads to the BUG_ON():

1) The ordered extent for extent B completes, the respective page gets its
   writeback bit cleared and the extent map is unpinned, at that point it
   is not yet merged with extent map A because it's in the list of modified
   extents;

2) Due to memory pressure, or some other reason, the MM subsystem releases
   the page corresponding to extent B - btrfs_releasepage() is called and
   returns 1, meaning the page can be released as it's not dirty, not under
   writeback anymore and the extent range is not locked in the inode's
   iotree. However the extent map is not released, either because we are
   not in a context that allows memory allocations to block or because the
   inode's size is smaller than 16MiB - in this case our inode has a size
   of 8KiB;

3) Task B needs to read extent B and ends up __do_readpage() through the
   btrfs_readpage() callback. At __do_readpage() it gets a reference to
   extent map B;

4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
   while holding the write lock on the inode's extent map tree - this
   results in try_merge_map() being called and since it's possible to merge
   extent map B with extent map A now (the extent map B was removed from
   the list of modified extents), the merging begins - it sets extent map
   B's start offset to 0 (was 4KiB), but before it increments the map's
   length to 8KiB (4kb + 4KiB), task A is at:

   BUG_ON(extent_map_end(em) &lt;= cur);

   The call to extent_map_end() sees the extent map has a start of 0
   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
   the BUG_ON() is triggered.

So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.

Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).

Reported-by: ryusuke1925 &lt;st13s20@gm.ibaraki-ct.ac.jp&gt;
Reported-by: Koki Mitani &lt;koki.mitani.xg@hco.ntt.co.jp&gt;
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik &lt;josef@toxicpanda.com&gt;
Signed-off-by: Filipe Manana &lt;fdmanana@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
<entry>
<title>btrfs: ref-verify: fix memory leaks</title>
<updated>2020-02-12T16:16:31+00:00</updated>
<author>
<name>Wenwen Wang</name>
<email>wenwen@cs.uga.edu</email>
</author>
<published>2020-02-01T20:38:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/blackbird-op-linux/commit/?id=f311ade3a7adf31658ed882aaab9f9879fdccef7'/>
<id>urn:sha1:f311ade3a7adf31658ed882aaab9f9879fdccef7</id>
<content type='text'>
In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
kmalloc(), respectively. In the following code, if an error occurs, the
execution will be redirected to 'out' or 'out_unlock' and the function will
be exited. However, on some of the paths, 'ref' and 'ra' are not
deallocated, leading to memory leaks. For example, if 'action' is
BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
value indicates an error, the execution will be redirected to 'out'. But,
'ref' is not deallocated on this path, causing a memory leak.

To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
the function when an error is encountered.

CC: stable@vger.kernel.org # 4.15+
Signed-off-by: Wenwen Wang &lt;wenwen@cs.uga.edu&gt;
Reviewed-by: David Sterba &lt;dsterba@suse.com&gt;
Signed-off-by: David Sterba &lt;dsterba@suse.com&gt;
</content>
</entry>
</feed>
