<feed xmlns='http://www.w3.org/2005/Atom'>
<title>talos-op-linux/mm, branch v4.4.4</title>
<subtitle>Talos™ II Linux sources for OpenPOWER</subtitle>
<id>https://git.raptorcs.com/git/talos-op-linux/atom?h=v4.4.4</id>
<link rel='self' href='https://git.raptorcs.com/git/talos-op-linux/atom?h=v4.4.4'/>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/'/>
<updated>2016-03-03T23:07:23+00:00</updated>
<entry>
<title>make sure that freeing shmem fast symlinks is RCU-delayed</title>
<updated>2016-03-03T23:07:23+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2016-01-22T23:08:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=bcb1875a069043c70af27dc9f0f5e075a09d76b1'/>
<id>urn:sha1:bcb1875a069043c70af27dc9f0f5e075a09d76b1</id>
<content type='text'>
commit 3ed47db34f480df7caf44436e3e63e555351ae9a upstream.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>virtio_balloon: fix race between migration and ballooning</title>
<updated>2016-03-03T23:07:18+00:00</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2015-12-27T23:35:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=afb02993539468a15700481b54968edc10940b0d'/>
<id>urn:sha1:afb02993539468a15700481b54968edc10940b0d</id>
<content type='text'>
commit 21ea9fb69e7c4b1b1559c3e410943d3ff248ffcb upstream.

In balloon_page_dequeue, pages_lock should cover the loop
(ie, list_for_each_entry_safe). Otherwise, the cursor page could
be isolated by compaction and then list_del by isolation could
poison the page-&gt;lru.{prev,next} so the loop finally could
access wrong address like this. This patch fixes the bug.

general protection fault: 0000 [#1] SMP
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 2 PID: 82 Comm: vballoon Not tainted 4.4.0-rc5-mm1-access_bit+ #1906
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff8800a7ff0000 ti: ffff8800a7fec000 task.ti: ffff8800a7fec000
RIP: 0010:[&lt;ffffffff8115e754&gt;]  [&lt;ffffffff8115e754&gt;] balloon_page_dequeue+0x54/0x130
RSP: 0018:ffff8800a7fefdc0  EFLAGS: 00010246
RAX: ffff88013fff9a70 RBX: ffffea000056fe00 RCX: 0000000000002b7d
RDX: ffff88013fff9a70 RSI: ffffea000056fe00 RDI: ffff88013fff9a68
RBP: ffff8800a7fefde8 R08: ffffea000056fda0 R09: 0000000000000000
R10: ffff8800a7fefd90 R11: 0000000000000001 R12: dead0000000000e0
R13: ffffea000056fe20 R14: ffff880138809070 R15: ffff880138809060
FS:  0000000000000000(0000) GS:ffff88013fc40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f229c10e000 CR3: 00000000b8b53000 CR4: 00000000000006a0
Stack:
 0000000000000100 ffff880138809088 ffff880138809000 ffff880138809060
 0000000000000046 ffff8800a7fefe28 ffffffff812c86d3 ffff880138809020
 ffff880138809000 fffffffffff91900 0000000000000100 ffff880138809060
Call Trace:
 [&lt;ffffffff812c86d3&gt;] leak_balloon+0x93/0x1a0
 [&lt;ffffffff812c8bc7&gt;] balloon+0x217/0x2a0
 [&lt;ffffffff8143739e&gt;] ? __schedule+0x31e/0x8b0
 [&lt;ffffffff81078160&gt;] ? abort_exclusive_wait+0xb0/0xb0
 [&lt;ffffffff812c89b0&gt;] ? update_balloon_stats+0xf0/0xf0
 [&lt;ffffffff8105b6e9&gt;] kthread+0xc9/0xe0
 [&lt;ffffffff8105b620&gt;] ? kthread_park+0x60/0x60
 [&lt;ffffffff8143b4af&gt;] ret_from_fork+0x3f/0x70
 [&lt;ffffffff8105b620&gt;] ? kthread_park+0x60/0x60
Code: 8d 60 e0 0f 84 af 00 00 00 48 8b 43 20 a8 01 75 3b 48 89 d8 f0 0f ba 28 00 72 10 48 8b 03 f6 c4 08 75 2f 48 89 df e8 8c 83 f9 ff &lt;49&gt; 8b 44 24 20 4d 8d 6c 24 20 48 83 e8 20 4d 39 f5 74 7a 4c 89
RIP  [&lt;ffffffff8115e754&gt;] balloon_page_dequeue+0x54/0x130
 RSP &lt;ffff8800a7fefdc0&gt;
---[ end trace 43cf28060d708d5f ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Michael S. Tsirkin &lt;mst@redhat.com&gt;
Acked-by: Rafael Aquini &lt;aquini@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: numa: quickly fail allocations for NUMA balancing on full nodes</title>
<updated>2016-03-03T23:07:11+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2016-02-26T23:19:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=18b75e0bdc6c1413a44db3d37b6d1d82d02eb84b'/>
<id>urn:sha1:18b75e0bdc6c1413a44db3d37b6d1d82d02eb84b</id>
<content type='text'>
commit 8479eba7781fa9ffb28268840de6facfc12c35a7 upstream.

Commit 4167e9b2cf10 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
flag combination due to confusing semantics.  It noted that
alloc_misplaced_dst_page() was one such user after changes made by
commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify").

Unfortunately when GFP_THISNODE was removed, users of
alloc_misplaced_dst_page() started waking kswapd and entering direct
reclaim because the wrong GFP flags are cleared.  The consequence is
that workloads that used to fit into memory now get reclaimed which is
addressed by this patch.

The problem can be demonstrated with "mutilate" that exercises memcached
which is software dedicated to memory object caching.  The configuration
uses 80% of memory and is run 3 times for varying numbers of clients.
The results on a 4-socket NUMA box are

mutilate
                            4.4.0                 4.4.0
                          vanilla           numaswap-v1
Hmean    1      8394.71 (  0.00%)     8395.32 (  0.01%)
Hmean    4     30024.62 (  0.00%)    34513.54 ( 14.95%)
Hmean    7     32821.08 (  0.00%)    70542.96 (114.93%)
Hmean    12    55229.67 (  0.00%)    93866.34 ( 69.96%)
Hmean    21    39438.96 (  0.00%)    85749.21 (117.42%)
Hmean    30    37796.10 (  0.00%)    50231.49 ( 32.90%)
Hmean    47    18070.91 (  0.00%)    38530.13 (113.22%)

The metric is queries/second with the more the better.  The results are
way outside of the noise and the reason for the improvement is obvious
from some of the vmstats

                                 4.4.0       4.4.0
                               vanillanumaswap-v1r1
Minor Faults                1929399272  2146148218
Major Faults                  19746529        3567
Swap Ins                      57307366        9913
Swap Outs                     50623229       17094
Allocation stalls                35909         443
DMA allocs                           0           0
DMA32 allocs                  72976349   170567396
Normal allocs               5306640898  5310651252
Movable allocs                       0           0
Direct pages scanned         404130893      799577
Kswapd pages scanned         160230174           0
Kswapd pages reclaimed        55928786           0
Direct pages reclaimed         1843936       41921
Page writes file                  2391           0
Page writes anon              50623229       17094

The vanilla kernel is swapping like crazy with large amounts of direct
reclaim and kswapd activity.  The figures are aggregate but it's known
that the bad activity is throughout the entire test.

Note that simple streaming anon/file memory consumers also see this
problem but it's not as obvious.  In those cases, kswapd is awake when
it should not be.

As there are at least two reclaim-related bugs out there, it's worth
spelling out the user-visible impact.  This patch only addresses bugs
related to excessive reclaim on NUMA hardware when the working set is
larger than a NUMA node.  There is a bug related to high kswapd CPU
usage but the reports are against laptops and other UMA hardware and is
not addressed by this patch.

Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED</title>
<updated>2016-03-03T23:07:10+00:00</updated>
<author>
<name>Andrea Arcangeli</name>
<email>aarcange@redhat.com</email>
</author>
<published>2016-02-26T23:19:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=915d02457e74344bcd99fe64b159de2f6074b2c6'/>
<id>urn:sha1:915d02457e74344bcd99fe64b159de2f6074b2c6</id>
<content type='text'>
commit ad33bb04b2a6cee6c1f99fabb15cddbf93ff0433 upstream.

pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
introduced to locklessy (but atomically) detect when a pmd is a regular
(stable) pmd or when the pmd is unstable and can infinitely transition
from pmd_none() and pmd_trans_huge() from under us, while only holding
the mmap_sem for reading (for writing not).

While holding the mmap_sem only for reading, MADV_DONTNEED can run from
under us and so before we can assume the pmd to be a regular stable pmd
we need to compare it against pmd_none() and pmd_trans_huge() in an
atomic way, with pmd_trans_unstable().  The old pmd_trans_huge() left a
tiny window for a race.

Useful applications are unlikely to notice the difference as doing
MADV_DONTNEED concurrently with a page fault would lead to undefined
behavior.

[akpm@linux-foundation.org: tidy up comment grammar/layout]
Signed-off-by: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Reported-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm,thp: khugepaged: call pte flush at the time of collapse</title>
<updated>2016-02-25T20:01:23+00:00</updated>
<author>
<name>Vineet Gupta</name>
<email>Vineet.Gupta1@synopsys.com</email>
</author>
<published>2016-02-12T00:13:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=edfde263bd8a70a0b93b11996fd2a77526ff3b80'/>
<id>urn:sha1:edfde263bd8a70a0b93b11996fd2a77526ff3b80</id>
<content type='text'>
commit 6a6ac72fd6ea32594b316513e1826c3f6db4cc93 upstream.

This showed up on ARC when running LMBench bw_mem tests as Overlapping
TLB Machine Check Exception triggered due to STLB entry (2M pages)
overlapping some NTLB entry (regular 8K page).

bw_mem 2m touches a large chunk of vaddr creating NTLB entries.  In the
interim khugepaged kicks in, collapsing the contiguous ptes into a
single pmd.  pmdp_collapse_flush()-&gt;flush_pmd_tlb_range() is called to
flush out NTLB entries for the ptes.  This for ARC (by design) can only
shootdown STLB entries (for pmd).  The stray NTLB entries cause the
overlap with the subsequent STLB entry for collapsed page.  So make
pmdp_collapse_flush() call pte flush interface not pmd flush.

Note that originally all thp flush call sites in generic code called
flush_tlb_range() leaving it to architecture to implement the flush for
pte and/or pmd.  Commit 12ebc1581ad11454 changed this by calling a new
opt-in API flush_pmd_tlb_range() which made the semantics more explicit
but failed to distinguish the pte vs pmd flush in generic code, which is
what this patch fixes.

Note that ARC can fixed w/o touching the generic pmdp_collapse_flush()
by defining a ARC version, but that defeats the purpose of generic
version, plus sementically this is the right thing to do.

Fixes STAR 9000961194: LMBench on AXS103 triggering duplicate TLB
exceptions with super pages

Fixes: 12ebc1581ad11454 ("mm,thp: introduce flush_pmd_tlb_range")
Signed-off-by: Vineet Gupta &lt;vgupta@synopsys.com&gt;
Reviewed-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Acked-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>memcg: only free spare array when readers are done</title>
<updated>2016-02-25T20:01:22+00:00</updated>
<author>
<name>Martijn Coenen</name>
<email>maco@google.com</email>
</author>
<published>2016-01-16T00:57:49+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=ececa3ebe27f61db28161bc8c6ef2e1612dbafbe'/>
<id>urn:sha1:ececa3ebe27f61db28161bc8c6ef2e1612dbafbe</id>
<content type='text'>
commit 6611d8d76132f86faa501de9451a89bf23fb2371 upstream.

A spare array holding mem cgroup threshold events is kept around to make
sure we can always safely deregister an event and have an array to store
the new set of events in.

In the scenario where we're going from 1 to 0 registered events, the
pointer to the primary array containing 1 event is copied to the spare
slot, and then the spare slot is freed because no events are left.
However, it is freed before calling synchronize_rcu(), which means
readers may still be accessing threshold-&gt;primary after it is freed.

Fixed by only freeing after synchronize_rcu().

Signed-off-by: Martijn Coenen &lt;maco@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Vladimir Davydov &lt;vdavydov@virtuozzo.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: fix regression in remap_file_pages() emulation</title>
<updated>2016-02-25T20:01:21+00:00</updated>
<author>
<name>Kirill A. Shutemov</name>
<email>kirill.shutemov@linux.intel.com</email>
</author>
<published>2016-02-17T21:11:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=d1f8217a9a6e9b84df4cf6a0d3c99c7bbe1430a7'/>
<id>urn:sha1:d1f8217a9a6e9b84df4cf6a0d3c99c7bbe1430a7</id>
<content type='text'>
commit 48f7df329474b49d83d0dffec1b6186647f11976 upstream.

Grazvydas Ignotas has reported a regression in remap_file_pages()
emulation.

Testcase:
	#define _GNU_SOURCE
	#include &lt;assert.h&gt;
	#include &lt;stdlib.h&gt;
	#include &lt;stdio.h&gt;
	#include &lt;sys/mman.h&gt;

	#define SIZE    (4096 * 3)

	int main(int argc, char **argv)
	{
		unsigned long *p;
		long i;

		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
				MAP_SHARED | MAP_ANONYMOUS, -1, 0);
		if (p == MAP_FAILED) {
			perror("mmap");
			return -1;
		}

		for (i = 0; i &lt; SIZE / 4096; i++)
			p[i * 4096 / sizeof(*p)] = i;

		if (remap_file_pages(p, 4096, 0, 1, 0)) {
			perror("remap_file_pages");
			return -1;
		}

		if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
			perror("remap_file_pages");
			return -1;
		}

		assert(p[0] == 1);

		munmap(p, SIZE);

		return 0;
	}

The second remap_file_pages() fails with -EINVAL.

The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map.  That assumption is broken by
first remap_file_pages() call: it split the area into two vma.

The solution is to check next adjacent vmas, if they map the same file
with the same flags.

Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
Signed-off-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Reported-by: Grazvydas Ignotas &lt;notasas@gmail.com&gt;
Tested-by: Grazvydas Ignotas &lt;notasas@gmail.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: replace vma_lock_anon_vma with anon_vma_lock_read/write</title>
<updated>2016-02-25T20:01:21+00:00</updated>
<author>
<name>Konstantin Khlebnikov</name>
<email>koct9i@gmail.com</email>
</author>
<published>2016-02-05T23:36:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=413aab16bc7b9b58fb1a52010eb3af3d227d7007'/>
<id>urn:sha1:413aab16bc7b9b58fb1a52010eb3af3d227d7007</id>
<content type='text'>
commit 12352d3cae2cebe18805a91fab34b534d7444231 upstream.

Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
anon_vma appeared between lock and unlock.  We have to check anon_vma
first or call anon_vma_prepare() to be sure that it's here.  There are
only few users of these legacy helpers.  Let's get rid of them.

This patch fixes anon_vma lock imbalance in validate_mm().  Write lock
isn't required here, read lock is enough.

And reorders expand_downwards/expand_upwards: security_mmap_addr() and
wrapping-around check don't have to be under anon vma lock.

Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com
Signed-off-by: Konstantin Khlebnikov &lt;koct9i@gmail.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Acked-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: fix mlock accouting</title>
<updated>2016-02-25T20:01:21+00:00</updated>
<author>
<name>Kirill A. Shutemov</name>
<email>kirill.shutemov@linux.intel.com</email>
</author>
<published>2016-01-22T00:40:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=918a2c388ed7828fa4b6ccb5cacd9e422f30938c'/>
<id>urn:sha1:918a2c388ed7828fa4b6ccb5cacd9e422f30938c</id>
<content type='text'>
commit 7162a1e87b3e380133dadc7909081bb70d0a7041 upstream.

Tetsuo Handa reported underflow of NR_MLOCK on munlock.

Testcase:

    #include &lt;stdio.h&gt;
    #include &lt;stdlib.h&gt;
    #include &lt;sys/mman.h&gt;

    #define BASE ((void *)0x400000000000)
    #define SIZE (1UL &lt;&lt; 21)

    int main(int argc, char *argv[])
    {
        void *addr;

        system("grep Mlocked /proc/meminfo");
        addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
                MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED,
                -1, 0);
        if (addr == MAP_FAILED)
            printf("mmap() failed\n"), exit(1);
        munmap(addr, SIZE);
        system("grep Mlocked /proc/meminfo");
        return 0;
    }

It happens on munlock_vma_page() due to unfortunate choice of nr_pages
data type:

    __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);

For unsigned int nr_pages, implicitly casted to long in
__mod_zone_page_state(), it becomes something around UINT_MAX.

munlock_vma_page() usually called for THP as small pages go though
pagevec.

Let's make nr_pages signed int.

Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in
mod_zone_page_state()") used `long' type, but `int' here is OK for a
count of the number of sub-pages in a huge page.

Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages")
Signed-off-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Reported-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Tested-by: Tetsuo Handa &lt;penguin-kernel@I-love.SAKURA.ne.jp&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: soft-offline: check return value in second __get_any_page() call</title>
<updated>2016-02-25T20:01:21+00:00</updated>
<author>
<name>Naoya Horiguchi</name>
<email>n-horiguchi@ah.jp.nec.com</email>
</author>
<published>2016-01-16T00:54:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-op-linux/commit/?id=bd55913cf20804d2c3d4d83e79d9867963af2ff1'/>
<id>urn:sha1:bd55913cf20804d2c3d4d83e79d9867963af2ff1</id>
<content type='text'>
commit d96b339f453997f2f08c52da3f41423be48c978f upstream.

I saw the following BUG_ON triggered in a testcase where a process calls
madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
calls migratepages command repeatedly (doing ping-pong among different
NUMA nodes) for the first process:

   Soft offlining page 0x60000 at 0x700000600000
   __get_any_page: 0x60000 free buddy page
   page:ffffea0001800000 count:0 mapcount:-127 mapping:          (null) index:0x1
   flags: 0x1fffc0000000000()
   page dumped because: VM_BUG_ON_PAGE(atomic_read(&amp;page-&gt;_count) == 0)
   ------------[ cut here ]------------
   kernel BUG at /src/linux-dev/include/linux/mm.h:342!
   invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
   Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
   CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G           O    4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000
   RIP: 0010:[&lt;ffffffff8118998c&gt;]  [&lt;ffffffff8118998c&gt;] put_page+0x5c/0x60
   RSP: 0018:ffff88007c213e00  EFLAGS: 00010246
   Call Trace:
     put_hwpoison_page+0x4e/0x80
     soft_offline_page+0x501/0x520
     SyS_madvise+0x6bc/0x6f0
     entry_SYSCALL_64_fastpath+0x12/0x6a
   Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 &lt;0f&gt; 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
   RIP  [&lt;ffffffff8118998c&gt;] put_page+0x5c/0x60
    RSP &lt;ffff88007c213e00&gt;

The root cause resides in get_any_page() which retries to get a refcount
of the page to be soft-offlined.  This function calls
put_hwpoison_page(), expecting that the target page is putback to LRU
list.  But it can be also freed to buddy.  So the second check need to
care about such case.

Fixes: af8fae7c0886 ("mm/memory-failure.c: clean up soft_offline_page()")
Signed-off-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Cc: Aneesh Kumar K.V &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Jerome Marchand &lt;jmarchan@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Steve Capper &lt;steve.capper@linaro.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
</feed>
