<feed xmlns='http://www.w3.org/2005/Atom'>
<title>talos-obmc-linux/kernel/bpf/hashtab.c, branch dev-4.10</title>
<subtitle>Talos™ II Linux sources for OpenBMC</subtitle>
<id>https://git.raptorcs.com/git/talos-obmc-linux/atom?h=dev-4.10</id>
<link rel='self' href='https://git.raptorcs.com/git/talos-obmc-linux/atom?h=dev-4.10'/>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/'/>
<updated>2017-01-18T22:12:26+00:00</updated>
<entry>
<title>bpf: don't trigger OOM killer under pressure with map alloc</title>
<updated>2017-01-18T22:12:26+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2017-01-18T14:14:17+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=d407bd25a204bd66b7346dde24bd3d37ef0e0b05'/>
<id>urn:sha1:d407bd25a204bd66b7346dde24bd3d37ef0e0b05</id>
<content type='text'>
This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.

Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.

Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.

Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: do not use KMALLOC_SHIFT_MAX</title>
<updated>2017-01-11T02:31:54+00:00</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2017-01-11T00:57:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=7984c27c2c5cd3298de8afdba3e1bd46f884e934'/>
<id>urn:sha1:7984c27c2c5cd3298de8afdba3e1bd46f884e934</id>
<content type='text'>
Commit 01b3f52157ff ("bpf: fix allocation warnings in bpf maps and
integer overflow") has added checks for the maximum allocateable size.
It (ab)used KMALLOC_SHIFT_MAX for that purpose.

While this is not incorrect it is not very clean because we already have
KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
KMALLOC_MAX_SIZE instead.

The original motivation for using KMALLOC_SHIFT_MAX was to work around
an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
will fit into MAX_ORDER".

Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH</title>
<updated>2016-11-15T16:50:20+00:00</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2016-11-11T18:55:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=8f8449384ec364ba2a654f11f94e754e4ff719e0'/>
<id>urn:sha1:8f8449384ec364ba2a654f11f94e754e4ff719e0</id>
<content type='text'>
Provide a LRU version of the existing BPF_MAP_TYPE_PERCPU_HASH

Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: Add BPF_MAP_TYPE_LRU_HASH</title>
<updated>2016-11-15T16:50:20+00:00</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2016-11-11T18:55:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=29ba732acbeece1e34c68483d1ec1f3720fa1bb3'/>
<id>urn:sha1:29ba732acbeece1e34c68483d1ec1f3720fa1bb3</id>
<content type='text'>
Provide a LRU version of the existing BPF_MAP_TYPE_HASH.

Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: Refactor codes handling percpu map</title>
<updated>2016-11-15T16:50:20+00:00</updated>
<author>
<name>Martin KaFai Lau</name>
<email>kafai@fb.com</email>
</author>
<published>2016-11-11T18:55:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=fd91de7b3c69a7f108b92521e1115df3e058af55'/>
<id>urn:sha1:fd91de7b3c69a7f108b92521e1115df3e058af55</id>
<content type='text'>
Refactor the codes that populate the value
of a htab_elem in a BPF_MAP_TYPE_PERCPU_HASH
typed bpf_map.

Signed-off-by: Martin KaFai Lau &lt;kafai@fb.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: fix htab map destruction when extra reserve is in use</title>
<updated>2016-11-07T18:20:52+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2016-11-03T23:01:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=483bed2b0ddd12ec33fc9407e0c6e1088e77a97c'/>
<id>urn:sha1:483bed2b0ddd12ec33fc9407e0c6e1088e77a97c</id>
<content type='text'>
Commit a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
added an extra per-cpu reserve to the hash table map to restore old
behaviour from pre prealloc times. When non-prealloc is in use for a
map, then problem is that once a hash table extra element has been
linked into the hash-table, and the hash table is destroyed due to
refcount dropping to zero, then htab_map_free() -&gt; delete_all_elements()
will walk the whole hash table and drop all elements via htab_elem_free().
The problem is that the element from the extra reserve is first fed
to the wrong backend allocator and eventually freed twice.

Fixes: a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: restore behavior of bpf_map_update_elem</title>
<updated>2016-08-07T00:49:19+00:00</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-08-05T21:01:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=a6ed3ea65d9868fdf9eff84e6fe4f666b8d14b02'/>
<id>urn:sha1:a6ed3ea65d9868fdf9eff84e6fe4f666b8d14b02</id>
<content type='text'>
The introduction of pre-allocated hash elements inadvertently broke
the behavior of bpf hash maps where users expected to call
bpf_map_update_elem() without considering that the map can be full.
Some programs do:
old_value = bpf_map_lookup_elem(map, key);
if (old_value) {
  ... prepare new_value on stack ...
  bpf_map_update_elem(map, key, new_value);
}
Before pre-alloc the update() for existing element would work even
in 'map full' condition. Restore this behavior.

The above program could have updated old_value in place instead of
update() which would be faster and most programs use that approach,
but sometimes the values are large and the programs use update()
helper to do atomic replacement of the element.
Note we cannot simply update element's value in-place like percpu
hash map does and have to allocate extra num_possible_cpu elements
and use this extra reserve when the map is full.

Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: pre-allocate hash map elements</title>
<updated>2016-03-08T20:28:31+00:00</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-03-08T05:57:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=6c90598174322b8888029e40dd84a4eb01f56afe'/>
<id>urn:sha1:6c90598174322b8888029e40dd84a4eb01f56afe</id>
<content type='text'>
If kprobe is placed on spin_unlock then calling kmalloc/kfree from
bpf programs is not safe, since the following dead lock is possible:
kfree-&gt;spin_lock(kmem_cache_node-&gt;lock)...spin_unlock-&gt;kprobe-&gt;
bpf_prog-&gt;map_update-&gt;kmalloc-&gt;spin_lock(of the same kmem_cache_node-&gt;lock)
and deadlocks.

The following solutions were considered and some implemented, but
eventually discarded
- kmem_cache_create for every map
- add recursion check to slow-path of slub
- use reserved memory in bpf_map_update for in_irq or in preempt_disabled
- kmalloc via irq_work

At the end pre-allocation of all map elements turned out to be the simplest
solution and since the user is charged upfront for all the memory, such
pre-allocation doesn't affect the user space visible behavior.

Since it's impossible to tell whether kprobe is triggered in a safe
location from kmalloc point of view, use pre-allocation by default
and introduce new BPF_F_NO_PREALLOC flag.

While testing of per-cpu hash maps it was discovered
that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
fails to allocate memory even when 90% of it is free.
The pre-allocation of per-cpu hash elements solves this problem as well.

Turned out that bpf_map_update() quickly followed by
bpf_map_lookup()+bpf_map_delete() is very common pattern used
in many of iovisor/bcc/tools, so there is additional benefit of
pre-allocation, since such use cases are must faster.

Since all hash map elements are now pre-allocated we can remove
atomic increment of htab-&gt;count and save few more cycles.

Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
large malloc/free done by users who don't have sufficient limits.

Pre-allocation is done with vmalloc and alloc/free is done
via percpu_freelist. Here are performance numbers for different
pre-allocation algorithms that were implemented, but discarded
in favor of percpu_freelist:

1 cpu:
pcpu_ida	2.1M
pcpu_ida nolock	2.3M
bt		2.4M
kmalloc		1.8M
hlist+spinlock	2.3M
pcpu_freelist	2.6M

4 cpu:
pcpu_ida	1.5M
pcpu_ida nolock	1.8M
bt w/smp_align	1.7M
bt no/smp_align	1.1M
kmalloc		0.7M
hlist+spinlock	0.2M
pcpu_freelist	2.0M

8 cpu:
pcpu_ida	0.7M
bt w/smp_align	0.8M
kmalloc		0.4M
pcpu_freelist	1.5M

32 cpu:
kmalloc		0.13M
pcpu_freelist	0.49M

pcpu_ida nolock is a modified percpu_ida algorithm without
percpu_ida_cpu locks and without cross-cpu tag stealing.
It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

bt is a variant of block/blk-mq-tag.c simlified and customized
for bpf use case. bt w/smp_align is using cache line for every 'long'
(similar to blk-mq-tag). bt no/smp_align allocates 'long'
bitmasks continuously to save memory. It's comparable to percpu_ida
and in some cases faster, but slower than percpu_freelist

hlist+spinlock is the simplest free list with single spinlock.
As expeceted it has very bad scaling in SMP.

kmalloc is existing implementation which is still available via
BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
but saves memory, so in cases where map-&gt;max_entries can be large
and number of map update/delete per second is low, it may make
sense to use it.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: grab rcu read lock for bpf_percpu_hash_update</title>
<updated>2016-02-19T19:37:43+00:00</updated>
<author>
<name>Sasha Levin</name>
<email>sasha.levin@oracle.com</email>
</author>
<published>2016-02-19T18:53:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=6bbd9a05a1f9839873a9290b5b7c6fafde8447ba'/>
<id>urn:sha1:6bbd9a05a1f9839873a9290b5b7c6fafde8447ba</id>
<content type='text'>
bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not,
which pointed out a missing rcu read lock.

Fixes: 15a07b338 ("bpf: add lookup/update support for per-cpu hash and array maps")
Signed-off-by: Sasha Levin &lt;sasha.levin@oracle.com&gt;
Acked-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
<entry>
<title>bpf: add lookup/update support for per-cpu hash and array maps</title>
<updated>2016-02-06T08:34:36+00:00</updated>
<author>
<name>Alexei Starovoitov</name>
<email>ast@fb.com</email>
</author>
<published>2016-02-02T06:39:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.raptorcs.com/git/talos-obmc-linux/commit/?id=15a07b33814d14ca817887dbea8530728dc0fbe4'/>
<id>urn:sha1:15a07b33814d14ca817887dbea8530728dc0fbe4</id>
<content type='text'>
The functions bpf_map_lookup_elem(map, key, value) and
bpf_map_update_elem(map, key, value, flags) need to get/set
values from all-cpus for per-cpu hash and array maps,
so that user space can aggregate/update them as necessary.

Example of single counter aggregation in user space:
  unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
  long values[nr_cpus];
  long value = 0;

  bpf_lookup_elem(fd, key, values);
  for (i = 0; i &lt; nr_cpus; i++)
    value += values[i];

The user space must provide round_up(value_size, 8) * nr_cpus
array to get/set values, since kernel will use 'long' copy
of per-cpu values to try to copy good counters atomically.
It's a best-effort, since bpf programs and user space are racing
to access the same memory.

Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
</entry>
</feed>
