summaryrefslogtreecommitdiffstats
path: root/Documentation/cgroup-v1/memcg_test.rst
diff options
context:
space:
mode:
authorMauro Carvalho Chehab <mchehab+samsung@kernel.org>2019-06-27 13:08:35 -0300
committerMauro Carvalho Chehab <mchehab+samsung@kernel.org>2019-07-15 11:03:02 -0300
commitda82c92f1150f66afabf78d2c85ef9ac18dc6d38 (patch)
treec8f1c200188e0edc8391fa14d02eff60c6347adb /Documentation/cgroup-v1/memcg_test.rst
parent83bbf6e103544d65f17f4b2ccea1c6a51c0b0769 (diff)
downloadtalos-op-linux-da82c92f1150f66afabf78d2c85ef9ac18dc6d38.tar.gz
talos-op-linux-da82c92f1150f66afabf78d2c85ef9ac18dc6d38.zip
docs: cgroup-v1: add it to the admin-guide book
Those files belong to the admin guide, so add them. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Diffstat (limited to 'Documentation/cgroup-v1/memcg_test.rst')
-rw-r--r--Documentation/cgroup-v1/memcg_test.rst355
1 files changed, 0 insertions, 355 deletions
diff --git a/Documentation/cgroup-v1/memcg_test.rst b/Documentation/cgroup-v1/memcg_test.rst
deleted file mode 100644
index 91bd18c6a514..000000000000
--- a/Documentation/cgroup-v1/memcg_test.rst
+++ /dev/null
@@ -1,355 +0,0 @@
-=====================================================
-Memory Resource Controller(Memcg) Implementation Memo
-=====================================================
-
-Last Updated: 2010/2
-
-Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
-
-Because VM is getting complex (one of reasons is memcg...), memcg's behavior
-is complex. This is a document for memcg's internal behavior.
-Please note that implementation details can be changed.
-
-(*) Topics on API should be in Documentation/cgroup-v1/memory.rst)
-
-0. How to record usage ?
-========================
-
- 2 objects are used.
-
- page_cgroup ....an object per page.
-
- Allocated at boot or memory hotplug. Freed at memory hot removal.
-
- swap_cgroup ... an entry per swp_entry.
-
- Allocated at swapon(). Freed at swapoff().
-
- The page_cgroup has USED bit and double count against a page_cgroup never
- occurs. swap_cgroup is used only when a charged page is swapped-out.
-
-1. Charge
-=========
-
- a page/swp_entry may be charged (usage += PAGE_SIZE) at
-
- mem_cgroup_try_charge()
-
-2. Uncharge
-===========
-
- a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
-
- mem_cgroup_uncharge()
- Called when a page's refcount goes down to 0.
-
- mem_cgroup_uncharge_swap()
- Called when swp_entry's refcnt goes down to 0. A charge against swap
- disappears.
-
-3. charge-commit-cancel
-=======================
-
- Memcg pages are charged in two steps:
-
- - mem_cgroup_try_charge()
- - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
-
- At try_charge(), there are no flags to say "this page is charged".
- at this point, usage += PAGE_SIZE.
-
- At commit(), the page is associated with the memcg.
-
- At cancel(), simply usage -= PAGE_SIZE.
-
-Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
-
-4. Anonymous
-============
-
- Anonymous page is newly allocated at
- - page fault into MAP_ANONYMOUS mapping.
- - Copy-On-Write.
-
- 4.1 Swap-in.
- At swap-in, the page is taken from swap-cache. There are 2 cases.
-
- (a) If the SwapCache is newly allocated and read, it has no charges.
- (b) If the SwapCache has been mapped by processes, it has been
- charged already.
-
- 4.2 Swap-out.
- At swap-out, typical state transition is below.
-
- (a) add to swap cache. (marked as SwapCache)
- swp_entry's refcnt += 1.
- (b) fully unmapped.
- swp_entry's refcnt += # of ptes.
- (c) write back to swap.
- (d) delete from swap cache. (remove from SwapCache)
- swp_entry's refcnt -= 1.
-
-
- Finally, at task exit,
- (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-
-5. Page Cache
-=============
-
- Page Cache is charged at
- - add_to_page_cache_locked().
-
- The logic is very clear. (About migration, see below)
-
- Note:
- __remove_from_page_cache() is called by remove_from_page_cache()
- and __remove_mapping().
-
-6. Shmem(tmpfs) Page Cache
-===========================
-
- The best way to understand shmem's page state transition is to read
- mm/shmem.c.
-
- But brief explanation of the behavior of memcg around shmem will be
- helpful to understand the logic.
-
- Shmem's page (just leaf page, not direct/indirect block) can be on
-
- - radix-tree of shmem's inode.
- - SwapCache.
- - Both on radix-tree and SwapCache. This happens at swap-in
- and swap-out,
-
- It's charged when...
-
- - A new page is added to shmem's radix-tree.
- - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-
-7. Page Migration
-=================
-
- mem_cgroup_migrate()
-
-8. LRU
-======
- Each memcg has its own private LRU. Now, its handling is under global
- VM's control (means that it's handled under global pgdat->lru_lock).
- Almost all routines around memcg's LRU is called by global LRU's
- list management functions under pgdat->lru_lock.
-
- A special function is mem_cgroup_isolate_pages(). This scans
- memcg's private LRU and call __isolate_lru_page() to extract a page
- from LRU.
-
- (By __isolate_lru_page(), the page is removed from both of global and
- private LRU.)
-
-
-9. Typical Tests.
-=================
-
- Tests for racy cases.
-
-9.1 Small limit to memcg.
--------------------------
-
- When you do test to do racy case, it's good test to set memcg's limit
- to be very small rather than GB. Many races found in the test under
- xKB or xxMB limits.
-
- (Memory behavior under GB and Memory behavior under MB shows very
- different situation.)
-
-9.2 Shmem
----------
-
- Historically, memcg's shmem handling was poor and we saw some amount
- of troubles here. This is because shmem is page-cache but can be
- SwapCache. Test with shmem/tmpfs is always good test.
-
-9.3 Migration
--------------
-
- For NUMA, migration is an another special case. To do easy test, cpuset
- is useful. Following is a sample script to do migration::
-
- mount -t cgroup -o cpuset none /opt/cpuset
-
- mkdir /opt/cpuset/01
- echo 1 > /opt/cpuset/01/cpuset.cpus
- echo 0 > /opt/cpuset/01/cpuset.mems
- echo 1 > /opt/cpuset/01/cpuset.memory_migrate
- mkdir /opt/cpuset/02
- echo 1 > /opt/cpuset/02/cpuset.cpus
- echo 1 > /opt/cpuset/02/cpuset.mems
- echo 1 > /opt/cpuset/02/cpuset.memory_migrate
-
- In above set, when you moves a task from 01 to 02, page migration to
- node 0 to node 1 will occur. Following is a script to migrate all
- under cpuset.::
-
- --
- move_task()
- {
- for pid in $1
- do
- /bin/echo $pid >$2/tasks 2>/dev/null
- echo -n $pid
- echo -n " "
- done
- echo END
- }
-
- G1_TASK=`cat ${G1}/tasks`
- G2_TASK=`cat ${G2}/tasks`
- move_task "${G1_TASK}" ${G2} &
- --
-
-9.4 Memory hotplug
-------------------
-
- memory hotplug test is one of good test.
-
- to offline memory, do following::
-
- # echo offline > /sys/devices/system/memory/memoryXXX/state
-
- (XXX is the place of memory)
-
- This is an easy way to test page migration, too.
-
-9.5 mkdir/rmdir
----------------
-
- When using hierarchy, mkdir/rmdir test should be done.
- Use tests like the following::
-
- echo 1 >/opt/cgroup/01/memory/use_hierarchy
- mkdir /opt/cgroup/01/child_a
- mkdir /opt/cgroup/01/child_b
-
- set limit to 01.
- add limit to 01/child_b
- run jobs under child_a and child_b
-
- create/delete following groups at random while jobs are running::
-
- /opt/cgroup/01/child_a/child_aa
- /opt/cgroup/01/child_b/child_bb
- /opt/cgroup/01/child_c
-
- running new jobs in new group is also good.
-
-9.6 Mount with other subsystems
--------------------------------
-
- Mounting with other subsystems is a good test because there is a
- race and lock dependency with other cgroup subsystems.
-
- example::
-
- # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
-
- and do task move, mkdir, rmdir etc...under this.
-
-9.7 swapoff
------------
-
- Besides management of swap is one of complicated parts of memcg,
- call path of swap-in at swapoff is not same as usual swap-in path..
- It's worth to be tested explicitly.
-
- For example, test like following is good:
-
- (Shell-A)::
-
- # mount -t cgroup none /cgroup -o memory
- # mkdir /cgroup/test
- # echo 40M > /cgroup/test/memory.limit_in_bytes
- # echo 0 > /cgroup/test/tasks
-
- Run malloc(100M) program under this. You'll see 60M of swaps.
-
- (Shell-B)::
-
- # move all tasks in /cgroup/test to /cgroup
- # /sbin/swapoff -a
- # rmdir /cgroup/test
- # kill malloc task.
-
- Of course, tmpfs v.s. swapoff test should be tested, too.
-
-9.8 OOM-Killer
---------------
-
- Out-of-memory caused by memcg's limit will kill tasks under
- the memcg. When hierarchy is used, a task under hierarchy
- will be killed by the kernel.
-
- In this case, panic_on_oom shouldn't be invoked and tasks
- in other groups shouldn't be killed.
-
- It's not difficult to cause OOM under memcg as following.
-
- Case A) when you can swapoff::
-
- #swapoff -a
- #echo 50M > /memory.limit_in_bytes
-
- run 51M of malloc
-
- Case B) when you use mem+swap limitation::
-
- #echo 50M > memory.limit_in_bytes
- #echo 50M > memory.memsw.limit_in_bytes
-
- run 51M of malloc
-
-9.9 Move charges at task migration
-----------------------------------
-
- Charges associated with a task can be moved along with task migration.
-
- (Shell-A)::
-
- #mkdir /cgroup/A
- #echo $$ >/cgroup/A/tasks
-
- run some programs which uses some amount of memory in /cgroup/A.
-
- (Shell-B)::
-
- #mkdir /cgroup/B
- #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
- #echo "pid of the program running in group A" >/cgroup/B/tasks
-
- You can see charges have been moved by reading ``*.usage_in_bytes`` or
- memory.stat of both A and B.
-
- See 8.2 of Documentation/cgroup-v1/memory.rst to see what value should
- be written to move_charge_at_immigrate.
-
-9.10 Memory thresholds
-----------------------
-
- Memory controller implements memory thresholds using cgroups notification
- API. You can use tools/cgroup/cgroup_event_listener.c to test it.
-
- (Shell-A) Create cgroup and run event listener::
-
- # mkdir /cgroup/A
- # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
-
- (Shell-B) Add task to cgroup and try to allocate and free memory::
-
- # echo $$ >/cgroup/A/tasks
- # a="$(dd if=/dev/zero bs=1M count=10)"
- # a=
-
- You will see message from cgroup_event_listener every time you cross
- the thresholds.
-
- Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
-
- It's good idea to test root cgroup as well.
OpenPOWER on IntegriCloud