Skip to the content.

OOM

代码结构

核心结构体 oom_control,又调用者提供。

主要的入口为 out_of_memory,调用来自三个地方

也是有这个路径 :

细节

  1. 为什么会因为 cpuset 而 oom ?
    struct oom_control {
     /* Used to determine cpuset */
     struct zonelist *zonelist;
    

    因为该进程运行执行的 node 上没有内存了。

  2. reaper 是做啥的?

oom_kill_process 中,将那些已经被杀死进程持有的内存直接释放掉[^1]。

pagefault

pagefault_out_of_memory 所有的架构都是需要注册

看注释也看不懂,似乎只是内核的

两个 sysrq 的机制

这个可以作为一个调试选项

static const struct sysrq_key_op sysrq_showmem_op = {
	.handler	= sysrq_handle_showmem,
	.help_msg	= "show-memory-usage(m)",
	.action_msg	= "Show Memory",
	.enable_mask	= SYSRQ_ENABLE_DUMP,
};

static const struct sysrq_key_op sysrq_moom_op = {
	.handler	= sysrq_handle_moom,
	.help_msg	= "memory-full-oom-kill(f)",
	.action_msg	= "Manual OOM execution",
	.enable_mask	= SYSRQ_ENABLE_SIGNAL,
};

为什么 OOM 的时候,需要等待那么长时间

因为 vm.oom_kill_allocating_task 为 0 的时候,内核需要扫描所有的进程,这个过程非常的缓慢。 但是在 cgroup 中 oom ,为什么还是非常快的。

echo f > /proc/sysrq-trigger 是瞬间完成的

似乎一切的关键在于调用到:

是 cgroup 中的程序太多导致的

大页太多导致的吗?

写一个程序 dump stack 出来当前的

先需要看到 out_of_memory 的调用路径吧

out_of_memory

需要把 __alloc_pages_slowpath 中的记录都打印出来

一次 oom 可以告诉我们什么信息

static void dump_header(struct oom_control *oc, struct task_struct *p)
[  112.939900] Out of memory: Killed process 5515 (a.out) total-vm:22022692kB, anon-rss:10875852kB, file-rss:1336kB, shmem-rss:0kB, UID:0 pgtables:41296kB oom_score_adj:0
[  113.314803] oom_reaper: reaped process 5515 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  130.772276] a.out invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[  130.773033] CPU: 17 PID: 5532 Comm: a.out Kdump: loaded Not tainted 5.10.0-60.18.0.50.oe2203.x86_64 #1
[  130.773692] Hardware name: Martins3 Inc Hacking Alpine, BIOS 12 2022-2-2
[  130.774184] Call Trace:
[  130.774403]  dump_stack+0x57/0x6a
[  130.774676]  dump_header+0x4a/0x1f0
[  130.774954]  oom_kill_process.cold+0xb/0x10
[  130.775276]  out_of_memory+0x100/0x310
[  130.775569]  __alloc_pages+0xe78/0xf50
[  130.775866]  pagecache_get_page+0x1cc/0x380
[  130.776188]  filemap_fault+0x2f1/0x510
[  130.776494]  ext4_filemap_fault+0x2d/0x40 [ext4]
[  130.776850]  __do_fault+0x38/0x110
[  130.777122]  do_read_fault+0x31/0xc0
[  130.777405]  do_fault+0x71/0x150
[  130.777667]  __handle_mm_fault+0x3dd/0x6d0
[  130.777983]  ? _copy_from_user+0x3c/0x80
[  130.778287]  handle_mm_fault+0xbe/0x290
[  130.778586]  exc_page_fault+0x273/0x550
[  130.778887]  ? asm_exc_page_fault+0x8/0x30
[  130.779546]  asm_exc_page_fault+0x1e/0x30
[  130.779872] RIP: 0033:0x4011cb
[  130.780126] Code: Unable to access opcode bytes at RIP 0x4011a1.
[  130.780557] RSP: 002b:00007ffdb61e5780 EFLAGS: 00010206
[  130.780944] RAX: 00007f48aad81000 RBX: 0000000000000000 RCX: 00007f4b41007887
[  130.781455] RDX: 00007f4880f08000 RSI: 0000000000000001 RDI: 00007f4b410fd570
[  130.781969] RBP: 00007ffdb61e57a0 R08: 00007f4b410fd570 R09: 000000000000000b
[  130.782482] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdb61e58b8
[  130.782997] R13: 0000000000401142 R14: 0000000000403e08 R15: 00007f4b41150000
[  130.783519] Mem-Info:
[  130.783726] active_anon:285 inactive_anon:2820935 isolated_anon:0
                active_file:28 inactive_file:59 isolated_file:0
                unevictable:2873 dirty:29 writeback:0
                slab_reclaimable:8117 slab_unreclaimable:18581
                mapped:2936 shmem:2213 pagetables:6126 bounce:0
                free:40346 free_pcp:62 free_cma:0
[  130.786048] Node 0 active_anon:1140kB inactive_anon:11283740kB active_file:112kB inactive_file:236kB unevictable:11492kB isolated(anon):0kB isolated(file):0kB mapped:11744kB dirty:116kB writeback:0kB shmem:8852kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 10727424kB writeback_tmp:0kB kernel_stack:6896kB all_unreclaimable? yes
[  130.787958] Node 0 DMA free:13296kB min:144kB low:180kB high:216kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  130.789699] lowmem_reserve[]: 0 2412 11403 11403 11403
[  130.790083] Node 0 DMA32 free:59900kB min:24044kB low:30052kB high:36060kB reserved_highatomic:0KB active_anon:0kB inactive_anon:2439692kB active_file:172kB inactive_file:84kB unevictable:0kB writepending:0kB present:3129204kB managed:2503388kB mlocked:0kB pagetables:80kB bounce:0kB free_pcp:248kB local_pcp:248kB free_cma:0kB
[  130.791976] lowmem_reserve[]: 0 0 8991 8991 8991
[  130.792325] Node 0 Normal free:88188kB min:88444kB low:110552kB high:132660kB reserved_highatomic:0KB active_anon:1140kB inactive_anon:8843776kB active_file:268kB inactive_file:68kB unevictable:11492kB writepending:116kB present:9437184kB managed:9207308kB mlocked:11492kB pagetables:24424kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  130.794295] lowmem_reserve[]: 0 0 0 0 0
[  130.794597] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 13296kB
[  130.795862] Node 0 DMA32: 14*4kB (UM) 46*8kB (U) 50*16kB (UME) 40*32kB (UE) 31*64kB (UE) 25*128kB (UME) 17*256kB (UME) 17*512kB (UM) 39*1024kB (UME) 0*2048kB 0*4096kB = 60680kB
[  130.796931] Node 0 Normal: 1371*4kB (UME) 343*8kB (UE) 600*16kB (UE) 394*32kB (UME) 222*64kB (UME) 153*128kB (UME) 60*256kB (UE) 19*512kB (UME) 3*1024kB (U) 0*2048kB 0*4096kB = 92388kB
[  130.798040] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  130.798654] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  130.799250] 4778 total pagecache pages
[  130.799543] 0 pages in swap cache
[  130.799816] Swap cache stats: add 2567968, delete 2567967, find 5630/8293
[  130.800313] Free swap  = 0kB
[  130.800553] Total swap = 0kB
[  130.800796] 3145595 pages RAM
[  130.801042] 0 pages HighMem/MovableOnly
[  130.801341] 214081 pages reserved
[  130.801608] 0 pages hwpoisoned
[  130.801864] Tasks state (memory values in pages):
[  130.802219] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  130.802859] [    784]     0   784     6662     1221    77824        0          -250 systemd-journal
[  130.803502] [    818]     0   818     7430     1714    77824        0         -1000 systemd-udevd
[  130.804128] [    964]    32   964     2722     1139    61440        0             0 rpcbind
[  130.804724] [   1007]     0  1007      812       27    45056        0             0 mdadm
[  130.805303] [   1022]     0  1022     4442      290    49152        0         -1000 auditd
[  130.805892] [   1128]    81  1128     2187      770    53248        0          -900 dbus-daemon
[  130.806504] [   1133]   997  1133      678      450    45056        0             0 lsmd
[  130.807084] [   1137]     0  1137    19916       87    57344        0          -500 irqbalance
[  130.807694] [   1141]   987  1141    58815      991    90112        0             0 polkitd
[  130.808292] [   1149]     0  1149    77789     1292   110592        0             0 rngd
[  130.808873] [   1154]   984  1154    19541      410    61440        0             0 chronyd
[  130.809464] [   1157]     0  1157     4193     1177    69632        0             0 systemd-logind
[  130.810097] [   1161]     0  1161     3573      739    65536        0             0 systemd-machine
[  130.810735] [   1163]     0  1163     2786      939    65536        0             0 restorecond
[  130.811621] [   1204]     0  1204    36925     7350   176128        0             0 firewalld
[  130.812293] [   1209]     0  1209    86700     1758   151552        0             0 NetworkManager
[  130.812927] [   1228]     0  1228     3476     1486    69632        0         -1000 sshd
[  130.813502] [   1232]     0  1232    52252     5367   151552        0             0 targetclid
[  130.814114] [   1234]     0  1234    69191     3357   143360        0             0 tuned
[  130.814696] [   1235]     0  1235    17609      578    90112        0             0 gssproxy
[  130.815292] [   1513]     0  1513     2256      922    53248        0             0 dhclient
[  130.815890] [   1531]     0  1531     2890     2819    61440        0           -17 iscsid
[  130.816475] [   1534]     0  1534    57333     1142    90112       34             0 rsyslogd
[  130.817073] [   1574]     0  1574      947      526    45056        0             0 atd
[  130.817641] [   1576]     0  1576     5845      686    69632        0             0 crond
[  130.818219] [   1582]     0  1582     5435      416    53248        0             0 agetty
[  130.818803] [   1584]     0  1584     5341      472    57344        0             0 agetty
[  130.819387] [   2033]   992  2033     2188      275    57344        0             0 dnsmasq
[  130.819979] [   2034]     0  2034     2181       90    57344        0             0 dnsmasq
[  130.820565] [   5259]     0  5259     3970     1603    73728        0             0 sshd
[  130.821139] [   5264]     0  5264     4637     1624    77824        0             0 systemd
[  130.821729] [   5267]     0  5267     6025     1512    90112        0             0 (sd-pam)
[  130.822323] [   5273]     0  5273    21817      597    73728        0             0 gcr-ssh-agent
[  130.822948] [   5274]     0  5274     3934      891    73728        0             0 sshd
[  130.823521] [   5276]     0  5276     6879     1214    69632        0             0 zsh
[  130.824089] [   5406]     0  5406     3970     1597    69632        0             0 sshd
[  130.824663] [   5409]     0  5409     3934      908    69632       30             0 sshd
[  130.825237] [   5410]     0  5410     6915     1253    69632        0             0 zsh
[  130.825810] [   5532]     0  5532  2884233  2793370 22437888        0             0 a.out
[  130.826389] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-3.scope,task=a.out,pid=5532,uid=0
[  130.827731] Out of memory: Killed process 5532 (a.out) total-vm:11536932kB, anon-rss:11172224kB, file-rss:1256kB, shmem-rss:0kB, UID:0 pgtables:21912kB oom_score_adj:0
[  130.843472] oom_reaper: reaped process 5532 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

oomd 和 earlyoom

没有什么特别惊艳的技术,就是周期性的扫描内核中的一些指标,oomd 比 earlyoom 观测的内容更多。

oom score 的含义

被废弃的 /proc/$pid/oom_adj 它的值从-17 到 15,值越大越容易被 oom killer 选中,值越小表示选中的可能性越小。 当值为-17 是,表示该进程永远不会被选中。这个 oom_adj 是要被 oom_score_adj 替代的,只是为了兼容旧的内核版本,暂时保留,以后会被废弃。

分析下内核的代码

	ONE("oom_score",  S_IRUGO, proc_oom_score),
	REG("oom_adj",    S_IRUGO|S_IWUSR, proc_oom_adj_operations),
	REG("oom_score_adj", S_IRUGO|S_IWUSR, proc_oom_score_adj_operations),

proc_oom_score_adj_operations 和 proc_oom_adj_operations 都会调用到 __set_oom_adj,两者唯一的区别就是就是做了下数值换算。

如果调整 oom score:

process_mrelease

不太理解,不过以后再看吧

mm: introduce process_mrelease system call

In modern systems it’s not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook’s OOM killer daemon called oomd and Android’s low memory killer daemon called lmkd.

For such system component it’s important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system’s ability to control its memory pressure.

Introduce process_mrelease system call that releases memory of a dying process from the context of the caller. This way the memory is freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process.

After previous discussions [1, 2, 3] the decision was made [4] to introduce a dedicated system call to cover this use case.

The API is as follows,

      int process_mrelease(int pidfd, unsigned int flags);

    DESCRIPTION
      The process_mrelease() system call is used to free the memory of
      an exiting process.

      The pidfd selects the process referred to by the PID file
      descriptor.
      (See pidfd_open(2) for further information)

      The flags argument is reserved for future use; currently, this
      argument must be specified as 0.

    RETURN VALUE
      On success, process_mrelease() returns 0. On error, -1 is
      returned and errno is set to indicate the error.

    ERRORS
      EBADF  pidfd is not a valid PID file descriptor.

      EAGAIN Failed to release part of the address space.

      EINTR  The call was interrupted by a signal; see signal(7).

      EINVAL flags is not 0.

      EINVAL The memory of the task cannot be released because the
             process is not exiting, the address space is shared
             with another live process or there is a core dump in
             progress.

      ENOSYS This system call is not supported, for example, without
             MMU support built into Linux.

      ESRCH  The target process does not exist (i.e., it has terminated
             and been waited on).

[1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/ [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/ [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/ [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/

Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.com Signed-off-by: Suren Baghdasaryan surenb@google.com Reviewed-by: Shakeel Butt shakeelb@google.com Acked-by: David Hildenbrand david@redhat.com Acked-by: Michal Hocko mhocko@suse.com Acked-by: Christian Brauner christian.brauner@ubuntu.com Cc: David Rientjes rientjes@google.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Johannes Weiner hannes@cmpxchg.org Cc: Roman Gushchin guro@fb.com Cc: Rik van Riel riel@surriel.com Cc: Minchan Kim minchan@kernel.org Cc: Christoph Hellwig hch@infradead.org Cc: Oleg Nesterov oleg@redhat.com Cc: Jann Horn jannh@google.com Cc: Geert Uytterhoeven geert@linux-m68k.org Cc: Andy Lutomirski luto@kernel.org Cc: Christian Brauner christian.brauner@ubuntu.com Cc: Florian Weimer fweimer@redhat.com Cc: Jan Engelhardt jengelh@inai.de Cc: Tim Murray timmurray@google.com Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org


## cgroup 如何嵌入到 oom 机制中的
<!-- 5b6dcb97-f246-409d-ae62-f8ca43dd46e7 -->

在现代的环境中,所有的程序都是放到 cgroup 中的,
但是,有时候触发 oom 不是从 cgroup 中触发的,例如:

```txt
  b'out_of_memory'
  b'__alloc_pages_slowpath'
  b'__alloc_pages_nodemask'
  b'filemap_fault'
  b'ext4_filemap_fault'
  b'__do_fault'
  b'do_fault'
  b'__handle_mm_fault'
  b'handle_mm_fault'
  b'__do_page_fault'
  b'do_page_fault'
  b'async_page_fault'
    1

这并不难理解,因为默认情况下(例如 fedora 的桌面环境), 我们不会设置 memory limit ,所以最后触发的时候,就是 系统中所有的内存都被干掉的时候,才可以触发 oom 的。

try_charge_memcg 前面已经分配到页面了,然后去对于这个页面进行 charge , 然后如果发现现在使用的内存超了,那么就会尝试回收掉

或者说简单点,每次分配内存经历两个检查:

  1. cgroup 一次 (默认不限制,如果限制,就是 cgroup 相关的 backctrace 来展示错误)
  2. 总的内存一次

rss 是有明确定义的

// 获取进程的总 RSS (所有 NUMA 节点聚合)
static inline unsigned long get_mm_rss(struct mm_struct *mm)
{
    return get_mm_counter(mm, MM_FILEPAGES) +
           get_mm_counter(mm, MM_ANONPAGES) +
           get_mm_counter(mm, MM_SHMEMPAGES);
}
RSS 的三类页面详解

 计数器          说明         典型场景
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 MM_ANONPAGES    匿名页       malloc 分配的堆内存、栈、私有匿名映射
 MM_FILEPAGES    文件映射页   mmap 映射的文件、可执行文件的代码段
 MM_SHMEMPAGES   共享内存页   共享匿名映射(MAP_SHARED|MAP_ANON)、tmpfs

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
统计机制:何时增减 RSS

增加 RSS 的场景(页面故障时):

// mm/memory.c - 处理页面故障
inc_mm_counter(vma->vm_mm, mm_counter_file(folio));  // 文件页
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);            // 匿名页

减少 RSS 的场景(页面释放/换出时):

// mm/rmap.c - 页面回收
dec_mm_counter(mm, mm_counter(folio));        // 释放页面

// 页面换出:从 ANONPAGES 移到 SWAPENTS
dec_mm_counter(mm, MM_ANONPAGES);
inc_mm_counter(mm, MM_SWAPENTS);              // 注意:这不计入 RSS

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
用户态查看 RSS

# /proc/PID/status - 详细分解
$ grep -E 'VmRSS|Rss' /proc/self/status
VmRSS:      1772 kB      # RSS 总量
RssAnon:     108 kB      # 匿名页部分
RssFile:    1652 kB      # 文件映射页部分
RssShmem:     12 kB      # 共享内存页部分

# /proc/PID/statm - 简洁格式(单位:页)
$ cat /proc/self/statm
665 443 88 0 0 0 0
  |   |  |
  |   |  +--- shared (包含 file+shmem)
  |   +------ resident (RSS)
  +---------- size (VSZ)

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
RSS vs VSZ 的区别

 指标                        含义                     包含                                   不包含
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 VSZ (Virtual Memory Size)   虚拟地址空间总大小       所有已映射区域(不论是否分配物理页)   -
 RSS (Resident Set Size)     实际驻留物理内存的页面   已分配物理内存的文件页+匿名页+共享页   已换出页面、未访问的零页

配合 ./verify/rss.c 中的结果,我认为是很清晰了:

╔════════════════════════════════════════════════════════════╗
║                      SUMMARY                               ║
╠════════════════════════════════════════════════════════════╣
║  File Type    │  Map Type   │  After Read  │  After Write ║
╠═══════════════╪═════════════╪══════════════╪══════════════╣
║  Regular File │  PRIVATE    │  RssFile     │  RssAnon     ║
║  Regular File │  SHARED     │  RssFile     │  RssFile     ║
║  tmpfs File   │  PRIVATE    │  RssShmem    │  RssAnon     ║
║  tmpfs File   │  SHARED     │  RssShmem    │  RssShmem    ║
║  Anonymous    │  SHARED     │  N/A         │  RssShmem    ║
╚═══════════════╧═════════════╧══════════════╧══════════════╝

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。