Skip to the content.

4

- asm_exc_page_fault
  - exc_page_fault
    - handle_page_fault
      - do_user_addr_fault
        - handle_mm_fault
          - __handle_mm_fault
            - handle_pte_fault
              - do_anonymous_page
                - mem_cgroup_charge
                  - __mem_cgroup_charge
                    - charge_memcg
                      - try_charge
                        - try_charge_memcg
                          - try_to_free_mem_cgroup_pages
                            - do_try_to_free_pages
                              - shrink_zones
                                - shrink_node
                                  - shrink_node_memcgs
                                    - shrink_lruvec
                                      - shrink_list

可以观测的一个的统计事件:

memcg_memory_event(mem_over_limit, MEMCG_MAX);

从 shrink_folio_list 中看,只有 free 页面才会是算作 nr_reclaimed 的。

5

如果使用 fault injection 机制,就不会堵塞在哪里,将会是 100% CPU 消耗,但是阻塞的位置不是在 try_charge_memcg

shrink_folio_list 中,发现所有的 page 都没有办法写回,

			switch (pageout(folio, mapping, &plug)) {
			case PAGE_KEEP:
				goto keep_locked;
			case PAGE_ACTIVATE:
				goto activate_locked;
			case PAGE_SUCCESS:
				stat->nr_pageout += nr_pages;

				if (folio_test_writeback(folio)){
					goto keep;
				}
				if (folio_test_dirty(folio)){
                    // 从这里直接跳转到 while 循环的尾部
					goto keep;
				}

shrink_inactive_list 中调用了 mem_cgroup_uncharge_list, 实际上,这些 page 还是在被使用的,folio_list 中都是没有释放的, 然后就被释放了,也许对于 mem_cgroup_uncharge_list

	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
					nr[LRU_INACTIVE_FILE]) {
		unsigned long nr_anon, nr_file, percentage;
		unsigned long nr_scanned;

		for_each_evictable_lru(lru) {
			if (nr[lru]) {
				nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
				nr[lru] -= nr_to_scan;

				nr_reclaimed += shrink_list(lru, nr_to_scan,
							    lruvec, sc);
			}
		}

		cond_resched();
		if (nr_reclaimed < nr_to_reclaim || proportional_reclaim)
			continue;

而且是无法杀掉的。

		pr_info("[->:%s:%d] %ld %ld\n", __FUNCTION__, __LINE__, nr_reclaimed, nr_to_reclaim);
Dec 28 15:00:09 localhost.localdomain kernel: vmscan: [->:shrink_lruvec:5958] 1 32

按道理来说,是一个都不行的,但是 nr_reclaimed 居然还可以回收一个。

这里并不是死循环才对的,而且跳出还很快,细节之后分析吧。

scsi debug 注入错误

如果是使用 scsi debug 的方式,会在 wbt 上阻塞。

其实是类似的方式,但是这个 write 循环太过于频繁,甚至导致 wbt 都其作用了。

在 end_swap_bio_write 的地方释放内存:

#0  end_swap_bio_write (bio=0xffff888101892300) at mm/page_io.c:32
#1  0xffffffff816ccfcc in req_bio_endio (error=10 '\n', nbytes=524288, bio=0xffff888101892300, rq=0xffff88812793c800) at block/blk-mq.c:794
#2  blk_update_request (req=req@entry=0xffff88812793c800, error=error@entry=10 '\n', nr_bytes=524288) at block/blk-mq.c:926
#3  0xffffffff81ad5b42 in scsi_end_request (req=req@entry=0xffff88812793c800, error=error@entry=10 '\n', bytes=<optimized out>) at drivers/scsi/scsi_lib.c:539
#4  0xffffffff81ad6a9b in scsi_io_completion_action (result=<optimized out>, cmd=0xffff88812793c910) at drivers/scsi/scsi_lib.c:841
#5  scsi_io_completion (cmd=0xffff88812793c910, good_bytes=<optimized out>) at drivers/scsi/scsi_lib.c:996
#6  0xffffffff816c9c78 in blk_complete_reqs (list=<optimized out>) at block/blk-mq.c:1131
#7  0xffffffff8219fe14 in __do_softirq () at kernel/softirq.c:571
#8  0xffffffff811318ca in invoke_softirq () at kernel/softirq.c:445
#9  __irq_exit_rcu () at kernel/softirq.c:650
#10 0xffffffff8218cfa6 in sysvec_apic_timer_interrupt (regs=0xffffc90042adfae8) at arch/x86/kernel/apic/apic.c:1107
#0  __wbt_done (wb_acct=WBT_TRACKED, rqos=0xffff888354881558) at block/blk-wbt.c:176
#1  wbt_done (rqos=0xffff888354881558, rq=0xffff88812793c800) at block/blk-wbt.c:201
#2  0xffffffff816dc6f0 in __rq_qos_done (rqos=0xffff888354881558, rq=rq@entry=0xffff88812793c800) at block/blk-rq-qos.c:39
#3  0xffffffff816cbdf1 in rq_qos_done (rq=0xffff88812793c800, q=0xffff8883413b3390) at block/blk-rq-qos.h:178
#4  blk_mq_free_request (rq=rq@entry=0xffff88812793c800) at block/blk-mq.c:742
#5  0xffffffff816cbf03 in __blk_mq_end_request (rq=rq@entry=0xffff88812793c800, error=error@entry=10 '\n') at block/blk-mq.c:1044
#6  0xffffffff81ad5bf9 in scsi_end_request (req=req@entry=0xffff88812793c800, error=error@entry=10 '\n', bytes=<optimized out>) at drivers/scsi/scsi_lib.c:574
#7  0xffffffff81ad6a9b in scsi_io_completion_action (result=<optimized out>, cmd=0xffff88812793c910) at drivers/scsi/scsi_lib.c:841
#8  scsi_io_completion (cmd=0xffff88812793c910, good_bytes=<optimized out>) at drivers/scsi/scsi_lib.c:996
#9  0xffffffff816c9c78 in blk_complete_reqs (list=<optimized out>) at block/blk-mq.c:1131
#10 0xffffffff8219fe14 in __do_softirq () at kernel/softirq.c:571
#11 0xffffffff811318ca in invoke_softirq () at kernel/softirq.c:445
#12 __irq_exit_rcu () at kernel/softirq.c:650
#13 0xffffffff8218cfa6 in sysvec_apic_timer_interrupt (regs=0xffffc90042adfae8) at arch/x86/kernel/apic/apic.c:1107

虽然是迅速的返回失败,但是在 wbt 的眼中,这个还是相当于 write 提交太快了。

scsi debug 没有 wbt,可以立刻 OOM 掉

去掉 wbt 机制,对于 fault injection 是没有影响的。

./fault/inject.sh: line 16:  1597 Bus error               (core dumped) cgexec -g memory:mem ./a.out

内核日志显示,死掉了,然后进行 read 的时候,进一步触发死亡的:

[   31.408189] vmscan: [->:shrink_lruvec:5958] 0 32
[   31.408378] vmscan: [->:shrink_lruvec:5958] 0 32
[   31.408563] vmscan: [->:shrink_lruvec:5958] 0 32
[   31.408748] vmscan: [->:shrink_lruvec:5958] 0 32
[   31.408935] vmscan: [->:shrink_lruvec:5958] 0 32
[   31.409120] vmscan: [->:shrink_lruvec:5958] 0 32
[   31.410510] systemd-coredum (1552) used greatest stack depth: 12880 bytes left
[   31.588199] Read-error on swap-device (8:64:131160)
[   31.590922] a.out (1549) used greatest stack depth: 11224 bytes left
➜  ~ dmesg | grep try_charge_memcg
[   31.221360] [->:try_charge_memcg:2743]

这,是不是有点没啥关联?

#0  martins3 () at mm/memcontrol.c:2633
#1  0xffffffff82122868 in try_charge_memcg (memcg=memcg@entry=0xffff88810632e000, gfp_mask=gfp_mask@entry=1843402, nr_pages=512) at mm/memcontrol.c:2687
#2  0xffffffff813a1b3d in try_charge (nr_pages=512, gfp_mask=1843402, memcg=0xffff88810632e000) at mm/memcontrol.c:2845
#3  charge_memcg (folio=folio@entry=0xffffea0004c10000, memcg=memcg@entry=0xffff88810632e000, gfp=gfp@entry=1843402) at mm/memcontrol.c:6962
#4  0xffffffff813a3438 in __mem_cgroup_charge (folio=0xffffea0004c10000, mm=<optimized out>, gfp=gfp@entry=1843402) at mm/memcontrol.c:6983
#5  0xffffffff8138a7af in mem_cgroup_charge (gfp=1843402, mm=<optimized out>, folio=<optimized out>) at ./include/linux/memcontrol.h:671
#6  __do_huge_pmd_anonymous_page (gfp=1843402, page=0xffffea0004c10000, vmf=0xffffc90043187df8) at mm/huge_memory.c:662
#7  do_huge_pmd_anonymous_page (vmf=vmf@entry=0xffffc90043187df8) at mm/huge_memory.c:837
#8  0xffffffff8132a8ab in create_huge_pmd (vmf=0xffffc90043187df8) at mm/memory.c:4790
#9  __handle_mm_fault (vma=vma@entry=0xffff8881134f7390, address=address@entry=139697269506048, flags=flags@entry=597) at mm/memory.c:5043
#10 0xffffffff8132b0a4 in handle_mm_fault (vma=0xffff8881134f7390, address=address@entry=139697269506048, flags=<optimized out>, flags@entry=597, regs=regs@entry=0xffffc90043187f58) at mm/memory.c:5219
#11 0xffffffff8110d937 in do_user_addr_fault (regs=regs@entry=0xffffc90043187f58, error_code=error_code@entry=6, address=address@entry=139697269506048) at arch/x86/mm/fault.c:1428
#12 0xffffffff8218b6d6 in handle_page_fault (address=139697269506048, error_code=6, regs=0xffffc90043187f58) at arch/x86/mm/fault.c:1519
#13 exc_page_fault (regs=0xffffc90043187f58, error_code=6) at arch/x86/mm/fault.c:1575
#14 0xffffffff82201286 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570

是的,但是纯粹就是巧合而已,当超过 100M 之后,恰好使用 transparent hugepage 了,实际上。

如果将 wbt 关掉,那么我们就可以看到 memory 总是可以下写的。

3 如果只是单纯的写入过慢,应该存在不一样的效果

  1. 难道说,为什么 a.out 的代码段成为读取的内容了 ?
    • 这只是随机现象中的一种而已

将几个 printk 去掉之后,几乎立刻是 OOM 的

这个才是正常的,在

实际上 wbt 只是让时间推迟了而已,最后,会按照完全相同的剧本死掉

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。