slub 调试记录

https://access.redhat.com/solutions/358933

https://serverfault.com/questions/1020241/debugging-kmalloc-64-slab-allocations-memory-leak

https://www.kernel.org/doc/html/latest/mm/slub.html

perf

docs/kernel/mm/mm-slub-perf.sh

启动参数

F Sanity checks checks basic details about slab objects, such as making sure the current amount of objects on a slab is not greater than the maximum amount of objects that slab can hold, or the current amount of used slab objects is not more than the amount of objects on the slab (and various other small checks). This options is known to mitigate the freelist pointer corruption issue in Red Hat Enterprise Linux 7 systems.
Z Red zoning adds a small amount of internal fragmentation to slabs known as “Redzones”. These zones have an unlikely value (defined in include/linux/poison.h) written to them and these values are checked for consistency. Useful to detect times where part of a slab is overwritten.
P Posioning writes an unlikely “poison” value to a slab object on allocation or freeing a slab object. On allocation, the slab object is “poisoned” with this value until the object is actually stored in the slab. On free, the object is overwritten with the poison value. Useful to detect “use-after-free” situations.
U User tracking stores the PID and PID’s kernel stack when that PID allocates or frees a slab object in an internal structure. Requires a vmcore to view.
Specific slabs can be targeted for enabling debugging by adding one or more slab cache names after the flags as a comma-separated list. This also takes wildcard (*) to match multiple slab caches.

slub_debug=F                    # Enables sanity checks for all slabs.
slub_debug=FZUP,dentry          # Enables noted debugging for just the dentry slab cache
slub_debug=U,kmalloc-*,filp     # Enables user tracking for all slab caches whose name begins with 'kmalloc' and the file pointer slab cache

额外推荐的参数：

no_hash_pointers log_buf_len=10M

slub_debug=U,kmalloc-* no_hash_pointers log_buf_len=10M

测试了一下，打开之后，系统直接卡死，真的很绝望。

通过 /sys/kernel/slab/dentry/trace

以前，可以通过打开关闭:

echo 1 | sudo tee /sys/kernel/slab/dentry/trace

static ssize_t trace_show(struct kmem_cache *s, char *buf)
{
	return sysfs_emit(buf, "%d\n", !!(s->flags & SLAB_TRACE));
}

不过这个功能被 drop 掉了，只能从开机的时候打开

History:        #0
Commit:         060807f841ac94d3826ce6fa3b4f3831cd0c015b
Author:         Vlastimil Babka <vbabka@suse.cz>
Committer:      Linus Torvalds <torvalds@linux-foundation.org>
Author Date:    Fri 07 Aug 2020 02:18:45 PM CST
Committer Date: Sat 08 Aug 2020 02:33:22 AM CST

mm, slub: make remaining slub_debug related attributes read-only

SLUB_DEBUG creates several files under /sys/kernel/slab/<cache>/ that can
be read to check if the respective debugging options are enabled for given
cache.  Some options, namely sanity_checks, trace, and failslab can be
also enabled and disabled at runtime by writing into the files.

The runtime toggling is racy.  Some options disable __CMPXCHG_DOUBLE when
enabled, which means that in case of concurrent allocations, some can
still use __CMPXCHG_DOUBLE and some not, leading to potential corruption.
The s->flags field is also not updated or checked atomically.  The
simplest solution is to remove the runtime toggling.  The extended
slub_debug boot parameter syntax introduced by earlier patch should allow
to fine-tune the debugging configuration during boot with same
granularity.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Jann Horn <jannh@google.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Link: http://lkml.kernel.org/r/20200610163135.17364-5-vbabka@suse.cz
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

slub_debug : https://www.kernel.org/doc/Documentation/vm/slub.txt

slub_nomerge

The kernel often works to merge other slab caches into pre-existing kmalloc slab caches which can complicate troubleshooting. If necessary for troubleshooting, the merging behavior can be disabled with the kernel parameter slub_nomerge.

For example, the kmalloc-512 slab cache below has a number of other slab caches merged with it;

Raw
crash> tree -t rbtree  0xffff93e5537e0a20 -s kernfs_node -o kernfs_node.rb | grep -e ^fff -e name -e smylink -e target_kn | grep -B4 0xffff94449c7eb528 | grep name
  name = 0xffff9444abfc7ec0 "posix_timers_cache",
  name = 0xffff94449c7f2460 "sgpool-16",
  name = 0xffff94449c7f2160 "xfrm_dst_cache",
  name = 0xffff93e5593982a0 "khugepaged_mm_slot",
  name = 0xffff93e5537c3f00 "kmalloc-512",
  name = 0xffff94449c7f2220 "file_lock_cache",
  name = 0xffff93e5593981c0 "skbuff_fclone_cache",

没有关闭的时候:

 Active / Total Objects (% used)    : 220351 / 225206 (97.8%)
 Active / Total Slabs (% used)      : 6900 / 6900 (100.0%)
 Active / Total Caches (% used)     : 115 / 155 (74.2%)
 Active / Total Size (% used)       : 54184.28K / 55486.18K (97.7%)
 Minimum / Average / Maximum Object : 0.01K / 0.25K / 8.31K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 29016  29016 100%    0.10K    744       39      2976K buffer_head
 24870  24870 100%    0.13K    829       30      3316K kernfs_node_cache
 23961  23826  99%    0.19K   1141       21      4564K dentry
 14366  14366 100%    0.18K    653       22      2612K nfs_direct_cache
  9773   9773 100%    1.09K    337       29     10784K ext4_inode_cache
  8064   8064 100%    0.57K    288       28      4608K radix_tree_node
  7936   7790  98%    0.02K     31      256       124K kmalloc-16
  6834   6816  99%    0.04K     67      102       268K extent_status
  6144   6144 100%    0.01K     12      512        48K kmalloc-8
  5954   5954 100%    0.59K    229       26      3664K inode_cache
  5632   5481  97%    0.03K     44      128       176K kmalloc-32
  5548   5548 100%    0.05K     76       73       304K ftrace_event_field
  5264   5264 100%    0.07K     94       56       376K vmap_area
  5120   4858  94%    0.06K     80       64       320K kmalloc-64
  4998   4577  91%    0.04K     49      102       196K vma_lock
  4646   4483  96%    0.17K    202       23       808K vm_area_struct
  3990   3453  86%    0.09K     95       42       380K kmalloc-96
  3456   2950  85%    0.06K     54       64       216K kmalloc-cg-64
  2882   2816  97%    0.72K    131       22      2096K shmem_inode_cache
  2688   2688 100%    0.19K    128       21       512K kmalloc-192
  2320   2294  98%    0.50K    145       16      1160K kmalloc-512
  2144   2144 100%    0.25K    134       16       536K kmalloc-256
  2048   2048 100%    0.01K      4      512        16K kmalloc-cg-8
  2016   2016 100%    0.09K     48       42       192K trace_event_file
  1904   1053  55%    0.25K    119       16       476K maple_node
  1872   1621  86%    0.10K     48       39       192K anon_vma
  1472   1472 100%    0.06K     23       64        92K iommu_iova
  1296   1274  98%    1.00K     81       16      1296K kmalloc-1k

关闭之后

 Active / Total Objects (% used)    : 201313 / 205157 (98.1%)
 Active / Total Slabs (% used)      : 6255 / 6255 (100.0%)
 Active / Total Caches (% used)     : 162 / 227 (71.4%)
 Active / Total Size (% used)       : 46714.48K / 47648.04K (98.0%)
 Minimum / Average / Maximum Object : 0.01K / 0.23K / 8.31K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 26880  26880 100%    0.13K    896       30      3584K kernfs_node_cache
 22308  22308 100%    0.10K    572       39      2288K buffer_head
 17388  17340  99%    0.19K    828       21      3312K dentry
 14366  14366 100%    0.18K    653       22      2612K ext4_groupinfo_4k
  7424   7424 100%    0.02K     29      256       116K kmalloc-16
  7336   7336 100%    0.57K    262       28      4192K radix_tree_node
  6144   6140  99%    0.01K     12      512        48K kmalloc-8
  5632   5632 100%    0.03K     44      128       176K kmalloc-32
  5610   5186  92%    0.04K     55      102       220K vma_lock
  5460   5460 100%    0.59K    210       26      3360K inode_cache
  5336   4982  93%    0.17K    232       23       928K vm_area_struct
  5329   5329 100%    0.05K     73       73       292K ftrace_event_field
  5120   4753  92%    0.06K     80       64       320K kmalloc-64
  4756   4756 100%    1.09K    164       29      5248K ext4_inode_cache
  3864   3282  84%    0.09K     92       42       368K kmalloc-96
  3584   3584 100%    0.07K     64       56       256K vmap_area
  3468   3468 100%    0.04K     34      102       136K extent_status
  3264   3057  93%    0.06K     51       64       204K anon_vma_chain

的确是有区别的。

当使用 slub_nomerge 的时候，结果为: ls -la /sys/kernel/slab/ 下软链接会消失。

slub_nomerge 的基本作用

find_mergeable : 构建 slab cache 的时候才有用

slabinfo -a，源码位于内核源码树下的 tools/mm/slabinfo.c

http://raverstern.site/en/posts/slab-merging/

其实就是多个 slab cache 公用一个

:0000024     <- avtab_node audit_buffer fsnotify_mark_connector hashtab_node
:0000032     <- ext4_io_end_vec extended_perms_data i915_lut_handle lsm_file_cache sw_flow_stats numa_policy
:0000040     <- khugepaged_mm_slot bio_crypt_ctx ext4_system_zone avtab_extended_perms nfs4_clnt_odstate
:0000048     <- shared_policy_node xfs_log_ticket i915_priolist ksm_mm_slot xfs_refc_intent xfs_ifork Acpi-Namespace avc_xperms_decision_node
:0000056     <- damon_region Acpi-Parse ftrace_event_field xfs_extfree_intent avc_xperms_node zswap_entry file_lock_ctx zspage-zswap1
:0000064     <- io fanotify_path_event ksm_stable_node iommu_iova jbd2_inode ebitmap_node ksm_rmap_item xfs_defer_pending dmaengine-unmap-2
:0000072     <- vmap_area nf_conncount_tuple fanotify_fid_event avc_node lsm_inode_cache drm_buddy_block Acpi-Operand xfs_bmap_intent
:0000080     <- kernfs_iattrs_cache Acpi-State nfsd_file_mark audit_tree_mark Acpi-ParseExt xfs_exchmaps_intent xfs_rmap_intent
:0000088     <- configfs_dir_cache blkdev_ioc xfs_attr_intent
:0000096     <- nf_conncount_rb trace_event_file
:0000128     <- nfsd_file iwl_cmd_pool:0000:00:14.3 xe_hw_fence active_node net_bridge_fdb_entry nfsd_cacherep btree_node backing_aio
:0000136     <- kvm_async_pf kernfs_node_cache
:0000160     <- fuse_request file_lease_cache
:0000176     <- xfs_bud_item xfs_iul_item xfs_xmd_item xfs_cud_item xfs_rud_item xfs_attrd_item
:0000184     <- nf-frags nfs4_ol_stateid ip6-frags xfs_icr
:0000192     <- aio_kiocb mfc_cache sdebug_queued_cmd uid_cache skbuff_ext_cache mfc6_cache bio_integrity_payload rtable inet_peer dmaengine-unmap-16 drm_sched_fence file_lock_cache
:0000208     <- xfs_bui_item xfs_attri_item nf_conntrack_expect
:0000216     <- xfs_refcbt_cur xfs_rtrefcountbt_cur xfs_inobt_cur
:0000232     <- xfs_trans xfs_bnobt_cur
:0000256     <- maple_node sgpool-8 key_jar biovec-16 rpc_tasks nvmet-bvec xe_sched_job
:0000280     <- xfs_rmapbt_cur nfs4_file
:0000320     <- xfrm_dst i915_vma_resource
:0000432     <- xfs_cui_item xfs_efi_item
:0000440     <- nfs4_delegation xfs_efd_item
:0000512     <- sgpool-16 skbuff_fclone_cache pool_workqueue
:0000640     <- i915_vma task_group dio
:0001024     <- sgpool-32 iommu_iova_magazine biovec-64
:0001328     <- nfs4_client perf_event
:0002048     <- rpc_buffers sgpool-64 biovec-128
:0004096     <- fgraph_stack sgpool-128 nvme-chap-buf-cache biovec-max
:A-0000032   <- io_buffer dnotify_struct
:a-0000032   <- pending_reservation jbd2_revoke_record_s
:A-0000040   <- pde_opener vma_lock
:A-0000048   <- fasync_cache ip_fib_trie
:a-0000056   <- jbd2_journal_handle mb_cache_entry ext4_free_data
:A-0000064   <- anon_vma_chain fs_cache eventpoll_pwq
:A-0000080   <- sigqueue dnotify_mark inotify_inode_mark fanotify_mark
:a-0000104   <- buffer_head ext4_fc_dentry_update
:A-0000128   <- eventpoll_epi pte_list_desc tcp_bind_bucket tcp_bind2_bucket fib6_node
:A-0000192   <- cred pid pid_2
:A-0000256   <- ip6_dst_cache
:a-0000256   <- jbd2_transaction_s
:A-0000256   <- task_delay_info
:a-0000256   <- dquot
:A-0001024   <- UNIX UNIX-STREAM PING
:A-0001152   <- signal_cache UDP UDP-Lite
:A-0001344   <- UDPv6 UDPLITEv6

原来只是如此而已。

另外一个问题

构建一个 bpftrace 类似的 ./trace -I ‘linux/slab.h’ -I ‘linux/slub_def.h’
‘kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) (STRCMP(“task_delay_info”, s->name))’

用这个观察，所有的 cache 的分配的都是谁。

经典

https://access.redhat.com/solutions/5375971

为什么 cpu_slabs 会搞过限制 cpu_partial

[root@bogon kmalloc-512]# cat cpu_slabs
86 N0=86
[root@bogon kmalloc-512]# cat cpu_partial
52

使用这个命令:

sudo bpftrace -e 'kfunc:vmlinux:cpu_partial_show { printf("%d\n", args->s->cpu_partial_slabs); }'

输出的就是 4

补充的就是两个地方:

__slab_free
____slab_alloc -> get_partial_node -> put_cpu_partial

好吧，是我智障了，cpu_partial 显示的是一个 numa 节点的，是这个 numa 节点所有的 CPU 的 partial 之和。如果想要看每一个 CPU 有多少个 partial ，需要看 slab_cpu_pa

在 32 core 机器上，可以看到 slabs_cpu_partial 记录为 65 个 cpu partial slab + 32 ，正好是 cpu_slabs 97

[root@bogon kmalloc-512]# cat slabs_cpu_partial
1040(65) C0=32(2) C1=64(4) C2=16(1) C3=64(4) C4=32(2) C5=16(1) C6=32(2) C7=16(1) C8=48(3) C9=16(1) C10=32(2) C11=32(2) C12=32(2) C13=64(4) C14=16(1) C15=16(1) C16=32(2) C17=16(1) C18=32(2) C19=16(1) C20=64(4) C21=32(2) C22=48(3) C23=48(3) C24=16(1) C25=64(4) C26=32(2) C27=32(2) C29=16(1) C30=48(3) C31=16(1)
[root@bogon kmalloc-512]# cat cpu_slabs
97 N0=97

commit 9198ffbd2b49 (“mm/slub: Reduce memory consumption in extreme scenarios”)

这个是之前没有理解的一个代码，当时感觉到两个事情非常奇怪:

为什么 slub 会不先清理本地的 page cache ，而是会直接去另外的 numa node 中找一个 slab
为什么不去直接用其他的 node 中的 partial ，而是会分配一个完整的 slab 出来

[!NOTE] 参考神奇海螺的意见，有待验证

(认为基本上是对的，但是我没时间核查了)

第一个问题的回答:

get_partial() / get_any_partial() 是 SLUB 自己的“对象级复用”逻辑，成本很低，只是在已有 slab 里找空对象。
“释放 page cache”属于更下面 page allocator / reclaim 的事情，成本高很多，可能触发 direct reclaim、writeback、stall，甚至不保证马上在目标 node 上拿到可用页。

第二个问题的回答:

旧路径实际上是：

先找目标 node 的 partial slab
没找到就结束 partial 搜索
直接进入 new_slab(s, gfpflags, node)
new_slab() 再去 page allocator 要页
因为 gfpflags 没有 __GFP_THISNODE，page allocator 可以 fallback 到别的 node 分配页
结果就是“在别的 node 上新建一个完整 slab”

也就是说，旧实现不是“故意更喜欢新建完整 slab，而不喜欢远端 partial”；而是代码结构上把“远端 partial 搜索”这条路直接跳过去了。

这正是 9198ffbd2b49 修的点：

先仍然只看目标 node partial
如果失败，先试一次只在目标 node 新建 slab
再失败，才允许回到原始 gfpflags
这时 get_partial() 才会继续走到 get_any_partial()，于是会去其他 node 找 partial，而不是立刻新建远端 slab

本站所有文章转发 CSDN 将按侵权追究法律责任，其它情况随意。