Skip to the content.

HugeTLB

TODO

设计思路

相对于普通页,可以被简化的部分:

依旧保留的部分:

首先,注意区分一下

obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o

如何检查系统中,到底那些进程在使用 hugepages

grep huge /proc/*/numa_maps

接口

free_hugepages resv_hugepages surplus_hugepages


sys/devices/system/node/node1/hugepages/hugepages-2048kB
```txt
nr_hugepages


free_hugepages
surplus_hugepages

/proc/sys/vm

nr_hugepages
nr_hugepages_mempolicy
nr_overcommit_hugepages

不同层次的接口出现在各种位置,因为 /proc/meminfo 中存在 free resv 和 surplus 的,所以 /proc/sys/vm 少来三个。

数值基本含义

|——|——————————–|——————–| | surp | 该 node 中,如果不用,就会释放 | surplus_huge_pages | | free | 暂时没有用 | free_huge_pages | | resv | 暂时没用,在 free 中 | resv_huge_pages | | nr | 总数 | nr_huge_pages | | used | 已经被使用的 | |

当 free - resv > 0

+------ used --------+------------free----------+
|                    |                          |
+--------------------+--------------------+-resv+
|                    |                    |     |
|                    |                    |     |
+--------------------+--------------------+-----+

当 free == resv 的时候

+-----------------------+--------------------------+
|                       |                          |
|   surplus             |  "persistent"            |
+---------+-------------+--------------------------+
|         |                                        |
+--resv---+                                        |
|         |                                        |
+--free---+--------------used----------------------+

hugepages 的初始化过程

一共三个参数

  1. hugepages=
  2. hugepagesz=
  3. default_hugepagesz= 其中 hugepages= hugepagesz= 是配合使用的,使用方法为 hugepagesz=X hugepages=Y

三个都会调用 hugetlb_hstate_alloc_pages,但是如果是

hugepagesz= 中会将数值放到 parsed_hstate 中:

static struct hstate * __initdata parsed_hstate;

首先初始化内核参数,然后

测试数据

测试的是虚拟机中编译内核,可见,当让 Guest 使用大页之后,存在明显的性能提升。

file :

make -j31  1933.04s user 251.73s system 2139% cpu 1:42.10 total

memory:

make -j31  1912.18s user 249.34s system 2131% cpu 1:41.42 total

huge:

make -j31  1840.78s user 208.05s system 2168% cpu 1:34.49 total

常用脚本

配置其 zsh

function write() {
  node=$1
  num=$2
  echo $num >/sys/devices/system/node/node$node/hugepages/hugepages-2048kB/nr_hugepages
  numastat -m | grep -E "Node|HugePages_"
}

function global() {
  echo $1 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
  numastat -m | grep -E "Node|HugePages_"
}

使用大页的好处

使用大页的方法

mount

默认的 mount 点:

hugetlbfs /dev/hugepages hugetlbfs rw,seclabel,relatime,pagesize=1024M 0 0

libhugetlbfs 中有个函数 : hugetlbfs_find_path_for_size

mount -t hugetlbfs \
-o mode=777,pagesize=2048K,size=1G,min_size=500M none /mnt/huge

可以得到如下结果 /proc/mounts

hugetlbfs /dev/hugepages hugetlbfs rw,seclabel,relatime,pagesize=2M 0 0

然后映射其中的文件。 TODO

mmap

#include <assert.h> // assert
#include <errno.h>
#include <fcntl.h>   // open
#include <limits.h>  // INT_MAX
#include <math.h>    // sqrt
#include <stdbool.h> // bool false true
#include <stdio.h>
#include <stdlib.h> // malloc sort
#include <string.h> // strcmp ..
#include <unistd.h> // sleep

#include <sys/mman.h>

#define PG_SIZE (1 << 21)
int main(int argc, char *argv[]) {
  int num = 100;
  printf("[->:%s:%d] \n", __FUNCTION__, __LINE__);
  char *ptr = (char *)mmap(NULL, num * PG_SIZE, PROT_READ | PROT_WRITE,
                           MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
  if (ptr == MAP_FAILED) {
    printf("no hugepages  %s\n", strerror(errno));
    exit(1);
  }
  // prealloc
  for (int i = 0; i < num; ++i) {
    char *addr = ptr + i * PG_SIZE;
    *addr = 'x';
  }
  sleep(1000);
  return 0;
}

shmem

参考 linux/tools/testing/selftests/vm/hugepage-shm.c

struct hugetlbfs_inode_info {
    struct shared_policy policy;
    struct inode vfs_inode;
    unsigned int seals;
};

常用接口触发的经典路径

通过触发 /proc/sys/vm/nr_hugepages 来控制数量:

#0  remove_pool_huge_page (h=h@entry=0xffffffff834abe20 <hstates>, nodes_allowed=nodes_allowed@entry=0xffffffff82cf8218 <node_states+24>, acct_surplus=acct_surplu
s@entry=false) at mm/hugetlb.c:2041
#1  0xffffffff812f0321 in set_max_huge_pages (h=h@entry=0xffffffff834abe20 <hstates>, count=count@entry=4, nid=nid@entry=-1, nodes_allowed=0xffffffff82cf8218 <nod
e_states+24>) at mm/hugetlb.c:3393
#2  0xffffffff812f05c8 in __nr_hugepages_store_common (len=2, count=4, nid=-1, h=0xffffffff834abe20 <hstates>, obey_mempolicy=false) at mm/hugetlb.c:3582
#3  hugetlb_sysctl_handler_common (obey_mempolicy=<optimized out>, table=<optimized out>, write=<optimized out>, buffer=<optimized out>, length=<optimized out>, p
pos=<optimized out>) at mm/hugetlb.c:4385
#4  0xffffffff813a9c2e in proc_sys_call_handler (iocb=0xffffc90000a63ea0, iter=0xffffc90000a63e78, write=<optimized out>) at fs/proc/proc_sysctl.c:611
#5  0xffffffff8131d309 in call_write_iter (iter=0xffffffff82cf8218 <node_states+24>, kio=0xffffffff834abe20 <hstates>, file=0xffff88830ec92200) at include/linux/f
s.h:2187
#6  new_sync_write (ppos=0xffffc90000a63f08, len=2, buf=0x7f90eb632000 "4\n", filp=0xffff88830ec92200) at fs/read_write.c:491
#7  vfs_write (file=file@entry=0xffff88830ec92200, buf=buf@entry=0x7f90eb632000 "4\n", count=count@entry=2, pos=pos@entry=0xffffc90000a63f08) at fs/read_write.c:5
78
#8  0xffffffff8131d6da in ksys_write (fd=<optimized out>, buf=0x7f90eb632000 "4\n", count=2) at fs/read_write.c:631
#9  0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc90000a63f58) at arch/x86/entry/common.c:50
#10 do_syscall_64 (regs=0xffffc90000a63f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#11 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#12 0x0000000000000000 in ?? ()

如果获取到一个进程的使用的大页的分布情况

numastat -p $pid

hugetlb 是如何影响文件系统的

alloc_huge_page && hugetlb_reserve_pages

都是 hugepage_subpool_get_pages 打交道,只是一个在 page fault 的时候处理,一个是在 mmap 的时候

alloc_huge_page 调用位置 : hugetlb_cow(被 hugetlb_no_page 调用) hugetlb_no_page(被 hugetlb_fault 调用)

hugetlb_reserve_pages : hugetlb_file_setup 和 hugetlbfs_file_mmap

file_operations::mmap 和 vm_area_struct::vm_operations_struct::fault 的关系

调用 hugetlb_reserve_pages 的两个位置:

普通的 page 和 hugepage 是如何转换的

#define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)

最后全部在:

free 2M 的

#0  __remove_hugetlb_page (h=h@entry=0xffffffff834abe20 <hstates>, page=page@entry=0xffffea0004838000, adjust_surplus=adjust_surplus@entry=false, demote=demote@entry=false) at mm/hugetlb.c:1434
#1  0xffffffff812eef0d in remove_hugetlb_page (adjust_surplus=false, page=0xffffea0004838000, h=0xffffffff834abe20 <hstates>, h@entry=0xffffea0004838000) at mm/hugetlb.c:1479
#2  remove_pool_huge_page (h=h@entry=0xffffffff834abe20 <hstates>, nodes_allowed=nodes_allowed@entry=0xffffffff82cf8218 <node_states+24>, acct_surplus=acct_surplus@entry=false) at mm/hugetlb.c:2050
#3  0xffffffff812f0521 in set_max_huge_pages (h=h@entry=0xffffffff834abe20 <hstates>, count=count@entry=1, nid=nid@entry=-1, nodes_allowed=0xffffffff82cf8218 <node_states+24>) at mm/hugetlb.c:3393
#4  0xffffffff812f07c8 in __nr_hugepages_store_common (len=2, count=1, nid=-1, h=0xffffffff834abe20 <hstates>, obey_mempolicy=false) at mm/hugetlb.c:3582
#5  hugetlb_sysctl_handler_common (obey_mempolicy=<optimized out>, table=<optimized out>, write=<optimized out>, buffer=<optimized out>, length=<optimized out>, ppos=<optimized out>) at mm/hugetlb.c:4385
#6  0xffffffff813a9e2e in proc_sys_call_handler (iocb=0xffffc900015d3ea0, iter=0xffffc900015d3e78, write=<optimized out>) at fs/proc/proc_sysctl.c:611
#7  0xffffffff8131d509 in call_write_iter (iter=0xffffea0004838000, kio=0xffffffff834abe20 <hstates>, file=0xffff8881225c5100) at include/linux/fs.h:2187
#8  new_sync_write (ppos=0xffffc900015d3f08, len=2, buf=0x7ff38a997000 "1\n", filp=0xffff8881225c5100) at fs/read_write.c:491
#9  vfs_write (file=file@entry=0xffff8881225c5100, buf=buf@entry=0x7ff38a997000 "1\n", count=count@entry=2, pos=pos@entry=0xffffc900015d3f08) at fs/read_write.c:578
#10 0xffffffff8131d8da in ksys_write (fd=<optimized out>, buf=0x7ff38a997000 "1\n", count=2) at fs/read_write.c:631
#11 0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900015d3f58) at arch/x86/entry/common.c:50
#12 do_syscall_64 (regs=0xffffc900015d3f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#13 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

free 1G 也是类似的,但是需要注意的,当把数值降低之后,是没有办法再将数值升高的。 如果 hugetlb 永远不会增加了。

[ ] 大页如何 KSM

一个 mmap 的时候,其中是否可以同时包含两种 size 大小的 page

不可以

理解一下核心结构体

struct hstate {
    struct mutex resize_lock;
    int next_nid_to_alloc;
    int next_nid_to_free;
    unsigned int order;
    unsigned int demote_order;
    unsigned long mask;
    unsigned long max_huge_pages;
    unsigned long nr_huge_pages;
    unsigned long free_huge_pages;
    unsigned long resv_huge_pages;
    unsigned long surplus_huge_pages;
    unsigned long nr_overcommit_huge_pages;
    struct list_head hugepage_activelist;
    struct list_head hugepage_freelists[MAX_NUMNODES];
    unsigned int max_huge_pages_node[MAX_NUMNODES];
    unsigned int nr_huge_pages_node[MAX_NUMNODES];
    unsigned int free_huge_pages_node[MAX_NUMNODES];
    unsigned int surplus_huge_pages_node[MAX_NUMNODES];
#ifdef CONFIG_CGROUP_HUGETLB
    /* cgroup control files */
    struct cftype cgroup_files_dfl[8];
    struct cftype cgroup_files_legacy[10];
#endif
    char name[HSTATE_NAME_LEN];
};

总共的统计数据和所有的统计数据:

hstate_is_gigantic

[ ] demote 如何工作的

[ ] 大页应该更加小心的处理 memory poison 才对吧,毕竟更加容易出现错误

[ ] folio 是如何影响 hugetlb 的

cgroup 是如何影响的

hugetlb.2MB.current
hugetlb.2MB.max
hugetlb.2MB.events
hugetlb.2MB.events.local
hugetlb.2MB.numa_stat
hugetlb.2MB.rsvd.current
hugetlb.2MB.rsvd.max

cpuset 是如何影响的

reservation

CONFIG_CONTIG_ALLOC 是做什么

config CONTIG_ALLOC
    def_bool (MEMORY_ISOLATION && COMPACTION) || CMA

是否可以分配连续的内存

[ ] dissolve_free_huge_page 被 memory_failure 和 memory hotplug 有关的

[ ] 是在什么时间点预留的

应该很清楚了,但是整理一下

分配的时候如何考虑 numa 的

通过 interleave 的方法分配的:

                          Node 0          Node 1          Node 2           Total
                 --------------- --------------- --------------- ---------------
MemTotal                 6990.34          972.48            0.00         7962.82
MemFree                  3596.47           39.74            0.00         3636.21
MemUsed                  3393.88          932.74            0.00         4326.61
/*
 * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
 * manner.
 */
static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
				nodemask_t *node_alloc_noretry)

[x] 如果 hugetlb_file_setup 是入口,那个文件系统还需要使用吗

ksys_mmap_pgoff 中处理过

[ ] vm_operations_struct

#0  hugetlb_vm_op_close (vma=0xffff8883086d03c0) at include/linux/hugetlb.h:720
#1  0xffffffff812c58c9 in remove_vma (vma=vma@entry=0xffff8883086d03c0) at mm/mmap.c:143
#2  0xffffffff812c8363 in exit_mmap (mm=mm@entry=0xffff888300244800) at mm/mmap.c:3121
#3  0xffffffff810ff6ed in __mmput (mm=0xffff888300244800) at kernel/fork.c:1187
#4  mmput (mm=mm@entry=0xffff888300244800) at kernel/fork.c:1208
#5  0xffffffff811085db in exit_mm () at kernel/exit.c:510
#6  do_exit (code=code@entry=0) at kernel/exit.c:782
#7  0xffffffff81108e58 in do_group_exit (exit_code=0) at kernel/exit.c:925
#8  0xffffffff81108ecf in __do_sys_exit_group (error_code=<optimized out>) at kernel/exit.c:936
#9  __se_sys_exit_group (error_code=<optimized out>) at kernel/exit.c:934
#10 __x64_sys_exit_group (regs=<optimized out>) at kernel/exit.c:934
#11 0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900005a7f58) at arch/x86/entry/common.c:50
#12 do_syscall_64 (regs=0xffffc900005a7f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#13 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#14 0x0000000000000000 in ?? ()
const struct vm_operations_struct hugetlb_vm_ops = {
    .fault = hugetlb_vm_op_fault,
    .open = hugetlb_vm_op_open,
    .close = hugetlb_vm_op_close,
    .may_split = hugetlb_vm_op_split,
    .pagesize = hugetlb_vm_op_pagesize,
};

下面几个都是什么意思:

gup

mempolicy

memory policy 只是在调整 nr_hugepages 的时候有用,在 fault 的时候应该也是效果的才对吧

[ ] hugetlbfs_inode_info::policy 似乎根本没有用过

struct hugetlbfs_inode_info {
    struct shared_policy policy;
    struct inode vfs_inode;
    unsigned int seals;
};

[ ] alloc_buddy_huge_page_with_mpol 为什么正好在 dequeue_huge_page_vma 分配不出来的时候进行

dequeue_huge_page_vma 中还是有 memory policy 的代码的啊

之后还有

    hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);

nr_hugepages_mempolicy 的含义

numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy

echo 20 > /proc/sys/vm/nr_hugepages_mempolicy

的区别是什么 ?

从 nr_hugepages_mempolicy 中 echo 的时候,在函数 __nr_hugepages_store_common 中调用 init_nodemask_of_mempolicy 会根据 current 的 mempolicy 来构建。

echo 0 > /proc/sys/vm/nr_hugepages
# echo 1000 > /proc/sys/vm/nr_hugepages
numactl -m 1 echo 200 > /proc/sys/vm/nr_hugepages_mempolicy
numastat -m | grep -E "Node|HugePages_"

得到的结果:

                          Node 0          Node 1          Node 2          Node 3           Total
HugePages_Total             0.00          400.00            0.00            0.00          400.00
HugePages_Free              0.00          400.00            0.00            0.00          400.00
HugePages_Surp              0.00            0.00            0.00            0.00            0.00
echo 0 > /proc/sys/vm/nr_hugepages
numactl -m 1 echo 200 >  /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages_mempolicy
numastat -m | grep -E "Node|HugePages_"
                          Node 0          Node 1          Node 2          Node 3           Total
HugePages_Total             0.00          400.00            0.00            0.00          400.00
HugePages_Free              0.00          400.00            0.00            0.00          400.00
HugePages_Surp              0.00            0.00            0.00            0.00            0.00

当然,/sys/devices/system/node/node1/hugepages/hugepages-2048kB 根本就没有

free_hugepages
nr_hugepages
surplus_hugepages

为什么当已经存在 page 的时候,这会将已经存在的大页清理掉? 这是因为希望将所有的 hugepages 设置为 0,但是现在只能在特性的 numa 上操作,所以最后会导致为 0 。

会出现 /proc/sys/vm/nr_hugepages 和 /proc/sys/vm/nr_hugepages_mempolicy 的数值不同的情况吗?

[ ] copy_hugetlb_page_range

seal

hugetlbfs_inode_info 中的 policy 是没有用吧的发

struct hugetlbfs_inode_info {
    struct shared_policy policy;
    struct inode vfs_inode;
    unsigned int seals;
};

hugepage 的动态伸缩

最终进入到 set_max_huge_pages 中。

#0  set_max_huge_pages (h=h@entry=0xffffffff834abe20 <hstates>, count=2, nid=0, nodes_allowed=0xffffc90001617e10) at mm/hugetlb.c:3271
#1  0xffffffff812f087c in __nr_hugepages_store_common (len=2, count=<optimized out>, nid=<optimized out>, h=0xffffffff834abe20 <hstates>, obey_mempolicy=false) atmm/hugetlb.c:3582
#2  nr_hugepages_store_common (obey_mempolicy=<optimized out>, kobj=<optimized out>, buf=<optimized out>, len=2) at mm/hugetlb.c:3601
#3  0xffffffff813b212b in kernfs_fop_write_iter (iocb=0xffffc90001617ea0, iter=<optimized out>) at fs/kernfs/file.c:354
#4  0xffffffff8131d509 in call_write_iter (iter=0x2 <fixed_percpu_data+2>, kio=0xffffffff834abe20 <hstates>, file=0xffff888205e3a000) at include/linux/fs.h:2187
#5  new_sync_write (ppos=0xffffc90001617f08, len=2, buf=0x7fd94a3e8000 "2\n", filp=0xffff888205e3a000) at fs/read_write.c:491
#6  vfs_write (file=file@entry=0xffff888205e3a000, buf=buf@entry=0x7fd94a3e8000 "2\n", count=count@entry=2, pos=pos@entry=0xffffc90001617f08) at fs/read_write.c:578
#7  0xffffffff8131d8da in ksys_write (fd=<optimized out>, buf=0x7fd94a3e8000 "2\n", count=2) at fs/read_write.c:631
#8  0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc90001617f58) at arch/x86/entry/common.c:50
#9  do_syscall_64 (regs=0xffffc90001617f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#10 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#11 0x0000000000000000 in ?? ()

分配和回收是如何进行的

#0  dequeue_huge_page_nodemask (h=h@entry=0xffffffff834abe20 <hstates>, gfp_mask=1051842, nid=0, nmask=nmask@entry=0x0 <fixed_percpu_data>) at include/linux/gfp.h
:170
#1  0xffffffff812f1f74 in dequeue_huge_page_vma (chg=<optimized out>, avoid_reserve=<optimized out>, address=140629040431104, vma=0xffff88830ed12000, h=0xffffffff
834abe20 <hstates>) at mm/hugetlb.c:1220
#2  alloc_huge_page (vma=vma@entry=0xffff88830ed12000, addr=addr@entry=140629040431104, avoid_reserve=avoid_reserve@entry=0) at mm/hugetlb.c:2925
#3  0xffffffff812f5a25 in hugetlb_no_page (flags=597, old_pte=..., ptep=0xffff888300351cd8, address=140629040431104, idx=0, mapping=0xffff8883003f4188, vma=0xffff
88830ed12000, mm=0xffff888301a8bc00) at mm/hugetlb.c:5545
#4  hugetlb_fault (mm=0xffff888301a8bc00, vma=vma@entry=0xffff88830ed12000, address=address@entry=140629040431104, flags=flags@entry=597) at mm/hugetlb.c:5763
#5  0xffffffff812bdf9b in handle_mm_fault (vma=0xffff88830ed12000, address=address@entry=140629040431104, flags=flags@entry=597, regs=regs@entry=0xffffc9000057ff5
8) at mm/memory.c:5149
#6  0xffffffff810f29a3 in do_user_addr_fault (regs=regs@entry=0xffffc9000057ff58, error_code=error_code@entry=6, address=address@entry=140629040431104) at arch/x8
6/mm/fault.c:1397
#7  0xffffffff81ead672 in handle_page_fault (address=140629040431104, error_code=6, regs=0xffffc9000057ff58) at arch/x86/mm/fault.c:1488
#8  exc_page_fault (regs=0xffffc9000057ff58, error_code=6) at arch/x86/mm/fault.c:1544
#9  0xffffffff82000b62 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570
#10 0x0000000000000000 in ?? ()

hugetlb_vm_op_fault 是一定不会触发的,因为很早的时候就已经被 handle_mm_fault 中被劫持了。

使用 QEMU 分析一下 reservation

arg_mem_cpu="$arg_mem_cpu -object memory-backend-file,size=1G,prealloc=off,share=on,id=m2,mem-path=/dev/hugepages -numa node,memdev=m2,cpus=5-7,nodeid=2"

分析 on

on 的时候:

HugePages_Total:    1000
HugePages_Free:      741
HugePages_Rsvd:        0
HugePages_Surp:        0

使用: 259

但是在 guest 中 stress-ng 的时候:

HugePages_Total:    1000
HugePages_Free:      488
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

echo 0 之后:

HugePages_Total:     205
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:      205
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0

看来,我们对于 prealloc = on 的参数理解有问题,以为是直接就会使用掉,但是实际上,并没有。

分析 off 的时候

off 的时候:

HugePages_Total:    1000
HugePages_Free:      847
HugePages_Rsvd:      359
HugePages_Surp:        0
Hugepagesize:       2048 kB

153 + 359 = 512 的

echo 0 之后:

HugePages_Total:     512
HugePages_Free:      303
HugePages_Rsvd:      303
HugePages_Surp:      512
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0

似乎 QEMU 实现的有问题

QEMU 中的 backtrace :

#0  qemu_prealloc_mem (fd=22, area=area@entry=0x7ffda3a00000 "", sz=sz@entry=1073741824, max_threads=max_threads@entry=8, tc=tc@entry=0x0, errp=errp@en
try=0x7fffffffac80) at ../util/oslib-posix.c:508
#1  0x00005555559ef30b in host_memory_backend_memory_complete (uc=<optimized out>, errp=0x7fffffffacd0) at ../backends/hostmem.c:387
#2  0x0000555555c1e9c3 in user_creatable_complete (uc=0x5555567ef800, errp=errp@entry=0x7fffffffad18) at ../qom/object_interfaces.c:28
#3  0x0000555555c1ec7f in user_creatable_add_type (type=<optimized out>, id=id@entry=0x55555664a620 "m2", qdict=qdict@entry=0x555556647f10, v=v@entry=0
x5555568bd850, errp=0x7fffffffad20, errp@entry=0x5555565b9bb0 <error_fatal>) at ../qom/object_interfaces.c:125
#4  0x0000555555c1eec6 in user_creatable_add_qapi (options=<optimized out>, errp=errp@entry=0x5555565b9bb0 <error_fatal>) at ../qom/object_interfaces.c
:157
#5  0x00005555559e5c60 in object_option_foreach_add (type_opt_predicate=type_opt_predicate@entry=0x5555559e63f0 <object_create_late>) at ../softmmu/vl.
c:1714
#6  0x00005555559ead04 in qemu_create_late_backends () at ../softmmu/vl.c:1935
#7  qemu_init (argc=<optimized out>, argv=<optimized out>) at ../softmmu/vl.c:3581
#8  0x00005555558301e9 in main (argc=<optimized out>, argv=<optimized out>) at ../softmmu/main.c:47

似乎最近有些 update:

History:        #0
Commit:         a384bfa32ed8d616d766cb33360011157ae2f5c7
Author:         David Hildenbrand <david@redhat.com>
Committer:      Michael S. Tsirkin <mst@redhat.com>
Author Date:    Fri 17 Dec 2021 09:46:05 PM CST
Committer Date: Fri 07 Jan 2022 06:19:55 PM CST

util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()

Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
does not require a SIGBUS handler, doesn't actually touch page content,
and avoids context switches; it is, therefore, faster and easier to handle
than our current approach.

While MADV_POPULATE_WRITE is, in general, faster than manual
prefaulting, and especially faster with 4k pages, there is still value in
prefaulting using multiple threads to speed up preallocation.

More details on MADV_POPULATE_WRITE can be found in the Linux commits
4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
page tables") and eb2faa513c24 ("mm/madvise: report SIGBUS as -EFAULT for
MADV_POPULATE_(READ|WRITE)"), and in the man page proposal [1].

This resolves the TODO in do_touch_pages().

In the future, we might want to look into using fallocate(), eventually
combined with MADV_POPULATE_READ, when dealing with shared file/fd
mappings and not caring about memory bindings.

[1] https://lkml.kernel.org/r/20210816081922.5155-1-david@redhat.com

 Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
 Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
 Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
 Signed-off-by: David Hildenbrand <david@redhat.com>
 Message-Id: <20211217134611.31172-3-david@redhat.com>
 Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

似乎也不是多线程的问题。

并不是 QEMU 的问题

似乎开始的

@[
    enqueue_huge_page+1
    free_huge_page+548
    release_pages+491
    __pagevec_release+27
    remove_inode_hugepages+776
    hugetlbfs_fallocate+1152
    vfs_fallocate+328
    __x64_sys_fallocate+60
    do_syscall_64+56
    entry_SYSCALL_64_after_hwframe+99
]: 310
@[
    enqueue_huge_page+1
    free_huge_page+548
    release_pages+491
    __pagevec_release+27
    remove_inode_hugepages+776
    hugetlbfs_evict_inode+26
    evict+204
    __dentry_kill+223
    dput+324
    __fput+221
    task_work_run+89
    do_exit+805
    do_group_exit+45
    get_signal+2520
    arch_do_signal_or_restart+54
    exit_to_user_mode_prepare+267
    syscall_exit_to_user_mode+23
    do_syscall_64+72
    entry_SYSCALL_64_after_hwframe+99
]: 319

谁在调用 hugetlbfs_fallocate 啊?

#0  0x00007ffff6b03300 in fallocate64 () from /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6
#1  0x0000555555b8c732 in ram_block_discard_range (rb=rb@entry=0x555556646220, start=486539264, length=length@entry=4194304) at ../softmmu/physmem.c:36
17
#2  0x0000555555b5a1f5 in virtio_balloon_handle_report (vdev=0x555557868970, vq=0x7fffe3eb1270) at ../hw/virtio/virtio-balloon.c:380
#3  0x0000555555b4410c in virtio_queue_notify (vdev=0x555557868970, n=<optimized out>) at ../hw/virtio/virtio.c:2867
#4  0x0000555555b79de0 in memory_region_write_accessor (mr=mr@entry=0x555557861440, addr=16, value=value@entry=0x7fffa3dfe518, size=size@entry=2, shift
=<optimized out>, mask=mask@entry=65535, attrs=...) at ../softmmu/memory.c:493
#5  0x0000555555b775c6 in access_with_adjusted_size (addr=addr@entry=16, value=value@entry=0x7fffa3dfe518, size=size@entry=2, access_size_min=<optimize
d out>, access_size_max=<optimized out>, access_fn=0x555555b79d60 <memory_region_write_accessor>, mr=0x555557861440, attrs=...) at ../softmmu/memory.c:
555
#6  0x0000555555b7b88a in memory_region_dispatch_write (mr=mr@entry=0x555557861440, addr=16, data=<optimized out>, op=<optimized out>, attrs=attrs@entr
y=...) at ../softmmu/memory.c:1522
#7  0x0000555555b82bc0 in flatview_write_continue (fv=fv@entry=0x7fff947f4490, addr=addr@entry=4263555088, attrs=..., attrs@entry=..., ptr=ptr@entry=0x
7ffff53ef028, len=len@entry=2, addr1=<optimized out>, l=<optimized out>, mr=0x555557861440) at /home/martins3/core/qemu/include/qemu/host-utils.h:166
#8  0x0000555555b82e80 in flatview_write (fv=0x7fff947f4490, addr=addr@entry=4263555088, attrs=attrs@entry=..., buf=buf@entry=0x7ffff53ef028, len=len@e
ntry=2) at ../softmmu/physmem.c:2867
#9  0x0000555555b865e9 in address_space_write (len=2, buf=0x7ffff53ef028, attrs=..., addr=4263555088, as=0x55555659cc80 <address_space_memory>) at ../s
oftmmu/physmem.c:2963
#10 address_space_rw (as=0x55555659cc80 <address_space_memory>, addr=4263555088, attrs=attrs@entry=..., buf=buf@entry=0x7ffff53ef028, len=2, is_write=<
optimized out>) at ../softmmu/physmem.c:2973
#11 0x0000555555c08cae in kvm_cpu_exec (cpu=cpu@entry=0x5555568e1b40) at ../accel/kvm/kvm-all.c:2900
#12 0x0000555555c0a19d in kvm_vcpu_thread_fn (arg=arg@entry=0x5555568e1b40) at ../accel/kvm/kvm-accel-ops.c:51
#13 0x0000555555d78fe9 in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:505
#14 0x00007ffff6a88e86 in start_thread () from /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6
#15 0x00007ffff6b0fc60 in clone3 () from /nix/store/4nlgxhb09sdr51nc9hdm8az5b08vzkgx-glibc-2.35-163/lib/libc.so.6

内核中处理 reservation 的过程

#0  region_add (resv=resv@entry=0xffff8883063e73c0, f=0, t=1, in_regions_needed=in_regions_needed@entry=1, h=h@entry=0x0 <fixed_percpu_data>, h_cg=h_cg@entry=0x0
<fixed_percpu_data>) at mm/hugetlb.c:531
#1  0xffffffff812ef70c in __vma_reservation_common (h=<optimized out>, vma=0xffff88830631de40, addr=<optimized out>, mode=VMA_COMMIT_RESV) at mm/hugetlb.c:2532
#2  0xffffffff812f210e in vma_commit_reservation (addr=140508781346816, vma=0xffff88830631de40, h=0xffffffff834abe20 <hstates>) at mm/hugetlb.c:2597
#3  alloc_huge_page (vma=vma@entry=0xffff88830631de40, addr=addr@entry=140508781346816, avoid_reserve=avoid_reserve@entry=0) at mm/hugetlb.c:2952
#4  0xffffffff812f5a25 in hugetlb_no_page (flags=597, old_pte=..., ptep=0xffff88830085a958, address=140508781346816, idx=0, mapping=0xffff888302fbc410, vma=0xffff
88830631de40, mm=0xffff888300246c00) at mm/hugetlb.c:5545
#5  hugetlb_fault (mm=0xffff888300246c00, vma=vma@entry=0xffff88830631de40, address=address@entry=140508781346816, flags=flags@entry=597) at mm/hugetlb.c:5763
#6  0xffffffff812bdf9b in handle_mm_fault (vma=0xffff88830631de40, address=address@entry=140508781346816, flags=flags@entry=597, regs=regs@entry=0xffffc900005c7f5
8) at mm/memory.c:5149
#7  0xffffffff810f29a3 in do_user_addr_fault (regs=regs@entry=0xffffc900005c7f58, error_code=error_code@entry=6, address=address@entry=140508781346816) at arch/x8
6/mm/fault.c:1397
#8  0xffffffff81ead672 in handle_page_fault (address=140508781346816, error_code=6, regs=0xffffc900005c7f58) at arch/x86/mm/fault.c:1488

vma_resv_map 到底是做啥用的

#0  0xffffffff812ef611 in vma_resv_map (vma=<optimized out>) at mm/hugetlb.c:974
#1  __vma_reservation_common (h=0xffffffff834abe20 <hstates>, vma=0xffff88830631de40, addr=140508781346816, mode=VMA_NEEDS_RESV) at mm/hugetlb.c:2517
#2  0xffffffff812f623a in vma_needs_reservation (addr=140508781346816, vma=0xffff88830631de40, h=0xffffffff834abe20 <hstates>) at mm/hugetlb.c:2591
#3  hugetlb_no_page (flags=597, old_pte=..., ptep=0xffff88830085a958, address=140508781346816, idx=<optimized out>, mapping=0xffff888302fbc410, vma=0xffff88830631
de40, mm=0xffff888300246c00) at mm/hugetlb.c:5617
#4  hugetlb_fault (mm=0xffff888300246c00, vma=vma@entry=0xffff88830631de40, address=address@entry=140508781346816, flags=flags@entry=597) at mm/hugetlb.c:5763
#5  0xffffffff812bdf9b in handle_mm_fault (vma=0xffff88830631de40, address=address@entry=140508781346816, flags=flags@entry=597, regs=regs@entry=0xffffc900005c7f5
8) at mm/memory.c:5149
#6  0xffffffff810f29a3 in do_user_addr_fault (regs=regs@entry=0xffffc900005c7f58, error_code=error_code@entry=6, address=address@entry=140508781346816) at arch/x8
6/mm/fault.c:1397
#7  0xffffffff81ead672 in handle_page_fault (address=140508781346816, error_code=6, regs=0xffffc900005c7f58) at arch/x86/mm/fault.c:1488
#8  exc_page_fault (regs=0xffffc900005c7f58, error_code=6) at arch/x86/mm/fault.c:1544
#9  0xffffffff82000b62 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570
#10 0x0000000000000000 in ?? ()
#0  0xffffffff812ef611 in vma_resv_map (vma=<optimized out>) at mm/hugetlb.c:974
#1  __vma_reservation_common (h=0xffffffff834abe20 <hstates>, vma=0xffff88830631de40, addr=140508781346816, mode=VMA_END_RESV) at mm/hugetlb.c:2517
#2  0xffffffff812f6252 in vma_end_reservation (addr=140508781346816, vma=0xffff88830631de40, h=0xffffffff834abe20 <hstates>) at mm/hugetlb.c:2603
#3  hugetlb_no_page (flags=597, old_pte=..., ptep=0xffff88830085a958, address=140508781346816, idx=<optimized out>, mapping=0xffff888302fbc410, vma=0xffff88830631
de40, mm=0xffff888300246c00) at mm/hugetlb.c:5622
#4  hugetlb_fault (mm=0xffff888300246c00, vma=vma@entry=0xffff88830631de40, address=address@entry=140508781346816, flags=flags@entry=597) at mm/hugetlb.c:5763
#5  0xffffffff812bdf9b in handle_mm_fault (vma=0xffff88830631de40, address=address@entry=140508781346816, flags=flags@entry=597, regs=regs@entry=0xffffc900005c7f5
8) at mm/memory.c:5149
#6  0xffffffff810f29a3 in do_user_addr_fault (regs=regs@entry=0xffffc900005c7f58, error_code=error_code@entry=6, address=address@entry=140508781346816) at arch/x8
6/mm/fault.c:1397
#7  0xffffffff81ead672 in handle_page_fault (address=140508781346816, error_code=6, regs=0xffffc900005c7f58) at arch/x86/mm/fault.c:1488
#8  exc_page_fault (regs=0xffffc900005c7f58, error_code=6) at arch/x86/mm/fault.c:1544
#9  0xffffffff82000b62 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570
#10 0x0000000000000000 in ?? ()
#0  vma_resv_map (vma=0xffff88830631de40) at mm/hugetlb.c:974
#1  hugetlb_vm_op_close (vma=0xffff88830631de40) at mm/hugetlb.c:4589
#2  0xffffffff812c58c9 in remove_vma (vma=vma@entry=0xffff88830631de40) at mm/mmap.c:143
#3  0xffffffff812c8363 in exit_mmap (mm=mm@entry=0xffff888300246c00) at mm/mmap.c:3121
#4  0xffffffff810ff6ed in __mmput (mm=0xffff888300246c00) at kernel/fork.c:1187
#5  mmput (mm=mm@entry=0xffff888300246c00) at kernel/fork.c:1208
#6  0xffffffff811085db in exit_mm () at kernel/exit.c:510
#7  do_exit (code=code@entry=0) at kernel/exit.c:782
#8  0xffffffff81108e58 in do_group_exit (exit_code=0) at kernel/exit.c:925
#9  0xffffffff81108ecf in __do_sys_exit_group (error_code=<optimized out>) at kernel/exit.c:936
#10 __se_sys_exit_group (error_code=<optimized out>) at kernel/exit.c:934
#11 __x64_sys_exit_group (regs=<optimized out>) at kernel/exit.c:934
#12 0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900005c7f58) at arch/x86/entry/common.c:50
#13 do_syscall_64 (regs=0xffffc900005c7f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#14 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#15 0x0000000000000000 in ?? ()

reservation 应该是创建的时候就存在的,但是为什么要设计出来 commit 之类的操作

alloc_huge_page 中间干了什么

#0  __vma_reservation_common (h=0xffffffff834abe20 <hstates>, vma=0xffff88822998a180, addr=140245337112576, mode=VMA_COMMIT_RESV) at mm/hugetlb.c:2511
#1  0xffffffff812f230e in vma_commit_reservation (addr=140245337112576, vma=0xffff88822998a180, h=0xffffffff834abe20 <hstates>) at mm/hugetlb.c:2597
#2  alloc_huge_page (vma=vma@entry=0xffff88822998a180, addr=addr@entry=140245337112576, avoid_reserve=avoid_reserve@entry=0) at mm/hugetlb.c:2952
#3  0xffffffff812f5c25 in hugetlb_no_page (flags=597, old_pte=..., ptep=0xffff8882281d1a60, address=140245337112576, idx=0, mapping=0xffff888228304188, vma=0xffff88822998a180, mm=0xffff88823951d400) at mm/hugetlb.c:5545
#4  hugetlb_fault (mm=0xffff88823951d400, vma=vma@entry=0xffff88822998a180, address=address@entry=140245337112576, flags=flags@entry=597) at mm/hugetlb.c:5763
#5  0xffffffff812be19b in handle_mm_fault (vma=0xffff88822998a180, address=address@entry=140245337112576, flags=flags@entry=597, regs=regs@entry=0xffffc90001d4ff58) at mm/memory.c:5149
#6  0xffffffff810f2983 in do_user_addr_fault (regs=regs@entry=0xffffc90001d4ff58, error_code=error_code@entry=6, address=address@entry=140245337112576) at arch/x86/mm/fault.c:1397
#7  0xffffffff81ead672 in handle_page_fault (address=140245337112576, error_code=6, regs=0xffffc90001d4ff58) at arch/x86/mm/fault.c:1488
#8  exc_page_fault (regs=0xffffc90001d4ff58, error_code=6) at arch/x86/mm/fault.c:1544
#9  0xffffffff82000b62 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570
#10 0x0000000000000000 in ?? ()

的确是下面的场景:

[ ] region_chg && region_add

#0  region_chg (resv=resv@entry=0xffff888228c272a0, f=0, t=1, out_regions_needed=out_regions_needed@entry=0xffffc90001abfd60) at include/linux/spinlock.h:349
#1  0xffffffff812ef935 in __vma_reservation_common (h=<optimized out>, vma=0xffff8882257ec480, addr=<optimized out>, mode=VMA_NEEDS_RESV) at mm/hugetlb.c:2524
#2  0xffffffff812f1f0f in vma_needs_reservation (addr=<optimized out>, vma=0xffff8882257ec480, h=0xffffffff834abe20 <hstates>) at mm/hugetlb.c:2591
#3  alloc_huge_page (vma=vma@entry=0xffff8882257ec480, addr=addr@entry=139629120454656, avoid_reserve=avoid_reserve@entry=0) at mm/hugetlb.c:2875
#4  0xffffffff812f5c25 in hugetlb_no_page (flags=597, old_pte=..., ptep=0xffff88822a000c08, address=139629120454656, idx=0, mapping=0xffff888228c48410, vma=0xffff8882257ec480, mm=0xffff8881226ee800) at mm/hugetlb.c:5545
#5  hugetlb_fault (mm=0xffff8881226ee800, vma=vma@entry=0xffff8882257ec480, address=address@entry=139629120454656, flags=flags@entry=597) at mm/hugetlb.c:5763
#6  0xffffffff812be19b in handle_mm_fault (vma=0xffff8882257ec480, address=address@entry=139629120454656, flags=flags@entry=597, regs=regs@entry=0xffffc90001abff58) at mm/memory.c:5149
#7  0xffffffff810f2983 in do_user_addr_fault (regs=regs@entry=0xffffc90001abff58, error_code=error_code@entry=6, address=address@entry=139629120454656) at arch/x86/mm/fault.c:1397
#8  0xffffffff81ead672 in handle_page_fault (address=139629120454656, error_code=6, regs=0xffffc90001abff58) at arch/x86/mm/fault.c:1488
#9  exc_page_fault (regs=0xffffc90001abff58, error_code=6) at arch/x86/mm/fault.c:1544
#10 0xffffffff82000b62 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570
#11 0x0000000000000000 in ?? ()
#0  region_add (resv=resv@entry=0xffff888228c272a0, f=0, t=1, in_regions_needed=in_regions_needed@entry=1, h=h@entry=0x0 <fixed_percpu_data>, h_cg=h_cg@entry=0x0 <fixed_percpu_data>) at mm/hugetlb.c:531
#1  0xffffffff812ef90c in __vma_reservation_common (h=<optimized out>, vma=0xffff8882257ec480, addr=<optimized out>, mode=VMA_COMMIT_RESV) at mm/hugetlb.c:2532
#2  0xffffffff812f230e in vma_commit_reservation (addr=139629120454656, vma=0xffff8882257ec480, h=0xffffffff834abe20 <hstates>) at mm/hugetlb.c:2597
#3  alloc_huge_page (vma=vma@entry=0xffff8882257ec480, addr=addr@entry=139629120454656, avoid_reserve=avoid_reserve@entry=0) at mm/hugetlb.c:2952
#4  0xffffffff812f5c25 in hugetlb_no_page (flags=597, old_pte=..., ptep=0xffff88822a000c08, address=139629120454656, idx=0, mapping=0xffff888228c48410, vma=0xffff8882257ec480, mm=0xffff8881226ee800) at mm/hugetlb.c:5545
#5  hugetlb_fault (mm=0xffff8881226ee800, vma=vma@entry=0xffff8882257ec480, address=address@entry=139629120454656, flags=flags@entry=597) at mm/hugetlb.c:5763
#6  0xffffffff812be19b in handle_mm_fault (vma=0xffff8882257ec480, address=address@entry=139629120454656, flags=flags@entry=597, regs=regs@entry=0xffffc90001abff58) at mm/memory.c:5149
#7  0xffffffff810f2983 in do_user_addr_fault (regs=regs@entry=0xffffc90001abff58, error_code=error_code@entry=6, address=address@entry=139629120454656) at arch/x86/mm/fault.c:1397
#8  0xffffffff81ead672 in handle_page_fault (address=139629120454656, error_code=6, regs=0xffffc90001abff58) at arch/x86/mm/fault.c:1488
#9  exc_page_fault (regs=0xffffc90001abff58, error_code=6) at arch/x86/mm/fault.c:1544
#10 0xffffffff82000b62 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570
#11 0x0000000000000000 in ?? ()

分析 hugetlb_reserve_pages 中调用 region_chg 的位置,可以发现,如果是之所以创建出来各种 region 出来,是为了处理 共享的情况,如果一个 mmap 一个共享的,第一个进程使用一部分,这需要 charge,然后第二个函数使用一部分,那么如果和之前的重叠了。

在 hugetlb_no_page 中的这个位置,也有两个,调用这两个函数的位置:

    /*
     * If we are going to COW a private mapping later, we examine the
     * pending reservations for this page now. This will ensure that
     * any allocations necessary to record that reservation occur outside
     * the spinlock.
     */
    if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
        if (vma_needs_reservation(h, vma, haddr) < 0) {
            ret = VM_FAULT_OOM;
            goto backout_unlocked;
        }
        /* Just decrements count, does not deallocate */
        vma_end_reservation(h, vma, haddr);
    }

为什么设置出来

[ ] unmap_ref_private

mremap 带来的挑战

mremap 可以修改映射的虚拟地址和大小,其粒度是 page 的,也就是可以将一部分 vma 的一部分修改位置。

vma_resv_map

    if (vma->vm_flags & VM_MAYSHARE) {
#0  hugetlbfs_file_mmap (file=0xffff8882213f7f00, vma=0xffff88822579b900) at include/linux/fs.h:1333
#1  0xffffffff812c9a00 in call_mmap (vma=0xffff88822579b900, file=0xffff8882213f7f00) at include/linux/fs.h:2192
#2  mmap_region (file=file@entry=0xffff8882213f7f00, addr=addr@entry=139783313555456, len=len@entry=16777216, vm_flags=vm_flags@entry=115, pgoff=<optimized out>, uf=uf@entry=0xffffc90001c0feb0) at mm/mmap.c:1749
#3  0xffffffff812c9fbe in do_mmap (file=file@entry=0xffff8882213f7f00, addr=139783313555456, addr@entry=0, len=len@entry=16777216, prot=<optimized out>, prot@entry=3, flags=flags@entry=262178, pgoff=<optimized out>, pgoff@entry=0, populate=0xffffc90001c0fea8, uf=0xffffc90001c0feb0) at mm/mmap.c:1540
#4  0xffffffff8129ee35 in vm_mmap_pgoff (file=file@entry=0xffff8882213f7f00, addr=addr@entry=0, len=len@entry=16777216, prot=prot@entry=3, flag=flag@entry=262178, pgoff=pgoff@entry=0) at mm/util.c:552
#5  0xffffffff812c7333 in ksys_mmap_pgoff (addr=0, len=16777216, prot=3, flags=262178, fd=<optimized out>, pgoff=0) at mm/mmap.c:1586
#6  0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc90001c0ff58) at arch/x86/entry/common.c:50
#7  do_syscall_64 (regs=0xffffc90001c0ff58, nr=<optimized out>) at arch/x86/entry/common.c:80
#8  0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#9  0x0000000000000003 in fixed_percpu_data ()
#10 0xffffffffffffffff in ?? ()
#11 0x0000000000000000 in ?? ()

vma_has_reserves : 可以理解这个吗

huge_add_to_page_cache

两个位置调用 huge_add_to_page_cache

  1. hugetlbfs_fallocate : 的确是对于每一个 hugepage 都是需要调用的
  2. hugetlb_no_page : 分配的时候,如果发现其所在的 VM_MAYSHARE 的,那么将会

在 fallocate 的时候,就会分配 page,但是没有建立 mapping 的,然后 hugetlbfs_file_mmap 之后,那么会出现

最后,当 page fault 的时候,hugetlb_no_page 中,首先在 page table 中查询。

到底是如何分配的

启动过程中,如何预留的

这是参数解析的时候

#0  hugetlb_hstate_alloc_pages (h=0xffffffff834abe20 <hstates>) at mm/hugetlb.c:3088
#1  0xffffffff832fbd53 in hugepages_setup (s=0xffff88833fff186a "8") at mm/hugetlb.c:4221
#2  0xffffffff832cf89b in obsolete_checksetup (line=0xffff88833fff1860 "hugepages=8") at init/main.c:219
#3  unknown_bootoption (param=0xffff88833fff1860 "hugepages=8", val=val@entry=0xffff88833fff186a "8", unused=unused@entry=0xffffffff827eb41c "Booting kernel", arg
=arg@entry=0x0 <fixed_percpu_data>) at init/main.c:539
#4  0xffffffff81127893 in parse_one (handle_unknown=0xffffffff832cf801 <unknown_bootoption>, arg=0x0 <fixed_percpu_data>, max_level=-1, min_level=-1, num_params=5
84, params=0xffffffff829aecb8 <__param_initcall_debug>, doing=0xffffffff827eb41c "Booting kernel", val=0xffff88833fff186a "8", param=0xffff88833fff1860 "hugepages
=8") at kernel/params.c:153
#5  parse_args (doing=doing@entry=0xffffffff827eb41c "Booting kernel", args=0xffff88833fff186b "", params=0xffffffff829aecb8 <__param_initcall_debug>, num=584, mi
n_level=min_level@entry=-1, max_level=max_level@entry=-1, arg=0x0 <fixed_percpu_data>, unknown=0xffffffff832cf801 <unknown_bootoption>) at kernel/params.c:188
#6  0xffffffff832cfe00 in start_kernel () at init/main.c:967
#7  0xffffffff81000145 in secondary_startup_64 () at arch/x86/kernel/head_64.S:358
#8  0x0000000000000000 in ?? ()

这是 initcalls 得到的

#0  hugetlb_hstate_alloc_pages (h=h@entry=0xffffffff834ad5e8 <hstates+6088>) at mm/hugetlb.c:3088
#1  0xffffffff832fc0d1 in hugetlb_init_hstates () at mm/hugetlb.c:3157
#2  hugetlb_init () at mm/hugetlb.c:4069
#3  0xffffffff81000e7c in do_one_initcall (fn=0xffffffff832fbf41 <hugetlb_init>) at init/main.c:1296
#4  0xffffffff832d0491 in do_initcall_level (command_line=0xffff888300804180 "root", level=4) at init/main.c:1369
#5  do_initcalls () at init/main.c:1385
#6  do_basic_setup () at init/main.c:1404
#7  kernel_init_freeable () at init/main.c:1611
#8  0xffffffff81eae271 in kernel_init (unused=<optimized out>) at init/main.c:1500
#9  0xffffffff81001a8f in ret_from_fork () at arch/x86/entry/entry_64.S:306
#10 0x0000000000000000 in ?? ()

hugetlb

  1. 为了实现简单,那么 hugetlb 减少处理什么东西 ?

https://www.ibm.com/developerworks/cn/linux/l-cn-hugetlb/ https://www.ibm.com/developerworks/cn/linux/1305_zhangli_hugepage/index.html

总结一下 :

  1. subpool, resv_map , enqueue 机制
  2. hugetlb_file_setup hugetlb_fault 和对外提供的关键接口
  3. 利用 sys 提供了很多接口

Huge pages can improve performance through reduced page faults (a single fault brings in a large chunk of memory at once) and by reducing the cost of virtual to physical address translation (fewer levels of page tables must be traversed to get to the physical address).

用户层 : https://lwn.net/Articles/375096/ 中间的使用首先理解清楚吧 !

https://github.com/libhugetlbfs/libhugetlbfs

其中包含有大量的测试 The library provides support for automatically backing text, data, heap and shared memory segments with huge pages. In addition, this package also provides a programming API and manual pages. The behaviour of the library is controlled by environment variables (as described in the libhugetlbfs.7 manual page) with a launcher utility hugectl that knows how to configure almost all of the variables. hugeadm, hugeedit and pagesize provide information about the system and provide support to system administration. tlbmiss_cost.sh automatically calculates the average cost of a TLB miss. cpupcstat and oprofile_start.sh provide help with monitoring the current behaviour of the system. Manual pages are available describing in further detail each utility.

  1. shmget() : SHM_HUGETLB
  2. hugetlbfs : 似乎用户共享的,同时可以用于实现
       #include <hugetlbfs.h>
       int hugetlbfs_unlinked_fd(void);
       int hugetlbfs_unlinked_fd_for_size(long page_size);
       // hugetlbfs_unlinked_fd, hugetlbfs_unlinked_fd_for_size - Obtain a file descriptor for a new unlinked file in hugetlbfs

One important common point between them all is how huge pages are faulted and when the huge pages are allocated. Further, there are important differences between shared and private mappings depending on the exact kernel version used. [^1]

重点处理的方面

  1. fault
  2. shared/private
  3. hugetlb 不处理 swap

和正常大小的 page 的比较

  1. hugetlb_fault

  2. include/asm-generic/hugetlb.h : 如果架构含有关于 page table 的不同处理, 那么就可以使用

HugeTLB Pages 的阅读结果 :

/proc/sys/vm/nr_hugepages indicates the current number of “persistent” huge pages in the kernel’s huge page pool. “Persistent” huge pages will be returned to the huge page pool when freed by a task. A user with root privileges can dynamically allocate more or free some persistent huge pages by increasing or decreasing the value of nr_hugepages.

Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure.

Once a number of huge pages have been pre-allocated to the kernel huge page pool, a user with appropriate privilege can use either the mmap system call or shared memory system calls to use the huge pages.

TO BE CONTINUE

hugetlbfs

hugepage_subpool

hugepage_put_subpool 和 hugepage_new_subpool 是对应的:

#0  hugepage_new_subpool (h=0xffffffff834abe20 <hstates>, max_hpages=-1, min_hpages=2) at include/linux/slab.h:600
#1  0xffffffff81432d2b in hugetlbfs_fill_super (sb=0xffff888302e7e000, fc=<optimized out>) at fs/hugetlbfs/inode.c:1359
#2  0xffffffff813207c9 in vfs_get_super (fill_super=0xffffffff81432c90 <hugetlbfs_fill_super>, keying=vfs_get_independent_super, fc=0xffff8883042c29c0) at fs/super.c:1168
#3  get_tree_nodev (fc=0xffff8883042c29c0, fill_super=0xffffffff81432c90 <hugetlbfs_fill_super>) at fs/super.c:1198
#4  0xffffffff8131ee8d in vfs_get_tree (fc=0xffffffff834abe20 <hstates>, fc@entry=0xffff8883042c29c0) at fs/super.c:1530
#5  0xffffffff813482d3 in do_new_mount (data=0xffff88830a771000, name=0xffff888300d273d0 "none", mnt_flags=32, sb_flags=<optimized out>, fstype=0x20 <fixed_percpu_data+32> <error: Cannot acc
ess memory at address 0x20>, path=0xffffc90000cb7ef8) at fs/namespace.c:3040
#6  path_mount (dev_name=dev_name@entry=0xffff888300d273d0 "none", path=path@entry=0xffffc90000cb7ef8, type_page=type_page@entry=0xffff8883086e9f10 "hugetlbfs", flags=<optimized out>, flags@
entry=3236757504, data_page=data_page@entry=0xffff88830a771000) at fs/namespace.c:3370
#7  0xffffffff81348b72 in do_mount (data_page=0xffff88830a771000, flags=3236757504, type_page=0xffff8883086e9f10 "hugetlbfs", dir_name=0x555d2fa642f0 "/mnt/huge", dev_name=0xffff888300d273d0
 "none") at fs/namespace.c:3383
#8  __do_sys_mount (data=<optimized out>, flags=3236757504, type=<optimized out>, dir_name=0x555d2fa642f0 "/mnt/huge", dev_name=<optimized out>) at fs/namespace.c:3591
#9  __se_sys_mount (data=<optimized out>, flags=3236757504, type=<optimized out>, dir_name=93858719744752, dev_name=<optimized out>) at fs/namespace.c:3568
#10 __x64_sys_mount (regs=<optimized out>) at fs/namespace.c:3568
#11 0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc90000cb7f58) at arch/x86/entry/common.c:50

文件的创建

#0  hugetlbfs_create (mnt_userns=0xffffffff82a618e0 <init_user_ns>, dir=0xffff88830c764010, dentry=0xffff8883081b39c0, mode=33188, excl=false) at fs/hugetlbfs/inode.c:931
#1  0xffffffff8132eb98 in lookup_open (op=0xffffc9000237fedc, op=0xffffc9000237fedc, got_write=true, file=0xffff88830591ff00, nd=0xffffc9000237fdc0) at fs/namei.c :3413
#2  open_last_lookups (op=0xffffc9000237fedc, file=0xffff88830591ff00, nd=0xffffc9000237fdc0) at fs/namei.c:3481
#3  path_openat (nd=nd@entry=0xffffc9000237fdc0, op=op@entry=0xffffc9000237fedc, flags=flags@entry=65) at fs/namei.c:3688
#4  0xffffffff8132fd0d in do_filp_open (dfd=dfd@entry=-100, pathname=pathname@entry=0xffff8883020a5000, op=op@entry=0xffffc9000237fedc) at fs/namei.c:3718
#5  0xffffffff813198d5 in do_sys_openat2 (dfd=dfd@entry=-100, filename=<optimized out>, how=how@entry=0xffffc9000237ff18) at fs/open.c:1311
#6  0xffffffff81319cb0 in do_sys_open (mode=<optimized out>, flags=<optimized out>, filename=<optimized out>, dfd=-100) at fs/open.c:1327
#7  __do_sys_open (mode=<optimized out>, flags=<optimized out>, filename=<optimized out>) at fs/open.c:1335
#8  __se_sys_open (mode=<optimized out>, flags=<optimized out>, filename=<optimized out>) at fs/open.c:1331
#9  __x64_sys_open (regs=<optimized out>) at fs/open.c:1331
#10 0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc9000237ff58) at arch/x86/entry/common.c:50
#11 do_syscall_64 (regs=0xffffc9000237ff58, nr=<optimized out>) at arch/x86/entry/common.c:80
#12 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

使用 echo aaa > a 会失败的

#0  hugetlbfs_read_iter (iocb=0xffffc90002e67e98, to=0xffffc90002e67e70) at fs/hugetlbfs/inode.c:290
#1  0xffffffff8131ce0c in call_read_iter (iter=0xffffc90002e67e70, kio=0xffffc90002e67e98, file=0xffff88830c7d5300) at include/linux/fs.h:2181
#2  new_sync_read (ppos=0xffffc90002e67f08, len=1073741824, buf=0x7f8f4d05f000 <error: Cannot access memory at address 0x7f8f4d05f000>, filp=0xffff88830c7d5300) at fs/read_write.c:389
#3  vfs_read (file=file@entry=0xffff88830c7d5300, buf=buf@entry=0x7f8f4d05f000 <error: Cannot access memory at address 0x7f8f4d05f000>, count=count@entry=1073741824, pos=pos@entry=0xffffc90002e67f08) at fs/read_write.c:470
#4  0xffffffff8131d5da in ksys_read (fd=<optimized out>, buf=0x7f8f4d05f000 <error: Cannot access memory at address 0x7f8f4d05f000>, count=1073741824) at fs/read_write.c:607
#5  0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc90002e67f58) at arch/x86/entry/common.c:50
#6  do_syscall_64 (regs=0xffffc90002e67f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#7  0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#8  0x0000000000000fff in ?? ()
#9  0x0000000000000000 in ?? ()

resv_map 面对文件的伸缩,如何处理

在 remap.c 中的确是处理过

zero page

hugetlb_reserve_pages 中有一个注释

    /*
     * Shared mappings base their reservation on the number of pages that
     * are already allocated on behalf of the file. Private mappings need
     * to reserve the full area even if read-only as mprotect() may be
     * called to make the mapping read-write. Assume !vma is a shm mapping
     */

所以,无论是 fallcoate 还是 anon 映射,都是有 reserve 的。

syscontrol

int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);

#ifdef CONFIG_NUMA
int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
                    void __user *, size_t *, loff_t *);
#endif

unsigned long hugetlb_total_pages(void);

misc

int hugetlb_reserve_pages(struct inode *inode, long from, long to,
                        struct vm_area_struct *vma,
                        vm_flags_t vm_flags);
long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
                        long freed);
bool isolate_huge_page(struct page *page, struct list_head *list);
void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
void hugetlb_fix_reserve_counts(struct inode *inode);

从用户层角度分析

static inline bool is_file_hugepages(struct file *file)
{
    if (file->f_op == &hugetlbfs_file_operations) // 首先阅读一下 hugetlbfs 的作用吧
        return true;

    return is_file_shm_hugepages(file);
}

hugetlbfs_setattr 的操作比想象的复杂

和 vm 打交道的关键的外部接口

#0  __unmap_hugepage_range (tlb=tlb@entry=0xffffc90001d27df8, vma=vma@entry=0xffff888228c85480, start=start@entry=139726484930560, end=end@entry=139726501707776, ref_page=ref_page@entry=0x0 <fixed_percpu_data>, zap_flags=<optimized out>) at mm/hugetlb.c:4997
#1  0xffffffff812f6c59 in __unmap_hugepage_range_final (tlb=tlb@entry=0xffffc90001d27df8, vma=vma@entry=0xffff888228c85480, start=start@entry=139726484930560, end=end@entry=139726501707776, ref_page=ref_page@entry=0x0 <fixed_percpu_data>, zap_flags=zap_flags@entry=1) at mm/hugetlb.c:5140
#2  0xffffffff812bad8a in unmap_single_vma (tlb=tlb@entry=0xffffc90001d27df8, vma=vma@entry=0xffff888228c85480, start_addr=start_addr@entry=0, end_addr=end_addr@entry=18446744073709551615, details=details@entry=0xffffc90001d27d88) at mm/memory.c:1689
#3  0xffffffff812bb21d in unmap_vmas (tlb=tlb@entry=0xffffc90001d27df8, vma=0xffff888228c85480, vma@entry=0xffff888228c859c0, start_addr=start_addr@entry=0, end_addr=end_addr@entry=18446744073709551615) at mm/memory.c:1731
#4  0xffffffff812c852f in exit_mmap (mm=mm@entry=0xffff8881001ad800) at mm/mmap.c:3113
#5  0xffffffff810ff6cd in __mmput (mm=0xffff8881001ad800) at kernel/fork.c:1187
#6  mmput (mm=mm@entry=0xffff8881001ad800) at kernel/fork.c:1208
#7  0xffffffff811085bb in exit_mm () at kernel/exit.c:510
#8  do_exit (code=code@entry=0) at kernel/exit.c:782
#9  0xffffffff81108e38 in do_group_exit (exit_code=0) at kernel/exit.c:925
#10 0xffffffff81108eaf in __do_sys_exit_group (error_code=<optimized out>) at kernel/exit.c:936
#11 __se_sys_exit_group (error_code=<optimized out>) at kernel/exit.c:934
#12 __x64_sys_exit_group (regs=<optimized out>) at kernel/exit.c:934
#13 0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc90001d27f58) at arch/x86/entry/common.c:50
#14 do_syscall_64 (regs=0xffffc90001d27f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#15 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#16 0x0000000000000000 in ?? ()

应该是在 mem_cgroup_uncharge 中处理的,当释放 page 的时候,是可以知道该 page 被那个 memcg 管理,从而知道进行释放。

#0  mem_cgroup_uncharge (folio=0xffffea0004897740) at include/linux/memcontrol.h:706
#1  __folio_put_small (folio=0xffffea0004897740) at mm/swap.c:104
#2  __folio_put (folio=0xffffea0004897740) at mm/swap.c:128
#3  0xffffffff812e648b in folio_put (folio=<optimized out>) at include/linux/mm.h:1125
#4  0xffffffff812ca7be in __tlb_remove_table (table=<optimized out>) at arch/x86/include/asm/tlb.h:34
#5  __tlb_remove_table_free (batch=0xffff88812251d000) at mm/mmu_gather.c:114
#6  tlb_remove_table_rcu (head=0xffff88812251d000) at mm/mmu_gather.c:169
#7  0xffffffff811812b4 in rcu_do_batch (rdp=0xffff88813bc2bd00) at kernel/rcu/tree.c:2245
#8  rcu_core () at kernel/rcu/tree.c:2505
#9  0xffffffff822000e1 in __do_softirq () at kernel/softirq.c:571
#10 0xffffffff81109dda in invoke_softirq () at kernel/softirq.c:445
#11 __irq_exit_rcu () at kernel/softirq.c:650
#12 0xffffffff81ead0b2 in sysvec_apic_timer_interrupt (regs=0xffffc90001d27d98) at arch/x86/kernel/apic/apic.c:1106

但是,由于 hugetlb 的 free 过程是自营的,也就是 __unmap_hugepage_range,所以需要 vm_operations_struct::close

细节

是否支持 1G 大页

通过这个可以检查是否支持 1GB 大页:

[ ] 为什么不支持 1G 大页

但是不知道为什么,现在的 alpine.sh 拉起来的虚拟机中并不能支持 1gb 大页

hugepage anon mmap 的时候,大页的大小是不可以混合使用的

虽然 mmap 的时候,可以创建多个 flags 的:

#define MAP_HUGE_2MB    (21 << MAP_HUGE_SHIFT)
#define MAP_HUGE_1GB    (30 << MAP_HUGE_SHIFT)

动态申请

// 继续分析其调用者
unsigned long reclaim_clean_pages_from_list(struct zone *zone,
                        struct list_head *page_list)

测试一下性能

测试一下,这集中方法是不是可以帮助分配更加多的大页

看错了,构建大页的过程中,会导致 cache 的出现的

但是也不是将所有的内存都赶到 swap 中,现在不知道是什么原因

[ ] 大页的 migration 如何

深入理解一下 reserve 机制吧

理解 shmem hugepage 的基本工作流程

其实流程和 mmap 几乎相同的:

申请:

#0  dequeue_huge_page_node_exact (nid=1, h=0xffffffff83523428 <hstates+6088>) at ./arch/x86/include/asm/current.h:15
#1  dequeue_huge_page_nodemask (h=h@entry=0xffffffff83523428 <hstates+6088>, gfp_mask=1051850, nid=<optimized out>, nmask=0x0 <fixed_percpu_data>) at m
m/hugetlb.c:1193
#2  0xffffffff81316e94 in dequeue_huge_page_vma (chg=<optimized out>, avoid_reserve=<optimized out>, address=140111190687744, vma=0xffff888102ee2688, h
=0xffffffff83523428 <hstates+6088>) at mm/hugetlb.c:1242
#3  alloc_huge_page (vma=vma@entry=0xffff888102ee2688, addr=addr@entry=140111190687744, avoid_reserve=avoid_reserve@entry=0) at mm/hugetlb.c:2931
#4  0xffffffff8131b6bc in hugetlb_no_page (flags=597, old_pte=..., ptep=0xffff88821f029b70, address=140111190687744, idx=0, mapping=0xffff888101318410,
 vma=0xffff888102ee2688, mm=0xffff8881018ef2c0) at mm/hugetlb.c:5649
#5  hugetlb_fault (mm=0xffff8881018ef2c0, vma=vma@entry=0xffff888102ee2688, address=address@entry=140111190687744, flags=flags@entry=597) at mm/hugetlb
.c:5881
#6  0xffffffff812dde7b in handle_mm_fault (vma=0xffff888102ee2688, address=address@entry=140111190687744, flags=flags@entry=597, regs=regs@entry=0xffff
c90001e67f58) at mm/memory.c:5215
#7  0xffffffff810f3ca3 in do_user_addr_fault (regs=regs@entry=0xffffc90001e67f58, error_code=error_code@entry=6, address=address@entry=140111190687744)
 at arch/x86/mm/fault.c:1428
#8  0xffffffff81fb2342 in handle_page_fault (address=140111190687744, error_code=6, regs=0xffffc90001e67f58) at arch/x86/mm/fault.c:1519
#9  exc_page_fault (regs=0xffffc90001e67f58, error_code=6) at arch/x86/mm/fault.c:1575
#10 0xffffffff82000b62 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570

释放:

#0  enqueue_huge_page (h=h@entry=0xffffffff83523428 <hstates+6088>, page=page@entry=0xffffea00040f8000) at mm/hugetlb.c:1132
#1  0xffffffff8131506f in free_huge_page (page=0xffffea00040f8000) at mm/hugetlb.c:1761
#2  0xffffffff812a767d in __folio_put_large (folio=<optimized out>) at mm/swap.c:118
#3  release_pages (pages=pages@entry=0xffffc90001e67db8, nr=<optimized out>) at mm/swap.c:1021
#4  0xffffffff812a8406 in __pagevec_release (pvec=pvec@entry=0xffffc90001e67db0) at mm/swap.c:1075
#5  0xffffffff81486ed8 in pagevec_release (pvec=0xffffc90001e67db0) at ./include/linux/pagevec.h:71
#6  folio_batch_release (fbatch=0xffffc90001e67db0) at ./include/linux/pagevec.h:135
#7  remove_inode_hugepages (inode=inode@entry=0xffff888101318298, lstart=lstart@entry=0, lend=lend@entry=9223372036854775807) at fs/hugetlbfs/inode.c:6
48
#8  0xffffffff81487305 in hugetlbfs_evict_inode (inode=0xffff888101318298) at fs/hugetlbfs/inode.c:660
#9  0xffffffff8138dedb in evict (inode=0xffff888101318298) at fs/inode.c:664
#10 0xffffffff81389cb1 in __dentry_kill (dentry=0xffff888008afecc0) at fs/dcache.c:607
#11 0xffffffff8138a888 in dentry_kill (dentry=<optimized out>) at fs/dcache.c:733
#12 0xffffffff8136e748 in __fput (file=0xffff888100687b00) at fs/file_table.c:328
#13 0xffffffff81130e11 in task_work_run () at kernel/task_work.c:179
#14 0xffffffff8119bf62 in resume_user_mode_work (regs=0xffffc90001e67f58) at ./include/linux/resume_user_mode.h:49
#15 exit_to_user_mode_loop (ti_work=<optimized out>, regs=<optimized out>) at kernel/entry/common.c:171
#16 exit_to_user_mode_prepare (regs=0xffffc90001e67f58) at kernel/entry/common.c:203
#17 0xffffffff81fb253d in __syscall_exit_to_user_mode_work (regs=regs@entry=0xffffc90001e67f58) at kernel/entry/common.c:285
#18 syscall_exit_to_user_mode (regs=regs@entry=0xffffc90001e67f58) at kernel/entry/common.c:296
#19 0xffffffff81fae028 in do_syscall_64 (regs=0xffffc90001e67f58, nr=<optimized out>) at arch/x86/entry/common.c:86
#20 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

reserve 的基本过程

do_mmap 中,这到底会成为什么的计算内容:

		/* hugetlb applies strict overcommit unless MAP_NORESERVE */
		if (file && is_file_hugepages(file))
			vm_flags |= VM_NORESERVE;

QEMU 无法申请大页,结果如何?

pool 和 subpool

通过 vma 必然找到其关联的文件,然后找到对应的 fs, 然后就是关联的 subpool

static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
{
    return HUGETLBFS_SB(inode->i_sb)->spool;
}

static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
{
    return subpool_inode(file_inode(vma->vm_file));
}

从 hugetlbfs_fill_super 看,默认的 subpool 为 NULL,这导致 hugepage_subpool_get_pages 中检查必然 成功。

mmap 的时候,如果该文件是在 mount 点构建的,那么需要将:

#0  hugepage_subpool_get_pages (delta=1, spool=0x0 <fixed_percpu_data>) at mm/hugetlb.c:169
#1  hugetlb_reserve_pages (inode=inode@entry=0xffff888302fbc298, from=0, to=1, vma=vma@entry=0xffff888309e5b240, vm_flags=<optimized out>) at mm/hugetlb.c:6510
#2  0xffffffff81432f48 in hugetlbfs_file_mmap (file=0xffff888301e4cc00, vma=0xffff888309e5b240) at fs/hugetlbfs/inode.c:167
#3  0xffffffff812c9800 in call_mmap (vma=0xffff888309e5b240, file=0xffff888301e4cc00) at include/linux/fs.h:2192
#4  mmap_region (file=file@entry=0xffff888301e4cc00, addr=addr@entry=139922518310912, len=len@entry=1073741824, vm_flags=vm_flags@entry=115, pgoff=<optimized out>
, uf=uf@entry=0xffffc90000c43eb0) at mm/mmap.c:1749
#5  0xffffffff812c9dbe in do_mmap (file=file@entry=0xffff888301e4cc00, addr=139922518310912, addr@entry=0, len=len@entry=1073741824, prot=<optimized out>, prot@en
try=3, flags=flags@entry=262178, pgoff=<optimized out>, pgoff@entry=0, populate=0xffffc90000c43ea8, uf=0xffffc90000c43eb0) at mm/mmap.c:1540
#6  0xffffffff8129ec35 in vm_mmap_pgoff (file=file@entry=0xffff888301e4cc00, addr=addr@entry=0, len=len@entry=1073741824, prot=prot@entry=3, flag=flag@entry=26217
8, pgoff=pgoff@entry=0) at mm/util.c:552
#7  0xffffffff812c7133 in ksys_mmap_pgoff (addr=0, len=1073741824, prot=3, flags=262178, fd=<optimized out>, pgoff=0) at mm/mmap.c:1586
#8  0xffffffff81ea93c8 in do_syscall_x64 (nr=<optimized out>, regs=0xffffc90000c43f58) at arch/x86/entry/common.c:50
#9  do_syscall_64 (regs=0xffffc90000c43f58, nr=<optimized out>) at arch/x86/entry/common.c:80
#10 0xffffffff8200009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
#11 0x0000000000000003 in fixed_percpu_data ()
#12 0xffffffffffffffff in ?? ()
#13 0x0000000000000000 in ?? ()

这个 backtrace 这个结果:

  void *ptr = mmap(NULL, 8 * TwoMega, PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

但是,的确是有逻辑的,hugetlb 的 mapping 总是带有文件,所以,总是会掉用到 hugepage_subpool_get_pages 上。

各个字段的还是相当简单的:

struct hugepage_subpool {
	spinlock_t lock;
	long count;
	long max_hpages;	/* Maximum huge pages or -1 if no maximum. */
	long used_hpages;	/* Used count against maximum, includes */
				/* both allocated and reserved pages. */
	struct hstate *hstate;
	long min_hpages;	/* Minimum huge pages or -1 if no minimum. */
	long rsv_hpages;	/* Pages reserved against global pool to */
				/* satisfy minimum size. */
};

如果让内存 FALLOC_FL_PUNCH_HOLE 了,如果是用户态程序,将会 SIGBUS 的

如果映射的是普通文件,但是携带了 HUGETLB 的参数

mmap 直接失败

从的位置找找,具体是哪里失败?

TODO

有件事情无法理解,对于普通页的虚拟地址空间也是存在 reserver 的,为什么唯独大页的 reservation 单独的设计这么复杂?

subpool 让数值接口的终极离谱的现象

mkdir -p /mnt/huge
mount -t hugetlbfs -o min_size=2M none /mnt/huge
echo 0 | sudo tee /proc/sys/vm/nr_hugepages
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        1
HugePages_Surp:        1

[ ] 深入理解下文档 : 感觉文档也是有模糊和左右矛盾的

Documentation/admin-guide/mm/hugetlbpage.rst

/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are requested by applications. Writing any non-zero value into this file

Caveat: Shrinking the persistent huge page pool via nr_hugepages such that it becomes less than the number of huge pages in use will convert the balance of the in-use huge pages to surplus huge pages. This will occur even if the number of surplus pages would exceed the overcommit value. As long as this condition holds–that is, until nr_hugepages+nr_overcommit_hugepages is increased sufficiently, or the surplus huge pages go out of use and are freed– no more surplus huge pages will be allowed to be allocated.

分析下 hugetalb 中持有的锁

  1. 例如访问 hugetlbfs_size_to_hpages 中访问了 hstate 中的数值,但是此时此刻,正在被修改,怎么办?

理解下 remove_pool_huge_page

/*
 * This routine has two main purposes:
 * 1) Decrement the reservation count (resv_huge_pages) by the value passed
 *    in unused_resv_pages.  This corresponds to the prior adjustments made
 *    to the associated reservation map.
 * 2) Free any unused surplus pages that may have been allocated to satisfy
 *    the reservation.  As many as unused_resv_pages may be freed.
 */
static void return_unused_surplus_pages(struct hstate *h,
					unsigned long unused_resv_pages)

为什么会存在需求,释放掉 surplus 的 reservation 啊?

因为 reservation 制造了大量的 surplus ,当释放 reservation 的时候,自动释放掉对应的 surplus

分析 hugetlb_acct_memory 的实现 - gather_surplus_pages : 如果想要分配 - return_unused_surplus_pages : 如果想要释放

其调用位置为:

看来这些位置都是 fs 或者 subpool ,以及 vma 增加的时候,需要进行 surplus 以及 reservation 的修改。

所以,其实这不是一个 bug 的。就是每一个都需要分担一部分。 因为这些都是 free 的。

这是什么鬼路径

#0  0xffffffff813de4a9 in remove_hugetlb_folio (adjust_surplus=<optimized out>, folio=<optimized out>, h=<optimized out>) at mm/hugetlb.c:1648
#1  free_huge_page (page=<optimized out>) at mm/hugetlb.c:1902
#2  0xffffffff835dff44 in gather_bootmem_prealloc () at mm/hugetlb.c:3207
#3  hugetlb_init () at mm/hugetlb.c:4262
#4  0xffffffff81001b7a in do_one_initcall (fn=0xffffffff835dfcd0 <hugetlb_init>) at init/main.c:1232
#5  0xffffffff8359b6ff in do_initcall_level (command_line=0xffff88810089d000 "root", level=4) at init/main.c:1294
#6  do_initcalls () at init/main.c:1310
#7  do_basic_setup () at init/main.c:1329
#8  kernel_init_freeable () at init/main.c:1546
#9  0xffffffff82173fea in kernel_init (unused=<optimized out>) at init/main.c:1437
#10 0xffffffff810e39a1 in ret_from_fork (prev=<optimized out>, regs=0xffffc900000a3f58, fn=0xffffffff82173fd0 <kernel_init>, fn_arg=0x0 <fixed_percpu_data>) at arch/x86/kernel/process.c:145
#11 0xffffffff8100259b in ret_from_fork_asm () at arch/x86/entry/entry_64.S:304

分析下当使用了大页的程序自动释放

映射了 anon 上

#0  0xffffffff813dda7d in remove_hugetlb_folio (adjust_surplus=true, folio=<optimized out>, h=0xffffffff83a46728 <hstates+6088>) at mm/hugetlb.c:1648
#1  remove_pool_huge_page (h=h@entry=0xffffffff83a46728 <hstates+6088>, nodes_allowed=0xffffffff82f5d9f8 <node_states+24>, acct_surplus=acct_surplus@entry=true) at mm/hugetlb.c:2255
#2  0xffffffff813ddb58 in return_unused_surplus_pages (h=h@entry=0xffffffff83a46728 <hstates+6088>, unused_resv_pages=unused_resv_pages@entry=100) at mm/hugetlb.c:2640
#3  0xffffffff813de850 in hugetlb_acct_memory (h=h@entry=0xffffffff83a46728 <hstates+6088>, delta=-100) at mm/hugetlb.c:4807
#4  0xffffffff813ded8e in hugetlb_acct_memory (delta=<optimized out>, h=0xffffffff83a46728 <hstates+6088>) at mm/hugetlb.c:4768
#5  hugetlb_vm_op_close (vma=<optimized out>) at mm/hugetlb.c:4877
#6  0xffffffff813a5798 in remove_vma (vma=0xffff888104b40450, unreachable=<optimized out>) at mm/mmap.c:141
#7  0xffffffff813a8ab5 in exit_mmap (mm=mm@entry=0xffff888106f52100) at mm/mmap.c:3225
#8  0xffffffff8113c7bd in __mmput (mm=0xffff888106f52100) at kernel/fork.c:1348
#9  0xffffffff8113c8da in mmput (mm=<optimized out>) at kernel/fork.c:1370
#10 0xffffffff811478e0 in exit_mm () at kernel/exit.c:567
#11 do_exit (code=code@entry=2) at kernel/exit.c:861
#12 0xffffffff81148321 in do_group_exit (exit_code=2) at kernel/exit.c:1024
#13 0xffffffff8115a700 in get_signal (ksig=ksig@entry=0xffffc90003c83e80) at kernel/signal.c:2881
#14 0xffffffff810d2d9e in arch_do_signal_or_restart (regs=0xffffc90003c83f58) at arch/x86/kernel/signal.c:308
#15 0xffffffff811feb30 in exit_to_user_mode_loop (ti_work=16390, regs=<optimized out>) at kernel/entry/common.c:168
#16 exit_to_user_mode_prepare (regs=regs@entry=0xffffc90003c83f58) at kernel/entry/common.c:204
#17 0xffffffff82171ead in __syscall_exit_to_user_mode_work (regs=0xffffc90003c83f58) at kernel/entry/common.c:286
#18 syscall_exit_to_user_mode (regs=regs@entry=0xffffc90003c83f58) at kernel/entry/common.c:297
#19 0xffffffff8216cdaa in do_syscall_64 (regs=0xffffc90003c83f58, nr=<optimized out>) at arch/x86/entry/common.c:86
#20 0xffffffff822000aa in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

umap hugetlbfs 上的文件,可以自动释放吗?

mount hugetlbfs 的 min 会使用 reservation 吗?

是的,需要会增加 hugetlbfs 的

➜  /mnt   mount -t hugetlbfs -o min_size=2M none /mnt/huge

➜  /mnt cat /proc/meminfo| grep Huge
AnonHugePages:     20480 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        1
HugePages_Surp:        1
Hugepagesize:       2048 kB
Hugetlb:         2099200 kB

hugetlbfs 上创建文件,这个时候是 reservation 吗?

才知道存在 vm.hugetlb_shm_group

THP

https://mp.weixin.qq.com/s/hHP1Pkxyg6dOtVYTZgU_9Q

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。