Skip to the content.

Get User Pages

TODO

官方文档需要看看: Documentation/core-api/pin_user_pages.rst

pin_user_pages() pin_user_pages_fast() pin_user_pages_remote()

三个接口有啥区别?

The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN uses a different reference counting technique.

follow_page

为什么,这个有时候是从 p4d,有时候直接从 pmd 开始的

- entry_SYSCALL_64
  - do_syscall_64
    - do_syscall_x64
      - __x64_sys_execve
        - __se_sys_execve
          - __do_sys_execve
            - do_execve
              - do_execveat_common
                - copy_string_kernel
                  - get_arg_page
                    - get_user_pages_remote
                      - __get_user_pages_remote
                        - __get_user_pages_locked
                          - __get_user_pages
                            - follow_page_mask
                              - follow_p4d_mask
- ??
  - entry_SYSCALL_64
    - do_syscall_64
      - do_syscall_x64
        - __x64_sys_execve
          - __se_sys_execve
            - __do_sys_execve
              - do_execve
                - do_execveat_common
                  - copy_string_kernel
                    - get_arg_page
                      - get_user_pages_remote
                        - __get_user_pages_remote
                          - __get_user_pages_locked
                            - __get_user_pages
                              - follow_pmd_mask
                                - follow_page_pte

gup 的接口

类似这种函数:

一个经典调用路径:

如何保证被 pin 的页面是不可以换出

mlock 是不是也是使用了这种方法

使用案例

/proc/$pid/mem 的实现

通过 gup 和 cow ,如何实现修改没有权限的页面

kernel

和 cow 的关系

https://docs.kernel.org/core-api/pin_user_pages.html

内核文档[^8]主要分析下面三个接口:

pin_user_pages()
pin_user_pages_fast()
pin_user_pages_remote()
  1. In other word use pin_user_pages*() for DMA-pinned pages, and get_user_pages*() for other cases.
  2. The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN uses a different reference counting technique.
  3. FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. 。。。// TO BE CONTINUE

https://lwn.net/Articles/807108/

lwn 的文章[^9]说明的内容: To simplify the situation somewhat, the problems with get_user_pages() come about in two ways. One of those happens when the kernel thinks that the contents of a page will not change, but some peripheral device writes new data there. The other arises with memory that is located on persistent-memory devices managed by a filesystem; pinning pages into memory deprives the filesystem of the ability to make layout changes involving those pages. The latter problem has been “solved” for now by disallowing long-lasting page pins on persistent-memory devices, but there are use cases calling for creating just that kind of pin, so better solutions are being sought. get_user_pages 的问题来源于两个部分 :

  1. 内核以为 pin 的 page 没有发生修改,但是实际上外设偷偷的对于该内存修改过。(难道不是用户对于这种未经提示的修改感到奇怪吗 ? 类似的事情不是也会发生在 任何将内存映射到设备上的情况吗? )
  2. persistent memory device : emmmm 这似乎是 DAX 相关的

Part of the problem comes down to the fact that get_user_pages() does not perform any sort of special tracking of the pages it pins into RAM. It does increment the reference count for each page, preventing it from being evicted from memory, but pages that have been pinned in this way are indistinguishable from pages that have acquired references in any of a vast number of other ways. So, while one can ask whether a page has references, it is not possible for kernel code to ask whether a page has been pinned for purposes like DMA I/O.

  1. 虽然通过 reference 可以防止 page 被 evicted (evicted 指的是回收吗 ? 但是这个 page 是用户通过 brk 分配的,如果用户进程 exit 了,内核如何知道这个 page 如何回收啊 !)
  2. 而且为什么需要单独区分这个东西啊! 。。。// TO BE CONTINUE 写不错,但是没有耐心了

猜测其中的过程 :

  1. page 全部都是用户分配的,page 对应的内核虚拟地址自然确定了。
  2. 问题是 内核映射整个物理地址空间,还是仅仅映射了物理内存的部分。
    1. 猜测仅仅是内核的物理内存部分,不然 gup.c 应该会很简单
    2. 而且浪费了很多内存
    3. 那么 内核的物理内存的边缘在何处 ?

小问题: #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA) ,DAX 和 CMA 为什么可以影响

_get_user_pages 分析:

  1. 确定 vma
  2. follow_page_mask : 依次进行 page walk,直到找到该 page 或者失败
    • 如果失败,faultin_page
    • 将获取的 page 放到参数 : pages 中间

从 gup 到 dirty cow

In particular, each page frame corresponding to a private, writable page (VM_SHARED flag off and VM_MAYWRITE flag on) is marked as read-only for both the parent and the child, so that it will be handled with the Copy On Write mechanism.

https://stackoverflow.com/questions/48241187/memory-region-flags-in-linux-why-both-vm-write-and-vm-maywrite-are-needed

/proc/self/mem 向其中写的效果是什么?

从理解上,mmap 一个文件,然后对于一个文件写,必然产生一个原始的 page cache 页面,但是,由于没有权限,所以只能产生一个 cow page 出来,在 cow page 上写, 如果这个 cow page 被 刷掉了,在重新的进行 fault 的时候,没有添加上 但是此时访问请求标记没有了 FOLL_WRITE,所以会认为是一个读访问,不会触发 COW,这次缺页处理会填充 pte 对应原始物理页,再次调用 follow_page 成功获取原始页,所以正常情况会在 cowed page 上进行读写操作,

/*
 * FOLL_FORCE can write to even unwritable pte's, but only
 * after we've gone through a COW cycle and they are dirty.
 */
static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
{
    return pte_write(pte) ||
        ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
}

The reason for removing the FOLL_WRITE flag is to take into account the case the FOLL_FORCE flag is applied on a read-only VMA (when the VM_MAYWRITE flag is set in the VMA). In that case, the pte_maybe_mkwrite function won’t set the write bit, however the faulted-in page is indeed ready for writing.

       MADV_DONTNEED
              Do not expect access in the near future.  (For the time being, the application is finished with the given range, so the kernel can free resources associated with it.)

              After a successful MADV_DONTNEED operation, the semantics of memory access in the specified region are changed: subsequent accesses of pages in the range will succeed, but will result in either repopulating the memory contents from the up-to-date contents of the
              underlying mapped file (for shared file mappings, shared anonymous mappings, and shmem-based techniques such as System V shared memory segments) or zero-fill-on-demand pages for anonymous private mappings.

              Note that, when applied to shared mappings, MADV_DONTNEED might not lead to immediate freeing of the pages in the range.  The kernel is free to delay freeing the pages until an appropriate moment.  The resident set size (RSS) of the calling process will be imme‐
              diately reduced however.

              MADV_DONTNEED cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages.  (Pages marked with the kernel-internal VM_PFNMAP flag are special memory areas that are not managed by the virtual memory subsystem.  Such pages are typically created by device
              drivers that map the pages into user space.)
/*
 * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
 * servicing faults for write access.  In the normal case, do always want
 * pte_mkwrite.  But get_user_pages can cause write faults for mappings
 * that do not have writing enabled, when used by access_process_vm.
 */
static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
{
    if (likely(vma->vm_flags & VM_WRITE))
        pte = pte_mkwrite(pte);
    return pte;
}

分析 check_vma_flags :

  1. 如果有 write 的需求,想要获取的 vm 没有权限,除非拥有 FOLL_FORCE,否则失败
  2. 即使是 force 的,比必然让这个 page 是可以 cow 才可以(这个文件不可以被映射为 SHARED, 可以使用 1 作为测试) 也就是说,force 的是存在前提的,创建一个 cow page, 提供一个修改了文件的假象,毕竟,修改了 tracee 的二进制,然后忘记修改回来了,是非常糟糕的
    • check_vma_flags 调用 is_cow_mapping 检测 VM_MAYWRITE 的 flag 为什么可以成功啊, 这个 flag 是什么设置的

想不到 gup 在这个时候可以被调用

- ??
  - entry_SYSCALL_64
    - do_syscall_64
      - do_syscall_x64
        - __x64_sys_execve
          - __se_sys_execve
            - __do_sys_execve
              - do_execve
                - do_execveat_common
                  - copy_string_kernel
                    - get_arg_page
                      - get_user_pages_remote
                        - __get_user_pages_remote
                          - __get_user_pages_locked
                            - __get_user_pages
                              - follow_pmd_mask
                                - follow_page_pte
                                  - mark_page_accessed

从 kvm 到 zero page

commit b14b2690c50e02145bb867dfcde8845eb17aa8a4
Author: Sean Christopherson <seanjc@google.com>
Date:   Fri Apr 29 01:04:15 2022 +0000

    KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()

    Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()
    to better reflect what KVM is actually checking, and to eliminate extra
    pfn_to_page() lookups.  The kvm_release_pfn_*() an kvm_try_get_pfn()
    helpers in particular benefit from "refouncted" nomenclature, as it's not
    all that obvious why KVM needs to get/put refcounts for some PG_reserved
    pages (ZERO_PAGE and ZONE_DEVICE).

    Add a comment to call out that the list of exceptions to PG_reserved is
    all but guaranteed to be incomplete.  The list has mostly been compiled
    by people throwing noodles at KVM and finding out they stick a little too
    well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages
    didn't get freed.

    No functional change intended.

    Signed-off-by: Sean Christopherson <seanjc@google.com>
    Message-Id: <20220429010416.2788472-10-seanjc@google.com>
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

如何理解 PageReserved 的含义 ?

在函数 kvm_pfn_to_refcounted_page 中为什么需要处理这个

	if (!PageReserved(page))
		return page;

如何理解这个

	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
			 (FOLL_PIN | FOLL_GET)))

进一步在 __get_user_pages_locked 中:

	/*
	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
	 * for FOLL_GET, not for the newer FOLL_PIN.
	 *
	 * FOLL_PIN always expects pages to be non-null, but no need to assert
	 * that here, as any failures will be obvious enough.
	 */
	if (pages && !(flags & FOLL_PIN))
		flags |= FOLL_GET;

如何理解这个

如何理解 follow_page_pte 中的:

	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
	ret = try_grab_page(page, flags);
	if (unlikely(ret)) {
		page = ERR_PTR(ret);
		goto out;
	}

/**
 * try_grab_folio() - Attempt to get or pin a folio.
 * @page:  pointer to page to be grabbed
 * @refs:  the value to (effectively) add to the folio's refcount
 * @flags: gup flags: these are the FOLL_* flag values.
 *
 * "grab" names in this file mean, "look at flags to decide whether to use
 * FOLL_PIN or FOLL_GET behavior, when incrementing the folio's refcount.
 *
 * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
 * same time. (That's true throughout the get_user_pages*() and
 * pin_user_pages*() APIs.) Cases:
 *
 *    FOLL_GET: folio's refcount will be incremented by @refs.
 *
 *    FOLL_PIN on large folios: folio's refcount will be incremented by
 *    @refs, and its pincount will be incremented by @refs.
 *
 *    FOLL_PIN on single-page folios: folio's refcount will be incremented by
 *    @refs * GUP_PIN_COUNTING_BIAS.
 *
 * Return: The folio containing @page (with refcount appropriately
 * incremented) for success, or NULL upon failure. If neither FOLL_GET
 * nor FOLL_PIN was set, that's considered failure, and furthermore,
 * a likely bug in the caller, so a warning is also emitted.
 */
struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)

理解下 FOLL_PIN

居然这是 internal 的

enum {
	/* mark page accessed */
	FOLL_TOUCH = 1 << 16,
	/* a retry, previous pass started an IO */
	FOLL_TRIED = 1 << 17,
	/* we are working on non-current tsk/mm */
	FOLL_REMOTE = 1 << 18,
	/* pages must be released via unpin_user_page */
	FOLL_PIN = 1 << 19,
	/* gup_fast: prevent fall-back to slow gup */
	FOLL_FAST_ONLY = 1 << 20,
	/* allow unlocking the mmap lock */
	FOLL_UNLOCKABLE = 1 << 21,
};

FOLL_PIN 都是内存 mk 的内部用户

/*
 * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
 * other. Here is what they mean, and how to use them:
 *
 *
 * FIXME: For pages which are part of a filesystem, mappings are subject to the
 * lifetime enforced by the filesystem and we need guarantees that longterm
 * users like RDMA and V4L2 only establish mappings which coordinate usage with
 * the filesystem.  Ideas for this coordination include revoking the longterm
 * pin, delaying writeback, bounce buffer page writeback, etc.  As FS DAX was
 * added after the problem with filesystems was found FS DAX VMAs are
 * specifically failed.  Filesystem pages are still subject to bugs and use of
 * FOLL_LONGTERM should be avoided on those pages.
 *
 * In the CMA case: long term pins in a CMA region would unnecessarily fragment
 * that region.  And so, CMA attempts to migrate the page before pinning, when
 * FOLL_LONGTERM is specified.
 *
 * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
 * but an additional pin counting system) will be invoked. This is intended for
 * anything that gets a page reference and then touches page data (for example,
 * Direct IO). This lets the filesystem know that some non-file-system entity is
 * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
 * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
 * a call to unpin_user_page().
 *
 * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
 * and separate refcounting mechanisms, however, and that means that each has
 * its own acquire and release mechanisms:
 *
 *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
 *
 *     FOLL_PIN: pin_user_pages*() to acquire, and unpin_user_pages to release.
 *
 * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
 * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
 * calls applied to them, and that's perfectly OK. This is a constraint on the
 * callers, not on the pages.)
 *
 * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
 * directly by the caller. That's in order to help avoid mismatches when
 * releasing pages: get_user_pages*() pages must be released via put_page(),
 * while pin_user_pages*() pages must be released via unpin_user_page().
 *
 * Please see Documentation/core-api/pin_user_pages.rst for more information.
 */

enum {
	/* check pte is writable */
	FOLL_WRITE = 1 << 0,
	/* do get_page on page */
	FOLL_GET = 1 << 1,
	/* give error on hole if it would be zero */
	FOLL_DUMP = 1 << 2,
	/* get_user_pages read/write w/o permission */
	FOLL_FORCE = 1 << 3,
	/*
	 * if a disk transfer is needed, start the IO and return without waiting
	 * upon it
	 */
	FOLL_NOWAIT = 1 << 4,
	/* do not fault in pages */
	FOLL_NOFAULT = 1 << 5,
	/* check page is hwpoisoned */
	FOLL_HWPOISON = 1 << 6,
	/* don't do file mappings */
	FOLL_ANON = 1 << 7,
	/*
	 * FOLL_LONGTERM indicates that the page will be held for an indefinite
	 * time period _often_ under userspace control.  This is in contrast to
	 * iov_iter_get_pages(), whose usages are transient.
	 */
	FOLL_LONGTERM = 1 << 8,
	/* split huge pmd before returning */
	FOLL_SPLIT_PMD = 1 << 9,
	/* allow returning PCI P2PDMA pages */
	FOLL_PCI_P2PDMA = 1 << 10,
	/* allow interrupts from generic signals */
	FOLL_INTERRUPTIBLE = 1 << 11,

	/* See also internal only FOLL flags in mm/internal.h */
};

is_valid_gup_args() FOLL flags 的进一步说明

但是如何理解这个?

	/* Pages input must be given if using GET/PIN */
	if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
		return false;

The ongoing trouble with get_user_pages()

LWN:get_user_pages() 仍在带来麻烦!

看完更加迷茫了。

如果调用 gup 增加了一个 page 的 reference ,这个 page 需要从 lru 中取出来吗?

一个 security bug 让复杂变得更加复杂

9471f1f2f50282b9e8f59198ec6bb738b4ccc009

[ 8371.232795] ------------[ cut here ]------------
[ 8371.232796] WARNING: CPU: 4 PID: 6509 at mm/gup.c:1101 __get_user_pages+0x577/0x670

看看这个

c8070b78751955e59b42457b974bea4a4fe00187

分析下这个,太强了

https://github.com/lrh2000/StackRot

这个例子太好了

  1. 在 gup 的过程中,通过 page fault 是 __get_user_pages 直接调用的 handle_mm_fault 的,其实是相当于软件模拟硬件的 page fault 的。

  2. ext4 也使用的是 iomap

@[
    kvm_flush_tlb_multi+5
    flush_tlb_mm_range+287
    ptep_clear_flush+65
    do_wp_page+3201
    handle_mm_fault+2921
    __get_user_pages+1130
    __gup_longterm_locked+1333
    gup_fast_fallback+4269
    iov_iter_extract_pages+296
    bio_iov_iter_get_pages+282
    iomap_dio_bio_iter+748
    __iomap_dio_rw+639
    iomap_dio_rw+18
    ext4_file_read_iter+242
    aio_read+456
    io_submit_one+1608
    __se_sys_io_submit+189
    do_syscall_64+237
    entry_SYSCALL_64_after_hwframe+119
]: 41100

为什么 hva_to_pfn_slow 中获取的 page 的 refcount 很多都是 3 啊

使用如下的 diff :

@@ -2821,6 +2821,8 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 	if (npages != 1)
 		return npages;

+	trace_hi(page_ref_count(page));
+
 	/* map read fault as writable if possible */
 	if (unlikely(!write_fault) && writable) {
 		struct page *wpage;

但是可以看到如下的结果

80.80%  hi : 3
19.20%  hi : 2

但是至少,我们可以知道,kvm 映射的 page 之所以可以释放,是因为调用了:

@[
    kvm_release_pfn_clean+5
    kvm_tdp_page_fault+213
    kvm_mmu_do_page_fault+482
    kvm_mmu_page_fault+137
    vmx_handle_exit+300
    kvm_arch_vcpu_ioctl_run+2832
    kvm_vcpu_ioctl+558
    __x64_sys_ioctl+145
    do_syscall_64+188
    entry_SYSCALL_64_after_hwframe+119
]: 247319

仔细检查 hva_to_pfn_remapped 中的注释:

	/*
	 * Get a reference here because callers of *hva_to_pfn* and
	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
	 * set, but the kvm_try_get_pfn/kvm_release_pfn_clean pair will
	 * simply do nothing for reserved pfns.
	 *
	 * Whoever called remap_pfn_range is also going to call e.g.
	 * unmap_mapping_range before the underlying pages are freed,
	 * causing a call to our MMU notifier.
	 *
	 * Certain IO or PFNMAP mappings can be backed with valid
	 * struct pages, but be allocated without refcounting e.g.,
	 * tail pages of non-compound higher order allocations, which
	 * would then underflow the refcount when the caller does the
	 * required put_page. Don't allow those pages here.
	 */

问题是 pin_user_pages_remote 和 pin_user_pages_fast 的区别是什么?

需要继续确认一下。

pin_user_pages_remote()pin_user_pages_fast() 都是 Linux 内核中用于 长期 pin 住用户态页(page pinning) 的接口,典型使用场景包括 RDMA、VFIO、vDPA、GPU 直通等需要防止用户页被回收或迁移的子系统。二者的核心区别在于 适用的地址空间、实现路径以及可阻塞性/性能特征

下面从语义、实现、使用场景三个维度进行系统说明。

一、基本语义对比

维度 pin_user_pages_fast() pin_user_pages_remote()
地址空间 当前进程(current->mm) 指定进程(struct mm_struct *)
是否需要 mmap_lock 不需要(fast path) 需要(内部获取)
是否可能睡眠 通常 不可(快速路径) 可以
性能 高(无 page fault / 无锁) 较低(走完整 GUP 路径)
可处理缺页
典型用途 高性能 I/O、热路径 跨进程、冷路径、管理型操作

二、pin_user_pages_fast()

1. 核心特征

2. 典型使用场景

例如:

3. 使用示例(简化)

ret = pin_user_pages_fast(uaddr, npages,
                          FOLL_WRITE | FOLL_LONGTERM,
                          pages);
if (ret < npages) {
    /* fallback 或失败处理 */
}

实践中,很多子系统会在 fast 失败时 fallback 到慢路径


三、pin_user_pages_remote()

1. 核心特征

2. 典型使用场景

例如:

3. 使用示例(简化)

ret = pin_user_pages_remote(mm, uaddr, npages,
                            FOLL_WRITE | FOLL_LONGTERM,
                            pages, NULL, NULL);

四、为什么需要两套接口?

1. 性能 vs. 语义完整性

2. 内核设计原则

Fast path 不得睡眠,不得加重锁;Slow path 保证语义正确。

因此,常见设计模式是:

ret = pin_user_pages_fast(...);
if (ret < npages)
    ret = pin_user_pages_remote(...);

五、与 get_user_pages* 的关系(重要)

这是为了解决:


六、选择建议(实战总结)

  1. I/O 热路径、当前进程内存

    • 优先:pin_user_pages_fast()
    • 必须考虑 fallback
  2. 跨进程 / ioctl / 管理路径

    • 直接使用:pin_user_pages_remote()
  3. 长期 DMA pin

    • 必须带 FOLL_LONGTERM
    • 注意与 filesystem DAX / page migration 的冲突

七、一句话总结

如果你愿意,可以结合你当前在看的 RDMA / VFIO / io_uring / virtio 代码路径,我可以直接帮你指出为什么内核在那里选择了其中某一个接口。

(直接在内核中找找这个东西的使用,完全搞不懂使用的规律)

还有一个 get_user_pages_remote() 函数

🧀  rg -l pin_user_pages_remote
kernel/trace/trace_events_user.c
Documentation/core-api/pin_user_pages.rst
drivers/vfio/vfio_iommu_type1.c
drivers/iommu/iommufd/pages.c
mm/process_vm_access.c
mm/gup.c
include/linux/mm.h

FOLL_LONGTERM 的语义到底是什么?

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。

  1. https://github.com/dirtycow/dirtycow.github.io/blob/master/dirtyc0w.c