Skip to the content.

kvm track mode

似乎很接近了,可以继续分析一下

 KVM: page track: add the framework of guest page tracking

 The array, gfn_track[mode][gfn], is introduced in memory slot for every
 guest page, this is the tracking count for the gust page on different
 modes. If the page is tracked then the count is increased, the page is
 not tracked after the count reaches zero

 We use 'unsigned short' as the tracking count which should be enough as
 shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
 for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
 smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
 is enough room for other trackers

 Two callbacks, kvm_page_track_create_memslot() and
 kvm_page_track_free_memslot() are implemented in this patch, they are
 internally used to initialize and reclaim the memory of the array

 Currently, only write track mode is supported

https://terenceli.github.io/%E6%8A%80%E6%9C%AF/2018/08/11/dirty-pages-tracking-in-migration

So here for every gfn, we remove the write access. After return from this ioctl, the guest’s RAM has been marked no write access, every write to this will exit to KVM make the page dirty. This means ‘start the dirty log’.

gfn_track

 History:        #0
 Commit:         3d0c27ad6ee465f174b09ee99fcaf189c57d567a
 Author:         Xiao Guangrong <guangrong.xiao@linux.intel.com>
 Committer:      Paolo Bonzini <pbonzini@redhat.com>
 Author Date:    Wed 24 Feb 2016 09:51:11 AM UTC
 Committer Date: Thu 03 Mar 2016 01:36:21 PM UTC

 KVM: MMU: let page fault handler be aware tracked page

 The page fault caused by write access on the write tracked page can not
 be fixed, it always need to be emulated. page_fault_handle_page_track()
 is the fast path we introduce here to skip holding mmu-lock and shadow
 page table walking

 However, if the page table is not present, it is worth making the page
 table entry present and readonly to make the read access happy

 mmu_need_write_protect() need to be cooked to avoid page becoming writable
 when making page table present or sync/prefetch shadow page table entries

 Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

gfn_track 其实没有什么特别的,告诉该 页面被 track 了,然后 kvm_mmu_page_fault 中间将会调用 x86_emulate_instruction 来处理, 似乎然后通过 mmu_notifier 使用 kvm_mmu_pte_write 来更新 guest page table

page_fault_handle_page_track

direct_page_fault 和 FNAME(page_fault) 调用, 似乎如果被 track,那么这两个函数会返回 RET_PF_EMULATE

track 机制

track 和 dirty bitmap 实际上是两个事情吧!

对于加以维护的: kvm_slot_page_track_add_page : kvm_slot_page_track_remove_page : ==> update_gfn_track

分别被 account_shadowed 和 unaccount_shadowed 调用

__kvm_mmu_prepare_zap_page : 被各种 zap page 调用,并且配合 commit_zap 使用 => unaccount_shadowed

kvm_mmu_get_page : => account_shadowed

  1. kvm_mmu_page_write
void kvm_mmu_init_vm(struct kvm *kvm)
{
    struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;

    node->track_write = kvm_mmu_pte_write;
    node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
    kvm_page_track_register_notifier(kvm, node);
}

kvm_mmu_get_page: 当不是 direct 模式,那么需要对于 kvm_mmu_alloc_page 的 page 进行 account_shadowed => account_shadowed : => kvm_slot_page_track_add_page

所以,保护的是 shadow page table ?

TOBECON

gfn_to_rmap

RMAP_RECYCLE_THRESHOLD 居然是 1000

parent_ptes

static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
{
    u64 *sptep;
    struct rmap_iterator iter;

    for_each_rmap_spte(&sp->parent_ptes, &iter, sptep) {
        mark_unsync(sptep);
    }
}

static void mark_unsync(u64 *spte)
{
    struct kvm_mmu_page *sp;
    unsigned int index;

    sp = sptep_to_sp(spte);
    index = spte - sp->spt;
    if (__test_and_set_bit(index, sp->unsync_child_bitmap))
        return;
    if (sp->unsync_children++)
        return;
    kvm_mmu_mark_parents_unsync(sp);
}

递归向上,当发现存在有人 没有 unsync 的时候,在 unsync_child_bitmap 中间设置标志位, 并且向上传导,直到发现没人检测过

link_shadow_page : mark_unsync 的唯一调用位置 kvm_unsync_page : kvm_mmu_mark_parents_unsync 唯一调用位置

mmu_need_write_protect : 对于 sp

mmu_need_write_protect

for_each_gfn_indirect_valid_sp : 一个 gfn 可以 同时对应多个 shadow page,原因是一个 guest page 可以对应多个 shadow page

hash : 实现 guest page tabel 和 shadow page 的映射

rmap_add 处理的是 : gfn 和其对应的 pte 的对应关系

role.quadrant

作用: 一个 guest 地址对应的 page table

get_written_sptes : 依靠 gpa 的 page_offset 计算出来,然后和 sp->role.quadrant 对比

obsolete sp

static bool is_obsolete_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
{
    return sp->role.invalid ||
           unlikely(sp->mmu_valid_gen != kvm->arch.mmu_valid_gen);
}

问题

  1. kvm_page_track_create_memslot
  2. kvm_page_track_write_tracking_alloc

kvm_page_track_create_memslot

这个和 dirty page tracking 是一个东西吗?

config DRM_I915_GVT_KVMGT
	tristate "Enable KVM host support Intel GVT-g graphics virtualization"
	depends on DRM_I915
	depends on KVM_X86
	depends on 64BIT
	depends on VFIO
	select DRM_I915_GVT
	select KVM_EXTERNAL_WRITE_TRACKING
	select VFIO_MDEV

	help
	  Choose this option if you want to enable Intel GVT-g graphics
	  virtualization technology host support with integrated graphics.
	  With GVT-g, it's possible to have one integrated graphics
	  device shared by multiple VMs under KVM.

	  Note that this driver only supports newer device from Broadwell on.
	  For further information and setup guide, you can visit:
	  https://github.com/intel/gvt-linux/wiki.

	  If in doubt, say "N".

KVM_EXTERNAL_WRITE_TRACKING

config KVM_EXTERNAL_WRITE_TRACKING
	bool
History:        #0
Commit:         deae4a10f16649d9c8bfb89f38b61930fb938284
Author:         David Stevens <stevensd@chromium.org>
Committer:      Paolo Bonzini <pbonzini@redhat.com>
Author Date:    Wed 22 Sep 2021 12:58:59 PM CST
Committer Date: Fri 01 Oct 2021 03:44:58 PM CST

KVM: x86: only allocate gfn_track when necessary

Avoid allocating the gfn_track arrays if nothing needs them. If there
are no external to KVM users of the API (i.e. no GVT-g), then page
tracking is only needed for shadow page tables. This means that when tdp
is enabled and there are no external users, then the gfn_track arrays
can be lazily allocated when the shadow MMU is actually used. This avoid
allocations equal to .05% of guest memory when nested virtualization is
not used, if the kernel is compiled without GVT-g.

Signed-off-by: David Stevens <stevensd@chromium.org>
Message-Id: <20210922045859.2011227-3-stevensd@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

page track

kvm_mmu_get_page ==> account_shadowed ==> kvm_slot_page_track_remove_page / kvm_slot_page_track_add_page ==> update_gfn_track

static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
{
	struct kvm_memslots *slots;
	struct kvm_memory_slot *slot;
	gfn_t gfn;

	kvm->arch.indirect_shadow_pages++;
	gfn = sp->gfn;
	slots = kvm_memslots_for_spte_role(kvm, sp->role);
	slot = __gfn_to_memslot(slots, gfn);

	/* the non-leaf shadow pages are keeping readonly. */
	if (sp->role.level > PG_LEVEL_4K)
		return kvm_slot_page_track_add_page(kvm, slot, gfn,
						    KVM_PAGE_TRACK_WRITE);

	kvm_mmu_gfn_disallow_lpage(slot, gfn);
}
/*
 * add guest page to the tracking pool so that corresponding access on that
 * page will be intercepted.
 *
 * It should be called under the protection both of mmu-lock and kvm->srcu
 * or kvm->slots_lock.
 *
 * @kvm: the guest instance we are interested in.
 * @slot: the @gfn belongs to.
 * @gfn: the guest page.
 * @mode: tracking mode, currently only write track is supported.
 */
void kvm_slot_page_track_add_page(struct kvm *kvm,
				  struct kvm_memory_slot *slot, gfn_t gfn,
				  enum kvm_page_track_mode mode)
{

	if (WARN_ON(!page_track_mode_is_valid(mode)))
		return;

	update_gfn_track(slot, gfn, mode, 1);

	/*
	 * new track stops large page mapping for the
	 * tracked page.
	 */
	kvm_mmu_gfn_disallow_lpage(slot, gfn);

  // 必然执行,当前只有这种模式
	if (mode == KVM_PAGE_TRACK_WRITE)
    // 将所有的和这一个 gfn 关联的 spte 设置保护
		if (kvm_mmu_slot_gfn_write_protect(kvm, slot, gfn))
			kvm_flush_remote_tlbs(kvm);
}
struct kvm_page_track_notifier_node {
	struct hlist_node node;

	/*
	 * It is called when guest is writing the write-tracked page
	 * and write emulation is finished at that time.
	 *
	 * @vcpu: the vcpu where the write access happened.
	 * @gpa: the physical address written by guest.
	 * @new: the data was written to the address.
	 * @bytes: the written length.
	 * @node: this node
	 */
	void (*track_write)(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
			    int bytes, struct kvm_page_track_notifier_node *node);
	/*
	 * It is called when memory slot is being moved or removed
	 * users can drop write-protection for the pages in that memory slot
	 *
	 * @kvm: the kvm where memory slot being moved or removed
	 * @slot: the memory slot being moved or removed
	 * @node: this node
	 */
	void (*track_flush_slot)(struct kvm *kvm, struct kvm_memory_slot *slot,
			    struct kvm_page_track_notifier_node *node);
};
  1. kvm_page_track_notifier_node::track_write
  1. 调用位置
    • emulate_ops::write_emulated ==> emulator_write_emulated ==> emulator_read_write ==> emulator_read_write_onepage ==> write_emultor::read_write_emulate ==> write_emulate ==> emulator_write_phys
    • emulate_ops::cmpxchg_emulated ==> mulator_cmpxchg_emulated

分析其中一个分支: segmented_write ==> emulate_ops::write_emulated

direct_page_fault 和 paging64_page_fault 的位置都调用,page_fault_handle_page_track 检查,如果访问的 page 被 track 那么访问需要被 emulate,具体在 kvm_mmu_page_fault 中间。

/*
 * Currently, we have two sorts of write-protection, a) the first one
 * write-protects guest page to sync the guest modification, b) another one is
 * used to sync dirty bitmap when we do KVM_GET_DIRTY_LOG. The differences
 * between these two sorts are:
 * 1) the first case clears SPTE_MMU_WRITEABLE bit.
 * 2) the first case requires flushing tlb immediately avoiding corrupting
 *    shadow page table between all vcpus so it should be in the protection of
 *    mmu-lock. And the another case does not need to flush tlb until returning
 *    the dirty bitmap to userspace since it only write-protects the page
 *    logged in the bitmap, that means the page in the dirty bitmap is not
 *    missed, so it can flush tlb out of mmu-lock.
 *
 * So, there is the problem: the first case can meet the corrupted tlb caused
 * by another case which write-protects pages but without flush tlb
 * immediately. In order to making the first case be aware this problem we let
 * it flush tlb if we try to write-protect a spte whose SPTE_MMU_WRITEABLE bit
 * is set, it works since another case never touches SPTE_MMU_WRITEABLE bit.
 *
 * Anyway, whenever a spte is updated (only permission and status bits are
 * changed) we need to check whether the spte with SPTE_MMU_WRITEABLE becomes
 * readonly, if that happens, we need to flush tlb. Fortunately,
 * mmu_spte_update() has already handled it perfectly.
 *
 * The rules to use SPTE_MMU_WRITEABLE and PT_WRITABLE_MASK:
 * - if we want to see if it has writable tlb entry or if the spte can be
 *   writable on the mmu mapping, check SPTE_MMU_WRITEABLE, this is the most
 *   case, otherwise
 * - if we fix page fault on the spte or do write-protection by dirty logging,
 *   check PT_WRITABLE_MASK.
 *
 * TODO: introduce APIs to split these two cases.
 */
static inline int is_writable_pte(unsigned long pte)
{
	return pte & PT_WRITABLE_MASK;
}

从注释的可以解读的东西:

  1. 使用通过 SPTE_MMU_WRITEABLE 来实现对于 guest 的保护
  2. 使用 PT_WRITABLE_MASK 用于 dirty bitmap 操作的

所以 page_track 实际上是用来处理 write protect page 的

目前注册的 kvm_page_track_write,所以很清楚了,就是通过 gfn track 来跟踪的

drivers/gpu/drm/i915/gvt/kvmgt.c
109:static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,
667:    vgpu->track_node.track_write = kvmgt_page_track_write;
1607:static void kvmgt_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa,

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。