Skip to the content.

posted interrupt

posted interrupt 基本测试

enable_device_posted_irqs

新内核可以通过 enable_device_posted_irqs 来直接观察:

🤒  cat /sys/module/kvm_intel/parameters/enable_device_posted_irqs
N

也可以手动关闭掉

sudo modprobe kvm_intel enable_device_posted_irqs=0

在 map_iommu 中读取,在

	iommu->cap = dmar_readq(iommu->reg + DMAR_CAP_REG);

在 set_irq_posting_cap 中将 posted interrupt 清理掉

bool kvm_arch_has_irq_bypass(void)
{
	return enable_apicv && irq_remapping_cap(IRQ_POSTING_CAP);
}

性能对比

测试环境为 13900k + ASUS 主板,这个主板的 iommu 有

 lspci -s 0000:02:00.0 -vv
02:00.0 Non-Volatile memory controller: Yangtze Memory Technologies Co.,Ltd ZHITAI TiPro7000 (rev 01) (prog-if 02 [NVM Express])
        Subsystem: Yangtze Memory Technologies Co.,Ltd ZHITAI TiPro7000
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16

虚拟机中:

jobs: 1 (f=1): [r(1)][0.5%][r=971MiB/s][r=249k IOPS][eta 16m:36s]

物理机中 360k

所以,这个性能开销还是蛮大的

SR-IOV 也支持 posted interrupts 吗?

没有任何问题,看上去 SR-IOV 完全就是一个解耦的功能,就像是多了一个 vfio 设备一样

 663:          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0 IR-PCI-MSIX-0000:17:00.2    0-edge      vfio-msix[0](0000:17:00.2)

关键代码和路径

当启用 posted interrupt 的时候:

虚拟设备例如 virtio-blk 如何使用 posted interrupt

vmx_deliver_interrupt -> kvm_vcpu_trigger_posted_interrupt

static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
						     int pi_vec)
{
#ifdef CONFIG_SMP
	if (vcpu->mode == IN_GUEST_MODE) {
		/*
		 * The vector of the virtual has already been set in the PIR.
		 * Send a notification event to deliver the virtual interrupt
		 * unless the vCPU is the currently running vCPU, i.e. the
		 * event is being sent from a fastpath VM-Exit handler, in
		 * which case the PIR will be synced to the vIRR before
		 * re-entering the guest.
		 *
		 * When the target is not the running vCPU, the following
		 * possibilities emerge:
		 *
		 * Case 1: vCPU stays in non-root mode. Sending a notification
		 * event posts the interrupt to the vCPU.
		 *
		 * Case 2: vCPU exits to root mode and is still runnable. The
		 * PIR will be synced to the vIRR before re-entering the guest.
		 * Sending a notification event is ok as the host IRQ handler
		 * will ignore the spurious event.
		 *
		 * Case 3: vCPU exits to root mode and is blocked. vcpu_block()
		 * has already synced PIR to vIRR and never blocks the vCPU if
		 * the vIRR is not empty. Therefore, a blocked vCPU here does
		 * not wait for any requested interrupts in PIR, and sending a
		 * notification event also results in a benign, spurious event.
		 */

		if (vcpu != kvm_get_running_vcpu())
			__apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
		return;
	}
#endif
	/*
	 * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
	 * otherwise do nothing as KVM will grab the highest priority pending
	 * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
	 */
	kvm_vcpu_wake_up(vcpu);
}

这里还需要确认一下 IN_GUEST_MODE 和 kvm_get_running_vcpu() 的含义,

所以,这里的三个情况是

  1. vcpu->mode == IN_GUEST_MODE && vcpu != kvm_get_running_vcpu()
    • 需要注入的中断在其他的 pCPU 上运行,那么就需要 posted interrupt ipi 通知了,但是如果 vCPU 实际上不是, 那么就可以在 /proc/interrupts PIN : Posted-interrupt notification event 来观察到 kvm_posted_intr_ipi() 被调用 kvm_posted_intr_ipi() 不需要做太多事情,因为前面已经设置了 pi_test_and_set_pir ,vCPU 开始执行的时候,会检查到这个 bit
  2. vcpu->mode == IN_GUEST_MODE && vcpu == kvm_get_running_vcpu()
    • vCPU 在 正在 VM-exit 的 fastpath 中 ,马上会重新进入 pCPU 执行中,所以什么都不需要做
  3. vcpu->mode != IN_GUEST_MODE
    • 需要唤醒 vCPU ,走的普通的 process wake up 机制

所以,可以想到,iommu 来注入中断其实非常类似的过程,只是这个 ipi 是 iommu 来触发的。

所以,这里在强调一次,posted interrupts 是 iommu 和 APICv 两个功能一起合作才有的。

pi_desc 只是普通内存吗?

是的

/* Posted-Interrupt Descriptor */
struct pi_desc {
	unsigned long pir[NR_PIR_WORDS];     /* Posted interrupt requested */
	union {
		struct {
			u16	notifications; /* Suppress and outstanding bits */
			u8	nv;
			u8	rsvd_2;
			u32	ndst;
		};
		u64 control;
	};
	u32 rsvd[6];
} __aligned(64);

vmx_pi_update_irte 中:

		struct intel_iommu_pi_data pi_data = {
			.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)),
			.vector = vector,
		};
static struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
	return &(to_vt(vcpu)->pi_desc);
}
struct vcpu_vt {
	/* Posted interrupt descriptor */
	struct pi_desc pi_desc;

	/* Used if this vCPU is waiting for PI notification wakeup. */
	struct list_head pi_wakeup_list;

	union vmx_exit_reason exit_reason;

	unsigned long	exit_qualification;
	u32		exit_intr_info;

	/*
	 * If true, guest state has been loaded into hardware, and host state
	 * saved into vcpu_{vt,vmx,tdx}.  If false, host state is loaded into
	 * hardware.
	 */
	bool		guest_state_loaded;
	bool		emulation_required;

#ifdef CONFIG_X86_64
	u64		msr_host_kernel_gs_base;
#endif
};

每一个 vCPU 都会有一个 pi_desc ,需要写入到两个位置中

  1. arch/x86/kvm/vmx/vmx.c:init_vmcs() 中,提供地址给 vmcs 就可以了:
         vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->vt.pi_desc)));
    
  2. drivers/iommu/intel/irq_remapping.c 中的
         irte_pi.pda_l = (pi_data->pi_desc_addr >>
                 (32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT);
         irte_pi.pda_h = (pi_data->pi_desc_addr >> 32) &
                 ~(-1UL << PDA_HIGH_BIT);
    

    中断注入的整个流程大致如下:

[!NOTE] 参考神奇海螺的意见,有待验证

Device MSI
  ↓
VT-d Interrupt Remapping
  ↓  (查 IRTE)
IRTE.p == 1 ?
  ↓ yes
IRTE.PDA --> PID (内存)
  ↓
原子置位 PID.PIR
  ↓
(若 SN=0)发 posted notification vector
  ↓
目标 pCPU LAPIC
  ↓
VMCS.POSTED_INTR_DESC_ADDR 指向同一个 PID

SN 是 Suppress Notification

所以,注意 pi_desc 是一个 vCPU 关联的,itre 指向的记事本, 无论 vCPU 在那个 pCPU 上执行,itre 都是指向到是一个固定的 pi_desc , 将中断记录到其中。

每一个 vCPU 都有自己的 vmcs ,在初始化的时候,vmcs 关联上自己的 pi_desc 就可以了

posted interrupt 如何 vCPU 迁移和虚拟机中的中断绑定

vCPU 迁移

先思考一个简单问题:

vmx_vcpu_pi_put() 在 vCPU 即将被挂起(put / block / schedule out)时,决定:

vmx_vcpu_pi_put 的判断是,只有一个 vCPU 为:

@[
        vmx_vcpu_pi_put+5
        vmx_vcpu_put+18
        kvm_arch_vcpu_put+297
        vcpu_put+25
        kvm_arch_vcpu_ioctl_run+525
        kvm_vcpu_ioctl+276
        __x64_sys_ioctl+150
        do_syscall_64+97
        entry_SYSCALL_64_after_hwframe+118
]: 53060

接下来,这个 vCPU 被放到 pCPU 的 wakeup_vcpus_on_cpu 上, 当接受到中断之后,就会执行来做唤醒操作也就是执行 sysvec_kvm_posted_intr_wakeup_ipi 这个 pCPU 就是当前 vCPU 使用 CPU

	list_add_tail(&vt->pi_wakeup_list,
		      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));

所以再回到这个问题,vCPU 切换 pCPU 咋办?

vmx_vcpu_pi_load 中会配置 ndst 的:

	do {
		new.control = old.control;

		/*
		 * Clear SN (as above) and refresh the destination APIC ID to
		 * handle task migration (@cpu != vcpu->cpu).
		 */
		new.ndst = dest;
		__pi_clear_sn(&new);

		/*
		 * Restore the notification vector; in the blocking case, the
		 * descriptor was modified on "put" to use the wakeup vector.
		 */
		new.nv = POSTED_INTR_VECTOR;
	} while (pi_try_set_control(pi_desc, &old.control, new.control));

虚拟机中对于直通设备的中断进行绑定的流程

完全都是相同的机制,让虚拟机去写 msix-table ,然后来做监听

仅仅直通设备 (vfio_msihandler) 如何 vCPU 迁移和虚拟机中的中断绑定

irq 的参数是 eventfd_ctx 还是走 gsi 机制的

@[
        vmx_deliver_interrupt+5
        __apic_accept_irq+251
        kvm_irq_delivery_to_apic_fast+336
        kvm_arch_set_irq_inatomic+217
        irqfd_wakeup+275
        __wake_up_common+120
        eventfd_signal_mask+112
        vfio_msihandler+19
        __handle_irq_event_percpu+85
        handle_irq_event+56
        handle_edge_irq+199
        __common_interrupt+76
        common_interrupt+128
        asm_common_interrupt+38
        cpuidle_enter_state+211
        cpuidle_enter+45
        cpuidle_idle_call+241
        do_idle+119
        cpu_startup_entry+41
        start_secondary+296
        common_startup_64+318
]: 393269

那么,vCPU 迁移,显然没有任何影响,因为注入的对象就是 vCPU ,vCPU 在那个 pCPU 上,那是 vmx_deliver_interrupt 的工作。

如果虚拟机中修改被直通设备的中断亲和性,那么这个结果完全符合我们的预期

内核中观察到:

@[
        irqfd_update+1
        kvm_irq_routing_update+167
        kvm_set_irq_routing+494
        kvm_vm_ioctl+1543
        __x64_sys_ioctl+150
        do_syscall_64+97
        entry_SYSCALL_64_after_hwframe+118
]: 20

itre 是如何被共用的

如果修改中断亲和性:

echo 10 | sudo tee /proc/irq/368/smp_affinity_list
@[
        __modify_irte_ga.isra.0+1
        irte_ga_set_affinity+72
        amd_ir_set_affinity+122
        msi_domain_set_affinity+79
        irq_do_set_affinity+207
        irq_set_affinity_locked+235
        __irq_set_affinity+72
        write_irq_affinity.isra.0+229
        proc_reg_write+89
        vfs_write+207
        ksys_write+99
        do_syscall_64+97
        entry_SYSCALL_64_after_hwframe+118
]: 2

如果是 posted interrupt 的中断修改,也就是上面提到了

intel_ir_set_affinity 和 intel_ir_set_vcpu_affinity 什么关系?

文档

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。