posted interrupt
posted interrupt 基本测试
enable_device_posted_irqs
新内核可以通过 enable_device_posted_irqs 来直接观察:
🤒 cat /sys/module/kvm_intel/parameters/enable_device_posted_irqs
N
也可以手动关闭掉
sudo modprobe kvm_intel enable_device_posted_irqs=0
在 map_iommu 中读取,在
iommu->cap = dmar_readq(iommu->reg + DMAR_CAP_REG);
在 set_irq_posting_cap 中将 posted interrupt 清理掉
bool kvm_arch_has_irq_bypass(void)
{
return enable_apicv && irq_remapping_cap(IRQ_POSTING_CAP);
}
性能对比
测试环境为 13900k + ASUS 主板,这个主板的 iommu 有
lspci -s 0000:02:00.0 -vv
02:00.0 Non-Volatile memory controller: Yangtze Memory Technologies Co.,Ltd ZHITAI TiPro7000 (rev 01) (prog-if 02 [NVM Express])
Subsystem: Yangtze Memory Technologies Co.,Ltd ZHITAI TiPro7000
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 16
虚拟机中:
jobs: 1 (f=1): [r(1)][0.5%][r=971MiB/s][r=249k IOPS][eta 16m:36s]
物理机中 360k
所以,这个性能开销还是蛮大的
SR-IOV 也支持 posted interrupts 吗?
没有任何问题,看上去 SR-IOV 完全就是一个解耦的功能,就像是多了一个 vfio 设备一样
663: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 IR-PCI-MSIX-0000:17:00.2 0-edge vfio-msix[0](0000:17:00.2)
关键代码和路径
- arch/x86/kvm/vmx/posted_intr.c : 代码量很少的
- arch/x86/include/asm/posted_intr.h
当启用 posted interrupt 的时候:
- pi_update_irte
- irq_set_vcpu_affinity : kernel/irq/manage.c
chip->irq_set_vcpu_affinity- intel_ir_set_vcpu_affinity
- modify_irte
- amd_ir_set_vcpu_affinity
- intel_ir_set_vcpu_affinity
- irq_set_vcpu_affinity : kernel/irq/manage.c
虚拟设备例如 virtio-blk 如何使用 posted interrupt
vmx_deliver_interrupt -> kvm_vcpu_trigger_posted_interrupt
static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
int pi_vec)
{
#ifdef CONFIG_SMP
if (vcpu->mode == IN_GUEST_MODE) {
/*
* The vector of the virtual has already been set in the PIR.
* Send a notification event to deliver the virtual interrupt
* unless the vCPU is the currently running vCPU, i.e. the
* event is being sent from a fastpath VM-Exit handler, in
* which case the PIR will be synced to the vIRR before
* re-entering the guest.
*
* When the target is not the running vCPU, the following
* possibilities emerge:
*
* Case 1: vCPU stays in non-root mode. Sending a notification
* event posts the interrupt to the vCPU.
*
* Case 2: vCPU exits to root mode and is still runnable. The
* PIR will be synced to the vIRR before re-entering the guest.
* Sending a notification event is ok as the host IRQ handler
* will ignore the spurious event.
*
* Case 3: vCPU exits to root mode and is blocked. vcpu_block()
* has already synced PIR to vIRR and never blocks the vCPU if
* the vIRR is not empty. Therefore, a blocked vCPU here does
* not wait for any requested interrupts in PIR, and sending a
* notification event also results in a benign, spurious event.
*/
if (vcpu != kvm_get_running_vcpu())
__apic_send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
return;
}
#endif
/*
* The vCPU isn't in the guest; wake the vCPU in case it is blocking,
* otherwise do nothing as KVM will grab the highest priority pending
* IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
*/
kvm_vcpu_wake_up(vcpu);
}
这里还需要确认一下 IN_GUEST_MODE 和 kvm_get_running_vcpu() 的含义,
- IN_GUEST_MODE 应该不是说这个 vCPU 正在 guest mode 中
- kvm_get_running_vcpu() 也不是这个意思,不然怎么可能会是这个 vCPU 又在运行,当前的 CPU 又是这个 vCPU ,而且当前的 CPU 正在执行函数
所以,这里的三个情况是
- vcpu->mode == IN_GUEST_MODE && vcpu != kvm_get_running_vcpu()
- 需要注入的中断在其他的 pCPU 上运行,那么就需要 posted interrupt ipi 通知了,但是如果 vCPU 实际上不是, 那么就可以在 /proc/interrupts PIN : Posted-interrupt notification event 来观察到 kvm_posted_intr_ipi() 被调用 kvm_posted_intr_ipi() 不需要做太多事情,因为前面已经设置了 pi_test_and_set_pir ,vCPU 开始执行的时候,会检查到这个 bit
- vcpu->mode == IN_GUEST_MODE && vcpu == kvm_get_running_vcpu()
- vCPU 在 正在 VM-exit 的 fastpath 中 ,马上会重新进入 pCPU 执行中,所以什么都不需要做
- vcpu->mode != IN_GUEST_MODE
- 需要唤醒 vCPU ,走的普通的 process wake up 机制
所以,可以想到,iommu 来注入中断其实非常类似的过程,只是这个 ipi 是 iommu 来触发的。
所以,这里在强调一次,posted interrupts 是 iommu 和 APICv 两个功能一起合作才有的。
pi_desc 只是普通内存吗?
是的
/* Posted-Interrupt Descriptor */
struct pi_desc {
unsigned long pir[NR_PIR_WORDS]; /* Posted interrupt requested */
union {
struct {
u16 notifications; /* Suppress and outstanding bits */
u8 nv;
u8 rsvd_2;
u32 ndst;
};
u64 control;
};
u32 rsvd[6];
} __aligned(64);
vmx_pi_update_irte 中:
struct intel_iommu_pi_data pi_data = {
.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu)),
.vector = vector,
};
static struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
{
return &(to_vt(vcpu)->pi_desc);
}
struct vcpu_vt {
/* Posted interrupt descriptor */
struct pi_desc pi_desc;
/* Used if this vCPU is waiting for PI notification wakeup. */
struct list_head pi_wakeup_list;
union vmx_exit_reason exit_reason;
unsigned long exit_qualification;
u32 exit_intr_info;
/*
* If true, guest state has been loaded into hardware, and host state
* saved into vcpu_{vt,vmx,tdx}. If false, host state is loaded into
* hardware.
*/
bool guest_state_loaded;
bool emulation_required;
#ifdef CONFIG_X86_64
u64 msr_host_kernel_gs_base;
#endif
};
每一个 vCPU 都会有一个 pi_desc ,需要写入到两个位置中
- arch/x86/kvm/vmx/vmx.c:init_vmcs() 中,提供地址给 vmcs 就可以了:
vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->vt.pi_desc))); - drivers/iommu/intel/irq_remapping.c 中的
irte_pi.pda_l = (pi_data->pi_desc_addr >> (32 - PDA_LOW_BIT)) & ~(-1UL << PDA_LOW_BIT); irte_pi.pda_h = (pi_data->pi_desc_addr >> 32) & ~(-1UL << PDA_HIGH_BIT);中断注入的整个流程大致如下:
[!NOTE] 参考神奇海螺的意见,有待验证
Device MSI
↓
VT-d Interrupt Remapping
↓ (查 IRTE)
IRTE.p == 1 ?
↓ yes
IRTE.PDA --> PID (内存)
↓
原子置位 PID.PIR
↓
(若 SN=0)发 posted notification vector
↓
目标 pCPU LAPIC
↓
VMCS.POSTED_INTR_DESC_ADDR 指向同一个 PID
SN 是 Suppress Notification
所以,注意 pi_desc 是一个 vCPU 关联的,itre 指向的记事本, 无论 vCPU 在那个 pCPU 上执行,itre 都是指向到是一个固定的 pi_desc , 将中断记录到其中。
每一个 vCPU 都有自己的 vmcs ,在初始化的时候,vmcs 关联上自己的 pi_desc 就可以了
posted interrupt 如何 vCPU 迁移和虚拟机中的中断绑定
vCPU 迁移
先思考一个简单问题:
vmx_vcpu_pi_put() 在 vCPU 即将被挂起(put / block / schedule out)时,决定:
- 是否允许 Posted Interrupt 通过 notification IRQ 唤醒该 vCPU ,也就是 pi_enable_wakeup_handler
- 还是直接 Suppress Notification,只更新 PIR 位图,不打断 host CPU ,也就是 pi_set_sn
vmx_vcpu_pi_put 的判断是,只有一个 vCPU 为:
- 没被 host 抢占 (vCPU 的时间片用完了)
- 正在 guest 中睡眠 (虚拟机中执行 mwait / halt)
- 并且 guest 当前能接收中断
@[
vmx_vcpu_pi_put+5
vmx_vcpu_put+18
kvm_arch_vcpu_put+297
vcpu_put+25
kvm_arch_vcpu_ioctl_run+525
kvm_vcpu_ioctl+276
__x64_sys_ioctl+150
do_syscall_64+97
entry_SYSCALL_64_after_hwframe+118
]: 53060
接下来,这个 vCPU 被放到 pCPU 的 wakeup_vcpus_on_cpu 上, 当接受到中断之后,就会执行来做唤醒操作也就是执行 sysvec_kvm_posted_intr_wakeup_ipi 这个 pCPU 就是当前 vCPU 使用 CPU
list_add_tail(&vt->pi_wakeup_list,
&per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
所以再回到这个问题,vCPU 切换 pCPU 咋办?
vmx_vcpu_pi_load 中会配置 ndst 的:
do {
new.control = old.control;
/*
* Clear SN (as above) and refresh the destination APIC ID to
* handle task migration (@cpu != vcpu->cpu).
*/
new.ndst = dest;
__pi_clear_sn(&new);
/*
* Restore the notification vector; in the blocking case, the
* descriptor was modified on "put" to use the wakeup vector.
*/
new.nv = POSTED_INTR_VECTOR;
} while (pi_try_set_control(pi_desc, &old.control, new.control));
虚拟机中对于直通设备的中断进行绑定的流程
完全都是相同的机制,让虚拟机去写 msix-table ,然后来做监听
- __clone3
- start_thread
- qemu_thread_start
- kvm_vcpu_thread_fn
- kvm_cpu_exec
- address_space_rw
- address_space_write
- flatview_write
- flatview_write_continue
- flatview_write_continue_step
- memory_region_dispatch_write
- access_with_adjusted_size
- memory_region_write_accessor
- vfio_msix_vector_release
- vfio_device_irq_set_signaling
- vfio_device_io_set_irqs
@[ intel_ir_set_vcpu_affinity+5 irq_set_vcpu_affinity+206 kvm_pi_update_irte+131 kvm_arch_irq_bypass_del_producer+95 __disconnect+59 irq_bypass_unregister_producer+51 vfio_msi_set_vector_signal+104 vfio_msi_set_block+89 vfio_pci_core_ioctl+684 vfio_device_fops_unl_ioctl+140 __x64_sys_ioctl+150 do_syscall_64+97 entry_SYSCALL_64_after_hwframe+118 ]: 1 @[ intel_ir_set_vcpu_affinity+5 irq_set_vcpu_affinity+206 vmx_pi_update_irte+84 kvm_pi_update_irte+131 kvm_arch_irq_bypass_add_producer+143 __connect+88 irq_bypass_register_producer+181 vfio_msi_set_vector_signal+427 vfio_msi_set_block+89 vfio_pci_core_ioctl+684 vfio_device_fops_unl_ioctl+140 __x64_sys_ioctl+150 do_syscall_64+97 entry_SYSCALL_64_after_hwframe+118 ]: 1也就是直接重置 irte 了,这是很合理的。
- vfio_device_io_set_irqs
- vfio_device_irq_set_signaling
- vfio_msix_vector_release
- memory_region_write_accessor
- access_with_adjusted_size
- memory_region_dispatch_write
- flatview_write_continue_step
- flatview_write_continue
- flatview_write
- address_space_write
- address_space_rw
- kvm_cpu_exec
- kvm_vcpu_thread_fn
- qemu_thread_start
- start_thread
仅仅直通设备 (vfio_msihandler) 如何 vCPU 迁移和虚拟机中的中断绑定
irq 的参数是 eventfd_ctx 还是走 gsi 机制的
@[
vmx_deliver_interrupt+5
__apic_accept_irq+251
kvm_irq_delivery_to_apic_fast+336
kvm_arch_set_irq_inatomic+217
irqfd_wakeup+275
__wake_up_common+120
eventfd_signal_mask+112
vfio_msihandler+19
__handle_irq_event_percpu+85
handle_irq_event+56
handle_edge_irq+199
__common_interrupt+76
common_interrupt+128
asm_common_interrupt+38
cpuidle_enter_state+211
cpuidle_enter+45
cpuidle_idle_call+241
do_idle+119
cpu_startup_entry+41
start_secondary+296
common_startup_64+318
]: 393269
那么,vCPU 迁移,显然没有任何影响,因为注入的对象就是 vCPU ,vCPU 在那个 pCPU 上,那是 vmx_deliver_interrupt 的工作。
如果虚拟机中修改被直通设备的中断亲和性,那么这个结果完全符合我们的预期
- __clone3
- start_thread
- qemu_thread_start
- kvm_vcpu_thread_fn
- kvm_cpu_exec
- address_space_rw
- address_space_write
- flatview_write
- flatview_write_continue
- flatview_write_continue_step
- memory_region_dispatch_write
- access_with_adjusted_size
- memory_region_write_accessor
- msix_handle_mask_update
- msix_fire_vector_notifier
- vfio_msix_vector_use
- vfio_msix_vector_do_use
- vfio_update_kvm_msi_virq
- kvm_irqchip_commit_routes : 调用 KVM_SET_GSI_ROUTING
- vfio_update_kvm_msi_virq
- vfio_msix_vector_do_use
- vfio_msix_vector_use
- msix_fire_vector_notifier
- msix_handle_mask_update
- memory_region_write_accessor
- access_with_adjusted_size
- memory_region_dispatch_write
- flatview_write_continue_step
- flatview_write_continue
- flatview_write
- address_space_write
- address_space_rw
- kvm_cpu_exec
- kvm_vcpu_thread_fn
- qemu_thread_start
- start_thread
内核中观察到:
@[
irqfd_update+1
kvm_irq_routing_update+167
kvm_set_irq_routing+494
kvm_vm_ioctl+1543
__x64_sys_ioctl+150
do_syscall_64+97
entry_SYSCALL_64_after_hwframe+118
]: 20
itre 是如何被共用的
如果修改中断亲和性:
echo 10 | sudo tee /proc/irq/368/smp_affinity_list
@[
__modify_irte_ga.isra.0+1
irte_ga_set_affinity+72
amd_ir_set_affinity+122
msi_domain_set_affinity+79
irq_do_set_affinity+207
irq_set_affinity_locked+235
__irq_set_affinity+72
write_irq_affinity.isra.0+229
proc_reg_write+89
vfs_write+207
ksys_write+99
do_syscall_64+97
entry_SYSCALL_64_after_hwframe+118
]: 2
如果是 posted interrupt 的中断修改,也就是上面提到了
intel_ir_set_affinity 和 intel_ir_set_vcpu_affinity 什么关系?
文档
- Intel SDM Vol 3, Section 29.6 “Posted-Interrupt Processing”
本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。