Skip to the content.

mq 的 queue 的数量就是 hctx 的数量

qemu 会根据 CPU 数量设置队列数量:

static void virtio_blk_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
{
    VirtIOBlkPCI *dev = VIRTIO_BLK_PCI(vpci_dev);
    DeviceState *vdev = DEVICE(&dev->vdev);
    VirtIOBlkConf *conf = &dev->vdev.conf;

    if (conf->num_queues == VIRTIO_BLK_AUTO_NUM_QUEUES) {
        conf->num_queues = virtio_pci_optimal_num_queues(0);
    }

    if (vpci_dev->nvectors == DEV_NVECTORS_UNSPECIFIED) {
        vpci_dev->nvectors = conf->num_queues + 1;
    }

    qdev_realize(vdev, BUS(&vpci_dev->bus), errp);
}

在 virtio_blk 驱动中,从 qemu 那里获取到 num_queues

static int init_vq(struct virtio_blk *vblk)
{
	int err;
	unsigned short i;
	vq_callback_t **callbacks;
	const char **names;
	struct virtqueue **vqs;
	unsigned short num_vqs;
	unsigned short num_poll_vqs;
	struct virtio_device *vdev = vblk->vdev;
	struct irq_affinity desc = { 0, };

	err = virtio_cread_feature(vdev, VIRTIO_BLK_F_MQ,
				   struct virtio_blk_config, num_queues,
				   &num_vqs);
	if (err)
		num_vqs = 1;

在 virtblk_probe 中,根据 num_queues 初始化硬件队列:


	vblk->tag_set.nr_hw_queues = vblk->num_vqs;

如何获取 nvme 硬件队列数量

参考: https://news.ycombinator.com/item?id=28706762

Samsung’s current flagship 980 PRO consumer drive supports 128 queues, but the previous generation (970 PRO/EVO/EVO Plus) only supported 32. Their first two generations were limited to 8 and 7 queues. I wouldn’t be surprised if these limits also applied to their entry-level enterprise SSDs that used the same controllers.

🤒  sudo nvme get-feature /dev/nvme0n1 -f 7
get-feature:0x07 (Number of Queues), Current value:0x00070007 # TiPro7100  # 等待确认

🧀  sudo nvme get-feature /dev/nvme1n1  -f 7
get-feature:0x07 (Number of Queues), Current value:0x00400040 # TiPro7000  # 32 个 hctx0

🧀  sudo nvme get-feature /dev/nvme2n1  -f 7
get-feature:0x07 (Number of Queues), Current value:0x000f000f # MAXIO 1602 # 16 个 hctx0

从软件上来说,就是 hctx = min(硬件队列的数量,CPU 数量)

🧀  sudo nvme get-feature /dev/nvme1n1 -f 7
get-feature:0x07 (Number of Queues), Current value:0x003e003e

🧀  lspci -nn | grep -i non
c8:00.0 Non-Volatile memory controller [0108]: PETAIO INC PETA8118 NVMe SSD Series [1ee4:1180] (rev 01)

🧀  cat /proc/interrupts | grep -i nvme | wc -l
60

[!NOTE] 参考神奇海螺的意见,有待验证

Submission Queues 数量: NSQ = 0x3e + 1 = 63 Completion Queues 数量: NCQ = 0x3e + 1 = 63

所以,这都是完全符合预期的。

/sys/block/*/inflight

非常确认了,这个就是发送给设备的

每一个 block 设备都存在,但是为什么不是 queue 中的:

cat /sys/block/sda/inflight
struct mq_inflight {
	struct block_device *part;
	unsigned int inflight[2];
};

对应的 backtrace 为:

#0  blk_mq_in_flight_rw (q=0xffff8881047a14a0, part=part@entry=0xffff888101a0ce00, inflight=inflight@entry=0xffffc900400a3db8) at block/blk-mq.c:156
#1  0xffffffff816d862a in part_inflight_show (dev=0xffff888101a0ce48, attr=<optimized out>, buf=0xffff888145862000 "") at block/genhd.c:1009
#2  0xffffffff81a6d9e8 in dev_attr_show (kobj=<optimized out>, attr=0xffffffff82df5740 <dev_attr_inflight>, buf=<optimized out>) at drivers/base/core.c:2196
#3  0xffffffff8146dc8b in sysfs_kf_seq_show (sf=0xffff8881460b1258, v=<optimized out>) at fs/sysfs/file.c:59

/sys/block/*/queue/nr_requests 的含义

nr_requests 这个盘的一个硬件队列接受多少个请求, 所以,当使用了 scheduler ,那么 nr_requests 就去操控 scheduler 当没有使用 nr_requests ,那么就作用于硬件队列的大小

Documentation/ABI/stable/sysfs-block

nr_requests (RW)
----------------
This controls how many requests may be allocated in the block layer for
read or write requests. Note that the total allocated number may be twice
this amount, since it applies only to reads or writes (not the accumulated
sum).

To avoid priority inversion through request starvation, a request
queue maintains a separate request pool per each cgroup when
CONFIG_BLK_CGROUP is enabled, and this parameter applies to each such
per-block-cgroup request pool.  IOW, if there are N block cgroups,
each request queue may have up to N request pools, each independently
regulated by nr_requests.

更新这个字段的位置为 blk_mq_update_nr_requests

cmd_per_lun

看上去这个关系很大呢?

  find /sys -name cmd_per_lun -exec cat {} \;
find: ‘/sys/kernel/debug’: Permission denied
0
0
0
0
0
0
0
0
0
0
0
0
256

can_queue

find /sys -name can_queue -exec cat {} \;
/sys/devices/pci0000:00/0000:00:17.0/ata1/host0/scsi_host/host0/can_queue
/sys/devices/pci0000:00/0000:00:17.0/ata8/host7/scsi_host/host7/can_queue
/sys/devices/pci0000:00/0000:00:17.0/ata6/host5/scsi_host/host5/can_queue
/sys/devices/pci0000:00/0000:00:17.0/ata4/host3/scsi_host/host3/can_queue
/sys/devices/pci0000:00/0000:00:17.0/ata2/host1/scsi_host/host1/can_queue
/sys/devices/pci0000:00/0000:00:17.0/ata7/host6/scsi_host/host6/can_queue
/sys/devices/pci0000:00/0000:00:17.0/ata5/host4/scsi_host/host4/can_queue
/sys/devices/pci0000:00/0000:00:17.0/ata3/host2/scsi_host/host2/can_queue

在服务器观察:

find /sys -name can_queue -exec cat {} \;
find: ‘/sys/kernel/debug’: Permission denied
32
32
32
32
32
32
32
32
32
32
32
32
5089
  1. 获取

  2. 传递给 block layer

scsi 中的 can_queue 就是 queue_depth ,在 scsi_mq_setup_tags 中将

	tag_set->queue_depth = shost->can_queue;

nr_requests 影响 sbitmap

- queue_requests_store
  - blk_mq_update_nr_requests
    - blk_mq_tag_update_depth
      - blk_mq_alloc_map_and_rqs
      - sbitmap_queue_resize

scsi_device::queue_depth 和 scsi_device::max_queue_depth

继续分析一下,这种场景吧

echo 1 > /sys/block/sda/device/queue_depth

queue_depth 起作用的地方是 scsi budget 机制

#0  scsi_dev_queue_ready (q=0xffff88810b428a98, sdev=0xffff88810c46d000) at drivers/scsi/scsi_lib.c:1253
#1  scsi_mq_get_budget (q=0xffff88810b428a98) at drivers/scsi/scsi_lib.c:1662
#2  0xffffffff8190f9ac in blk_mq_get_dispatch_budget (q=0xffff88810b428a98) at block/blk-mq.h:254
#3  blk_mq_get_budget_and_tag (rq=rq@entry=0xffff88810c0d6800) at block/blk-mq.c:2634
#4  0xffffffff81910863 in blk_mq_try_issue_directly (hctx=hctx@entry=0xffff88810bcfc200, rq=rq@entry=0xffff88810c0d6800) at block/blk-mq.c:2665
#5  0xffffffff81911754 in blk_mq_submit_bio (bio=<optimized out>) at block/blk-mq.c:3047

queue_depth 初始化的位置

virtscsi_probe 中初始化

	shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

在 scsi_change_queue_depth 中

	depth = sdev->host->cmd_per_lun ?: 1;

max_queue_depth 的初始化和使用

无论初始化 (scsi_add_lun) 还是通过 sysfs 修改 (sdev_store_queue_depth) ,最后 都是设置 max_queue_depth = queue_depth

使用 max_queue_depth 地方在 : scsi_handle_queue_ramp_up

在 scsi_decide_disposition 和 scsi_eh_completed_normally 中如果检测到 queue 的大小在上升,那么就 使用 scsi_handle_queue_ramp_up 来提升 queue_depth ,其限制为 max_queue_depth

而 scsi_handle_queue_full 是用来限制 queue_depth 的

request_queue::queue_depth

修改 scsi_device::queue_depth 的时候会顺便修改 request_queue::queue_depth

实际上使用的地方并不多,修改 scsi_device::queue_depth 不会影响到 nr_requests

blk_mq_tag_set

这个也整理也好:

一个 HBA 或者 nvme 控制器对应的一个 tag_set

blk_mq_tag_set 是放到 nvme_dev 中的

struct nvme_dev {
	struct nvme_queue *queues;
	struct blk_mq_tag_set tagset;

类似的

struct Scsi_Host {

	/* Area to keep a shared tag map */
	struct blk_mq_tag_set	tag_set;

以 nvme 为例子创建过程:

blk_mq_tag_set::nr_hw_queues

nvme_alloc_io_tag_set 中 nvme_ctrl::queue_count 获取,后者 nvme_alloc_queue 中每次加上 1 ,而 nvme_create_io_queues -> 会连续调用 nvme_alloc_queue (这里并没有完全分析清楚)

blk_mq_tag_set::queue_depth : 硬件队列深度

nvme_alloc_io_tag_set 从 nvme_ctrl::sqsize 获取,后者在 nvme_setup_io_queues 初始化

request_queue::queue_depth 和 request_queue::nr_reuqest 是什么关系

/**
 * blk_set_queue_depth - tell the block layer about the device queue depth
 * @q:		the request queue for the device
 * @depth:		queue depth
 *
 */
void blk_set_queue_depth(struct request_queue *q, unsigned int depth)
{
	q->queue_depth = depth;
	rq_qos_queue_depth_changed(q);
}

https://patchwork.kernel.org/project/linux-scsi/patch/1597850436-116171-18-git-send-email-john.garry@huawei.com/

利用这个 backtrace ,可以轻松的追踪 tag 的大小是 tag_set::queue_depth 来初始化的

[    0.904009]  dump_stack_lvl+0x64/0x80
[    0.904009]  blk_mq_init_bitmaps+0x3b/0xc0
[    0.904009]  blk_mq_init_tags+0x7d/0xb0
[    0.904009]  blk_mq_alloc_map_and_rqs+0x4e/0x380
[    0.904009]  blk_mq_alloc_tag_set+0x1a4/0x3f0
[    0.904009]  scsi_add_host_with_dma+0xd0/0x370
[    0.904009]  virtscsi_probe+0x2ba/0x390
[    0.904009]  virtio_dev_probe+0x1b2/0x270
[    0.904009]  really_probe+0xbc/0x2c0
[    0.904009]  ? __pfx___driver_attach+0x10/0x10
[    0.904009]  __driver_probe_device+0x73/0x120
[    0.904009]  driver_probe_device+0x1f/0xe0
[    0.904009]  __driver_attach+0x88/0x180
[    0.904009]  bus_for_each_dev+0x85/0xd0
[    0.904009]  bus_add_driver+0xec/0x1f0
[    0.904009]  driver_register+0x59/0x100
[    0.904009]  ? __pfx_virtio_scsi_init+0x10/0x10
[    0.904009]  virtio_scsi_init+0x64/0xd0
[    0.904009]  ? __pfx_virtio_scsi_init+0x10/0x10
[    0.904009]  do_one_initcall+0x58/0x230
[    0.904009]  kernel_init_freeable+0x1c4/0x300
[    0.904009]  ? __pfx_kernel_init+0x10/0x10
[    0.904009]  kernel_init+0x1a/0x1c0
[    0.904009]  ret_from_fork+0x31/0x50
[    0.904009]  ? __pfx_kernel_init+0x10/0x10
[    0.904009]  ret_from_fork_asm+0x1b/0x30
[    0.904009]  </TASK>
[    0.904009] blk-mq: depth=256
[    0.904009] blk-mq: depth=256

nvme 中类似的操作:

mq 典型函数

1.

#define queue_for_each_hw_ctx(q, hctx, i)				\
	xa_for_each(&(q)->hctx_table, (i), (hctx))

#define hctx_for_each_ctx(hctx, ctx, i)					\
	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
  1. blk_mq_queue_tag_busy_iter 遍历一个盘中提交到硬件队列及其之后阶段的 request
int scsi_host_busy(struct Scsi_Host *shost)
{
	int cnt = 0;

	blk_mq_tagset_busy_iter(&shost->tag_set,
				scsi_host_check_in_flight, &cnt);
	return cnt;
}

tag 是 hctx 级别的

( 2026-04-15 似乎并不是,其实可以让多个 hctx 共享 tag ,似乎和 shared tags 有关 ?)

获取 tags 实际上是依赖于 sbitmap 的,是 sbitmap 来实现

真正锁的粒度是在 __sbitmap_get_word 上面的。

每一个 blk_mq_tag_set 是 HBA 卡级别的,而 blk_mq_tags 是 hardware queue 级别的

/*
 * @tags:	   Tag sets. One tag set per hardware queue. Has @nr_hw_queues
 *		   elements.
 */
struct blk_mq_tag_set {
  // ...
  unsigned int		nr_hw_queues;
  // ...
	struct blk_mq_tags	**tags;

一个盘上最多可以挂 nr_hw_queues * blk_mq_tags::nr_tags

会出现两个同时都是在 inflight 的 tag 的编号

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。