Skip to the content.

bio request request_queue 三者的关系

[!NOTE] 参考 Deepseeek ,有待验证 实话实说,总结的很好:

bio (Block I/O) 是块设备I/O的基本容器。当内核的上层(如文件系统、虚拟内存/交换系统、LVM等)需要读写数据时,它们不关心磁盘的物理特性或如何优化性能,它们只关心一件事: “我需要把这些内存页(pages)的数据,写入到这个块设备的这个逻辑扇区(sector)上,长度是多少。”

在 blk-mq 框架下,I/O的路径发生了变化:

Request queues keep track of outstanding block I/O requests.

Request queues also implement a plug-in interface that allows the use of multiple I/O schedulers (or elevators) to be used.

TODO

bio bio_vec 和 bio_set 关系是什么?

https://zhuanlan.zhihu.com/p/545906763 中的总结:

struct 磁盘物理地址 用户文件地址 内存物理地址
bio_vec 地址连续 地址连续 地址连续
bio 地址连续 地址连续 地址不连续
request 地址连续 地址不连续 地址不连续

bio request 面对了三个层次,也就是物理内存,文件和物理磁盘

bio 贴近上层,所以可以知道文件地址的情况,所以其工作是合并连续的文件 而 request 是驱动,知道磁盘布局。

bio_set 不是这个体系下讨论的话题,是主要用于 bio 内存分配的,类似 mempool

/**
 * struct bio_vec - a contiguous range of physical memory addresses
 * @bv_page:   First page associated with the address range.
 * @bv_len:    Number of bytes in the address range.
 * @bv_offset: Start of the address range relative to the start of @bv_page.
 *
 * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len:
 *
 *   nth_page(@bv_page, n) == @bv_page + n
 *
 * This holds because page_is_mergeable() checks the above property.
 */
struct bio_vec {
	struct page	*bv_page;
	unsigned int	bv_len;
	unsigned int	bv_offset;
};

继续思考的问题:

  1. 一个 page 比一个 block 更大,导致一个 page 可能映射到多个不连续的 disk block 中。
  2. 从 iovec 到 bio 的转换过程是什么样子的

request 容易理解,如果多个 bio 对应的 disk 地址连续,那么将他们都加进来。

__blkdev_direct_IO 中,分配的 bio vector 的数量正好是 nr_pages 的数量:

		bio = bio_alloc(bdev, nr_pages, opf, GFP_KERNEL);

似乎需要首先理解 iomap 和 buffer head 才可以

submit_bio

  1. 这是 fs 和 dev 进行 io 的唯一的接口吗 ?
  2. submit_bio 和 generic_make_request 的关系是什么 ? 两者都是类似接口,但是 submit_bio 做出部分检查,最后调用 generic_make_request

iomap 直接来调用 submit_bio,有趣

bio

/*
 * main unit of I/O for the block layer and lower layers (ie drivers and
 * stacking drivers)
 */
struct bio {
	struct bio		*bi_next;	/* request queue link */
	struct block_device	*bi_bdev;
	blk_opf_t		bi_opf;		/* bottom bits REQ_OP, top bits
						 * req_flags.
						 */
	unsigned short		bi_flags;	/* BIO_* below */
	unsigned short		bi_ioprio;
	blk_status_t		bi_status;
	atomic_t		__bi_remaining;

	struct bvec_iter	bi_iter;

	blk_qc_t		bi_cookie;
	bio_end_io_t		*bi_end_io;
	void			*bi_private;
#ifdef CONFIG_BLK_CGROUP
	/*
	 * Represents the association of the css and request_queue for the bio.
	 * If a bio goes direct to device, it will not have a blkg as it will
	 * not have a request_queue associated with it.  The reference is put
	 * on release of the bio.
	 */
	struct blkcg_gq		*bi_blkg;
	struct bio_issue	bi_issue;
#ifdef CONFIG_BLK_CGROUP_IOCOST
	u64			bi_iocost_cost;
#endif
#endif

#ifdef CONFIG_BLK_INLINE_ENCRYPTION
	struct bio_crypt_ctx	*bi_crypt_context;
#endif

	union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
		struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif
	};

	unsigned short		bi_vcnt;	/* how many bio_vec's */

	/*
	 * Everything starting with bi_max_vecs will be preserved by bio_reset()
	 */

	unsigned short		bi_max_vecs;	/* max bvl_vecs we can hold */

	atomic_t		__bi_cnt;	/* pin count */

	struct bio_vec		*bi_io_vec;	/* the actual vec list */

	struct bio_set		*bi_pool;

	/*
	 * We can inline a number of vecs at the end of the bio, to avoid
	 * double allocations for a small number of bio_vecs. This member
	 * MUST obviously be kept at the very end of the bio.
	 */
	struct bio_vec		bi_inline_vecs[];
};

[ ] bio:bi_opf

enum req_flag_bits {
	__REQ_FAILFAST_DEV =	/* no driver retries of device errors */
		REQ_OP_BITS,
	__REQ_FAILFAST_TRANSPORT, /* no driver retries of transport errors */
	__REQ_FAILFAST_DRIVER,	/* no driver retries of driver errors */
	__REQ_SYNC,		/* request is sync (sync write or read) */
	__REQ_META,		/* metadata io request */
	__REQ_PRIO,		/* boost priority in cfq */
	__REQ_NOMERGE,		/* don't touch this for merging */
	__REQ_IDLE,		/* anticipate more IO after this one */
	__REQ_INTEGRITY,	/* I/O includes block integrity payload */
	__REQ_FUA,		/* forced unit access */
	__REQ_PREFLUSH,		/* request for cache flush */
	__REQ_RAHEAD,		/* read ahead, can fail anytime */
	__REQ_BACKGROUND,	/* background IO */
	__REQ_NOWAIT,           /* Don't wait if request will block */
	/*
	 * When a shared kthread needs to issue a bio for a cgroup, doing
	 * so synchronously can lead to priority inversions as the kthread
	 * can be trapped waiting for that cgroup.  CGROUP_PUNT flag makes
	 * submit_bio() punt the actual issuing to a dedicated per-blkcg
	 * work item to avoid such priority inversions.
	 */
	__REQ_CGROUP_PUNT,
	__REQ_POLLED,		/* caller polls for completion using bio_poll */
	__REQ_ALLOC_CACHE,	/* allocate IO from cache if available */
	__REQ_SWAP,		/* swap I/O */
	__REQ_DRV,		/* for driver use */

	/*
	 * Command specific flags, keep last:
	 */
	/* for REQ_OP_WRITE_ZEROES: */
	__REQ_NOUNMAP,		/* do not free blocks when zeroing */

	__REQ_NR_BITS,		/* stops here */
};

bio::bio_vec

/**
 * struct bio_vec - a contiguous range of physical memory addresses
 * @bv_page:   First page associated with the address range.
 * @bv_len:    Number of bytes in the address range.
 * @bv_offset: Start of the address range relative to the start of @bv_page.
 *
 * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len:
 *
 *   nth_page(@bv_page, n) == @bv_page + n
 *
 * This holds because page_is_mergeable() checks the above property.
 */
struct bio_vec {
	struct page	*bv_page;
	unsigned int	bv_len;
	unsigned int	bv_offset;
};

确实就是图上的样子。

bio::bi_iter

在一个静态环境中的时间:

@[
    bio_alloc_bioset+5
    ext4_mpage_readpages+1100
    read_pages+88
    page_cache_ra_unbounded+315
    filemap_fault+1459
    __do_fault+49
    do_fault+435
    __handle_mm_fault+1581
    handle_mm_fault+259
    do_user_addr_fault+464
    exc_page_fault+107
    asm_exc_page_fault+38
]: 2742
struct bvec_iter {
	sector_t		bi_sector;	/* device address in 512 byte
						   sectors */
	unsigned int		bi_size;	/* residual I/O count */

	unsigned int		bi_idx;		/* current index into bvl_vec */

	unsigned int            bi_bvec_done;	/* number of bytes completed in
						   current bvec */
} __packed;

似乎是 bi_idxbi_bvec_done 的确是记录提交的顺序的,但是没有绝对的证据

好奇怪,

#0  bvec_iter_advance (bytes=1310720, iter=0xffff88810d641f20, bv=0xffff888105bb3000) at ./include/linux/bvec.h:143
#1  bio_advance_iter (bytes=1310720, iter=0xffff88810d641f20, bio=0xffff88810d641f00) at ./include/linux/bio.h:106
#2  __bio_advance (bio=bio@entry=0xffff88810d641f00, bytes=bytes@entry=1310720) at block/bio.c:1392
#3  0xffffffff81739441 in bio_advance (nbytes=<optimized out>, bio=0xffff88810d641f00) at ./include/linux/bio.h:142
#4  bio_split (bio=bio@entry=0xffff88810d641f00, sectors=<optimized out>, gfp=gfp@entry=3072, bs=bs@entry=0xffffffff81738fa4 <bio_alloc_bioset+628>) at block/bio.c:1646
#5  0xffffffff8174505f in bio_split_rw (bio=bio@entry=0xffff88810d641f00, lim=<optimized out>, segs=0xffff88810d641f00, segs@entry=0xffffc90001c87804, bs=0xffffffff81738fa4 <bio_alloc_bioset+628>, max_bytes=<optimized out>) at block/blk-merge.c:337
#6  0xffffffff81745a45 in __bio_split_to_limits (bio=bio@entry=0xffff88810d641f00, lim=lim@entry=0xffff888107591e08, nr_segs=nr_segs@entry=0xffffc90001c87804) at block/blk-mer ge.c:370
#7  0xffffffff8174e863 in blk_mq_submit_bio (bio=0xffff88810d641f00) at block/blk-mq.c:2943
#8  0xffffffff8173e817 in __submit_bio (bio=bio@entry=0xffff88810d641f00) at block/blk-core.c:602
#9  0xffffffff8173ee92 in __submit_bio_noacct_mq (bio=0xffff88810d641f00) at block/blk-core.c:679
#10 submit_bio_noacct_nocheck (bio=<optimized out>) at block/blk-core.c:708
#11 0xffffffff8173f060 in submit_bio_noacct (bio=<optimized out>) at block/blk-core.c:807
#12 0xffffffff81526f0e in ext4_mpage_readpages (inode=<optimized out>, rac=<optimized out>, page=<optimized out>) at fs/ext4/readpage.c:402
#13 0xffffffff8133f428 in read_pages (rac=rac@entry=0xffffc90001c87ab0) at mm/readahead.c:161
#14 0xffffffff8133f73b in page_cache_ra_unbounded (ractl=ractl@entry=0xffffc90001c87ab0, nr_to_read=513, lookahead_size=<optimized out>) at mm/readahead.c:270
#15 0xffffffff8133fd00 in do_page_cache_ra (lookahead_size=<optimized out>, nr_to_read=<optimized out>, ractl=0xffffc90001c87ab0) at mm/readahead.c:300
#16 0xffffffff81332d24 in do_sync_mmap_readahead (vmf=0xffffc90001c87b98) at mm/filemap.c:3190
#17 filemap_fault (vmf=0xffffc90001c87b98) at mm/filemap.c:3282
#18 0xffffffff81374d61 in __do_fault (vmf=vmf@entry=0xffffc90001c87b98) at mm/memory.c:4155
#19 0xffffffff81379505 in do_cow_fault (vmf=0xffffc90001c87b98) at mm/memory.c:4536
#20 do_fault (vmf=vmf@entry=0xffffc90001c87b98) at mm/memory.c:4637
#21 0xffffffff8137e1df in handle_pte_fault (vmf=0xffffc90001c87b98) at mm/memory.c:4923
#22 __handle_mm_fault (vma=vma@entry=0xffff88810611d688, address=address@entry=94523413734384, flags=flags@entry=533) at mm/memory.c:5065
#23 0xffffffff8137eeea in handle_mm_fault (vma=0xffff88810611d688, address=address@entry=94523413734384, flags=<optimized out>, flags@entry=533, regs=regs@entry=0xffffc90001c87cf8) at mm/memory.c:5211
#24 0xffffffff811211f6 in do_user_addr_fault (regs=0xffffc90001c87cf8, error_code=2, address=94523413734384) at arch/x86/mm/fault.c:1407
#25 0xffffffff82293c80 in handle_page_fault (address=94523413734384, error_code=2, regs=0xffffc90001c87cf8) at arch/x86/mm/fault.c:1498
#26 exc_page_fault (regs=0xffffc90001c87cf8, error_code=2) at arch/x86/mm/fault.c:1554
#27 0xffffffff82401286 in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:570

这个是证据,在

#0  req_bio_endio (error=0 '\000', nbytes=4096, bio=0xffff88810cbd6000, rq=0xffff8881099e6480) at block/blk-mq.c:789
#1  blk_update_request (req=req@entry=0xffff8881099e6480, error=error@entry=0 '\000', nr_bytes=4096) at block/blk-mq.c:927
#2  0xffffffff8174c29e in blk_mq_end_request (rq=rq@entry=0xffff8881099e6480, error=error@entry=0 '\000') at block/blk-mq.c:1054
#3  0xffffffff81b72c6a in virtblk_request_done (req=0xffff8881099e6480) at drivers/block/virtio_blk.c:348
#4  virtblk_handle_req (vq=vq@entry=0xffff8881095fc180, iob=iob@entry=0xffffc900002e0ef8) at drivers/block/virtio_blk.c:376
#5  0xffffffff81b72d9f in virtblk_done (vq=0xffff888107860200) at drivers/block/virtio_blk.c:394
#6  0xffffffff818b39ab in vring_interrupt (irq=<optimized out>, _vq=0x1 <fixed_percpu_data+1>) at drivers/virtio/virtio_ring.c:2491
#7  vring_interrupt (irq=<optimized out>, _vq=0x1 <fixed_percpu_data+1>) at drivers/virtio/virtio_ring.c:2466
#8  0xffffffff811c926d in __handle_irq_event_percpu (desc=desc@entry=0xffff888109678000) at kernel/irq/handle.c:158
#9  0xffffffff811c9478 in handle_irq_event_percpu (desc=0xffff888109678000) at kernel/irq/handle.c:193
#10 handle_irq_event (desc=desc@entry=0xffff888109678000) at kernel/irq/handle.c:210
#11 0xffffffff811ce39b in handle_edge_irq (desc=0xffff888109678000) at kernel/irq/chip.c:819
#12 0xffffffff810d9d9f in generic_handle_irq_desc (desc=0xffff888109678000) at ./include/linux/irqdesc.h:158
#13 handle_irq (regs=<optimized out>, desc=0xffff888109678000) at arch/x86/kernel/irq.c:231
#14 __common_interrupt (regs=<optimized out>, vector=8) at arch/x86/kernel/irq.c:250
#15 0xffffffff82290e7f in common_interrupt (regs=0xffffc900000d7e38, error_code=<optimized out>) at arch/x86/kernel/irq.c:240

而且从 raid1 中使用的 bio_split 也可以发现 bio 的返回就是做左向右的。

bio::bi_end_io

submit_bio 是异步的,但是可以修改为同步的,通过设置 bio::bi_end_io 来提醒结束,否则始终等待在 上面。

#0  io_schedule_timeout (timeout=timeout@entry=9223372036854775807) at kernel/sched/core.c:8794
#1  0xffffffff821906a3 in do_wait_for_common (state=2, timeout=9223372036854775807, action=0xffffffff8218ea70 <io_schedule_timeout>, x=0xffffc9003f94fe20) at kernel/sched/build_utility.c:3653
#2  __wait_for_common (x=x@entry=0xffffc9003f94fe20, action=0xffffffff8218ea70 <io_schedule_timeout>, timeout=timeout@entry=9223372036854775807, state=state@entry=2) at kernel/sched/build_utility.c:3674
#3  0xffffffff8219099f in wait_for_common_io (state=2, timeout=9223372036854775807, x=0xffffc9003f94fe20) at kernel/sched/build_utility.c:3691
#4  0xffffffff816ba46a in submit_bio_wait (bio=bio@entry=0xffffc9003f94fe58) at block/bio.c:1388
#5  0xffffffff816c408b in blkdev_issue_flush (bdev=<optimized out>) at block/blk-flush.c:467
#6  0xffffffff8148bf39 in ext4_sync_file (file=0xffff888145091e00, start=<optimized out>, end=<optimized out>, datasync=<optimized out>) at fs/ext4/fsync.c:177
#7  0xffffffff8140af15 in vfs_fsync_range (datasync=1, end=9223372036854775807, start=0, file=0xffff888145091e00) at fs/sync.c:188
#8  vfs_fsync (datasync=1, file=0xffff888145091e00) at fs/sync.c:202
#9  do_fsync (datasync=1, fd=<optimized out>) at fs/sync.c:212
#10 __do_sys_fdatasync (fd=<optimized out>) at fs/sync.c:225
#11 __se_sys_fdatasync (fd=<optimized out>) at fs/sync.c:223
#12 __x64_sys_fdatasync (regs=<optimized out>) at fs/sync.c:223
#13 0xffffffff82180c3f in do_syscall_x64 (nr=<optimized out>, regs=0xffffc9003f94ff58) at arch/x86/entry/common.c:50
#14 do_syscall_64 (regs=0xffffc9003f94ff58, nr=<optimized out>) at arch/x86/entry/common.c:80
#15 0xffffffff822000ae in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

通知 bio 结束的执行流程:

#0  bio_endio (bio=bio@entry=0xffff888104430100) at block/bio.c:1577
#1  0xffffffff816cd00c in req_bio_endio (error=0 '\000', nbytes=4096, bio=0xffff888104430100, rq=0xffff8881042d1680) at block/blk-mq.c:794
#2  blk_update_request (req=req@entry=0xffff8881042d1680, error=error@entry=0 '\000', nr_bytes=4096) at block/blk-mq.c:926
#3  0xffffffff816cd349 in blk_mq_end_request (rq=0xffff8881042d1680, error=0 '\000') at block/blk-mq.c:1053
#4  0xffffffff81aa9e99 in virtblk_done (vq=0xffff8881008a7000) at drivers/block/virtio_blk.c:291
#5  0xffffffff818134f9 in vring_interrupt (irq=<optimized out>, _vq=<optimized out>) at drivers/virtio/virtio_ring.c:2470
#6  vring_interrupt (irq=<optimized out>, _vq=<optimized out>) at drivers/virtio/virtio_ring.c:2445
#7  0xffffffff811a3d35 in __handle_irq_event_percpu (desc=desc@entry=0xffff888100cb7a00) at kernel/irq/handle.c:158
#8  0xffffffff811a3f13 in handle_irq_event_percpu (desc=0xffff888100cb7a00) at kernel/irq/handle.c:193
#9  handle_irq_event (desc=desc@entry=0xffff888100cb7a00) at kernel/irq/handle.c:210
#10 0xffffffff811a8bee in handle_edge_irq (desc=0xffff888100cb7a00) at kernel/irq/chip.c:819
#11 0xffffffff810ce1d8 in generic_handle_irq_desc (desc=0xffff888100cb7a00) at ./include/linux/irqdesc.h:158
#12 handle_irq (regs=<optimized out>, desc=0xffff888100cb7a00) at arch/x86/kernel/irq.c:231
#13 __common_interrupt (regs=<optimized out>, vector=35) at arch/x86/kernel/irq.c:250
#14 0xffffffff8218a897 in common_interrupt (regs=0xffffc900000a3e38, error_code=<optimized out>) at arch/x86/kernel/irq.c:240

几乎每一个和 bio 打交道的都会对应的注册 bio::bi_end_io

request

	/*
	 * The hash is used inside the scheduler, and killed once the
	 * request reaches the dispatch list. The ipi_list is only used
	 * to queue the request for softirq completion, which is long
	 * after the request has been unhashed (and even removed from
	 * the dispatch list).
	 */
	union {
		struct hlist_node hash;	/* merge hash */
		struct llist_node ipi_list;
	};

task_struct::bio_list

/*
 * The loop in this function may be a bit non-obvious, and so deserves some
 * explanation:
 *
 *  - Before entering the loop, bio->bi_next is NULL (as all callers ensure
 *    that), so we have a list with a single bio.
 *  - We pretend that we have just taken it off a longer list, so we assign
 *    bio_list to a pointer to the bio_list_on_stack, thus initialising the
 *    bio_list of new bios to be added.  ->submit_bio() may indeed add some more
 *    bios through a recursive call to submit_bio_noacct.  If it did, we find a
 *    non-NULL value in bio_list and re-enter the loop from the top.
 *  - In this case we really did just take the bio of the top of the list (no
 *    pretending) and so remove it from bio_list, and call into ->submit_bio()
 *    again.
 *
 * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio.
 * bio_list_on_stack[1] contains bios that were submitted before the current
 *	->submit_bio, but that haven't been processed yet.
 */
static void __submit_bio_noacct(struct bio *bio)
{
	struct bio_list bio_list_on_stack[2];

正如 submit_bio_noacct_nocheck 中间的注释所说,这个是为了处理 ->submit_bio 的递归调用的。

关键参考

以 mddev::queue 到 mddev::gendisk::queue 的转变

之前的时候 mddev 中引用了 request_queue ,但是后来移除掉了

struct mddev {

  struct request_queue                *queue;        /* for plugging ... */

commit 396799eb5b6f (“md: remove mddev->queue”) commit 176df894d797 (“md: add a mddev_is_dm helper”)

后来直接变为使用 mddev::gendisk

/*
 * MD devices can be used undeneath by DM, in which case ->gendisk is NULL.
 */
static inline bool mddev_is_dm(struct mddev *mddev)
{
	return !mddev->gendisk;
}

gendisk::queue

首先,gendisk 是对应一个盘,而不是一个分区的。

struct gendisk {
	/*
	 * major/first_minor/minors should not be set by any new driver, the
	 * block core will take care of allocating them automatically.
	 */
	int major;
	int first_minor;
	int minors;
	// ...
}

正如我们常说的,一个盘对应一个 request_queue

热插一个 virtio scsi disk 的结果:

当 scsi_alloc_sdev 的时候,通过 blk_mq_alloc_queue 从 tag_set 哪里分配一个 mq ,最后提供给 __alloc_disk_node

@[
        __alloc_disk_node+5
        blk_mq_alloc_disk_for_queue+51
        sd_probe+173
        really_probe+219
        __driver_probe_device+120
        driver_probe_device+31
        __device_attach_driver+137
        bus_for_each_drv+144
        __device_attach_async_helper+163
        async_run_entry_fn+49
        process_one_work+368
        worker_thread+597
        kthread+236
        ret_from_fork+49
        ret_from_fork_asm+26
]: 1

mddev::gendisk 的意义

md_alloc 中看到

	disk = blk_alloc_disk(NULL, NUMA_NO_NODE);

这个 gendisk 就是我们看到的 /dev/md127

raid5_unplug 中

	if (mddev->queue)
		trace_block_unplug(mddev->queue, cnt, !from_schedule);
static void mddev_detach(struct mddev *mddev)
{
	mddev->bitmap_ops->wait_behind_writes(mddev);
	if (mddev->pers && mddev->pers->quiesce && !is_md_suspended(mddev)) {
		mddev->pers->quiesce(mddev, 1);
		mddev->pers->quiesce(mddev, 0);
	}
	md_unregister_thread(mddev, &mddev->thread);

	/* the unplug fn references 'conf' */
	if (!mddev_is_dm(mddev))
		blk_sync_queue(mddev->gendisk->queue);
}

虽然 gendisk 中存在其对应的 request_queue ,但是 应该不在于提交 io ,而是一些通用机制了。

bio_add_folio

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。