buffer.c

什么时候时候需要使用 buffer head

在一个使用 ext4 的系统中，buffer head :

3010086 1755653  58%    0.22K  83614       36    668912K dentry
2683902 1631823  60%    0.10K  68818       39    275272K buffer_head
1188864 1178796  99%    0.03K   9288      128     37152K kmalloc-32
1099320 1083166  98%    0.13K  18322       60    146576K kernfs_node_cache
826112 294509  35%    0.06K  12908       64     51632K kmalloc-rcl-64
619650 583479  94%    0.70K  13770       45    440640K proc_inode_cache
556500 511105  91%    0.63K  11130       50    356160K inode_cache
539021 249230  46%    0.57K   9642       56    308544K radix_tree_node
381824 211601  55%    0.50K   5966       64    190912K kmalloc-512
354688  81444  22%    1.13K  12671       28    405472K ext4_inode_cache
305536 260047  85%    0.06K   4774       64     19096K lsm_inode_cache
301504 288972  95%    0.06K   4711       64     18844K kmalloc-64

当然，如果当前的文件系统使用的是 xfs ，就只能观察到很少的东西了。

首先，第一个问题，就是我们会在哪里来分配:

文件读写时（最常见）

int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
 // 为整个 folio 创建 buffer head 链
 head = folio_create_buffers(folio, inode, 0);  // ← 这里分配
 // ...
}
struct buffer_head *folio_buffers(struct folio *folio)
{
 bh = folio_buffers(folio);
 if (!bh)
     bh = create_empty_buffers(folio, blocksize, 0);  // ← 这里分配
 return bh;
}

直接操作块设备时：getblk/bdev_getblk

struct buffer_head *bdev_getblk(struct block_device *bdev, sector_t block,
        unsigned size, gfp_t gfp)
{
    // 1. 先在缓存中查找
    bh = __find_get_block(bdev, block, size);
    if (bh)
        return bh;

    // 2. 没找到，分配新的
    return __getblk_slow(bdev, block, size, gfp);  // ← 这里分配
}

最基本的总结

buffer head 需要解决两个问题:

page size 比一个 block 的 size 要大，所以一个 struct buffer_head 来记录一个 page 中的内容和 block 的映射关系
metadata 不会加载到 page cache 中，有的文件系统利用 bh 机制来加载 metadata

从 plka 中记录下来的东西，应为 buffer_head 很多年没有变化，其中的内容基本正确:

dependent cache

参考 struct buffer_head 的注释，如下三个功能:

extracting block mappings (via a get_block_t call),
- 构建内存和 disk 的映射关系，记录到 buffer_head:b_page 和 buffer_head:b_blocknr
- 一个 page 有多个 buffer_head ，buffer_head 之间通过 b_assoc_buffers 联系起来
for tracking state within a page (via a page_mapping)
and for wrapping bio submission for backward compatibility reasons

/*
 * Historically, a buffer_head was used to map a single block
 * within a page, and of course as the unit of I/O through the
 * filesystem and block layers.  Nowadays the basic I/O unit
 * is the bio, and buffer_heads are used for extracting block
 * mappings (via a get_block_t call), for tracking state within
 * a page (via a page_mapping) and for wrapping bio submission
 * for backward compatibility reasons (e.g. submit_bh).
 */
struct buffer_head

Buffers are kept for small I/O transfers with block size granularity. This is often required by filesystems to handle their metadata. Transfer of raw data is done in a page-centric fashion, and the implementation of buffers is also on top of the page cache.

The buffer cache consists of two structural units:

A buffer head holds all management data relating to the state of the buffer including information on block number, block size, access counter, and so on, discussed below. These data are not stored directly after the buffer head but in a separate area of RAM memory indicated by a corresponding pointer in the buffer head structure.
The useful data are held in specially reserved pages that may also reside in the page cache.

The buffer cache operates independently of the page cache, not in addition to it.

In this case, private(struct page) points to the first buffer head used to split the page into smaller units. The various buffer heads are linked in a cyclic list by means of b_this_page.

The kernel provides the create_empty_buffers and link_dev_buffers functions for this purpose, both of which are implemented in fs/buffer.c. The latter serves to associate an existing set of buffer heads with a page, whereas create_empty_buffers generates a completely new set of buffers for association with the page. For example, create_empty_buffers is invoked when reading and writing complete pages with block_read_full_page and __block_write_full_page.

As already noted, some transfer operations to and from block devices may need to be performed in units whose size depends on the block size of the underlying devices, whereas many parts of the kernel prefer to carry out I/O operations with page granularity as this makes things much easier — especially in terms of memory management.

In this scenario, buffers act as intermediaries between the two worlds

引用 The future of the page cache 中内容 Initially, the page and buffer caches were entirely separate, but Ingo Molnar unified them in 1999. Now, the buffer cache still exists, but its entries point into the page cache.

The buffer cache is used not only as an add-on to the page cache but also as an independent cache for objects that are not handled in pages but in blocks.

independent cache

处理 page cache 之外的东西

Buffers are used not only in the context of pages. However, there are still situations in which access to block device data is performed on the block level and not on the page level in the view of higher-level code. To help speed up such operations, the kernel provides yet another cache known as an LRU buffer cache discussed below.

This cache for independent buffers is not totally divorced from the page cache. Since RAM memory is always managed in pages, buffered blocks must also be held in pages, with the result that there are some points of contact with the page cache. These cannot and should not be ignored — after all, access to individual blocks is still possible via the buffer cache without having to worry about the organization of the blocks into pages.

When is it necessary to read individual blocks? There are not too many points in the kernel where this must be done, but these are nevertheless of great importance. Filesystems in particular make use of the routines described above when reading superblocks or management blocks.

The kernel defines two functions to simplify the work of filesystems with individual blocks:

static inline struct buffer_head *
sb_bread(struct super_block *sb, sector_t block)
{
	return __bread_gfp(sb->s_bdev, block, sb->s_blocksize, __GFP_MOVABLE);
}

细节代码分析

dependent cache 的 io 过程

主要是 block_read_full_folio 和 block_write_full_folio 他们的一个关键参数是 get_block_t

block_read_full_page reads a full page in three steps:

The buffers are set up and their state is checked.
The buffers are locked to rule out interference by other kernel threads in the next step.
The data are transferred to the buffers.

创建，上锁，读取。创建包括的内容 : buffer 和 get_block

ext4_block_write_begin : 从 page cache 层提供了 folio 的，然后在
- create_empty_buffers
  - folio_alloc_buffers
try_to_free_buffers : 释放 page 及其关联的 buffer head 等
- free_buffer_head

释放 buffer_head 的位置:

@[
    try_to_free_buffers+5
    shrink_folio_list+1776
    evict_folios+600
    try_to_shrink_lruvec+420
    shrink_one+253
    shrink_node+2749
    balance_pgdat+1226
    kswapd+492
    kthread+220
    ret_from_fork+49
    ret_from_fork_asm+26
]: 3378

在 pageout 中，这里是很经典的，folio 中的 private 关联了 buffer head

	if (!mapping) {
		/*
		 * Some data journaling orphaned folios can have
		 * folio->mapping == NULL while being dirty with clean buffers.
		 */
		if (folio_test_private(folio)) {
			if (try_to_free_buffers(folio)) {
				folio_clear_dirty(folio);
				pr_info("%s: orphaned folio\n", __func__);
				return PAGE_CLEAN;
			}
		}
		return PAGE_KEEP;
	}

independent cache 的 io 过程

sb_bread

sb_bread
- __bread_gfp
  - bdev_getblk : Get a buffer_head in a block device’s buffer cache ，folio 是这个时候释放的
    - __find_get_block
      - lookup_bh_lru : 先在一个小的 lru cache 中查询
      - __find_get_block_slow : 然后在 radix 中查询，和 page cache 的效果很像
    - __getblk_slow
  - __bread_slow : 使用 submit_bio 来提交 io

@[
    __bread_gfp+5
    ext2_get_inode+231
    __ext2_write_inode+118
    __writeback_single_inode+668
    writeback_single_inode+175
    sync_inode_metadata+71
    ext2_add_link+1096
    ext2_create+113
    path_openat+2214
    do_filp_open+196
    do_sys_openat2+171
    __x64_sys_openat+87
    do_syscall_64+188
    entry_SYSCALL_64_after_hwframe+119
]: 1

brelse

brelse 不是真的释放，而是一个

/**
 * brelse - Release a buffer.
 * @bh: The buffer to release.
 *
 * Decrement a buffer_head's reference count.  If @bh is NULL, this
 * function is a no-op.
 *
 * If all buffers on a folio have zero reference count, are clean
 * and unlocked, and if the folio is unlocked and not under writeback
 * then try_to_free_buffers() may strip the buffers from the folio in
 * preparation for freeing it (sometimes, rarely, buffers are removed
 * from a folio but it ends up not being freed, and buffers may later
 * be reattached).
 *
 * Context: Any context.
 */
static inline void brelse(struct buffer_head *bh)

try_to_free_buffers

其他细节

BH_Mapped

BH_Mapped means that there is a mapping of the buffer contents on a secondary storage device, as is the case with all buffers that originate from filesystems or from direct accesses to block devices

告诉 buffer_head 他里面的内容指向到那里:

static inline void
map_bh(struct buffer_head *bh, struct super_block *sb, sector_t block)
{
	set_buffer_mapped(bh);
	bh->b_bdev = sb->s_bdev;
	bh->b_blocknr = block;
	bh->b_size = sb->s_blocksize;
}

理解这个问题的典型位置，这里的 buffer head 都是从上面传递过来的，通过 ext2_get_block 获取的 bno ，然后让 buffer head 指向到这个获取的 bno

int ext2_get_block(struct inode *inode, sector_t iblock,
		struct buffer_head *bh_result, int create)
{
	unsigned max_blocks = bh_result->b_size >> inode->i_blkbits;
	bool new = false, boundary = false;
	u32 bno;
	int ret;

	ret = ext2_get_blocks(inode, iblock, max_blocks, &bno, &new, &boundary,
			create);
	if (ret <= 0)
		return ret;

	map_bh(bh_result, inode->i_sb, bno);
	bh_result->b_size = (ret << inode->i_blkbits);
	if (new)
		set_buffer_new(bh_result);
	if (boundary)
		set_buffer_boundary(bh_result);
	return 0;

}

mark_buffer_dirty

BH_Dirty 和 page cache 中间 page 的 dirty 的功能不是重复，而是具有相关性:

@[
    mark_buffer_dirty+151
    ext2_new_blocks+1636
    ext2_get_blocks+746
    ext2_get_block+94
    __block_write_begin_int+350
    block_write_begin+81
    ext2_write_begin+48
    generic_perform_write+220
    generic_file_write_iter+98
    vfs_write+673
    ksys_write+110
    do_syscall_64+188
    entry_SYSCALL_64_after_hwframe+119
]: 1

mark_buffer_dirty 的时候会标记到其关联的 folio 上

blockdev 的 writeback 还是 buffer head 的:

@[
        __block_write_full_folio+5
        blkdev_writepages+110
        do_writepages+199
        __writeback_single_inode+65
        writeback_sb_inodes+534
        __writeback_inodes_wb+76
        wb_writeback+427
        wb_workfn+795
        process_one_work+394
        worker_thread+598
        kthread+249
        ret_from_fork+242
        ret_from_fork_asm+26
]: 5

为什么 nfs 和 fuse 可以不依赖 buffer head

[!NOTE] 参考神奇海螺的意见，有待验证

NFS 和 FUSE 不依赖 buffer_head 机制，根本原因在于它们不是面向本地块设备（block device）的文件系统，而是面向“远程”或“用户态抽象后端”的文件系统。

buffer_head 最初是 Linux 块设备缓存（buffer cache） 的核心数据结构，用于：

表达 “文件偏移 ↔ 块设备扇区” 的映射（b_blocknr）
管理 每个 fs-block 的状态（BH_Uptodate, BH_Dirty, BH_Lock 等）
协调 页缓存（page cache）与块设备 I/O 之间的细粒度状态（尤其当 fs block size < page size 时）

这个说的看上去显然有点扯了，但是，既然 buffer_head 是管理文件偏移到 block 之间的关系，为什么 fuse 还有 iomap 的支持，或者说，buffer_head 不是和 iomap 是相同的生态位的吗?

那么 nfs 也有 iomap 的支持吗?

buffer_head 中到底存储了什么东西

简而言之，描述内核映射到磁盘的哪一个块上:

struct buffer_head {
    unsigned long b_state;        // 状态标志（BH_Uptodate, BH_Dirty, BH_Lock 等）
    struct block_device *b_bdev;  // 所属块设备
    sector_t b_blocknr;           // 在设备上的块号（sector_t，按 b_size 对齐）
    size_t b_size;                // 块大小（如 1KB, 4KB）
    char *b_data;                 // 指向页内偏移的数据指针
    struct folio *b_folio;        // 所属 folio
    struct buffer_head *b_this_page; // 同一页内 buffer_head 链表
    struct list_head b_assoc_buffers; // 关联到 inode 的私有链表（用于 fsync）
    struct address_space *b_assoc_map; // 所属 inode 的 mapping
    bh_end_io_t *b_end_io;        // I/O 完成回调
    void *b_private;              // 私有数据
    atomic_t b_count;             // 引用计数
    // ...
};

为什么 buffer head 需要一个 lru

系统中到底有多少个 buffer head

3015450 1760850  58%    0.22K  83763       36    670104K dentry
2684136 1634245  60%    0.10K  68824       39    275296K buffer_head
1188864 1178901  99%    0.03K   9288      128     37152K kmalloc-32
1099260 1084157  98%    0.13K  18321       60    146568K kernfs_node_cache
828160 296498  35%    0.06K  12940       64     51760K kmalloc-rcl-64
618480 582007  94%    0.70K  13744       45    439808K proc_inode_cache
559050 513208  91%    0.63K  11181       50    357792K inode_cache
539413 250020  46%    0.57K   9649       56    30876 radix_tree_node
385280 212026  55%    0.50K   6020       64    192640K kmalloc-512
354940  81852  23%    1.13K  12680       28    405760K ext4_inode_cache
305536 260221  85%    0.06K   4774       64     19096K lsm_inode_cache
301312 294252  97%    0.02K   1177      256      4708K lsm_file_cache
300672 287872  95%    0.06K   4698       64     18792K kmalloc-64
290368 258285  88%    0.25K   4537       64     72592K filp
288256 275888  95%    0.01K    563      512      2252K kmalloc-8
247488 222758  90%    0.06K   3867       64     15468K kmem_cache_node
209728 189431  90%    0.25K   3277       64     52432K skbuff_head_cache
209100 205815  98%    0.04K   2050      102      8200K ext4_extent_status
200270 182174  90%    0.23K   2861       70     45776K vm_area_struct
188412 162335  86%    0.09K   4486       42     17944K kmalloc-96
181566 154183  84%    0.09K   4323       42     17292K kmalloc-rcl-96
171904 154567  89%    0.06K   2686       64     10744K anon_vma_chain
156672 146521  93%    0.02K    612      256      2448K kmalloc-16
132864 132624  99%    0.12K   2076       64     16608K scsi_sense_cache
127488 125440  98%    0.12K   1992       64     15936K nfs_page
120768 118592  98%    0.12K   1887       64     15096K seq_file
120512 118108  98%    0.12K   1883       64     15064K eventpoll_epi
112768  97455  86%    1.00K   3524       32    112768K kmalloc-1k
 95706  86140  90%    0.10K   2454       39      9816K anon_vma
 91520  90705  99%    0.06K   1430       64      5720K jbd2_inode
 66822  57132  85%    0.74K   1554       43     49728K shmem_inode_cache
 59552  47855  80%    2.00K   3722       16    119104K kmalloc-2k
 52736  49235  93%    0.12K    824       64      6592K pid
 52224  52224 100%    0.04K    512      102      2048K pde_opener

在 13900k 上，就完全观察不到了，因为没有挂载 ext4

4096596 4095866  99%    0.19K  97538       42    780304K dentry
3887934 3887330  99%    0.08K  76234       51    304936K lsm_inode_cache
3826496 3826488  99%    1.00K 119578       32   3826496K xfs_inode
3189760 3189074  99%    0.02K  12460      256     49840K kmalloc-rnd-09-16
1515640 1407521  92%    0.57K  54130       28    866080K radix_tree_node
1465800 1440803  98%    0.20K  36645       40    293160K xfs_ili
595840 595840 100%    0.06K   9310       64     37240K kmalloc-rnd-04-64
567035 566572  99%    0.05K   6671       85     26684K xfs_ifork
522400 469399  89%    1.00K  16325       32    522400K kmalloc-rnd-01-1k
496768 496768 100%    0.03K   3881      128     15524K kmalloc-rnd-08-32
245700 245700 100%    0.38K   5850       42     93600K xfs_buf
174528 174518  99%    0.25K   5454       32     43632K kmalloc-rnd-03-256
154112 154112 100%    0.03K   1204      128      4816K kmalloc-rnd-04-32
105856 105699  99%    0.50K   3308       32     52928K kmalloc-rnd-03-512
 82944  20755  25%    0.50K   2592       32     41472K kmalloc-rnd-01-512
 81396  81396 100%    0.09K   1938       42      7752K kmalloc-rcl-96
 77580  77365  99%    0.13K   2586       30     10344K kernfs_node_cache
 71424  34615  48%    0.06K   1116       64      4464K kmalloc-rnd-08-64
 66048  66045  99%    0.01K    129      512       516K zs_handle-zram0
 60928  60928 100%    0.03K    476      128      1904K kmalloc-rnd-09-32
 55062  53978  98%    0.09K   1311       42      5244K kmalloc-rnd-04-96
 38148  26808  70%    0.62K    748       51     23936K inode_cache
 33264  33264 100%    0.19K    792       42      6336K kmalloc-rnd-04-192
 32046  31618  98%    0.09K    763       42      3052K kmalloc-rnd-08-96
 31936  31936 100%    0.25K    998       32      7984K kmalloc-rnd-09-256
 30240  26881  88%    0.07K    540       56      2160K vmap_area

buffer head 的作用到底是什么?

因为一般的流程都是这样的:

┌─────────────────────────────────────────────┐
│  1. 先在当前 CPU 的 bh_lru 中查找           │
│     (O(16) = O(1)，无锁或轻量级锁)          │
│         ↓                                   │
│  2. 命中 → 直接返回，移到队首 (MRU)         │
│         ↓ 未命中                            │
│  3. 在 page cache 中查找 (慢路径)           │
│     需要锁，遍历 radix tree                 │
│         ↓                                   │
│  4. 找到后加入 bh_lru，淘汰最久未用的       │
└─────────────────────────────────────────────┘

这个过程也就是:

/*
 * Perform a pagecache lookup for the matching buffer.  If it's there, refresh
 * it in the LRU and mark it as accessed.  If it is not present then return
 * NULL. Atomic context callers may also return NULL if the buffer is being
 * migrated; similarly the page is not marked accessed either.
 */
static struct buffer_head *
find_get_block_common(struct block_device *bdev, sector_t block,
			unsigned size, bool atomic)
{
	struct buffer_head *bh = lookup_bh_lru(bdev, block, size);

	if (bh == NULL) {
		/* __find_get_block_slow will mark the page accessed */
		bh = __find_get_block_slow(bdev, block, atomic);
		if (bh)
			bh_lru_install(bh);
	} else
		touch_buffer(bh);

	return bh;
}

find_get_block_common 就在那个经典的 get_block_t 的路径上。

慢速路径具体是，在 radix tree 中查询:

	folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);

  #define BH_LRU_SIZE 16
  struct bh_lru {
      struct buffer_head *bhs[BH_LRU_SIZE];
  };
  static DEFINE_PER_CPU(struct bh_lru, bh_lrus);

每 CPU 一份，无需全局锁。
容量小（仅 16 个），适合缓存最近频繁访问的 bh（如 superblock、inode table 块）。
命中时直接返回，避免 folio 查找。

sb_bread()
  → __bread_gfp()
    → bdev_getblk()
      → __find_get_block_nonatomic()  // 先查 LRU
        → find_get_block_common()
          → lookup_bh_lru()           // LRU 命中？→ 返回
          → __find_get_block_slow()   // 未命中 → 慢路径
            → __filemap_get_folio()   // 获取 folio
            → 遍历/创建 buffer_head
      → 若 bh 未 uptodate，调用 __bread_slow() 提交 bio 读数据

本站所有文章转发 CSDN 将按侵权追究法律责任，其它情况随意。