关键源码位置
fs/namei.c : 文件系统的路径解析,提供 hardlink, symbol link 之类的操作 fs/d_path.c
- 提供辅助函数 : d_path
- 提供了 getpwd 的系统调用
- fs/dcache.c
- fs/inode.c (inode 的各种管理,evict 等操作)
- ext2/inode.c (这里居然放置的内容是 address_space_operation 的系统) ext2/namei.c (查询的支持,和查询到之后的操作)
Documentation
基本流程
int kern_path(const char *name, unsigned int flags, struct path *path)
{
return filename_lookup(AT_FDCWD, getname_kernel(name),
flags, path, NULL);
}
注意,这里是通过 name ,获取到 struct path
- kern_path
- filename_lookup : 装配 nameidata
- path_lookupat
- link_path_walk
- walk_component
- handle_dots
- lookup_fast
- lookup_slow : 相对于 __lookup_slow ,持有
- __lookup_slow
- d_alloc_parallel
- d_alloc : 最基本的分配和初始化而已,复杂的东西在 d_alloc_parallel
- simplefs_lookup : 读取磁盘文件,比对,然后获取到 inode number
- simplefs_iget
- iget_locked : 首先尝试从 inode cache 中获取
- simplefs_iget
- d_alloc_parallel
- __lookup_slow
- step_into : 处理 symbolic
- walk_component
- link_path_walk
- path_lookupat
- filename_lookup : 装配 nameidata
path_lookupat 中:
while (!(err = link_path_walk(s, nd)) &&
(s = lookup_last(nd)) != NULL)
;
原来 openat 也是一个路径:
@[
vfs_open+5
path_openat+2820
do_filp_open+215
do_sys_openat2+138
__x64_sys_openat+84
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 28
在 path_openat 中:
while (!(error = link_path_walk(s, nd)) &&
(s = open_last_lookups(nd, file, op)) != NULL)
;
if (!error)
error = do_open(nd, file, op); // 调用 vfs_open
关键结构体
struct path
struct path {
struct vfsmount *mnt;
struct dentry *dentry;
} __randomize_layout;
非常合理,如果想要知道路径,那么需要知道路径了
struct nameidata
struct nameidata {
struct path path;
struct qstr last;
struct path root;
struct inode *inode; /* path.dentry.d_inode */
unsigned int flags;
unsigned seq, m_seq, r_seq;
int last_type;
unsigned depth;
int total_link_count;
struct saved {
struct path link;
struct delayed_call done;
const char *name;
unsigned seq;
} *stack, internal[EMBEDDED_LEVELS];
struct filename *name;
struct nameidata *saved;
unsigned root_seq;
int dfd;
kuid_t dir_uid;
umode_t dir_mode;
} __randomize_layout;
struct dentry
三个关键内容:
- a component name,
- a pointer to a parent dentry,
- and a pointer to the “inode” which contains further information about the object in that parent with the given name.
测试 /sys/kernel/debug/block/vdb/state 这个文件 然后 blk_mq_debugfs_show 中打点,利用 gdb 可以看到:
$ p *m->file->f_path.dentry
$4 = {
d_flags = 4194312,
d_seq = {
seqcount = {
sequence = 2
}
},
d_hash = {
next = 0x0 <fixed_percpu_data>,
pprev = 0xffffc90000154970
},
d_parent = 0xffff88800581d780,
d_name = {
{
{
hash = 2664144965,
len = 5
},
hash_len = 24138981445
},
name = 0xffff888004bdecf8 "state"
},
d_inode = 0xffff88800b78b288,
d_iname = "state\000-switch-root.service", '\000' <repeats 13 times>,
d_op = 0xffffffff8246a100 <debugfs_dops>,
d_sb = 0xffff8880045d6800,
d_time = 0,
然后继续展示他的 parent ,可以看到:
$ p *(struct dentry *)0xffff88800581d780
$6 = {
d_flags = 2097160,
d_seq = {
seqcount = {
sequence = 2
}
},
d_hash = {
next = 0x0 <fixed_percpu_data>,
pprev = 0xffffc9000013ae18
},
d_parent = 0xffff88804081c000,
d_name = {
{
{
hash = 2448476182,
len = 3
},
hash_len = 15333378070
},
name = 0xffff88800581d7b8 "vdb"
},
d_inode = 0xffff88800b78a8e8,
d_iname = "vdb\000MSIX-0000:00:01.0\000p.gz", '\000' <repeats 13 times>,
d_op = 0xffffffff8246a100 <debugfs_dops>,
d_sb = 0xffff8880045d6800,
d_time = 0,
可以看到,dentry 就是通过描述自己的 parent 是什么来构建整个路线的.
路径查询
[ ] 核心 : link_path_walk
link_path_walk 调用 walk_component walk_component 负责处理单个组件
核心 : walk_component
static int walk_component(struct nameidata *nd, int flags)
{
err = lookup_fast(nd, &path, &inode, &seq); // todo this dcache, what
path.dentry = lookup_slow(&nd->last, nd->path.dentry, // path.dentry is parent !
err = follow_managed(&path, nd); // todo ?
return step_into(nd, &path, flags, inode, seq); // todo
}
/* Fast lookup failed, do it the slow way */
static struct dentry *__lookup_slow(const struct qstr *name,
struct dentry *dir,
unsigned int flags)
{
dentry = d_alloc_parallel(dir, name, &wq); // create and link : struct dentry *new = d_alloc(parent, name); and then insert into the file name
// we will not create dentry for every item in the directory ! only one
old = inode->i_op->lookup(inode, dentry, flags);
}
link
一共这些 ops 和 link 有关
| get_link | 查询 |
| readlink | 似乎专门给 /proc 用的 |
| link | 创建硬链接 |
| unlink | 删除 |
| symlink | 创建软链接 |
@[
ext4_get_link+5
step_into+1392
path_openat+348
do_filp_open+215
do_open_execat+91
alloc_bprm+36
do_execveat_common+147
__x64_sys_execve+52
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 90
在 struct dentry::d_alias
d_alias links the dentry objects of identical files. This situation arises
when links are used to make the file available under two different names. This
list is linked from the corresponding inode by using its i_dentry element as a
list head. The individual dentry objects are linked by d_alias
- d_splice_alias
- 如果不考虑 alias 问题,等价于
__d_add - 对于目录 VFS 原则上不允许存在多个别名(即硬链接)。
- 如果不考虑 alias 问题,等价于
[!NOTE] 参考 Deepseeek ,有待验证
但有一个重要的例外: 当一个目录作为另一个文件系统的挂载点根目录时(例如,将一个U盘挂载到 /mnt/usb), 此时代表 U盘根目录的 inode 就会同时拥有它自己文件系统内的根 dentry (/) 和宿主系统中的挂载点 dentry (/mnt/usb) 两个别名
NFS 场景举例: 服务器导出了一个目录,比如 /export/data。
对于 NFS 服务器来说,/export/data 这个目录的 inode 存在一个 dentry,并且这个 dentry 被标记为 IS_ROOT(因为它是导出文件系统的根)。
当一个 NFS 客户端请求访问 /export/data/file.txt 时,服务器的 VFS 会尝试在 dcache 中构建这个路径。
当查找到 data 这个名字时,VFS 准备创建一个新的 dentry,并发现它指向的 inode 已经有了一个 IS_ROOT 的别名 dentry (new)
对于 nfs 的场景,我勉强可以接受,另外的场景,我不太可以接受
inode 基本维护
- new_inode(): creates a new inode, sets the i_nlink field to 1 and initializes i_blkbits, i_sb and i_dev; 和 super_operations::alloc_inode 的关系是什么 ? new_inode 会调用 alloc_inode ,进而调用其。
@[
new_inode+5
__ext4_new_inode+234
ext4_create+227
path_openat+2357
do_filp_open+215
do_sys_openat2+138
__x64_sys_openat+84
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 5
@[
iget_locked+5
__ext4_iget+310
ext4_lookup+258
__lookup_slow+133
walk_component+219
path_lookupat+103
filename_lookup+241
user_path_at+55
do_faccessat+255
__x64_sys_access+28
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 2
-
unlock_new_inode(): used in conjunction with iget_locked(), releases the lock on the inode;
-
iput(): tells the kernel that the work on the inode is finished; if no one else uses it, it will be destroyed (after being written on the disk if it is maked as dirty);
@[
iput+5
__dentry_kill+113
dput+235
__fput+302
__x64_sys_close+61
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 1377
inode 如何关联上 struct file 的
简而言之,当然,首先需要进行路径解析,获取到 dentry , 然后在 do_dentry_open 中打开:
@[
do_dentry_open+1
vfs_open+46
path_openat+2820
do_filp_open+215
do_sys_openat2+138
__x64_sys_openat+84
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 7370
inode 如何关联上 下层设备的 ?
是通过 superblock 关联的,应该是在这里初始化的 set_bdev_super
inode 落盘
inode 从磁盘上读取一般通过
static const struct inode_operations simplefs_inode_ops = {
.lookup = simplefs_lookup,
.create = simplefs_create,
.mkdir = simplefs_mkdir,
};
那么 dirty 的 inode 是通过 struct super_operations 来落盘的:
@[
__mark_inode_dirty+5
block_commit_write+77
block_write_end+59
ext4_da_write_end+137
generic_perform_write+276
ext4_buffered_write_iter+104
vfs_write+663
__x64_sys_pwrite64+157
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 202
@[
inode_dio_wait+5
ext4_setattr+1276
notify_change+881
do_truncate+148
path_openat+2903
do_filp_open+215
do_sys_openat2+138
__x64_sys_openat+84
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 178
@[
evict+1
__dentry_kill+113
shrink_dentry_list+162
shrink_dcache_parent+215
d_invalidate+104
proc_invalidate_siblings_dcache+317
release_task+847
wait_consider_task+1276
__do_wait+162
do_wait+106
kernel_wait4+182
__do_sys_wait4+71
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 25
@[
ext4_write_inode+5
__writeback_single_inode+655
writeback_sb_inodes+539
__writeback_inodes_wb+76
wb_writeback+427
wb_workfn+822
process_one_work+346
worker_thread+826
kthread+251
ret_from_fork+49
ret_from_fork_asm+26
]: 4942
dcache sysfs
slabtop
🤒 sudo slabtop –once | grep dentry 239316 239223 99% 0.19K 5698 42 45584K dentry
/proc/sys/fs/dentry-state
- proc_nr_dentry 中实现的
struct dentry_stat_t {
long nr_dentry;
long nr_unused;
long age_limit; /* age in seconds */
long want_pages; /* pages requested by system */
long nr_negative; /* # of unused negative dentries */
long dummy; /* Reserved for future use */
};
其中每一个项目的结果是:d_lru_add
- retain_dentry
- d_lru_add 的时候才去判断
🧀 cat /proc/sys/fs/dentry-state
766073 706656 45 0 271345 0
执行 echo 3 | sudo tee /proc/sys/vm/drop_caches 之后
🧀 cat /proc/sys/fs/dentry-state
40562 3267 45 0 778 0
类似的有
cat /proc/sys/fs/inode-state
41713 437 0 0 0 0 0
- 执行一次 Disk usage analyze 真的就会产生好几个 G 的 dcache / icache 吗
- 在 /home/martins3/ 中执行一次 ncdu ,dentry 增加 1 百万左右,考虑一个 dentry 100 多 byte ,所以差不多增加几百兆吧
dentry 的基本维护
分析一个 cache 基本方法:
- 加入
- 删除
- 查询
There are a number of functions defined which permit a filesystem to manipulate dentries:
- dget : open a new handle for an existing dentry (this just increments the usage count)
- dput (引用计数) : close a handle for a dentry (decrements the usage count). If the usage count drops to 0, and the dentry is still in its parent’s hash, the “d_delete” method is called to check whether it should be cached. If it should not be cached, or if the dentry is not hashed, it is deleted. Otherwise cached dentries are put into an LRU list to be reclaimed on memory shortage.
- d_drop : this unhashes a dentry from its parents hash list. A subsequent call to dput() will deallocate the dentry if its usage count drops to 0
- d_delete (文件删除) : delete a dentry. If there are no other open references to the dentry then the dentry is turned into a negative dentry (the d_iput() method is called). If there are other references, then d_drop() is called instead
- d_add (构建新的 dentry ,不是新建文件) : add a dentry to its parents hash list and then calls d_instantiate()
- d_instantiate : add a dentry to the alias hash list for the inode and updates the “d_inode” member. The “i_count” member in the inode structure should be set/incremented. If the inode pointer is NULL, the dentry is called a “negative dentry”. This function is commonly called when an inode is created for an existing negative dentry
- d_lookup : look up a dentry given its parent and path name component It looks up the child of that given name from the dcache hash table. If it is found, the reference count is incremented and the dentry is returned. The caller must use dput() to free the dentry when it finishes using it.
- d_move : 实现 rename
@[
d_delete+5
vfs_unlink+539
do_unlinkat+657
__x64_sys_unlinkat+53
do_syscall_64+188
entry_SYSCALL_64_after_hwframe+119
]: 100000
@[
d_lru_add+1
dput+404
path_put+22
vfs_statx+218
vfs_fstatat+107
__do_sys_newfstatat+59
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 5661
@[
dput+5
__fput+299
task_work_run+89
do_exit+753
do_group_exit+48
__x64_sys_exit_group+24
x64_sys_call+6131
do_syscall_64+188
entry_SYSCALL_64_after_hwframe+119
]: 17437
static inline struct dentry *dget(struct dentry *dentry)
{
if (dentry)
lockref_get(&dentry->d_lockref);
return dentry;
}
@[
dput+5
terminate_walk+88
path_lookupat+150
filename_lookup+220
vfs_statx+143
vfs_fstatat+123
__do_sys_newfstatat+63
do_syscall_64+188
entry_SYSCALL_64_after_hwframe+119
]: 11015
extern void d_instantiate(struct dentry *, struct inode *);
struct dentry * d_alloc_anon(struct inode *);
struct dentry * d_splice_alias(struct inode *, struct dentry *);
static inline void d_add(struct dentry *entry, struct inode *inode);
void dput(struct dentry *dentry);
static inline struct dentry *dget(struct dentry *dentry)
struct dentry * d_lookup(struct dentry *, struct qstr *);
static struct dentry *__dentry_kill(struct dentry *dentry);
void d_drop(struct dentry *dentry){
void d_delete(struct dentry * dentry)
d_make_root: allocates the root dentry. It is generally used in the function that is called to read the superblock (fill_super), which must initialize the root directory. So the root inode is obtained from the superblock and is used as an argument to this function, to fill the s_root field from the struct super_block structure.d_add: associates a dentry with an inode; the dentry received as a parameter in the calls discussed above signifies the entry (name, length) that needs to be created. This function will be used when creating/loading a new inode that does not have a dentry associated with it and has not yet been introduced to the hash table of inodes (at lookup); 将 新创建的 inode 和 其 dentry 关联起来。d_instantiate: The lighter version of the previous call, in which the dentry was previously added in the hash table.
dentry_kill 是做啥的
- d_delete : 删除文件
- d_put -> __dentry_kill
d_instantiate 和 d_add 都是才是将 dentry 和 inode 关联的函数, 一个 d_instantiate 只有关联了 inode 才会有
d_add
- d_add 调用位置太少了,比较通用的调用位置在 libfs 中间,
-
其核心执行的内容如下,也就是存在两个 hash,全局的,每个
hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
- 实际上,依托 hlist_bl_add_head_rcu d_hash 加入 dentry_hashtable
dcache
将 parent 指针和 name 来共同实现 hash :
struct dentry *d_alloc_name(struct dentry *parent, const char *name)
{
struct qstr q;
q.name = name;
q.hash_len = hashlen_string(parent, name);
return d_alloc(parent, &q);
}
为什么不是一个 mount point 一个 hash table ? 我猜测是,dentry 的量没有那么大,会刷掉很多
static struct hlist_bl_head *dentry_hashtable __read_mostly;
static inline struct hlist_bl_head *d_hash(unsigned int hash)
{
return dentry_hashtable + (hash >> d_hash_shift);
}
@[
simplefs_lookup+5
__lookup_slow+131
walk_component+219
path_lookupat+106
filename_lookup+220
vfs_statx+143
do_statx+102
__x64_sys_statx+154
do_syscall_64+188
entry_SYSCALL_64_after_hwframe+119
]: 2
lru
- alloc_super
- super_cache_scan
- super_cache_count
struct superpage中间存在两个函数 TODO 应该用于特殊内容的缓存的
long (*nr_cached_objects)(struct super_block *,
struct shrink_control *);
long (*free_cached_objects)(struct super_block *,
struct shrink_control *);
- super_cache_scan 调用两个函数
- prune_dcache_sb
- prune_icache_sb
long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
{
LIST_HEAD(dispose);
long freed;
freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
dentry_lru_isolate, &dispose);
shrink_dentry_list(&dispose);
return freed;
}
prune_icache_sb : TODO 这个是用于释放 inode 还是 inode 持有的文件 ? 还是当 inode 被打开之后就不释放 ?
list_lru_shrink_walk 似乎就是遍历一下列表,将可以清理的页面放出来
static enum lru_status dentry_lru_isolate(struct list_head *item,
struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
然后使用 prune_dcache_sb 紧接着调用 shrink_dentry_list ,将刚刚清理出来的内容真正的释放:
- shrink 的源头还有 unmount 的时候
那么 dcache / icache 的 shrink 机制在整个 shrink 机制中间是怎么处理的 ?
shrink_node_memcgs ==> shrink_slab ==> 对于所有的 struct shrinker 调用
do_shrink_slab
对于 inode 和 icache 的回收是放在 alloc_super 的初始化中间的。 而 x86 kvm 中间也是存在对于 shrinker 的回收工作的。
static struct shrinker mmu_shrinker = {
.count_objects = mmu_shrink_count,
.scan_objects = mmu_shrink_scan,
.seeks = DEFAULT_SEEKS * 10,
};
@[
inode_add_lru+5
delete_from_page_cache_batch+794
truncate_inode_pages_range+298
ext4_evict_inode+296
evict+259
__dentry_kill+113
dput+235
__fput+302
task_work_run+89
do_exit+717
do_group_exit+48
get_signal+2075
arch_do_signal_or_restart+58
syscall_exit_to_user_mode+173
do_syscall_64+107
entry_SYSCALL_64_after_hwframe+118
]: 5
inode cache
从 superblock 下,从 inode number 到 inode 的查询:
和 dcache 一样,也是定义 hashtable ,但是
static struct hlist_head *inode_hashtable __ro_after_init;
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
inode_insert5 中
struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
spin_lock(&inode_hash_lock);
old = find_inode(inode->i_sb, head, test, data, true);
find_inode 中为什么还是需要 rcu_read_lock 的保护? 具体看
commit 7180f8d91fcb (“vfs: add rcu-based find_inode variants for iget ops”)
iget_locked -> find_inode_fast
测试方法,不用用 tree ,要用 yazi ,需要打开文件才可以:
@[
iget_locked+5
__ext4_iget+310
ext4_lookup+258
__lookup_slow+133
walk_component+219
path_lookupat+103
filename_lookup+241
vfs_statx+128
do_statx+98
__x64_sys_statx+165
do_syscall_64+95
entry_SYSCALL_64_after_hwframe+118
]: 855
negative dentry cache
- Negative dentries, 20 years later : 2022
- https://news.ycombinator.com/item?id=30993527
- Dentry negativity : 2020
- 为什么 nagative dentry 会加速 lookup 的速度?
-
每一次查询都会导致产生一个新的 negative cache 吗?
- 回答这个问题
- https://unix.stackexchange.com/questions/236914/negative-dentry
[ ] 分析 fs/readdir.c
- this file is aim at Man getdents(2)
- the example code in Man behave counter intuition :
linux_dirent::d_off
- the example code in Man behave counter intuition :
ccls 索引的时候,但是没人用的 __x64_sys_getdents
@[
__x64_sys_getdents64+5
do_syscall_64+59
entry_SYSCALL_64_after_hwframe+114
]: 3390
注册这个给具体的文件系统用
struct getdents_callback64 buf = {
.ctx.actor = filldir64,
.count = count,
.current_dir = dirent
};
__x64_sys_getdents64- iterate_dir : 携带参数 getdents_callback64
- file->f_op->iterate_shared(file, ctx);
- xfs_dir_file_operations
- file->f_op->iterate(file, ctx); # 目前的配置中,从来没有被调用过
- file->f_op->iterate_shared(file, ctx);
- iterate_dir : 携带参数 getdents_callback64
经典问题
- 基于 pwd 的 openat 实现细节看看
open 系统调用的基本 flags
Three kinds of open flags accoring to man page:
- file privilege : O_RDONLY, O_WRONLY, or O_RDWR
- file create : O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE
- file status : O_APPEND, O_ASYNC
- 这些是如何转换为 e.g. FMODE_WRITE ?
-
为什么总是使用的是 openat 而非 open
- O_CLOEXEC
By default, the new file descriptor is set to remain open across an execve(2) (i.e., the FD_CLOEXEC file descriptor flag described in fcntl(2) is initially disabled); the O_CLOEXEC flag, described below, can be used to change this default. The file offset is set to the beginning of the file (see lseek(2)).
本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。