ftrace 输出的格式
https://www.kernel.org/doc/html/latest/trace/ftrace.html#output-format
ftrace 是如何防止自己 trace 自己的
linux/kernel/trace/ 下的函数都没有办法 trace,
sudo perf probe –add register_ftrace_command 可以观察到
[29020.876213] trace_kprobe: Could not probe notrace function _text
[29288.128622] trace_kprobe: Could not probe notrace function _text
#0 __register_trace_kprobe (tk=0xffff888128bd6700) at kernel/trace/trace_kprobe.c:482
#1 0xffffffff812c7379 in register_trace_kprobe (tk=0xffff888128bd6700) at kernel/trace/trace_kprobe.c:657
#2 __trace_kprobe_create (argc=<optimized out>, argv=<optimized out>) at kernel/trace/trace_kprobe.c:957
#3 0xffffffff812cf6b8 in trace_probe_create (raw_command=<optimized out>, createfn=createfn@entry=0xffffffff812c69d0 <__trace_kprobe_create>) at kernel/trace/trace_probe.c:1948
#4 0xffffffff812c4a45 in trace_kprobe_create (raw_command=<optimized out>) at kernel/trace/trace_kprobe.c:985
#5 create_or_delete_trace_kprobe (raw_command=<optimized out>) at kernel/trace/trace_kprobe.c:995
#6 0xffffffff8129b5ca in trace_parse_run_command (file=<optimized out>, buffer=0x555d3cf78520 "p:probe/__register_trace_kprobe _text+2902656", count=45, ppos=<optimized out>, createfn=0xffffffff812c4a30 <create_or_delete_trace_kprobe>) at kernel/trace/trace.c:10478
#7 0xffffffff8146795b in vfs_write (file=file@entry=0xffff8881355ae200, buf=buf@entry=0x555d3cf78520 "p:probe/__register_trace_kprobe _text+2902656", count=count@entry=45, pos=pos@entry=0xffffc900023ebef8) at fs/read_write.c:588
#8 0xffffffff8146808f in ksys_write (fd=<optimized out>, buf=0x555d3cf78520 "p:probe/__register_trace_kprobe _text+2902656", count=45) at fs/read_write.c:643
#9 0xffffffff824088aa in do_syscall_x64 (nr=<optimized out>, regs=0xffffc900023ebf58) at arch/x86/entry/common.c:52
#10 do_syscall_64 (regs=0xffffc900023ebf58, nr=<optimized out>) at arch/x86/entry/common.c:83
#11 0xffffffff826000eb in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
__register_trace_kprobe 中检查跟踪的符号的位置:
/* Internal register function - just handle k*probes and flags */
static int __register_trace_kprobe(struct trace_kprobe *tk)
{
int i, ret;
ret = security_locked_down(LOCKDOWN_KPROBES);
if (ret)
return ret;
if (trace_kprobe_is_registered(tk))
return -EINVAL;
if (within_notrace_func(tk)) {
pr_warn("Could not probe notrace function %ps\n",
(void *)trace_kprobe_address(tk));
return -EINVAL;
}
kernel/trace/Kconfig
config KPROBE_EVENTS_ON_NOTRACE
bool "Do NOT protect notrace function from kprobe events"
depends on KPROBE_EVENTS
depends on DYNAMIC_FTRACE
default n
help
This is only for the developers who want to debug ftrace itself
using kprobe events.
If kprobes can use ftrace instead of breakpoint, ftrace related
functions are protected from kprobe-events to prevent an infinite
recursion or any unexpected execution path which leads to a kernel
crash.
This option disables such protection and allows you to put kprobe
events on ftrace functions for debugging ftrace by itself.
Note that this might let you shoot yourself in the foot.
If unsure, say N.
确定一个符号是否是 notrace 的比想象的要复杂
- within_notrace_func
__within_notrace_func- ftrace_location_range
- lookup_rec : 在这里,利用
ftrace_pages_start记录的数据来查询,这和我想象的不一样,本来以为链接信息的
- lookup_rec : 在这里,利用
- ftrace_location_range
tracer
- timerlat
- osnoise
- blk
- mmiotrace
- function_graph
- wakeup_dl
- wakeup_rt
- wakeup
- preemptirqsoff
- preemptoff
- irqsoff
- function
- nop
[x] timerlat
Documentation/trace/timerlat-tracer.rst
和 osnoise 很像
[x] mmiotrace
Documentation/trace/mmiotrace.rst
感觉长期没有人维护了
- 自己构建内核,使用之后 guest kernel 直接 crash ,应该是存在什么依赖的选项没有打上,应该是和 cpuoffline 相关的
echo 0 > options/function-trace echo preemptoff > current_tracer echo 1 > tracing_on echo 0 > tracing_max_latency - 打开浏览器,可以观察到,但是看上去 header 和输出不是匹配的
# tracer: mmiotrace # # entries-in-buffer/entries-written: 15/15 #P:1 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | W 4 411.390479 4 0x3ffa01c90048 0xa1690a20 0x0 0 W 4 411.390483 4 0x3ffa01c9004c 0x17b4dbba 0x0 0 W 4 411.390484 4 0x3ffa01c90040 0xbbbbbbbb 0x0 0 W 4 411.407167 4 0x3ffa01c90048 0xa267c040 0x0 0 W 4 411.407168 4 0x3ffa01c9004c 0x17b4dbba 0x0 0 W 4 411.407169 4 0x3ffa01c90040 0xbbbbbbbb 0x0 0 W 4 411.423834 4 0x3ffa01c90048 0xa3661bc0 0x0 0 W 4 411.423835 4 0x3ffa01c9004c 0x17b4dbba 0x0 0 W 4 411.423836 4 0x3ffa01c90040 0xbbbbbbbb 0x0 0 W 4 411.440515 4 0x3ffa01c90048 0xa464b900 0x0 0 W 4 411.440516 4 0x3ffa01c9004c 0x17b4dbba 0x0 0 W 4 411.440517 4 0x3ffa01c90040 0xbbbbbbbb 0x0 0 W 4 411.457201 4 0x3ffa01c90048 0xa5634d20 0x0 0 W 4 411.457203 4 0x3ffa01c9004c 0x17b4dbba 0x0 0 W 4 411.457204 4 0x3ffa01c90040 0xbbbbbbbb 0x0 0
打开 mmiotrace 之后,除了 CPU 0 之外,其他的 CPU 都会被 offline 掉的
[x] preemptirqsoff preemptoff irqsoff
Documentation/trace/ftrace.rst
wakeup wakeup_dl wakeup_rt
Documentation/trace/ftrace.rst
实现的代码: kernel/trace/trace_sched_wakeup.c
当使用 fio 测试的时候
echo wakeup > current_tracer cat trace
fio-1745 11d..5. 160174482us : 1745:120:R + [011] 379:100:R kworker/11:1H
fio-1745 11d..5. 160174484us : <stack trace>
=> __ftrace_trace_stack
=> probe_wakeup
=> ttwu_do_activate
=> try_to_wake_up
=> kick_pool
=> __queue_work
=> mod_delayed_work_on
=> kblockd_mod_delayed_work_on
=> blk_mq_submit_bio
=> submit_bio_noacct_nocheck
=> blkdev_direct_IO.part.0
=> blkdev_read_iter
=> aio_read
=> io_submit_one
=> __x64_sys_io_submit
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
fio-1745 11d..5. 160174484us : 0
fio-1745 11d..3. 160174692us : __schedule
fio-1745 11d..3. 160174693us : 1745:120:S ==> [011] 379:100:R kworker/11:1H
fio-1745 11d..3. 160174694us : <stack trace>
=> __ftrace_trace_stack
=> probe_wakeup_sched_switch
=> __schedule
=> schedule
=> read_events
=> do_io_getevents
=> __x64_sys_io_getevents
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
其实这个 tracer 主要是 realtime 的时候有用,用于观察,从进入就绪到真的可以执行的时候一共花费了多长时间。
其他阅读材料
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/using_the_ftrace_utility_for_tracing_latencies
- https://static.lwn.net/images/conf/rtlws11/papers/proc/p02.pdf
- https://www.kernel.org/doc/Documentation/trace/ftrace.txt
sched_switch 和 sched_wakeup
其实并不存在这个 tracer ,而是存在对应这个 event
config SCHED_TRACER
bool "Scheduling Latency Tracer"
select GENERIC_TRACER
select CONTEXT_SWITCH_TRACER
select TRACER_MAX_TRACE
select TRACER_SNAPSHOT
help
This tracer tracks the latency of the highest priority task
to be scheduled in, starting from the point it has woken up.
- kernel/trace/trace_sched_switch.c
echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable 可以得到这个 backtrace 出来:
#0 tracing_start_cmdline_record () at kernel/trace/trace_sched_switch.c:133
#1 0xffffffff812b32c2 in __ftrace_event_enable_disable (file=file@entry=0xffff888100815420, enable=<optimized out>, soft_disable=soft_disable@entry=0) at kernel/trace/trace_events.c:690
#2 0xffffffff812b37f7 in ftrace_event_enable_disable (enable=<optimized out>, file=0xffff888100815420) at kernel/trace/trace_events.c:730
#3 event_enable_write (filp=<optimized out>, ubuf=<optimized out>, cnt=2, ppos=0xffffc90002157ef8) at kernel/trace/trace_events.c:1434
#4 0xffffffff8146867b in vfs_write (file=file@entry=0xffff8881064e4b00, buf=buf@entry=0x55aafd83f980 "1\n1;echo\a1 > enable\asys/kernel/tracing/events/sched/sched_switch \a", count=count@entry=2, pos=pos@entry=0xffffc90002157ef8) at fs/read_write.c:588
#5 0xffffffff81468daf in ksys_write (fd=<optimized out>, buf=0x55aafd83f980 "1\n1;echo\a1 > enable\asys/kernel/tracing/events/sched/sched_switch\a", count=2) at fs/read_write.c:643
#6 0xffffffff8240a8ca in do_syscall_x64 (nr=<optimized out>, regs=0xffffc90002157f58) at arch/x86/entry/common.c:52
#7 do_syscall_64 (regs=0xffffc90002157f58, nr=<optimized out>) at arch/x86/entry/common.c:83
但是谁来调用 tracing_start_tgid_record
[x] branch
Documentation/trace/ftrace.rst
- kernel/trace/trace_branch.c : 实现的文件
三种方式来实现,实现的方式令人震惊
CONFIG_BRANCH_PROFILE_NONE=y
CONFIG_PROFILE_ANNOTATED_BRANCHES
CONFIG_PROFILE_ALL_BRANCHES
include/linux/compiler.h
#ifdef CONFIG_PROFILE_ALL_BRANCHES
/*
* "Define 'is'", Bill Clinton
* "Define 'if'", Steven Rostedt
*/
#define if(cond, ...) if ( __trace_if_var( !!(cond , ## __VA_ARGS__) ) )
#define __trace_if_var(cond) (__builtin_constant_p(cond) ? (cond) : __trace_if_value(cond))
#define __trace_if_value(cond) ({ \
static struct ftrace_branch_data \
__aligned(4) \
__section("_ftrace_branch") \
__if_trace = { \
.func = __func__, \
.file = __FILE__, \
.line = __LINE__, \
}; \
(cond) ? \
(__if_trace.miss_hit[1]++,1) : \
(__if_trace.miss_hit[0]++,0); \
})
#endif /* CONFIG_PROFILE_ALL_BRANCHES */
#else
# define likely(x) __builtin_expect(!!(x), 1)
# define unlikely(x) __builtin_expect(!!(x), 0)
# define likely_notrace(x) likely(x)
# define unlikely_notrace(x) unlikely(x)
#endif
[ ] OSNOISE_TRACER 和 HWLAT_TRACER 的关系是什么?
boottime tracing
检查内核选项: CONFIG_BOOTTIME_TRACING
设置如下等效于
ftrace_notrace=rcu_read_lock,rcu_read_unlock,spin_lock,spin_unlock
ftrace_filter=kfree,kmalloc,schedule,vmalloc_fault,spurious_fault
在开机的时候设置 ftrace_notrace 和 ftrace_filter 中的内容。
- https://docs.kernel.org/trace/boottime-trace.html : 几乎什么都可以做到的
-
https://lpc.events/event/4/contributions/445/attachments/289/487/LPC_2019_-_boottime_tracing.pdf
- 使用 “kprobe_event=” 来实现 kprobes
似乎不容易使用啊
ftrace=function ftrace_filter=request_firmware
🧀 iperf3 -c 10.0.0.2
Connecting to host 10.0.0.2, port 5201
[ 5] local 10.0.0.2 port 46030 connected to 10.0.0.2 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 2.52 GBytes 21.6 Gbits/sec 0 1.06 MBytes
[ 5] 1.00-2.00 sec 2.56 GBytes 22.0 Gbits/sec 0 1.25 MBytes
[ 5] 2.00-3.00 sec 2.48 GBytes 21.3 Gbits/sec 0 1.25 MBytes
[ 5] 3.00-4.00 sec 2.28 GBytes 19.6 Gbits/sec 0 1.25 MBytes
[ 5] 4.00-5.00 sec 2.42 GBytes 20.8 Gbits/sec 0 1.25 MBytes
[ 5] 5.00-6.00 sec 2.38 GBytes 20.5 Gbits/sec 0 1.25 MBytes
[ 5] 6.00-7.00 sec 2.48 GBytes 21.3 Gbits/sec 0 1.25 MBytes
[ 5] 7.00-8.00 sec 2.55 GBytes 21.9 Gbits/sec 0 3.06 MBytes
[ 5] 8.00-9.00 sec 2.50 GBytes 21.4 Gbits/sec 0 3.06 MBytes
[ 5] 9.00-10.00 sec 2.50 GBytes 21.5 Gbits/sec 0 3.06 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 25.4 GBytes 21.8 Gbits/sec 0 sender
[ 5] 0.00-10.00 sec 25.4 GBytes 21.8 Gbits/sec receiver
iperf Done.
调整之后:
🧀 iperf3 -s
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 10.0.0.2, port 36052
[ 5] local 10.0.0.2 port 5201 connected to 10.0.0.2 port 36056
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 14.7 GBytes 126 Gbits/sec
[ 5] 1.00-2.00 sec 14.1 GBytes 121 Gbits/sec
[ 5] 2.00-3.00 sec 14.5 GBytes 124 Gbits/sec
[ 5] 3.00-4.00 sec 14.6 GBytes 126 Gbits/sec
[ 5] 4.00-5.00 sec 14.6 GBytes 125 Gbits/sec
[ 5] 4.00-5.00 sec 14.6 GBytes 125 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-5.00 sec 80.8 GBytes 139 Gbits/sec receiver
CONFIG_FUNCTION_PROFILER
echo 1 > /sys/kernel/debug/tracing/function_profile_enabled
cd /sys/kernel/debug/tracing/trace_stat
其中每一个文件对应一个 CPU 的函数统计:
trace/ftrace.rst 中 function_profile_enabled 这个章节:
文摘
- A look at ftrace
- 价值不大
- Secrets of the Ftrace function tracer
- 基本的,最核心的使用,很有价值,常看常新
- Debugging the kernel using Ftrace - part 1
- 介绍了 trace_prink
- Debugging the kernel using Ftrace - part 2
- 介绍了 stack tracking 的技术
- tracing_off() 在内核中直接调控 tracing_on
- 通过设置 /proc/sys/kernel/ftrace_dump_on_oops : 可以实现在内核挂掉的时候,让 ftrace buffer 的内容输出到 printk 中
- You can also trigger a dump of the Ftrace buffer to the console with sysrq-z.
/sys/kernel/debug/tracing/README
- trace_marker : 给 trace 增加标记的
- set_ftrace_filter 和 events/
/ /trigger 的差别 - 其中的 stacktrace snapshot dump cpudump 的操作效果检查下
Documentation/trace/ftrace.rst
- 参考
The File Systemsection 和cat /sys/kernel/tracing/README来看看 options 都是做什么的- buffer_size_kb : 每一个 CPU 的 buffer size
- buffer_total_size_kb : readonly 的,展示所有 CPU 的 buffer 的和
- buffer_subbuf_size_kb : 和超大 log entry 有关,系统底层了,不关心了.
- 一共到底存在那些 tracer
- 讲解 filter-commands 的各种技巧
Documentation/trace/kprobetrace.rst
You can also use /sys/kernel/tracing/dynamic_events instead of kprobe_events. That interface will provide unified access to other dynamic events too.
[root@nixos:/sys/kernel/debug/tracing]# cat dynamic_events
p:kprobes/myprobe2 do_sys_openat2 dfd=%ax filename=%dx open_how=%cx usize=+4($stack)
这样可以:
echo 'f:myprobe vfs_read' > dynamic_events
然后 events 目录中会增加 fprobes
Documentation/trace/kprobes.rst
分析 kprobe 的实现原理
Documentation/trace/kprobetrace.rst
分析 kprobe 在 ftrace 接口中如何使用
材料
-
https://stackoverflow.com/questions/58080787/ftrace-function-enabling-fails-with-resource-busy
-
通过 dyn_ftrace_total_info 可以知道有多少函数其实含有 nop 指令的
64566 pages:472 groups: 273
brendangregg/perf-tools
https://github.com/brendangregg/perf-tools : 被废弃的项目,基本说明了 ftrace 是怎么操作的 作为 trace 技术的考古还是非常有意思的。
其他特殊功能
- instances 目录 : 在这个目录中操作会记录到单独的 trace buffer
-
snapshot : 对于 trace buffer 做 snapshot
tracing_on和/proc/sys/kernel/ftrace_enabled的含义: 没太搞懂差别,但是一般只是使用 tracing_on 就够了
options
记录下让人好奇的 trace options
-
display-graph
When set, the latency tracers (irqsoff, wakeup, etc) will use function graph tracing instead of function tracing.
-
function-trace
The latency tracers will enable function tracing if this option is enabled (default it is). When it is disabled, the latency tracers do not trace functions. This keeps the overhead of the tracer down when performing latency tests.
-
funcgraph-irqs
When disabled, functions that happen inside an interrupt will not be traced.
-
funcgraph-abstime
When set, the timestamp is displayed at each line.
-
funcgraph-proc
Unlike other tracers, the process’ command line is not displayed by default, but instead only when a task is traced in and out during a context switch. Enabling this options has the command of each process displayed at every line.
trace_printk
关于 trace_printk 的内容,参考 https://lwn.net/articles/365835/
这这个图是怎么做的?
https://mp.weixin.qq.com/s/-qSXrEa1NsUxjtqxF8BvpA
我的记忆是有 ftrace 本来就有 backtrace 的,一个 option ?
man trace-cmd-record
--func-stack
Enable a stack trace on all functions. Note this is only applicable for the "function" plugin tracer, and will only take
effect if the -l option is used and succeeds in limiting functions. If the function tracer is not filtered, and the
stack trace is enabled, you can live lock the machine.
这里有一堆的 function-graph 的 option ,也许可以让问题更加简单
[root@nixos:/sys/kernel/debug/tracing]# ls options/
annotate disable_on_free funcgraph-irqs function-fork markers record-tgid test_nop_refuse
bin display-graph funcgraph-overhead function-trace overwrite sleep-time trace_printk
blk_cgname event-fork funcgraph-overrun graph-time pause-on-trace stacktrace userstacktrace
blk_cgroup fields funcgraph-proc hash-ptr printk-msg-only sym-addr verbose
blk_classic funcgraph-abstime funcgraph-tail hex print-parent sym-offset
block funcgraph-cpu func-no-repeats irq-info raw sym-userobj
context-info funcgraph-duration func_stack_trace latency-format record-cmd test_nop_accept
TODO
Documentation/trace/ftrace.rst 中分析了几个非常关键的技术
preemptirqsoff
--------------
Knowing the locations that have interrupts disabled or
preemption disabled for the longest times is helpful. But
sometimes we would like to know when either preemption and/or
interrupts are disabled.
利用这个好好看看 ftrace 来如何分析内核,尤其是 preemption
什么时候才会产生 stacktrace ,之前还一直以为内核是不会产生 backtrace 的:
➜ tracing cat trace
# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 6.7.0-rc8-dirty
# --------------------------------------------------------------------
# latency: 102 us, #4/4, CPU#28 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:32)
# -----------------
# | task: irqbalance-963 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: mnt_want_write
# => ended at: mnt_want_write
#
#
# _------=> CPU#
# / _-----=> irqs-off/BH-disabled
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / _-=> migrate-disable
# ||||| / delay
# cmd pid |||||| time | caller
# \ / |||||| \ | /
irqbalan-963 28...1. 0us : mnt_want_write <-mnt_want_write
irqbalan-963 28...1. 0us!: mnt_want_write <-mnt_want_write
irqbalan-963 28...1. 103us+: tracer_preempt_on <-mnt_want_write
irqbalan-963 28...1. 117us : <stack trace>
=> path_openat
=> do_filp_open
=> do_sys_openat2
=> __x64_sys_openat
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=nvme&sort=byPopularity&type=story
这个案例不错 https://stackoverflow.com/questions/72476553/how-to-monitor-the-io-queue-depth
实现相关的文件,都是有 perf_event_open 对应的实现
- kernel/trace/trace_kprobe.c
- kernel/trace/trace_uprobe.c
- kernel/trace/trace_kprobe.c
- kernel/trace/trace_eprobe.c
本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。