Skip to the content.

Memory Failure

基本操作

按钮:

/proc/sys/vm/memory_failure_early_kill  (Disabled by default, Will immediately kill processes mapped to the page)
/proc/sys/vm/memory_failure_recovery  (Enabled by default, The page is unmapped/Isolated and when the page is referenced back then it will kill process)

操作:

文档

https://www.kernel.org/doc/html/latest/mm/hwpoison.html

sudo modprobe hwpoison-inject echo 0x3e8 > /sys/kernel/debug/hwpoison/corrupt-pfn

# 随便注入一个
[   83.857915] Injecting memory failure at pfn 0x3e8
[   83.858191] Memory failure: 0x3e8: recovery action for free buddy page: Recovered

# 通过 docs/kernel/mm-failure.sh 注入一个进程的页
[ 7165.324423] Injecting memory failure at pfn 0x17bcde
[ 7165.324824] Memory failure: 0x17bcde: recovery action for dirty LRU page: Recovered

但是如果程序去写,那么就可以看到这些内容:

[197425.968665] Memory failure: 0x40b7719: recovery action for dirty LRU page: Recovered
[197426.771495] MCE: Killing a.out:299207 due to hardware memory corruption fault at 7fea7092d000

还引入了 MCE ,那么 aarch64 中恐怕就有不同的效果了。

参考

https://blogs.oracle.com/linux/post/memory-ue-handling-and-hwpoison-injection

如何注入 zcat /proc/config.gz | egrep ‘CONFIG_DEBUG_FS=y|CONFIG_ACPI_APEI=y|CONFIG_ACPI_APEI_EINJ=m’

CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
CONFIG_ACPI_APEI_EINJ=m

hygon 机器是可以的,kunpeng 也是可以的:

🧀  dmesg | grep EINJ
[    0.007486] ACPI: EINJ 0x000000007861A048 000150 (v01 HYGON  HGN EINJ 00000001 HGN  00000001)
[    0.007523] ACPI: Reserving EINJ table memory at [mem 0x7861a048-0x7861a197]

hygon 测试

只能这些内容:

[root@localhost 14:38:29 einj]$ cat available_error_type
0x00000001      Processor Correctable
0x00000002      Processor Uncorrectable non-fatal
0x00000004      Processor Uncorrectable fatal

hygon 注入一个,结果为,机器直接宕机

sudo modprobe einj
echo 1 | sudo tee /sys/kernel/debug/apei/einj/error_type
echo 1 | sudo tee /sys/kernel/debug/apei/einj/notrigger
echo 1 | sudo tee /sys/kernel/debug/apei/einj/error_inject

如果触发 1 ,那么没有任何工作。

intel

而 Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz 上有这些:

[root@n-201 14:55:08 einj]$ cat available_error_type
0x00000008      Memory Correctable
0x00000010      Memory Uncorrectable non-fatal
0x00000020      Memory Uncorrectable fatal
0x00000040      PCI Express Correctable
0x00000080      PCI Express Uncorrectable non-fatal
0x00000100      PCI Express Uncorrectable fatal
echo 0x10 > /sys/kernel/debug/apei/einj/error_type
echo 0x40e38ca000   > /sys/kernel/debug/apei/einj/param1
echo 0xffffffffffffff00 > /sys/kernel/debug/apei/einj/param2
echo 2 > /sys/kernel/debug/apei/einj/flags
echo 1 > /sys/kernel/debug/apei/einj/notrigger
echo 1 > /sys/kernel/debug/apei/einj/error_inject

原来虚拟机中是无法模拟这个的,似乎说的有道理: https://unix.stackexchange.com/questions/715963/cant-modprobe-einj-in-vm-cloud-services

但是似乎 ARM 是支持的?

https://patchwork.kernel.org/project/qemu-devel/cover/20190906083152.25716-1-zhengxiang9@huawei.com/#22903991

67.88 环境

/sys/kernel/debug/apei/einj # cat available_error_type                                                                                   root@6788
0x00000008      Memory Correctable
0x00000010      Memory Uncorrectable non-fatal
0x00000020      Memory Uncorrectable fatal
echo 0x10 > /sys/kernel/debug/apei/einj/error_type
echo 0x40e38ca000   > /sys/kernel/debug/apei/einj/param1
echo 0xffffffffffffff00 > /sys/kernel/debug/apei/einj/param2
echo 2 > /sys/kernel/debug/apei/einj/flags
echo 1 > /sys/kernel/debug/apei/einj/notrigger
echo 1 > /sys/kernel/debug/apei/einj/error_inject

可以得到:

[198348.652036] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[198348.653106] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[198348.654146] {1}[Hardware Error]: event severity: corrected
[198348.655142] {1}[Hardware Error]:  Error 0, type: corrected
[198348.656123] {1}[Hardware Error]:  fru_text: B1

机器直接宕机,因为让 trigger 了:

echo 0x10 > /sys/kernel/debug/apei/einj/error_type
echo 0x40e38ca000   > /sys/kernel/debug/apei/einj/param1
echo 0xffffffffffffff00 > /sys/kernel/debug/apei/einj/param2
echo 2 > /sys/kernel/debug/apei/einj/flags
# echo 1 > /sys/kernel/debug/apei/einj/notrigger
echo 1 > /sys/kernel/debug/apei/einj/error_inject

重启之后:

这个没有任何反应的,差别为使用 error_type

echo 0x08 > /sys/kernel/debug/apei/einj/error_type
echo 0x40c4ee9000 > /sys/kernel/debug/apei/einj/param1
echo 0xffffffffffffff00 > /sys/kernel/debug/apei/einj/param2
echo 2 > /sys/kernel/debug/apei/einj/flags
echo 1 > /sys/kernel/debug/apei/einj/notrigger
echo 1 > /sys/kernel/debug/apei/einj/error_inject

修改为 0x20 之后,系统直接宕机。

echo 0x20 > /sys/kernel/debug/apei/einj/error_type
echo 0x40c4ee9000 > /sys/kernel/debug/apei/einj/param1
echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2
echo 2 > /sys/kernel/debug/apei/einj/flags
echo 1 > /sys/kernel/debug/apei/einj/notrigger
echo 1 > /sys/kernel/debug/apei/einj/error_inject

在 bmc 的界面中也可以观察到错误。

kunpeng

[root@localhost einj]# cat available_error_type
0x00000001      Processor Correctable
0x00000002      Processor Uncorrectable non-fatal
0x00000004      Processor Uncorrectable fatal
0x00000008      Memory Correctable
0x00000010      Memory Uncorrectable non-fatal
0x00000020      Memory Uncorrectable fatal
0x00000040      PCI Express Correctable
0x00000080      PCI Express Uncorrectable non-fatal
0x00000100      PCI Express Uncorrectable fatal
0x00000200      Platform Correctable
0x00000400      Platform Uncorrectable non-fatal
0x00000800      Platform Uncorrectable fatal

似乎之前不是这个样子的

echo 0x08 > /sys/kernel/debug/apei/einj/error_type echo 0x203d1b744000 > /sys/kernel/debug/apei/einj/param1 echo 0xfffffffffffff000 > /sys/kernel/debug/apei/einj/param2 echo 2 > /sys/kernel/debug/apei/einj/flags echo 1 > /sys/kernel/debug/apei/einj/notrigger echo 1 > /sys/kernel/debug/apei/einj/error_inject

cat /proc/meminfo 中的: HardwareCorrupted: 0 kB 也不变

补充一个 ACPI

解析 cat /sys/firmware/acpi/tables/EINJ 中的内容:

page poison 似乎和这个不是一个东西吧

https://lwn.net/Articles/753261/

触发的代码:

static void
do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
	  vm_fault_t fault)
{
	/* Kernel mode? Handle exceptions or die: */
	if (!user_mode(regs)) {
		kernelmode_fixup_or_oops(regs, error_code, address,
					 SIGBUS, BUS_ADRERR, ARCH_DEFAULT_PKEY);
		return;
	}

	/* User-space => ok to do another page fault: */
	if (is_prefetch(regs, error_code, address))
		return;

	sanitize_error_code(address, &error_code);

	if (fixup_vdso_exception(regs, X86_TRAP_PF, error_code, address))
		return;

	set_signal_archinfo(address, error_code);

#ifdef CONFIG_MEMORY_FAILURE
	if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
		struct task_struct *tsk = current;
		unsigned lsb = 0;

		pr_err(
	"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
			tsk->comm, tsk->pid, address);
		if (fault & VM_FAULT_HWPOISON_LARGE)
			lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
		if (fault & VM_FAULT_HWPOISON)
			lsb = PAGE_SHIFT;
		force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
		return;
	}
#endif
	force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
}

问题,如果一个 page 坏了,是如何隔离的

hwpoison

对应的文件: mm/memory-failure.c

kernel doc

url: https://www.kernel.org/doc/html/latest/vm/hwpoison.html

The code consists of a the high level handler in mm/memory-failure.c, a new page poison bit and various checks in the VM to handle poisoned pages.

The main target right now is KVM guests, but it works for all kinds of applications. KVM support requires a recent qemu-kvm release.

For the KVM use there was need for a new signal type so that KVM can inject the machine check into the guest with the proper address. This in theory allows other applications to handle memory failures too. The expection is that near all applications won’t do that, but some very specialized ones might.

HWPOISON

url : https://lwn.net/Articles/348886/

While, HWPOISON adopted Intel’s usage of the term “poisoning”, this should not be confused with the unrelated Linux kernel concept of poisoning: writing a pattern to memory to catch uninitialized memory.

In either case, the hardware doesn’t immediately cause a machine check but rather flags the data unit as poisoned until read (or consumed). Later, when erroneous data is read by executing software, a machine check is initiated. If the erroneous data is never read, no machine check is necessary.

将错误推迟,这好吗?

Thus, HWPOISON focuses on memory containment at the page granularity rather than the low granularity supported by Intel’s MCA Recovery hardware.

HWPOISON finds the page containing the poisoned data and attempts to isolate this page from further use. Potentially corrupted processes can then be located by finding all processes that have the corrupted page mapped.

To enable the HWPOISON handler, the kernel configuration parameter MEMORY_FAILURE must be set. Otherwise, hardware poisoning will cause a system panic. Additionally, the architecture must support data poisoning. As of this writing, HWPOISON is enabled for all architectures to make testing on any machine possible via a user-mode fault injector, which is detailed below.

The poisoned bit in the flags field serves as a lock allowing rapid-fire poisoning machine checks on the same page to be handled only once by ignoring subsequent calls to the handler.

Since faulty hardware that supports data poisoning is not easy to come by, a fault injection test harness mm/hwpoison-inject.c has also been developed. This simple harness uses debugfs to allow failures at an arbitrary page to be injected.

Two VM sysctl parameters are supported by HWPOISON with respect to killing user processes: vm.memory_failure_early_kill and vm.memory_failure_recovery. Setting the vm.memory_failure_early_kill parameter causes an immediate SIGBUS to be sent to the user process(es).

Memory poison recovery in khugepaged

url : https://lwn.net/Articles/889134/

We are also careful to unwind operations khuagepaged has performed before it detects memory failures. For example, before copying and collapsing a group of anonymous pages into a huge page, the source pages will be isolated and their page table is unlinked from their PMD. These operations need to be undone in order to ensure these pages are not changed/lost from the perspective of other threads (both user and kernel space). As for file backed memory pages, there already exists a rollback case. This patch just extends it so that khugepaged also correctly rolls back when it fails to copy poisoned 4K pages.

没看代码,这段不知道什么意思,但是这个 patch 描述的含义非常清楚,khugepage 经常扫描内存,如果遇到,那么就负责将其保留起来。

测试工具

https://lwn.net/Articles/893565/

这个是需要看看的

https://www.youtube.com/watch?v=YVow-qoOJJE

如何理解这个代码?

/*
 * Someone wants to read @bytes from a HWPOISON hugetlb @page from @offset.
 * Returns the maximum number of bytes one can read without touching the 1st raw
 * HWPOISON subpage.
 *
 * The implementation borrows the iteration logic from copy_page_to_iter*.
 */
static size_t adjust_range_hwpoison(struct page *page, size_t offset, size_t bytes)
{
	size_t n = 0;
	size_t res = 0;

	/* First subpage to start the loop. */
	page = nth_page(page, offset / PAGE_SIZE);
	offset %= PAGE_SIZE;
	while (1) {
		if (is_raw_hwpoison_page_in_hugepage(page))
			break;

		/* Safe to read n bytes without touching HWPOISON subpage. */
		n = min(bytes, (size_t)PAGE_SIZE - offset);
		res += n;
		bytes -= n;
		if (!bytes || !n)
			break;
		offset += n;
		if (offset == PAGE_SIZE) {
			page = nth_page(page, 1);
			offset = 0;
		}
	}

	return res;
}

看看

https://blogs.oracle.com/linux/post/memory-ue-handling-and-hwpoison-injection

本站所有文章转发 CSDN 将按侵权追究法律责任,其它情况随意。