2016 June 29 Kernel

Fix capture-kernel failure with notsc!

Kernel Bug:

If specify ‘notsc’ for capture-kernel, and then trigger crashdown. The capture-kernel will be hang at Calibrating delay loop....

serial console log as following,

............
[    0.000000] Linux version 4.7.0-rc2+
(root@localhost.localdomain)
(gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #2 SMP Wed Jun 156
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-4.7.0-rc2+ root=/dev/mapper/centos-root ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=256M vconsole.keymap=us console=tty0 console=ttyS0,115200n8 LANG=en_US.UTF-8 irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off panic=10 rootflags=nofail acpi_no_memhotplug notsc
............
[    0.000000] tsc: Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely
............
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 3192.714 MHz processor
[    0.000000] Calibrating delay loop...

Note:

This bug is found in Kernel Upstream. so we’d better to compile the latest kernel source code. You can refer to my blog Debug kernel by qemu! for how to compile.
Of course, you must enable the configuration about Kdump before make -j<n>. You can refer to Documentation/kdump.txt
specify ‘notsc’ for capture-kernel in /etc/sysconfig/kdump.

KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail"

trigger crashdown as following,

# echo 1 > /proc/sys/kernel/sysrq
# echo c > /proc/sysrq-trigger

Investigation and Report:

By browsh the codes, I found that the captrue-kernel hang in calibrate_delay_converge()

/* wait for "start of" clock tick */
ticks = jiffies;
while (ticks == jiffies)
    ; /* nothing */

It seems that the jiffies not being increased.

Two proposals, You can refer to the RFC email sent by me:

Take a look at it’s caller calibrate_delay(). If I specify ‘lpj’ value, it can skip the bug.

void calibrate_delay(void)
{
	unsigned long lpj;
	static bool printed;
	int this_cpu = smp_processor_id();

	if (per_cpu(cpu_loops_per_jiffy, this_cpu)) {
		lpj = per_cpu(cpu_loops_per_jiffy, this_cpu);
		if (!printed)
			pr_info("Calibrating delay loop (skipped) "
				"already calibrated this CPU");
	} else if (preset_lpj) {
		lpj = preset_lpj;
		if (!printed)
			pr_info("Calibrating delay loop (skipped) "
				"preset value.. ");
	} else if ((!printed) && lpj_fine) {
		lpj = lpj_fine;
		pr_info("Calibrating delay loop (skipped), "
			"value calculated using timer frequency.. ");
	} else if ((lpj = calibrate_delay_is_known())) {
		;
	} else if ((lpj = calibrate_delay_direct()) != 0) {
		if (!printed)
			pr_info("Calibrating delay using timer "
				"specific routine.. ");
	} else {
		if (!printed)
			pr_info("Calibrating delay loop... ");
		lpj = calibrate_delay_converge();
	}
	per_cpu(cpu_loops_per_jiffy, this_cpu) = lpj;
	if (!printed)
		pr_cont("%lu.%02lu BogoMIPS (lpj=%lu)\n",
			lpj/(500000/HZ),
			(lpj/(5000/HZ)) % 100, lpj);

	loops_per_jiffy = lpj;
	printed = true;

	calibration_delay_done();
}

Revert the 70de9a9. IMO, The flow of getting tsc_khz as following, tsc_init()-> x86_platform.calibrate_tsc()-> native_calibrate_tsc()-> quick_pit_calibrate(). so I think tsc_khz is available to calculate lpj. But it’s denied Denied Message.

commit 70de9a97049e0ba79dc040868564408d5ce697f9
Author: Alok Kataria <akataria@vmware.com>
Date:   Mon Nov 3 11:18:47 2008 -0800

x86: don't use tsc_khz to calculate lpj if notsc is passed

Impact: fix udelay when "notsc" boot parameter is passed
With notsc passed on commandline, tsc may not be used for
udelays, make sure that we do not use tsc_khz to calculate
the lpj value in such cases.

The Arch-Criminal

I found the first bad commit <522e66464467> by bisect, which used to fix erratum AVR31 for “Intel Atom Processor C2000 Product Family Specification Update”. You can find the Spec.

commit 522e66464467543c0d88d023336eec4df03ad40b
Author: Fenghua Yu <fenghua.yu@intel.com>
Date:   Wed Oct 23 18:30:12 2013 -0700

x86/apic: Disable I/O APIC before shutdown of the local APIC

In reboot and crash path, when we shut down the local APIC, the I/O APIC is
still active. This may cause issues because external interrupts
can still come in and disturb the local APIC during shutdown process.

To quiet external interrupts, disable I/O APIC before shutdown local APIC.

Solution 1st [denied]

It doesn’t make sense to me that change the order of disabling between I/O APIC and local APIC just for a certain model C2000. And I couldn’t find any related descriptions for Intel 64 and IA-32 Arch. so, I send a PATCH v1 to fix it. I got the feedback is that “By reverting the change can paper over the bug, but re-introduce the bug that can result in certain CPUs hanging if IO-APIC sends an APIC message if the lapic is disabled prematurely”

Solution 2nd [denied]

The local APIC was disabled in reboot and crash path by lapic_shutdown(), which causes no timer interrupts is passed to BSP by APIC. so the jiffies hasn’t been updated. We need to put APIC in legacy mode in kexec jump path(put the system into PIT during the crash kernel),PATCH v2. But the guys in lkml suggests me that it should be fixed in the bootup path of the dump kernel, not the crash kernel reboot path.

Solution 3rd [denied]

In fact, the lapic and timer are not ready when dump-capture waits them to update the jiffies value. so I suggest to put APIC in legacy mode in local_apic_timer_interrupt() temporarily, which in the bootup path of dump kernel.

Solution 4th [discussing]

Generly speaking, local APIC state can be initialized by BIOS after Power-Up or Reset, which doesn’t apply to kdump case. so the kernel has to be responsible for initialize the interrupt mode properly according the latest status of APIC in bootup path. We have to consider the worst case that no effective interrupt mode is set and do some proper setting. refer to PATH v2 for more details.

Nothing seek, nothing find!