Opened 3 years ago
Closed 3 years ago
#829 closed defect (fixed)
System deadlocks early in boot on ia32 with FPU lazy context switching off
Reported by: | Jiri Svoboda | Owned by: | Jiri Svoboda |
---|---|---|---|
Priority: | major | Milestone: | 0.12.1 |
Component: | helenos/unspecified | Version: | mainline |
Keywords: | Cc: | ||
Blocker for: | Depends on: | ||
See also: |
Description
Steps to reproduce:
(1) Build sysytem with default ia32 build config, except lazy FPU context switching disabled
(2) tools/ew.py
Attachments (1)
Change History (13)
by , 3 years ago
Attachment: | lazyfpuoff_deadlock.png added |
---|
comment:2 by , 3 years ago
Milestone: | 0.11.2 → 0.12.1 |
---|
comment:3 by , 3 years ago
am64 + lazy FPU context switching = Off works fine (i.e. problem NOT reproduced)
comment:4 by , 3 years ago
Owner: | set to |
---|---|
Status: | new → assigned |
Stack trace:
0x81967e20: 0x8012a8d5 stack_trace+0x13 0x81967e70: 0x8013c761 spinlock_lock_debug+0x84 0x81967eb0: 0x8013c8fd irq_spinlock_lock+0x24 0x81967f10: 0x8010c044 exc_dispatch+0x3d 0x81967f3c: 0x801189ef int_7+0x69 0x81967f74: 0x8011e318 fpu_init+0xf 0x81967ff4: 0x80138adb scheduler_separated_stack+0x59c
011e309 <fpu_init>: 8011e309: 55 push %ebp 8011e30a: 89 e5 mov %esp,%ebp 8011e30c: 83 ec 10 sub $0x10,%esp 8011e30f: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp) 8011e316: 31 c0 xor %eax,%eax **** 8011e318: db e3 fninit **** 8011e31a: 0f ae 5d fc stmxcsr -0x4(%ebp) 8011e31e: 8b 45 fc mov -0x4(%ebp),%eax 8011e321: 0d 80 1f 00 00 or $0x1f80,%eax 8011e326: 89 45 fc mov %eax,-0x4(%ebp) 8011e329: 0f ae 55 fc ldmxcsr -0x4(%ebp) 8011e32d: c9 leave 8011e32e: c3 ret
comment:5 by , 3 years ago
While executing fninit instruction in function fpu_init() we get a #NM (7) exception. exc_dispatch() will attempt to lock THREAD→lock, which is currently held, leading to deadlock.
fpu_init() is called from before_thread_runs() from scheduler_separated_stack() while THREAD→lock is being held.
comment:6 by , 3 years ago
I might be wrong, but it seems to me that maybe we do not expect to get an exception while executing fninit. Here's a few interesting tidbits from Intel 64 and IA-32 Architectures Software Developer's Manual:
FINIT/FNINIT Instruction
The FINIT instruction checks for and handles any pending unmasked floating-point exceptions before performing the initialization; the FNINIT instruction does not.
and also
When operating a Pentium or Intel486 processor in MS-DOS compatibility mode, it is possible (under unusual circumstances) for an FNINIT instruction to be interrupted prior to being executed to handle a pending FPU exception. See the section titled “No-Wait FPU Instructions Can Get FPU Interrupt in Window” in Appendix D of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for a description of these circumstances. An FNINIT instruction cannot be interrupted in this way on later Intel processors, except for the Intel QuarkTM X1000 processor.
and also
8.7.2 MS-DOS* Compatibility Sub-mode
If CR0.NE[bit 5] is 0, the MS-DOS compatibility mode for handling floating-point exceptions is selected. In this mode, the software exception handler for floating-point exceptions is invoked externally using the processor’s FERR#, INTR, and IGNNE# pins. This method of reporting floating-point errors and invoking an exception handler is provided to support the floating-point exception handling mechanism used in PC systems that are running the MS-DOS or Windows* 95 operating system.
comment:7 by , 3 years ago
Also forgot to mention that THREAD→lock locking was added to exc_dispatch() for the purpose of thread accounting.
comment:8 by , 3 years ago
If I run the ia32 build as usual, except replace qemu-system-i386 with qemu-system-x86_64, the system still deadlocks. (I thought maybe the CPU is different and it might make some difference).
It also appears that passing different -cpu flags to qemu-system-x86_64 (e.g. 486, Broadwell, Haswell, host. max) does not seem to make any difference.
comment:9 by , 3 years ago
I think the immediate cause here is that CR0.TS is set and, for some reason, the conditions are such that wit CR0.TS set fninit causes #NM exception. I think I've verified this: Added "clts" instruction to the beginning of fpu_init(), this allowed the system to start booting, then deadlock again. Added "clts" instruction to the beginning of fpu_context_f_restore() and fpu_context_fx_restore() and the system booted up correctly.
That said, I am not sure whether this is the correct fix. I'd like to look at the conditions needed for clts to generate an exception, (e.g. the MS-DOS compatibility mode), perhaps preventing these conditions could be used as a solution as well.
comment:10 by , 3 years ago
Ok so MS-DOS Compatibility mode means CR0.NE == 0. We do not set CR0.NE (Which is 0 by default), i.e. we do not enable internal FPU exceptions (CR0.NE is available for 486+). Now I tried setting CP0.NE = 1 in pm_init(), but this did not help (note this should change the exception vector number, although I did not notice any problem even with lazy FPU switching enabled). Even if it did help, it would be no help on an actual 386 system.
Now I was looking at kernel/arch/ia32/src/cpu/cpu.c's fpu_enable()/fpu_disable() and before_thread_runs() in kernel/generic/src/proc/scheduler.c and I just could not understand what the hell we are doing. Then I looked and amd64's cpu.c (amd64 works, afterall), and the problem becomes obvious:
kernel/arch/amd64/src/cpu/cpu.c:
oid fpu_disable(void) { write_cr0(read_cr0() | CR0_TS); } void fpu_enable(void) { write_cr0(read_cr0() & ~CR0_TS); }
and compare to
kernel/arch/ia32/src/cpu/cpu.c:
oid fpu_disable(void) { write_cr0(read_cr0() & ~CR0_TS); } void fpu_enable(void) { write_cr0(read_cr0() | CR0_TS); }
We can see that on ia32 we are setting the CR0.TS bit in reverse (one when it should be zero and vice versa)!
comment:11 by , 3 years ago
This problem was introduced by this changeset:
57c2a87b03a0b6c08cef4f64f0cf52a7d8b38b62 (Avoid even more magic numbers) in May 2016 (i.e. 0.7.0).
comment:12 by , 3 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Fixed in changeset 5e629ad488e0e5d27b616c452352f3cedc9f2db5.
System deadlocked early in boot with ia32/lazy FPU context switching = off