Opened 5 months ago
Closed 4 months ago
#871 closed defect (fixed)
Tasks crash during boot of 486 build
Reported by: | Jiri Svoboda | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 0.14.2 |
Component: | helenos/unspecified | Version: | mainline |
Keywords: | Cc: | ||
Blocker for: | Depends on: | ||
See also: |
Description
When I configure ia32 preset and change processor to 486, build the system and boot it using tools/ew.py, the system starts to boot, but a number of tasks crash.
Change History (19)
comment:1 by , 5 months ago
comment:2 by , 5 months ago
The first task to crash is /srv/hid/input:
Task /srv/hid/input (55) killed due to an exception at program counter 0x00000000. cs =0x0000001b eip=0x00000000 efl=0x00210202 err=0x00000004 ds =0x00000023 es =0x00000023 fs =0x00000023 gs =0x00000030 ss =0x00000023 eax=0x701705c4 ebx=0x7016f6dc ecx=0x00000001 edx=0x0804fc20 esi=0x70014ac0 edi=0x70014ac0 ebp=0x7012ee48 esp=0x7012ee1c 0x7012ee48: 0x00000000() 0x7012ee68: 0x7014cea0() 0x7012ee98: 0x70144e0c() 0x7012eeb8: 0x0804b151() 0x7012eec4: 0x0804b0c2() Kill message: Page fault: 0x00000000. [/srv/taskmon(16)] taskmon: Task 55 fault in thread 0x836d73b0. [/srv/taskmon(16)] taskmon: Executing /app/taskdump -t 55 [/app/taskdump(56)] Task Dump Utility [/app/taskdump(56)] Dumping task '/srv/hid/input' (task ID 55). [/app/taskdump(56)] Loaded symbol table from /srv/hid/input [/app/taskdump(56)] Threads: [/app/taskdump(56)] [1] hash: 0x836d73b0 [/app/taskdump(56)] Thread 0x836d73b0: PC = 0x00000000 (loc_category_get_id+0). FP = 0x7012ee48 [/app/taskdump(56)] 0x7012ee48: 0x00000000 (loc_category_get_id+0) [/app/taskdump(56)] 0x7012ee68: 0x7014cea0 (_end+1745867220) [/app/taskdump(56)] 0x7012ee98: 0x70144e0c (_end+1745834304) [/app/taskdump(56)] 0x7012eeb8: 0x0804b151 (__c_start+129) [/app/taskdump(56)] 0x7012eec4: 0x0804b0c2 (_start+21) [/app/taskdump(56)] Address space areas: [/app/taskdump(56)] [1] flags: R-XC base: 0x08048000 size: 16384 [/app/taskdump(56)] [2] flags: RW-C base: 0x0804c000 size: 16384 [/app/taskdump(56)] [3] flags: R-XC base: 0x70000000 size: 81920 [/app/taskdump(56)] [4] flags: RW-C base: 0x70014000 size: 69632 [/app/taskdump(56)] [5] flags: RW-C base: 0x70025000 size: 4096 [/app/taskdump(56)] [6] flags: RW-C base: 0x70026000 size: 4096 [/app/taskdump(56)] [7] flags: RW-C base: 0x70028000 size: 4096 [/app/taskdump(56)] [8] flags: RW-C base: 0x7002a000 size: 4096 [/app/taskdump(56)] [9] flags: R--C base: 0x7002c000 size: 4096 [/app/taskdump(56)] [10] flags: RW-C base: 0x7002d000 size: 4096 [/app/taskdump(56)] [11] flags: RW-C base: 0x7002f000 size: 1048576 [/app/taskdump(56)] [12] flags: RW-C base: 0x70130000 size: 258048 [/app/taskdump(56)] [13] flags: RW-C base: 0x7016f000 size: 73728 [/app/taskdump(56)] [14] flags: RW-C base: 0x7ff00000 size: 1048576 [/app/taskdump(56)] Failed dumping fibrils.
comment:3 by , 5 months ago
I extracted core file from /srv/input and started analyzing it:
core55 (gdb) print /x *0x70130e48 $4 = 0x70146e0c (gdb) print /x *0x70130e68 $5 = 0x70130e98 (gdb) print /x *0x70130e98 $6 = 0x70130eb8 (gdb) print /x *0x70130eb8 $7 = 0x70130ec4 (gdb) print /x *0x70130ec4 $8 = 0x0 Stack trace constructed by hand: 0x70130e48 0x0 0x70130e68 0x7014ef70 0x70130e98 0x70146e0c 0x70130eb8 0x804b161 0x70130ec4 0x804b0d2 (gdb) backtrace #0 0x00000000 in ?? () #1 0x701593d8 in ?? () #2 0x7014ef70 in ?? () #3 0x70146e0c in ?? () #4 0x0804b161 in __c_start () #5 0x0804b0d2 in _start () (gdb) disas 0x0804b0d2 Dump of assembler code for function _start: 0x0804b0bd <+0>: mov %ss,%ax 0x0804b0c0 <+3>: mov %eax,%ds 0x0804b0c2 <+5>: mov %eax,%es 0x0804b0c4 <+7>: mov %eax,%fs 0x0804b0c6 <+9>: push $0x0 0x0804b0c8 <+11>: push $0x0 0x0804b0ca <+13>: mov %esp,%ebp 0x0804b0cc <+15>: push %edi 0x0804b0cd <+16>: call 0x804b0e0 <__c_start> 0x0804b0d2 <+21>: xchg %ax,%ax 0x0804b0d4 <+23>: xchg %ax,%ax 0x0804b0d6 <+25>: xchg %ax,%ax 0x0804b0d8 <+27>: xchg %ax,%ax 0x0804b0da <+29>: xchg %ax,%ax 0x0804b0dc <+31>: xchg %ax,%ax 0x0804b0de <+33>: xchg %ax,%ax End of assembler dump. (gdb) (gdb) disas 0x804b0e0 Dump of assembler code for function __c_start: 0x0804b0e0 <+0>: push %ebp 0x0804b0e1 <+1>: mov %esp,%ebp 0x0804b0e3 <+3>: sub $0x14,%esp 0x0804b0e6 <+6>: movl $0x8048dc0,0x804fc80 0x0804b0f0 <+16>: movl $0x8048000,0x804fc84 0x0804b0fa <+26>: movl $0x804fd0c,0x804fc88 0x0804b104 <+36>: movl $0x804c7ec,0x804fc8c 0x0804b10e <+46>: mov $0x804c7ec,%eax 0x0804b113 <+51>: sub $0x804c7ec,%eax 0x0804b118 <+56>: sar $0x2,%eax 0x0804b11b <+59>: mov %eax,0x804fc90 0x0804b120 <+64>: movl $0x804c7ec,0x804fc94 0x0804b12a <+74>: mov $0x804c7ec,%eax 0x0804b12f <+79>: sub $0x804c7ec,%eax 0x0804b134 <+84>: sar $0x2,%eax 0x0804b137 <+87>: mov %eax,0x804fc98 0x0804b13c <+92>: movl $0x804c7ec,0x804fc9c 0x0804b146 <+102>: mov $0x804c7ec,%eax 0x0804b14b <+107>: sub $0x804c7ec,%eax 0x0804b150 <+112>: sar $0x2,%eax --Type <RET> for more, q to quit, c to continue without paging-- 0x0804b153 <+115>: mov %eax,0x804fca0 0x0804b158 <+120>: mov 0x8(%ebp),%eax 0x0804b15b <+123>: push %eax 0x0804b15c <+124>: call 0x8048d20 <__libc_main@plt> End of assembler dump. (gdb) (gdb) disas 0x8048d20 Dump of assembler code for function __libc_main@plt: 0x08048d20 <+0>: jmp *0x804c940 0x08048d26 <+6>: push $0x140 0x08048d2b <+11>: jmp 0x8048a90 End of assembler dump. gdb) print /x *0x804c940 $27 = 0x70146df0
Where is this?
comment:4 by , 5 months ago
If we objdump srv/loader, we can see its segments:
LOAD off 0x00000000 vaddr 0x70000000 paddr 0x70000000 align 2**12 filesz 0x00013460 memsz 0x00013460 flags r-x LOAD off 0x00014000 vaddr 0x70014000 paddr 0x70014000 align 2**12 filesz 0x0000023c memsz 0x00010f28 flags rw- TLS off 0x00014000 vaddr 0x70014000 paddr 0x70014000 align 2**2 filesz 0x00000000 memsz 0x00000010 flags r-- [loader(55)] libc_main: main=0x70000120 [loader(55)] Load segment v_addr=0x8048000 at addr 0x08048000, size 0x37ec, flags r-x [loader(55)] Load segment v_addr=0x804c7ec at addr 0x0804c7ec, size 0x3520, flags rw- [loader(55)] Parse program .dynamic section at 0x0804c7ec [loader(55)] prog tdata at 0x00000000 size 0, tbss size 0 [loader(55)] module_load_deps('[program]') [loader(55)] [program] needs libc.so.0 [loader(55)] Load segment v_addr=0x0 at addr 0x70132000, size 0x3e3b4, flags r-x [loader(55)] Load segment v_addr=0x3f3b4 at addr 0x701713b4, size 0x118f4, flags rw- [loader(55)] loaded '/lib/libc.so.0' at 0x70132000 [loader(55)] tdata at 0x701713b4 size 0, tbss size 80 [loader(55)] module_load_deps('libc.so.0') Task /srv/hid/input (55) killed due to an exception at program counter 0x00000000. 0x70130e48 0x0 0x70130e68 0x7014ef70 in __kio_init 0x70130e98 0x70146e0c in __libc_main 0x70130eb8 0x804b161 in __c_start 0x70130ec4 0x804b0d2 in _start 0001cf50 <__kio_init>: 1cf50: 55 push %ebp 1cf51: 89 e5 mov %esp,%ebp 1cf53: 53 push %ebx 1cf54: 83 ec 0c sub $0xc,%esp 1cf57: e8 43 e7 fe ff call b69f <__x86.get_pc_thunk.bx> 1cf5c: 81 c3 e0 2a 02 00 add $0x22ae0,%ebx 1cf62: 6a 01 push $0x1 1cf64: 8d 83 e4 0e 00 00 lea 0xee4(%ebx),%eax 1cf6a: 50 push %eax 1cf6b: e8 d0 d0 fe ff call a040 <futex_initialize@plt> 1cf70: 83 c4 10 add $0x10,%esp 1cf73: 85 c0 test %eax,%eax 1cf75: 75 07 jne 1cf7e <__kio_init+0x2e> 1cf77: 8b 5d fc mov -0x4(%ebp),%ebx 1cf7a: 89 ec mov %ebp,%esp 1cf7c: 5d pop %ebp 1cf7d: c3 ret 1cf7e: e8 dd d3 fe ff call a360 <abort@plt> 1cf83: 66 90 xchg %ax,%ax 1cf85: 66 90 xchg %ax,%ax 1cf87: 66 90 xchg %ax,%ax 1cf89: 66 90 xchg %ax,%ax 1cf8b: 66 90 xchg %ax,%ax 1cf8d: 66 90 xchg %ax,%ax 1cf8f: 90 nop
then it calls futex_initialize@plt
comment:5 by , 5 months ago
0001cf50 <__kio_init>: 1cf50: 55 push %ebp 1cf51: 89 e5 mov %esp,%ebp 1cf53: 53 push %ebx 1cf54: 83 ec 0c sub $0xc,%esp 1cf57: e8 43 e7 fe ff call b69f <__x86.get_pc_thunk.bx> -- ebx := 1cf5c+0x70132000 == 0x7014ef5c 1cf5c: 81 c3 e0 2a 02 00 add $0x22ae0,%ebx ... ebx := 0x7014ef5c + 0x22ae0 == 0x70171a3c 1cf62: 6a 01 push $0x1 1cf64: 8d 83 e4 0e 00 00 lea 0xee4(%ebx),%eax 1cf6a: 50 push %eax 1cf6b: e8 d0 d0 fe ff call a040 <futex_initialize@plt> 0000a040 <futex_initialize@plt>: a040: ff a3 d0 01 00 00 jmp *0x1d0(%ebx) a046: 68 88 03 00 00 push $0x388 a04b: e9 d0 f8 ff ff jmp 9920 <failed_asserts+0x98d4> (gdb) print /x *(0x7014ef5c+0x22ae0+0x1d0) $16 = 0x701593a0 (gdb) print /x 0x701593a0 - 0x70132000 $17 = 0x273a0 000273a0 <futex_initialize>: 273a0: 55 push %ebp 273a1: 89 e5 mov %esp,%ebp 273a3: 83 ec 0c sub $0xc,%esp 273a6: e8 8d 46 fe ff call ba38 <__x86.get_pc_thunk.dx> 273ab: 81 c2 91 86 01 00 add $0x18691,%edx 273b1: 8b 45 08 mov 0x8(%ebp),%eax 273b4: 8b 4d 0c mov 0xc(%ebp),%ecx
So we can see that we should have loaded the correct address of futex_initialize from the PLT
comment:6 by , 5 months ago
(gdb) info registers eax 0x70172924 1880566052 ecx 0x1 1 edx 0x804fc60 134544480 ebx 0x70171a3c 1880562236 .... contains correct address of GOT entry for futex_initialize esp 0x70130e1c 0x70130e1c ebp 0x70130e48 0x70130e48 esi 0x70014ac0 1879132864 edi 0x70014ac0 1879132864 eip 0x0 0x0 ... WHAT IS GOING ON ??? eflags 0x210206 [ PF IF RF ID ] cs 0x1b 27 ss 0x23 35 ds 0x23 35 es 0x23 35 fs 0x23 35 gs 0x30 48 (gdb)
comment:7 by , 5 months ago
The problem is equally reproducible with Qemu and KVM disbled, and in VirtualBox.
comment:8 by , 5 months ago
Stack trace might not be complete. Let's look at the top of the stack for a return address:
(gdb) print /x 0x701593d8 - 0x70132000
$2 = 0x273d8
000273a0 <futex_initialize>:
273a0: 55 push %ebp
273a1: 89 e5 mov %esp,%ebp
273a3: 83 ec 0c sub $0xc,%esp
273a6: e8 8d 46 fe ff call ba38 <x86.get_pc_thunk.dx>
273ab: 81 c2 91 86 01 00 add $0x18691,%edx
273b1: 8b 45 08 mov 0x8(%ebp),%eax
273b4: 8b 4d 0c mov 0xc(%ebp),%ecx
273b7: 89 08 mov %ecx,(%eax)
273b9: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
273c0: 8b 92 e0 ff ff ff mov -0x20(%edx),%edx
273c6: 6a 0b push $0xb
273c8: 6a 00 push $0x0
273ca: 6a 00 push $0x0
273cc: 6a 00 push $0x0
273ce: 6a 00 push $0x0
273d0: 6a 00 push $0x0
273d2: 83 c0 04 add $0x4,%eax
273d5: 50 push %eax
273d6: ff 12 call *(%edx)
273d8: 89 ec mov %ebp,%esp ←-return here
comment:9 by , 5 months ago
Note that we ran call *(%edx).
edx contains 0x804fc60
which is
$ objdump -D input .... 0804fc60 <__syscall_fast_func>: ...
and:
(gdb) print /x *0x804fc60 $16 = 0x0 (gdb)
So syscall_fast_func is NULL which seems to be the reason for the core dump.
comment:10 by , 5 months ago
I think it's because of how we define syscalls for 486 in uspace/lib/c/arch/ia32/include/libarch/syscall.h:
#define __syscall0 __syscall_fast_func #define __syscall1 __syscall_fast_func #define __syscall2 __syscall_fast_func #define __syscall3 __syscall_fast_func #define __syscall4 __syscall_fast_func #define __syscall5 __syscall_slow #define __syscall6 __syscall_slow
On 486 we can only use syscall_slow.
comment:11 by , 5 months ago
syscall_fast_func is a global variable / function pointer that is initialized to syscall_slow. On processors > 486 we then detect if we can use SYSENTER/SYSEXIT and if so, we set syscall_fast_func to syscall_fast.
Note however that 0x804fc60 is inside the main program. While libc is at 0x70130000.
[jirka@omelette dist]$ objdump -T lib/libc.so.0 | grep syscall_fast 0003fd60 g DO .data 00000004 __syscall_fast_func 0000b577 g DF .text 00000034 __syscall_fast [jirka@omelette dist]$ objdump -T srv/hid/input | grep syscall_fast 0804fc60 g DO .bss 00000004 __syscall_fast_func
The problem is that __syscall_fast_func
is defined both in libc.so.0
and input
.
In the binary/library the original value of the variable is zero and it needs to be fixed by the dynamic linker using R_386_32 relocation. It looks like we access the main executable's version of the variable, which is not correctly relocated.
The solution is to make sure we have just one instance of the variable.
comment:12 by , 5 months ago
To make things even more confusing, not all binaries contain a definition of __syscall_fast_func
, only some!
Specificaly, only these three binaries:
/app/bdsh /srv/hid/console /srv/hid/input
comment:13 by , 5 months ago
It seems that what these have in common is that they depend on the console
library. IN case of /app/bdsh, it depends on libclui, which itself depends on libconsole.
comment:14 by , 5 months ago
Here's the culprit in libconsole's console.c:
bool console_kcon(void) { return __SYSCALL0(SYS_DEBUG_CONSOLE); }
Removing this line removes the definition of __syscall_fast_func
variable from /srv/hid/input. So it was the move of this function outside libc that caused the problem. Now why does this create a definition instead of a declaration? Need to look at the preprocessed version of this file.
comment:15 by , 5 months ago
OK so this is even more complicated. There is nothing wrong with this code. It simply accesses an extern variable that happens to be defined in a shared library. However, variables and shared libraries work differently than one might think.
For functions it is straightforward. If we call a function foo that is defined in the main program, we will call foo
. If foo is defined in a shared library, we will call foo@plt
. So far so good.
If there is a (non-static) variable "v" defined in a shared library, the library code will always access it via its address in GOT, for which there is a relocation entry (e.g R_386_GLOB_DAT) with "v" symbol name. The dynamic linker will patch the GOT entry with the address of "v" in the shared library. This is how it works if the main program never refers to the variable "v".
If the main program contains a reference to "v", something surprising will happen. The main program will *also* contain a definition of "v" in bss. The main program will refer to the variable directly just as to any other variable defined in the main program.
Now the dynamic linker, when relocating the library's GOT, will find the definition of "v" in the main program first (due to the symbol lookup rules) and will patch it in. So now both the main program and the library will use the defition of "v" in the main program and the instance in the library is unused.
What if "v" is initialized? The shared library will contain a COPY relocation that copies the value of library's "v" to the main program's "v".
Now we are getting to the meat of things.
comment:16 by , 5 months ago
__syscall_fast_func
is initialized, but not with a constant. It's initialized with the value of the symbol __syscall_slow
. This means there must be a relocation entry that initializes __syscall_fast_func
.
So first we must relocate libc's __syscall_fast_func
to contain the address of __syscall_slow
.
Then we must COPY value of libc's __syscall_fast_func
to the main program's __syscall_fast_func
.
But we can see from the debug output that in our case the relocations were performed in the opposite order:
[loader(55)] name '__syscall_fast_func', value 0x804f840, size 0x4 [loader(55)] rel_type=5 r_ptr=0x0804f840 [loader(55)] dest name: '[program]' [loader(55)] dest bias: 0x0 [loader(55)] symbol definition found, addr=0x804f840, value=804f840, size=4, bind=1, type=1 [loader(55)] copy to 0x0804f840 from 70160de0 size 4 bytes [loader(55)] byte[0]=0x0 [loader(55)] byte[1]=0x0 [loader(55)] byte[2]=0x0 [loader(55)] byte[3]=0x0 [loader(55)] name '__syscall_slow', value 0xb5cd, size 0x27 [loader(55)] rel_type=7 r_ptr=0x70160bc8 [loader(55)] dest name: 'libc.so.0' [loader(55)] dest bias: 0x7012d000 [loader(55)] symbol definition found, addr=0x701385cd, value=b5cd, size=39, bind=1, type=2 [loader(55)] name '__syscall_fast_func', value 0x33de0, size 0x4 [loader(55)] rel_type=6 r_ptr=0x701606c4 [loader(55)] dest name: '[program]' [loader(55)] dest bias: 0x0 [loader(55)] symbol definition found, addr=0x804f840, value=804f840, size=4, bind=1, type=1 [loader(55)] name '__syscall_slow', value 0xb5cd, size 0x27 [loader(55)] rel_type=1 r_ptr=0x70160de0 [loader(55)] dest name: 'libc.so.0' [loader(55)] dest bias: 0x7012d000 [loader(55)] symbol definition found, addr=0x701385cd, value=b5cd, size=39, bind=1, type=2 [loader(55)] fixup R_386_32 (b+v+a) sym_addr=701385cd [loader(55)] now *rptr=701385cd
Which means they appear in the opposite order in the relocation table.
What is not clear to me is whether the relocation table is incorrect, or whether the dynamic linker is supposed to handle this situation and figure out the correct relocation order.
comment:17 by , 5 months ago
I am not entirely sure whether I want to pursue this or just work around it. I think I would need to ask in the linker community as I didn't find any information in this regard in the System V ABI (generic or 386 supplement).
comment:18 by , 4 months ago
Alright, as was pointed out to me the COPY relocation is in the main executable, not the shared library! So the two relocations are in different relocation tables. All we need is to make sure we process the main program's relocations LAST (i.e. after the shared libraries).
comment:19 by , 4 months ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed in changeset c576800952e07575202b2da8af33796cbbc9ebe9.
Bisecting the changeset history revealed that this problem starts with changeset a635535739c9e7901c0b9b5b1890f38965383f17: