#326 closed defect (fixed)
Assert on (addr >= ALIGN_DOWN(entry->p_vaddr, PAGE_SIZE)) && (addr < entry->p_vaddr + entry->p_memsz)
Reported by: | Jakub Jermář | Owned by: | Jakub Jermář |
---|---|---|---|
Priority: | major | Milestone: | 0.5.0 |
Component: | helenos/kernel/ia64 | Version: | |
Keywords: | Cc: | ||
Blocker for: | Depends on: | ||
See also: |
Description (last modified by )
Mainline revision 904, default build using the up-to-date toolchain, HelenOS/ia64/Ski, crashes during boot:
SPARTAN kernel, release 0.4.3 (Sashimi), revision 904M (martin@medusa.d3s.hide.ms.mff.cuni.cz-20110329215955-sayovtbd4vuolf4q) Built on 2011-03-30 00:52:09 for ia64 Copyright (c) 2001-2010 HelenOS project Detected 1 CPU(s), 64 MiB free memory Kernel console ready (press any key to activate) ######> Kernel panic on cpu0 due to a failed assertion: <###### elf_page_fault() at generic/src/mm/backend_elf.c:95: (addr >= ALIGN_DOWN(entry->p_vaddr, PAGE_SIZE)) && (addr < entry->p_vaddr + entry->p_memsz) cpu0: halted
Change History (14)
comment:1 by , 14 years ago
Description: | modified (diff) |
---|
comment:2 by , 14 years ago
Description: | modified (diff) |
---|
comment:3 by , 14 years ago
comment:4 by , 14 years ago
I wonder whether the new toolchain puts something into .rodata*. This changed in the normal app linker script and also in the loader linker script.
comment:5 by , 14 years ago
The kernel seems to be unhappy about the page fault address, which does not seem to fit within the [vaddr, vaddr + memsz) of the data segment.
Here is the situation of the ns
server built with the new toolchain:
addr=365b0 entry->p_vaddr=34250, entry->p_memsz=1840 Task init:ns (2) killed due to an exception at program counter 0x0000000000024870. Kill message: Page fault at 0x00000000000365b0.
Extract from objdump -x:
Program Header: LOAD off 0x00000000000000b0 vaddr 0x00000000000040b0 paddr 0x00000000000040b0 align 2**6 filesz 0x000000000002c1a0 memsz 0x000000000002c1a0 flags r-x LOAD off 0x000000000002c250 vaddr 0x0000000000034250 paddr 0x0000000000034250 align 2**4 filesz 0x000000000000058c memsz 0x0000000000000730 flags rw-
comment:6 by , 14 years ago
Yes. The question is how is something like this even possible? The linker script and all about the linking process seems to be OK (as far as I can tell) and it works fine with the previous version of binutils. Perhaps a bug in the linker?
BTW, the program counter in ns points to:
0000000000024840 <as_get_mappable_page>: 24840: 08 10 2d 08 80 05 [MMI] alloc r34=ar.pfs,11,4,0 24846: 40 02 07 8c 48 20 addl r36=9056,r1 2484c: 04 00 c4 00 mov r33=b0 24850: 09 28 01 40 00 21 [MMI] mov r37=r32 24856: 00 00 00 02 00 60 nop.m 0x0 2485c: 04 08 00 84 mov r35=r1;; 24860: 08 30 01 00 00 21 [MMI] mov r38=r0 24866: 70 02 00 00 42 00 mov r39=r0 2486c: 05 00 00 84 mov r40=r0 24870: 09 48 01 00 00 21 [MMI] mov r41=r0 <======= PC 24876: 40 02 90 30 20 40 ld8 r36=[r36] 2487c: 25 01 00 90 mov r42=18;; 24880: 11 00 00 00 01 00 [MIB] nop.m 0x0 24886: 00 00 00 02 00 00 nop.i 0x0 2488c: 38 f4 ff 58 br.call.sptk.many b0=23cb0 <__syscall>;; 24890: 09 08 00 46 00 21 [MMI] mov r1=r35 24896: 00 00 00 02 00 00 nop.m 0x0 2489c: 20 02 aa 00 mov.i ar.pfs=r34;; 248a0: 11 00 00 00 01 00 [MIB] nop.m 0x0 248a6: 00 08 05 80 03 80 mov b0=r33 248ac: 08 00 84 00 br.ret.sptk.many b0;;
comment:7 by , 14 years ago
The address in PC is the address of the instruction bundle, so I guess the "offending" instruction is:
24870: 09 48 01 00 00 21 [MMI] mov r41=r0 <======= PC 24876: 40 02 90 30 20 40 ld8 r36=[r36] <======= offending instruction 2487c: 25 01 00 90 mov r42=18;;
Address in r36 is computed as gp + 9056.
Could be that there is something wrong with the gp, GOT or even the as area btree (why is the pagefault being associated with the ELF-backed area?).
comment:8 by , 14 years ago
The ELF backend is involved, because the faulting address is still on the page mapped by the ELF backend, but already beyond the vaddr + memsz limit. So the backend and the btree are ok.
There is a difference in how is the r36 value computed in the good and the bad version.
Good version:
0000000000028800 <as_get_mappable_page>: 28800: 08 10 2d 08 80 05 [MMI] alloc r34=ar.pfs,11,4,0 28806: 40 02 04 00 48 20 addl r36=0,r1 2880c: 04 00 c4 00 mov r33=b0 28810: 09 28 01 40 00 21 [MMI] mov r37=r32 28816: 00 00 00 02 00 60 nop.m 0x0 2881c: 04 08 00 84 mov r35=r1;; 28820: 08 30 01 00 00 21 [MMI] mov r38=r0 28826: 70 02 00 00 42 00 mov r39=r0 2882c: 05 00 00 84 mov r40=r0 28830: 09 48 01 00 00 21 [MMI] mov r41=r0 28836: 40 02 90 30 20 40 ld8 r36=[r36]
Bad version:
0000000000024840 <as_get_mappable_page>: 24840: 08 10 2d 08 80 05 [MMI] alloc r34=ar.pfs,11,4,0 24846: 40 02 07 8c 48 20 addl r36=9056,r1 2484c: 04 00 c4 00 mov r33=b0 24850: 09 28 01 40 00 21 [MMI] mov r37=r32 24856: 00 00 00 02 00 60 nop.m 0x0 2485c: 04 08 00 84 mov r35=r1;; 24860: 08 30 01 00 00 21 [MMI] mov r38=r0 24866: 70 02 00 00 42 00 mov r39=r0 2486c: 05 00 00 84 mov r40=r0 24870: 09 48 01 00 00 21 [MMI] mov r41=r0 24876: 40 02 90 30 20 40 ld8 r36=[r36]
So the bad version seems to be adding extra 9056 bytes to the gp register.
Excerpt from ns.map reveals that:
.got 0x0000000000034250 0x58 0x0000000000034250 _gp = .
From here, we can make an interesting observation that:
page_fault_address
=
.got
+ 9056
comment:9 by , 14 years ago
This is the assembly generated by GCC:
Good version:
.global as_get_mappable_page# .type as_get_mappable_page#, @function .proc as_get_mappable_page# as_get_mappable_page: [.LFB9:] .loc 1 111 0 [.LVL4:] .mmi alloc r34 = ar.pfs, 1, 3, 7, 0 .loc 1 112 0 addl r36 = @ltoff(@fptr(__entry#)), gp .loc 1 111 0 mov r33 = b0 .loc 1 112 0 .mmi mov r37 = r32 .loc 1 111 0 nop 0 mov r35 = r1 .loc 1 112 0 ;; .mmi mov r38 = r0 mov r39 = r0 mov r40 = r0 .mmi mov r41 = r0 ld8 r36 = [r36]
Bad version:
.global as_get_mappable_page# .type as_get_mappable_page#, @function .proc as_get_mappable_page# as_get_mappable_page: [.LFB9:] .loc 1 111 0 [.LVL4:] .mmi alloc r34 = ar.pfs, 1, 3, 7, 0 [.LCFI8:] .loc 1 112 0 addl r36 = @ltoff(@fptr(__entry#)), gp .loc 1 111 0 mov r33 = b0 [.LCFI9:] .loc 1 112 0 .mmi mov r37 = r32 .loc 1 111 0 nop 0 mov r35 = r1 .loc 1 112 0 ;; .mmi mov r38 = r0 mov r39 = r0 mov r40 = r0 .mmi mov r41 = r0 ld8 r36 = [r36]
Both versions are de facto identical. They both do:
addl r36 = @ltoff(@fptr(__entry#)), gp
What happens here is that the code is assuming that it is possible to read the full 64-bit address of the __entry symbol using a gp-relative offset. The two version being identical, the problem must be either in the assembler phase or the linker phase.
comment:10 by , 14 years ago
I made a simple experiment. I built HelenOS/ia64/ski by the new toolchain. Afterwards I changed my CROSS_PREFIX to point to the old toolchain and removed the binaries. I then ran make again so that only the link phase was repeated, reusing .o files built by the new toolchain. This resulted in a functional HelenOS image which booted fine into bdsh prompt. This suggests that the problematic component of the new toolchain is the linker (assembler and gcc were proven to generate functional code).
follow-up: 14 comment:12 by , 14 years ago
Looks like the problem goes away if we use __gp
instead of
_gp
.
__gp
is a special symbol to ld which tells it where the binary wants to have its GP register point to while mere
_gp
is just a normal HelenOS symbol without any special meaning. We will need to fix this also for other architectures since without the enforcement through
__gp
, the linker will pick location for GP arbitrarily. It is still not clear why the linker picked a location beyond the end of the image without producing at least a warning.
comment:13 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed in changeset:mainline,925.
comment:14 by , 14 years ago
Replying to jermar:
We will need to fix this also for other architectures since without the enforcement through
__gp
, the linker will pick location for GP arbitrarily.
Hm, this is not that unified as it might have seemed. While the symbol needs to be called __gp
on ia64, it must be called
_gp
on mips32.
Confirmed. However, if compiled with the previous toolchain (GCC 4.5.1), the system boots and runs fine in Ski.