Opened 13 years ago

Closed 13 years ago

Last modified 13 years ago

#326 closed defect (fixed)

Assert on (addr >= ALIGN_DOWN(entry->p_vaddr, PAGE_SIZE)) && (addr < entry->p_vaddr + entry->p_memsz)

Reported by: Jakub Jermář Owned by: Jakub Jermář
Priority: major Milestone: 0.5.0
Component: helenos/kernel/ia64 Version:
Keywords: Cc:
Blocker for: Depends on:
See also:

Description (last modified by Jakub Jermář)

Mainline revision 904, default build using the up-to-date toolchain, HelenOS/ia64/Ski, crashes during boot:

SPARTAN kernel, release 0.4.3 (Sashimi), revision 904M (martin@medusa.d3s.hide.ms.mff.cuni.cz-20110329215955-sayovtbd4vuolf4q)
Built on 2011-03-30 00:52:09 for ia64
Copyright (c) 2001-2010 HelenOS project
Detected 1 CPU(s), 64 MiB free memory
Kernel console ready (press any key to activate)

######> Kernel panic on cpu0 due to a failed assertion: <######
elf_page_fault() at generic/src/mm/backend_elf.c:95:
(addr >= ALIGN_DOWN(entry->p_vaddr, PAGE_SIZE)) && (addr < entry->p_vaddr + entry->p_memsz)

cpu0: halted

Change History (14)

comment:1 by Jakub Jermář, 13 years ago

Description: modified (diff)

comment:2 by Jakub Jermář, 13 years ago

Description: modified (diff)

comment:3 by Martin Decky, 13 years ago

Confirmed. However, if compiled with the previous toolchain (GCC 4.5.1), the system boots and runs fine in Ski.

comment:4 by Jakub Jermář, 13 years ago

I wonder whether the new toolchain puts something into .rodata*. This changed in the normal app linker script and also in the loader linker script.

comment:5 by Jakub Jermář, 13 years ago

The kernel seems to be unhappy about the page fault address, which does not seem to fit within the [vaddr, vaddr + memsz) of the data segment.

Here is the situation of the ns server built with the new toolchain:

addr=365b0 entry->p_vaddr=34250, entry->p_memsz=1840
Task init:ns (2) killed due to an exception at program counter 0x0000000000024870.
Kill message: Page fault at 0x00000000000365b0.

Extract from objdump -x:

Program Header:
    LOAD off    0x00000000000000b0 vaddr 0x00000000000040b0 paddr 0x00000000000040b0 align 2**6
         filesz 0x000000000002c1a0 memsz 0x000000000002c1a0 flags r-x
    LOAD off    0x000000000002c250 vaddr 0x0000000000034250 paddr 0x0000000000034250 align 2**4
         filesz 0x000000000000058c memsz 0x0000000000000730 flags rw-

comment:6 by Martin Decky, 13 years ago

Yes. The question is how is something like this even possible? The linker script and all about the linking process seems to be OK (as far as I can tell) and it works fine with the previous version of binutils. Perhaps a bug in the linker?

BTW, the program counter in ns points to:

0000000000024840 <as_get_mappable_page>:
   24840:   08 10 2d 08 80 05   [MMI]       alloc r34=ar.pfs,11,4,0
   24846:   40 02 07 8c 48 20               addl r36=9056,r1
   2484c:   04 00 c4 00                     mov r33=b0
   24850:   09 28 01 40 00 21   [MMI]       mov r37=r32
   24856:   00 00 00 02 00 60               nop.m 0x0
   2485c:   04 08 00 84                     mov r35=r1;;
   24860:   08 30 01 00 00 21   [MMI]       mov r38=r0
   24866:   70 02 00 00 42 00               mov r39=r0
   2486c:   05 00 00 84                     mov r40=r0
   24870:   09 48 01 00 00 21   [MMI]       mov r41=r0       <======= PC
   24876:   40 02 90 30 20 40               ld8 r36=[r36]
   2487c:   25 01 00 90                     mov r42=18;;
   24880:   11 00 00 00 01 00   [MIB]       nop.m 0x0
   24886:   00 00 00 02 00 00               nop.i 0x0
   2488c:   38 f4 ff 58                     br.call.sptk.many b0=23cb0 <__syscall>;;
   24890:   09 08 00 46 00 21   [MMI]       mov r1=r35
   24896:   00 00 00 02 00 00               nop.m 0x0
   2489c:   20 02 aa 00                     mov.i ar.pfs=r34;;
   248a0:   11 00 00 00 01 00   [MIB]       nop.m 0x0
   248a6:   00 08 05 80 03 80               mov b0=r33
   248ac:   08 00 84 00                     br.ret.sptk.many b0;;

comment:7 by Jakub Jermář, 13 years ago

The address in PC is the address of the instruction bundle, so I guess the "offending" instruction is:

24870:   09 48 01 00 00 21   [MMI]       mov r41=r0       <======= PC
24876:   40 02 90 30 20 40               ld8 r36=[r36]    <======= offending instruction
2487c:   25 01 00 90                     mov r42=18;;

Address in r36 is computed as gp + 9056.

Could be that there is something wrong with the gp, GOT or even the as area btree (why is the pagefault being associated with the ELF-backed area?).

comment:8 by Jakub Jermář, 13 years ago

The ELF backend is involved, because the faulting address is still on the page mapped by the ELF backend, but already beyond the vaddr + memsz limit. So the backend and the btree are ok.

There is a difference in how is the r36 value computed in the good and the bad version.

Good version:

0000000000028800 <as_get_mappable_page>:
   28800:       08 10 2d 08 80 05       [MMI]       alloc r34=ar.pfs,11,4,0
   28806:       40 02 04 00 48 20                   addl r36=0,r1
   2880c:       04 00 c4 00                         mov r33=b0
   28810:       09 28 01 40 00 21       [MMI]       mov r37=r32
   28816:       00 00 00 02 00 60                   nop.m 0x0
   2881c:       04 08 00 84                         mov r35=r1;;
   28820:       08 30 01 00 00 21       [MMI]       mov r38=r0
   28826:       70 02 00 00 42 00                   mov r39=r0
   2882c:       05 00 00 84                         mov r40=r0
   28830:       09 48 01 00 00 21       [MMI]       mov r41=r0
   28836:       40 02 90 30 20 40                   ld8 r36=[r36]

Bad version:

0000000000024840 <as_get_mappable_page>:
   24840:       08 10 2d 08 80 05       [MMI]       alloc r34=ar.pfs,11,4,0
   24846:       40 02 07 8c 48 20                   addl r36=9056,r1
   2484c:       04 00 c4 00                         mov r33=b0
   24850:       09 28 01 40 00 21       [MMI]       mov r37=r32
   24856:       00 00 00 02 00 60                   nop.m 0x0
   2485c:       04 08 00 84                         mov r35=r1;;
   24860:       08 30 01 00 00 21       [MMI]       mov r38=r0
   24866:       70 02 00 00 42 00                   mov r39=r0
   2486c:       05 00 00 84                         mov r40=r0
   24870:       09 48 01 00 00 21       [MMI]       mov r41=r0
   24876:       40 02 90 30 20 40                   ld8 r36=[r36]

So the bad version seems to be adding extra 9056 bytes to the gp register.

Excerpt from ns.map reveals that:

.got            0x0000000000034250       0x58
                0x0000000000034250                _gp = .

From here, we can make an interesting observation that:

page_fault_address = .got + 9056

comment:9 by Jakub Jermář, 13 years ago

This is the assembly generated by GCC:

Good version:

        .global as_get_mappable_page#
        .type   as_get_mappable_page#, @function
        .proc as_get_mappable_page#
as_get_mappable_page:
[.LFB9:]
        .loc 1 111 0
[.LVL4:]
        .mmi
        alloc r34 = ar.pfs, 1, 3, 7, 0
        .loc 1 112 0
        addl r36 = @ltoff(@fptr(__entry#)), gp
        .loc 1 111 0
        mov r33 = b0
        .loc 1 112 0
        .mmi
        mov r37 = r32
        .loc 1 111 0
        nop 0
        mov r35 = r1
        .loc 1 112 0
        ;;
        .mmi
        mov r38 = r0
        mov r39 = r0
        mov r40 = r0
        .mmi
        mov r41 = r0
        ld8 r36 = [r36]

Bad version:

        .global as_get_mappable_page#
        .type   as_get_mappable_page#, @function
        .proc as_get_mappable_page#
as_get_mappable_page:
[.LFB9:]
        .loc 1 111 0
[.LVL4:]
        .mmi
        alloc r34 = ar.pfs, 1, 3, 7, 0
[.LCFI8:]
        .loc 1 112 0
        addl r36 = @ltoff(@fptr(__entry#)), gp
        .loc 1 111 0
        mov r33 = b0
[.LCFI9:]
        .loc 1 112 0
        .mmi
        mov r37 = r32
        .loc 1 111 0
        nop 0
        mov r35 = r1
        .loc 1 112 0
        ;;
        .mmi
        mov r38 = r0
        mov r39 = r0
        mov r40 = r0
        .mmi
        mov r41 = r0
        ld8 r36 = [r36]

Both versions are de facto identical. They both do:

        addl r36 = @ltoff(@fptr(__entry#)), gp

What happens here is that the code is assuming that it is possible to read the full 64-bit address of the __entry symbol using a gp-relative offset. The two version being identical, the problem must be either in the assembler phase or the linker phase.

Last edited 13 years ago by Jakub Jermář (previous) (diff)

comment:10 by Jakub Jermář, 13 years ago

I made a simple experiment. I built HelenOS/ia64/ski by the new toolchain. Afterwards I changed my CROSS_PREFIX to point to the old toolchain and removed the binaries. I then ran make again so that only the link phase was repeated, reusing .o files built by the new toolchain. This resulted in a functional HelenOS image which booted fine into bdsh prompt. This suggests that the problematic component of the new toolchain is the linker (assembler and gcc were proven to generate functional code).

comment:11 by Jakub Jermář, 13 years ago

I filed bug report 12669 in binutils' Bugzilla.

comment:12 by Jakub Jermář, 13 years ago

Looks like the problem goes away if we use __gp instead of _gp. __gp is a special symbol to ld which tells it where the binary wants to have its GP register point to while mere _gp is just a normal HelenOS symbol without any special meaning. We will need to fix this also for other architectures since without the enforcement through __gp, the linker will pick location for GP arbitrarily. It is still not clear why the linker picked a location beyond the end of the image without producing at least a warning.

comment:13 by Jakub Jermář, 13 years ago

Resolution: fixed
Status: newclosed

in reply to:  12 comment:14 by Jakub Jermář, 13 years ago

Replying to jermar:

We will need to fix this also for other architectures since without the enforcement through __gp, the linker will pick location for GP arbitrarily.

Hm, this is not that unified as it might have seemed. While the symbol needs to be called __gp on ia64, it must be called _gp on mips32.

Note: See TracTickets for help on using tickets.