Opened 9 years ago

Closed 8 years ago

#606 closed defect (fixed)

VFS sometimes crashes in fibril_switch() on sun4v

Reported by: Jakub Jermář Owned by: Jakub Jermář
Priority: major Milestone: 0.7.0
Component: helenos-build/sparc64 Version: mainline
Keywords: sun4v Cc:
Blocker for: Depends on:
See also: #324

Description

After mainline,1921, mainline,1922 and mainline,1923, HelenOS/sun4v can make it quite far into userspace initialization. As far as stability is concerned, the only problem seems to be around this area in fibril.c:

        fibril_t *srcf = __tcb_get()->fibril_data;
        if (stype != FIBRIL_FROM_DEAD) {

                /* Save current state */
                if (!context_save(&srcf->ctx)) {
                        if (serialization_count)
                                srcf->flags &= ~FIBRIL_SERIALIZED;

                        if (srcf->clean_after_me) {           <========== HERE
                                /*
                                 * Cleanup after the dead fibril from which we
                                 * restored context here.
                                 */
                                void *stack = srcf->clean_after_me->stack;     <=========== or HERE
                                if (stack) {
                                        /*
                                         * This check is necessary because a
                                         * thread could have exited like a
                                         * normal fibril using the
                                         * FIBRIL_FROM_DEAD switch type. In that
                                         * case, its fibril will not have the
                                         * stack member filled.
                                         */

Either srcf→clean_after_me or srcf→clean_after_me→stack contain some garbage (unaligned or unmapped).

The corresponding disasm is here:

    c6b4:       82 10 00 07     mov  %g7, %g1
    c6b8:       c4 5f a8 7f     ldx  [ %fp + 0x87f ], %g2
    c6bc:       c2 58 60 08     ldx  [ %g1 + 8 ], %g1
    c6c0:       80 a0 a0 03     cmp  %g2, 3
    c6c4:       02 40 00 73     be,pn   %icc, c890 <fibril_switch+0x250>
    c6c8:       c2 77 a7 f7     stx  %g1, [ %fp + 0x7f7 ]
    c6cc:       40 00 53 f5     call  216a0 <context_save>
    c6d0:       90 00 60 10     add  %g1, 0x10, %o0
    c6d4:       80 a2 20 00     cmp  %o0, 0
    c6d8:       12 40 00 a1     bne,pn   %icc, c95c <fibril_switch+0x31c>
    c6dc:       03 00 00 00     sethi  %hi(0), %g1
    c6e0:       82 18 7f e8     xor  %g1, -24, %g1
    c6e4:       c2 01 c0 01     ld  [ %g7 + %g1 ], %g1
    c6e8:       80 a0 60 00     cmp  %g1, 0
    c6ec:       12 48 00 2f     bne  %icc, c7a8 <fibril_switch+0x168>
    c6f0:       c8 5f a7 f7     ldx  [ %fp + 0x7f7 ], %g4
    c6f4:       ca 5f a7 f7     ldx  [ %fp + 0x7f7 ], %g5
    c6f8:       fa 59 60 c8     ldx  [ %g5 + 0xc8 ], %i5            <======== here %g5 is misaligned
    c6fc:       22 c7 40 10     brz,a,pn   %i5, c73c <fibril_switch+0xfc>
    c700:       b0 10 20 01     mov  1, %i0
    c704:       d0 5f 60 a8     ldx  [ %i5 + 0xa8 ], %o0
    c708:       02 c2 00 06     brz,pn   %o0, c720 <fibril_switch+0xe0>
    c70c:       01 00 00 00     nop 
    c710:       7f ff e8 b4     call  69e0 <as_area_destroy>

This crash can be still occasionally encountered also in the CHT pre-integration branch:

http://bazaar.launchpad.net/~jakub/helenos/cht-preintegration/revision/2291

Change History (1)

comment:1 by Jakub Jermář, 8 years ago

Component: helenos/srv/vfshelenos-build/sparc64
Resolution: fixed
Status: newclosed

There was a bug in tlb_invalidate_pages() fixed by mainline,2409 which was most likely causing this issue. As of mainline,2409, I was unable to reproduce the problem both under gem5 and on a real-world T1000.

Note: See TracTickets for help on using tickets.