Opened 8 months ago

Last modified 6 months ago

#866 new defect

console input freezes on ia64/ski after GCC upgrade

Reported by: Jiří Zárevúcky Owned by:
Priority: major Milestone: 0.14.2
Component: helenos/kernel/ia64 Version: mainline
Keywords: Cc:
Blocker for: Depends on:
See also:

Description

It seems that after we upgraded GCC to version 13.2, console input in the IA-64 ski emulator always freezes a couple of seconds after boot. With old GCC 8.2.0, this doesn't happen as far as I can tell.

Change History (7)

comment:1 by Jiri Svoboda, 6 months ago

Milestone: 0.13.10.14.1

Milestone renamed

comment:2 by Jiri Svoboda, 6 months ago

I can confirm this. I booted up and ran 'ls' and it hang halfway through listing the directory.

However, you can still switch to kconsole and it works. You can still switch between virtual terminals with Fn (although none of them respond to keyboard).

Maybe some server crashed? (VFS?)

comment:3 by Jiri Svoboda, 6 months ago

HelenOS release 0.12.1 (Cathode), revision a29af3775                            
Built on 2024-03-19 08:45:14                                                    
Running on ia64 (term/vc0)                                                      
Copyright (c) 2001-2022 HelenOS project                                                                                                                         Welcome to HelenOS!                                                             
http://www.helenos.org/                                                                                                                                         
Type 'help' [Enter] to see a few survival tips.                                                                                                                 
/ # ls                                                                          
app                                                     <dir>                   
boot                                                    <dir>                   
cfg                                                     <dir>                   
data                                                    <dir>                   
drv                                                     <dir>                   
gzx                                                     <dir>                   
kconsole> tasks                                                                 
[id    ] [name        ] [ctn] [address         ] [as              ]             
1        kernel         0     0xe0000000040ec000 0xe00000000406c000             
2        init:ns        0     0xe0000000040ec488 0xe00000000406c0a8             
4        init:locsrv    0     0xe0000000040ecd98 0xe00000000406c1f8             
5        init:rd        0     0xe0000000040ed220 0xe00000000406c2a0             
6        init:vfs       0     0xe0000000040ed6a8 0xe00000000406c348             
7        init:logger    0     0xe0000000040edb30 0xe00000000406c3f0
8        init:ext4fs    0     0xe0000000040edfb8 0xe00000000406c498
9        /srv/fs/tmpfs  0     0xe0000000040ee440 0xe00000000406c540
10       /srv/fs/exfat  0     0xe0000000040ee8c8 0xe00000000406c5e8
11       /srv/fs/fat    0     0xe0000000040eed50 0xe00000000406c690
12       /srv/fs/cdfs   0     0xe0000000040ef1d8 0xe00000000406c738
13       /srv/fs/mfs    0     0xe0000000040ef660 0xe00000000406c7e0
14       /srv/klog      0     0xe0000000040efae8 0xe00000000406c888
15       /srv/fs/locfs  0     0xe000000007ec0000 0xe00000000406c930
16       /srv/taskmon   0     0xe000000007ec0488 0xe00000000406c9d8
17       /srv/devman    0     0xe000000007ec0910 0xe00000000406ca80
18       /drv/root/root 0     0xe000000007ec0d98 0xe00000000406cb28
19       /srv/hid/s3c24xx_ts 0     0xe000000007ec1220 0xe00000000406cbd0
20       /drv/virt/virt 0     0xe000000007ec16a8 0xe00000000406cc78
21       /drv/ski/ski   0     0xe000000007ec1b30 0xe00000000406cd20
22       /srv/bd/vbd    0     0xe000000007ec1fb8 0xe00000000406cdc8
23       /drv/kfb/kfb   0     0xe000000007ec2440 0xe00000000406ce70
24       /drv/ski-con/ski-con 0     0xe000000007ec28c8 0xe00000000406cf18
25       /srv/volsrv    0     0xe000000007ec2d50 0xe00000000406cfc0
26       /srv/net/loopip 0     0xe000000007ec31d8 0xe00000000406d068
27       /srv/net/ethip 0     0xe000000007ec3660 0xe00000000406d110
28       /srv/net/inetsrv 0     0xe000000007ec3ae8 0xe00000000406d1b8
29       /srv/net/tcp   0     0xe000000008ed0000 0xe00000000406d260
30       /srv/net/udp   0     0xe000000008ed0488 0xe00000000406d308
31       /srv/net/dnsrsrv 0     0xe000000008ed0910 0xe00000000406d3b0
32       /srv/net/dhcp  0     0xe000000008ed0d98 0xe00000000406d458
33       /srv/net/nconfsrv 0     0xe000000008ed1220 0xe00000000406d500
34       /srv/clipboard 0     0xe000000008ed16a8 0xe00000000406d5a8
35       /srv/hid/remcons 0     0xe000000008ed1b30 0xe00000000406d650
36       /srv/hid/input 0     0xe000000008ed1fb8 0xe00000000406d6f8
37       /srv/hid/output 0     0xe000000008ed2440 0xe00000000406d7a0
38       /srv/audio/hound 0     0xe000000008ed28c8 0xe00000000406d848
39       /srv/hid/console 0     0xe000000008ed2d50 0xe00000000406d8f0
40       /app/getterm   0     0xe000000008ed31d8 0xe00000000406d998
41       /app/getterm   0     0xe000000008ed3660 0xe00000000406da40
42       /app/bdsh      0     0xe000000008ed3ae8 0xe00000000406dae8
43       /app/getterm   0     0xe000000009ebc000 0xe00000000406db90
44       /app/bdsh      0     0xe000000009ebc488 0xe00000000406dc38
45       /app/getterm   0     0xe000000009ebc910 0xe00000000406dce0
46       /app/bdsh      0     0xe000000009ebcd98 0xe00000000406dd88
47       /app/getterm   0     0xe000000009ebd220 0xe00000000406de30
48       loader         0     0xe000000009ebd6a8 0xe00000000406ded8
49       /app/getterm   0     0xe000000009ebdb30 0xe00000000406df80
50       loader         0     0xe000000009ebdfb8 0xe00000000406e028
51       loader         0     0xe0000000040ec910 0xe00000000406c150
kconsole> 

comment:4 by Jiri Svoboda, 6 months ago

In the listing above vfs is still present.

However, if I try to minimize the list of tasks run from init, the behavior changes.
Starting less VTs leads to not even getting a command line.
If I remove most other things (networking, filesystems, etc.), I get different behavior when the problem occurs: typing a character (e.g. 'A') into an active VT prints a space. Typing Enter prints a number of spaces (tab?).

If I try to start a UI application, the system usually freezes before it starts (except Hello). But if I start it from init instead of bdsh, it starts and continues working (e.g. Task Bar, Calculator). Note: These don't use stdout/vfs for output.

However taskbar then fails to start an application (returns right back).
In these cases if I press F12 and print the list of tasks, vfs is not listed.
This suggests it crashed.

The problem is ski kernel console doesn't keep history so we can't see the crash.
Unless…
If I press F12 before the problem occurs, I get:

Task init:vfs (6) killed due to an exception at program counter 0x400000000001d240.
ar.bsp=0xe00000000416c0b8       ar.bspstore=0x60000000002f4018
ar.rnat=0x0     ar.rsc=0xc
ar.ifs=0x8000000000000288       ar.pfs=0xc000000000000288
cr.isr=0x400000000      cr.ipsr=0x1013080a6010
cr.iip=0x400000000001d240, #0   (<unknown>)
cr.iipa=0x400000000001f2e0      (<unknown>)
cr.ifa=0x400    (<unknown>)
Kill message: Page fault: 0x0000000000000400.

comment:5 by Jiri Svoboda, 6 months ago

0x400000000001d240 is inside fibril_mutex_unlock:

400000000001d200 <fibril_mutex_unlock>:
400000000001d200:       04 18 21 0a 80 05       [MLX]       alloc r35=ar.pfs,8,5,0
400000000001d206:       00 00 00 00 60 c0                   movl r38=0xe00000000001d1a8
400000000001d20c:       84 22 8c 6e 
400000000001d210:       05 20 01 02 00 21       [MLX]       mov r36=r1
400000000001d216:       00 00 00 00 00 a0                   movl r37=0x10738;;
400000000001d21c:       84 23 38 60 
400000000001d220:       08 30 05 4c 00 20       [MMI]       add r38=r1,r38
400000000001d226:       50 0a 94 00 40 40                   add r37=r1,r37
400000000001d22c:       04 00 c4 00                         mov r34=b0
400000000001d230:       19 00 00 00 01 00       [MMB]       nop.m 0x0
400000000001d236:       00 00 00 02 00 00                   nop.m 0x0
400000000001d23c:       18 1f 00 50                         br.call.sptk.many b0=400000000001f140 <__futex_lock>;;
400000000001d240:       11 08 01 40 18 10       [MIB]       ld8 r33=[r32]
400000000001d246:       10 00 90 00 42 00                   mov r1=r36
400000000001d24c:       08 e2 ff 58                         br.call.sptk.many b0=400000000001b440 <fibril_get_id>;;
400000000001d250:       09 30 20 42 07 38       [MMI]       cmp.eq p6,p7=r8,r33
400000000001d256:       00 00 00 02 00 20                   nop.m 0x0
400000000001d25c:       00 20 01 84                         mov r1=r36;;
400000000001d260:       04 00 00 00 01 00       [MLX]       nop.m 0x0
400000000001d266:       00 00 00 00 e0 c3             (p07) movl r38=0xe00000000001d260
400000000001d26c:       04 26 90 6e 
400000000001d270:       e5 38 2d 01 01 24       [MLX] (p07) mov r39=203

comment:6 by Jiri Svoboda, 6 months ago

I think it's crashing here:

static void _fibril_mutex_unlock_unsafe(fibril_mutex_t *fm)
{
        assert(fm->oi.owned_by == (fibril_t *) fibril_get_id());   <--- NULL pointer dereference??

        if (fm->counter++ < 0) {
                awaiter_t *wdp = list_pop(&fm->waiters, awaiter_t, link);
                assert(wdp);

                fibril_t *f = (fibril_t *) wdp->fid;
                fm->oi.owned_by = f;
                f->waits_for = NULL;

                fibril_notify(&wdp->event);
        } else {
                fm->oi.owned_by = NULL;
        }
}

comment:7 by Jiri Svoboda, 6 months ago

Milestone: 0.14.10.14.2

When I try to instrument the code to get more information, it fails in a different spot. Need a better way of debugging, e.g. at least dump the raw stack when it crashes.

With -O2 this problem goes away. I tried to determine which exact optimization is causing this, but failed. When I compile with -O3, but specifically disable all the extra optimizations listed in the man page as being turned on by -O3, the problem persists.

I will use -O2 in the build defaults as a workaround so that we can make the release. Then I will revert it to -O3 so that we don't forget about this - I would like to continue investigating.

Note: See TracTickets for help on using tickets.