Fork us on GitHub Follow us on Facebook Follow us on Twitter

Opened 9 years ago

Closed 9 years ago

#260 closed defect (fixed)

Booting process sometimes gets stuck while starting shells on VCs

Reported by: Jiri Svoboda Owned by: Jiri Svoboda
Priority: major Milestone: 0.4.3
Component: helenos/fs/fat Version: mainline
Keywords: Cc: jakub@…
Blocker for: Depends on:
See also:

Description

Booting sometimes gets stuck at the point where the first four VCs contain the getterm banner. No more banners are printed (no other VCs are active) and command line is not reached on any VC. Keyboard and mouse input on the console work and it is possible to enter the kernel console.

Reproduced on revision: mainline,644
Config: defaults/ia32
Qemu version: 0.10.3
Qemu command line: qemu -m 32 -cdrom image.iso -boot d
Reproducibility: non-deterministic, in about 50% of attempts

Change History (11)

comment:1 Changed 9 years ago by Jiri Svoboda

Owner: set to Jiri Svoboda
Status: newaccepted

comment:2 Changed 9 years ago by Jiri Svoboda

Running 'tasks' in kcon shows four tasks with the name 'getterm' and five task with the name 'loader'. On non-debug build we can see 7x getterm and 7x loader.

comment:3 Changed 9 years ago by Martin Decky

Just a guess: What about available memory? The non-determinism can be simply caused by race conditions on allocating memory (mapping and demapping of address space areas). Then the second question would be why all the tasks don't get unblock eventually.

comment:4 Changed 9 years ago by Jiri Svoboda

That was my first guess as well, but no, it's not an OOM. Increasing memory does not help.

A tiny bit of further investigation: The loader tasks are waiting for VFS (1x vfs_in_read, 4x vfs_in_open), VFS is waiting for FAT (1x vfs_out_read, 4x vfs_out_lookup). FAT is not waiting for any other server.

With tmpfs root filesystem, the problem does not occur.

comment:5 Changed 9 years ago by Jakub Jermář

Cc: jakub@… added

comment:6 Changed 9 years ago by Jakub Jermář

Component: unspecifiedfs/fat

I have quickly prototyped a deadlock detection mechanism for fibril synchronization primitives (only mutexes as of now), it can be found in lp:~jakub/helenos/deadlock-detection branch where it will soak until it is ready for mainline.

Nevertheless, the detection mechanism is useful even now as it detected the following deadlock between two fibrils in fat:

fibril A:
fibril_mutex_lock()
fat_idx_get_by_pos()
fat_match()
libfs_lookup()
fat_lookup()

fibril B:
fibril_mutex_lock()
fat_idx_get_by_index()
fat_root_get()
libfs_lookup()
fat_lookup()

comment:7 Changed 9 years ago by Jakub Jermář

Ok, I think I know what is the problem, based on the above stacks. In fat_match(), we first lock parent→idx→lock and then call fat_idx_get_by_pos(), in which we want to lock used_lock. But in another fibril, we manage to lock used_lock first in fat_idx_get_by_index(), but cannot get the idx lock for parent, because it is already taken by the first fibril.

comment:8 Changed 9 years ago by Jakub Jermář

Jiri, I have just (hopefully) fixed this in lp:~jakub/helenos/fs. Can you merge from there and verify the issue is no longer reproducible?

Thanks,
Jakub

comment:9 Changed 9 years ago by Jiri Svoboda

Yes, I can confirm it fixes the issue :-)

comment:10 Changed 9 years ago by Jiri Svoboda

Nice work!

comment:11 Changed 9 years ago by Jakub Jermář

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.