Opened 12 years ago
Closed 12 years ago
#458 closed defect (fixed)
Deadlocks when memory management is under pressure
Reported by: | Jakub Jermář | Owned by: | Jakub Jermář |
---|---|---|---|
Priority: | major | Milestone: | 0.5.0 |
Component: | helenos/kernel/generic | Version: | mainline |
Keywords: | mm | Cc: | |
Blocker for: | Depends on: | ||
See also: | #445 |
Description
As of mainline,1486, running tester malloc1
and kconsole's test *
, or two instances of tester malloc1
on a SMP system, may deadlock the kernel in various ways.
One such deadlock is depicted on the attached picture (courtesy of Maurizio).
Some of these deadlocks are caused by a TLB shootdown sequence spinning on some non-IRQ-spinlock which is held by another CPU interrupted by the TLB shootdown IPI.
Other deadlocks do not seem to be related to TLB shootdown, but are most likely related to #445 and the fact the system is running low on memory.
Some deadlocks are not even fully reported in the current mainline because they involve mutexes and possibly other synchronization primitives.
Attachments (1)
Change History (5)
by , 12 years ago
Attachment: | spinlock_loop.png added |
---|
comment:1 by , 12 years ago
Status: | new → accepted |
---|
comment:2 by , 12 years ago
comment:3 by , 12 years ago
So far, after mainline,1489, I only noticed hangs and deadlocks of the following kind:
- exception → CPU lock acquired → FPU lazy context switch → page fault → deadlock on CPU lock acquired
- two instances of
tester malloc1
in combination withkconsole's
test *
; hung shortly after printing the message about waiting for N frames; this could actually happen as a result of the two testers reserving memory and the kconsole allocating that memory (kernel never reserves it in advance); it is then sufficient if the kernel uses any sort of blocking memory allocation (e.g. as part of syscall processing) and there will be no forward progress
The former suggests there is some problem with maintaining the FPU context (wrong or corrupted thread pointer).
The latter suggests there may still be some blocking allocations for secondary structures (i.e. other than user pages) in the syscall callpaths; some of the kernel tests may use blocking allocations too. The syscall paths need to be cleaned up. Running uspace tests together with kernel tests that both try to allocate as much memory as possible is a bad idea and since both parties may block, it may inherently lead to this kind of pathological behavior.
If no other issues are reported soon, I will be inclined to close this ticket as fixed because the TLB-related deadlocks don't seem to be reproducible after mainline,1489.
comment:4 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
Ok, closing as fixed. Please file a new ticket if a new issue occurs.
In mainline,1489, I merged a couple of fixes that ensure that the TLB shootdown sequences will not spin on any range, slab or frame allocator lock. Let us see if there are any other memory management related deadlocks now.