Opened 12 years ago

Closed 12 years ago

#458 closed defect (fixed)

Deadlocks when memory management is under pressure

Reported by: Jakub Jermář Owned by: Jakub Jermář
Priority: major Milestone: 0.5.0
Component: helenos/kernel/generic Version: mainline
Keywords: mm Cc:
Blocker for: Depends on:
See also: #445

Description

As of mainline,1486, running tester malloc1 and kconsole's test *, or two instances of tester malloc1 on a SMP system, may deadlock the kernel in various ways.

One such deadlock is depicted on the attached picture (courtesy of Maurizio).

Some of these deadlocks are caused by a TLB shootdown sequence spinning on some non-IRQ-spinlock which is held by another CPU interrupted by the TLB shootdown IPI.

Other deadlocks do not seem to be related to TLB shootdown, but are most likely related to #445 and the fact the system is running low on memory.

Some deadlocks are not even fully reported in the current mainline because they involve mutexes and possibly other synchronization primitives.

Attachments (1)

spinlock_loop.png (17.9 KB ) - added by Jakub Jermář 12 years ago.

Download all attachments as: .zip

Change History (5)

by Jakub Jermář, 12 years ago

Attachment: spinlock_loop.png added

comment:1 by Jakub Jermář, 12 years ago

Status: newaccepted

comment:2 by Jakub Jermář, 12 years ago

In mainline,1489, I merged a couple of fixes that ensure that the TLB shootdown sequences will not spin on any range, slab or frame allocator lock. Let us see if there are any other memory management related deadlocks now.

comment:3 by Jakub Jermář, 12 years ago

So far, after mainline,1489, I only noticed hangs and deadlocks of the following kind:

  • exception → CPU lock acquired → FPU lazy context switch → page fault → deadlock on CPU lock acquired
  • two instances of tester malloc1 in combination with kconsole's test *; hung shortly after printing the message about waiting for N frames; this could actually happen as a result of the two testers reserving memory and the kconsole allocating that memory (kernel never reserves it in advance); it is then sufficient if the kernel uses any sort of blocking memory allocation (e.g. as part of syscall processing) and there will be no forward progress

The former suggests there is some problem with maintaining the FPU context (wrong or corrupted thread pointer).

The latter suggests there may still be some blocking allocations for secondary structures (i.e. other than user pages) in the syscall callpaths; some of the kernel tests may use blocking allocations too. The syscall paths need to be cleaned up. Running uspace tests together with kernel tests that both try to allocate as much memory as possible is a bad idea and since both parties may block, it may inherently lead to this kind of pathological behavior.

If no other issues are reported soon, I will be inclined to close this ticket as fixed because the TLB-related deadlocks don't seem to be reproducible after mainline,1489.

comment:4 by Jakub Jermář, 12 years ago

Resolution: fixed
Status: acceptedclosed

Ok, closing as fixed. Please file a new ticket if a new issue occurs.

Note: See TracTickets for help on using tickets.