Opened 12 years ago
Closed 12 years ago
#451 closed defect (duplicate)
kernel panic when running tester malloc1/2 on 4-way SMP
Reported by: | Maurizio Lombardi | Owned by: | Jakub Jermář |
---|---|---|---|
Priority: | major | Milestone: | 0.5.0 |
Component: | helenos/kernel/generic | Version: | mainline |
Keywords: | slab, virtual memory, kernel panic, amd64 | Cc: | |
Blocker for: | Depends on: | ||
See also: | #396 |
Description
Running some concurrent instances of "tester malloc1" and "tester malloc2" sometimes triggers a kernel panic in the kernel slab allocator.
The bug appears quite often on 4-way SMP qemu machines.
Attachments (4)
Change History (9)
by , 12 years ago
Attachment: | kpanic-smp4.png added |
---|
by , 12 years ago
Attachment: | kernel.raw.gz added |
---|
comment:1 by , 12 years ago
What we see here is a failed assertion which suggests that the kernel was holding a spinlock while it attempted to sleep. In this case, the spinlock is held in slab_reclaim()
:
irq_spinlock_lock(&slab_cache_lock, true); size_t frames = 0; list_foreach(slab_cache_list, cur) { slab_cache_t *cache = list_get_instance(cur, slab_cache_t, link); frames += _slab_reclaim(cache, flags); } irq_spinlock_unlock(&slab_cache_lock, true);
and another one might be held in _slab_reclaim()
(mind the pe=2 in the panic message):
spinlock_lock(&cache->mag_cache[i].lock); mag = cache->mag_cache[i].current; if (mag) frames += magazine_destroy(cache, mag); cache->mag_cache[i].current = NULL; mag = cache->mag_cache[i].last; if (mag) frames += magazine_destroy(cache, mag); cache->mag_cache[i].last = NULL; spinlock_unlock(&cache->mag_cache[i].lock);
The reason for blocking is processing of a page fault, which according to the stack trace happened at the beginning of slab_obj_destroy()
:
if (!slab) slab = obj2slab(obj); ASSERT(slab->cache == cache); <=== HERE
So the slab
variable points to memory which is not mapped. This looks like the actual problem here as all slab_t
objects are allocated from low memory and are identity mapped. Clearly all of them are therefore always mapped in the page tables.
The question is what value can we expect in slab
? From the sources, we see that the function is called from magazine_destroy()
, which passes NULL
to slab_obj_destroy()
in place of its third argument, i.e. slab
:
for (i = 0; i < mag->busy; i++) { frames += slab_obj_destroy(cache, mag->objs[i], NULL); atomic_dec(&cache->cached_objs); }
This gives us reasons to believe that the value found in slab
was assigned to the variable by a call to obj2slab(obj)
just before the page fault:
if (!slab) slab = obj2slab(obj);
Now, what does obj2slab()
do?
NO_TRACE static slab_t *obj2slab(void *obj) { return (slab_t *) frame_get_parent(ADDR2PFN(KA2PA(obj)), 0); }
Simply put, it finds the slab_t
structure for the given buffer by inspecting frame allocator structures. Given a kernel address of the slab buffer, it finds the corresponding memory zone and the respective frame_t
structure. The frame_t
structure describes the physical memory frame which contains the allocator object. Its parent
member is supposed to point to the slab_t
which allocated it, as set in slab_space_alloc()
:
for (i = 0; i < ((size_t) 1 << cache->order); i++) frame_set_parent(ADDR2PFN(KA2PA(data)) + i, slab, zone);
Summing up what we already know, it looks as if the underlying frame_t contained an invalid parent pointer for the slab_t which allocated it. In order to continue with the investigation, we need to determine how can the parent pointer become invalid.
comment:2 by , 12 years ago
Keywords: | amd64 added |
---|
by , 12 years ago
Attachment: | slab_is_null.png added |
---|
in the slab_obj_destroy() function, if slab is null than the following instruction will be executed: "slab = obj2slab(obj);" I added an assert to raise a kernel panic if slab is still NULL.
comment:4 by , 12 years ago
Status: | new → accepted |
---|
comment:5 by , 12 years ago
Resolution: | → duplicate |
---|---|
See also: | → #396 |
Status: | accepted → closed |
I think this family of panics is in fact a duplicate of #396.
First, the panics were reproducible only on MP amd64, same as #396.
When #396 was fixed, the panics went away too.
Some of the panics happen in functions of scheduler_fpu_lazy_request()
, which also smells by #396.
For now, I am closing this as a duplicate of #396. Feel free to reopen if you can still reproduce some of these panics.
kernel panic screenshot