Opened 12 years ago

Closed 12 years ago

#451 closed defect (duplicate)

kernel panic when running tester malloc1/2 on 4-way SMP

Reported by: Maurizio Lombardi Owned by: Jakub Jermář
Priority: major Milestone: 0.5.0
Component: helenos/kernel/generic Version: mainline
Keywords: slab, virtual memory, kernel panic, amd64 Cc:
Blocker for: Depends on:
See also: #396

Description

Running some concurrent instances of "tester malloc1" and "tester malloc2" sometimes triggers a kernel panic in the kernel slab allocator.
The bug appears quite often on 4-way SMP qemu machines.

Attachments (4)

kpanic-smp4.png (34.1 KB ) - added by Maurizio Lombardi 12 years ago.
kernel panic screenshot
kernel.raw.gz (188.6 KB ) - added by Maurizio Lombardi 12 years ago.
slab_is_null.png (33.8 KB ) - added by Maurizio Lombardi 12 years ago.
in the slab_obj_destroy() function, if slab is null than the following instruction will be executed: "slab = obj2slab(obj);" I added an assert to raise a kernel panic if slab is still NULL.
kpanic_slab_create.png (54.2 KB ) - added by Maurizio Lombardi 12 years ago.
kernel panic in slab_obj_create()

Download all attachments as: .zip

Change History (9)

by Maurizio Lombardi, 12 years ago

Attachment: kpanic-smp4.png added

kernel panic screenshot

by Maurizio Lombardi, 12 years ago

Attachment: kernel.raw.gz added

comment:1 by Jakub Jermář, 12 years ago

What we see here is a failed assertion which suggests that the kernel was holding a spinlock while it attempted to sleep. In this case, the spinlock is held in slab_reclaim():

        irq_spinlock_lock(&slab_cache_lock, true);

        size_t frames = 0;
        list_foreach(slab_cache_list, cur) {
                slab_cache_t *cache = list_get_instance(cur, slab_cache_t, link);
                frames += _slab_reclaim(cache, flags);
        }

        irq_spinlock_unlock(&slab_cache_lock, true);

and another one might be held in _slab_reclaim() (mind the pe=2 in the panic message):

        spinlock_lock(&cache->mag_cache[i].lock);

        mag = cache->mag_cache[i].current;
        if (mag)
                frames += magazine_destroy(cache, mag);
        cache->mag_cache[i].current = NULL;

        mag = cache->mag_cache[i].last;
        if (mag)
                frames += magazine_destroy(cache, mag);
        cache->mag_cache[i].last = NULL;

        spinlock_unlock(&cache->mag_cache[i].lock);

The reason for blocking is processing of a page fault, which according to the stack trace happened at the beginning of slab_obj_destroy():

        if (!slab)
                slab = obj2slab(obj);

        ASSERT(slab->cache == cache);       <=== HERE

So the slab variable points to memory which is not mapped. This looks like the actual problem here as all slab_t objects are allocated from low memory and are identity mapped. Clearly all of them are therefore always mapped in the page tables.

The question is what value can we expect in slab? From the sources, we see that the function is called from magazine_destroy(), which passes NULL to slab_obj_destroy() in place of its third argument, i.e. slab:

        for (i = 0; i < mag->busy; i++) {
                frames += slab_obj_destroy(cache, mag->objs[i], NULL);
                atomic_dec(&cache->cached_objs);
        }

This gives us reasons to believe that the value found in slab was assigned to the variable by a call to obj2slab(obj) just before the page fault:

        if (!slab)
                slab = obj2slab(obj);

Now, what does obj2slab() do?

NO_TRACE static slab_t *obj2slab(void *obj)
{
        return (slab_t *) frame_get_parent(ADDR2PFN(KA2PA(obj)), 0);
}

Simply put, it finds the slab_t structure for the given buffer by inspecting frame allocator structures. Given a kernel address of the slab buffer, it finds the corresponding memory zone and the respective frame_t structure. The frame_t structure describes the physical memory frame which contains the allocator object. Its parent member is supposed to point to the slab_t which allocated it, as set in slab_space_alloc():

       for (i = 0; i < ((size_t) 1 << cache->order); i++)
                frame_set_parent(ADDR2PFN(KA2PA(data)) + i, slab, zone);

Summing up what we already know, it looks as if the underlying frame_t contained an invalid parent pointer for the slab_t which allocated it. In order to continue with the investigation, we need to determine how can the parent pointer become invalid.

comment:2 by Jakub Jermář, 12 years ago

Keywords: amd64 added

by Maurizio Lombardi, 12 years ago

Attachment: slab_is_null.png added

in the slab_obj_destroy() function, if slab is null than the following instruction will be executed: "slab = obj2slab(obj);" I added an assert to raise a kernel panic if slab is still NULL.

comment:3 by Jakub Jermář, 12 years ago

I created a dedicated ticket (#458) for the deadlock bug.

comment:4 by Jakub Jermář, 12 years ago

Status: newaccepted

by Maurizio Lombardi, 12 years ago

Attachment: kpanic_slab_create.png added

kernel panic in slab_obj_create()

comment:5 by Jakub Jermář, 12 years ago

Resolution: duplicate
See also: #396
Status: acceptedclosed

I think this family of panics is in fact a duplicate of #396.

First, the panics were reproducible only on MP amd64, same as #396.
When #396 was fixed, the panics went away too.

Some of the panics happen in functions of scheduler_fpu_lazy_request(), which also smells by #396.

For now, I am closing this as a duplicate of #396. Feel free to reopen if you can still reproduce some of these panics.

Note: See TracTickets for help on using tickets.