Opened 12 years ago

Closed 12 years ago

#396 closed defect (fixed)

tester malloc1 sometimes fails — at Version 9

Reported by: Maurizio Lombardi Owned by: Jakub Jermář
Priority: major Milestone: 0.5.0
Component: helenos/lib/c Version: mainline
Keywords: malloc, amd64, fpu Cc:
Blocker for: Depends on:
See also:

Description (last modified by Jakub Jermář)

The tester malloc1 command sometimes fails with a "corrupted content of a data block" error.

There is not a way to reproduce this bug in a systematic way but it appears quite often while executing multiple instances of the "tester malloc1" command on a 4-way SMP configuration.

Change History (11)

by Maurizio Lombardi, 12 years ago

Attachment: memory_corruption.png added

screenshot

by Maurizio Lombardi, 12 years ago

Attachment: tester.tar.bz2 added

tester binary

comment:1 by Jakub Jermář, 12 years ago

Component: helenos/unspecifiedhelenos/uspace/libc
Keywords: malloc added
Owner: set to Jakub Jermář

comment:2 by Jakub Jermář, 12 years ago

Hi Maurizio,

can you still reproduce this on top of the current mainline? I had no luck with 1484. In your experience, what is the chance that the bug will show when emulating on a 4-way QEMU machine. How many tester tasks neeed to run? What QEMU version was this, did you use kvm?

Thanks,
Jakub

comment:3 by Jakub Jermář, 12 years ago

Keywords: amd64 added

comment:4 by Jakub Jermář, 12 years ago

I managed to reproduce this on amd64.

The function which causes the test to fail is check_block(). It detects a mismatch of a value stored somewhere in the block with the expected value. I instrumented the function to provide additional information and to perform the heap allocator consistency check. The consistency check did not reveal any heap metadata corruption, so the problem is probably only in the data itself.

Printing maximum of 20 bytes starting with the first byte in the block which differs from the expected value reveals an interesting pattern. The pattern suggests the data is not corrupted entirely, but just a little bit. Here is an example:

block=0x28230, base=0x805020, size=228516, ofs=180800, pos=base+ofs=0x831260

[actual value|expected value]:
[f0|50] [f1|51] [f2|52] ... [e0|40] [e1|41] [e2|42] [e3|43]

The expected value (=fill pattern) for each byte is computed as ((block XOR pos) AND 0xff).

We can speculate that the part of the block 0x28230 with base 0x805020 at offset 180800 was rewritten by the contents of block 0x??????????????90 (0x90 BXOR 0x60 == F0). Patterns of both blocks seem to have the same phase (pos % 16).

comment:5 by Jakub Jermář, 12 years ago

Status: newaccepted

comment:6 by Jakub Jermář, 12 years ago

So we are in a situation when two entities write to one physical frame. Here is a list of dead ends that I already verified:

  • allocated frame_t's are not re-allocated to someone else
  • the range allocator does not return out-of-span addresses
  • TLB invalidation instructions are not optimized out by the compiler
  • the system with default 256M does not have any high memory and hence the kernel uses identity mapped addresses during these tests

comment:7 by Jakub Jermář, 12 years ago

This might not at all be a memory management related bug. The bug is reproducible with only one instance of tester malloc1, a couple of tester loop1 instances and top on a 7-way QEMU/amd64.

comment:8 by Jakub Jermář, 12 years ago

I noticed the amd64 version of tester's fill_block() function extensively uses FPU and XMM registers to populate the block with a fill pattern:

00000000000043c0 <fill_block>:
    43c0:       55                      push   %rbp
    43c1:       48 89 e5                mov    %rsp,%rbp
    43c4:       48 83 ec 10             sub    $0x10,%rsp
    43c8:       48 8b 37                mov    (%rdi),%rsi
    43cb:       49 89 f0                mov    %rsi,%r8
    43ce:       4c 03 47 08             add    0x8(%rdi),%r8
    43d2:       4c 39 c6                cmp    %r8,%rsi
    43d5:       0f 83 a4 01 00 00       jae    457f <fill_block+0x1bf>
    43db:       48 89 f1                mov    %rsi,%rcx
    43de:       4d 89 c2                mov    %r8,%r10
    43e1:       48 89 f0                mov    %rsi,%rax
    43e4:       48 f7 d9                neg    %rcx
    43e7:       49 29 f2                sub    %rsi,%r10
    43ea:       83 e1 0f                and    $0xf,%ecx
    43ed:       4d 89 d1                mov    %r10,%r9
    43f0:       49 39 ca                cmp    %rcx,%r10
    43f3:       49 0f 46 ca             cmovbe %r10,%rcx
    43f7:       49 83 fa 10             cmp    $0x10,%r10
    43fb:       49 0f 46 ca             cmovbe %r10,%rcx
    43ff:       48 85 c9                test   %rcx,%rcx
    4402:       74 22                   je     4426 <fill_block+0x66>
    4404:       0f 1f 40 00             nopl   0x0(%rax)
    4408:       89 fa                   mov    %edi,%edx
    440a:       31 c2                   xor    %eax,%edx
    440c:       88 10                   mov    %dl,(%rax)
    440e:       48 83 c0 01             add    $0x1,%rax
    4412:       48 89 c2                mov    %rax,%rdx
    4415:       48 29 f2                sub    %rsi,%rdx
    4418:       48 39 d1                cmp    %rdx,%rcx
    441b:       77 eb                   ja     4408 <fill_block+0x48>
    441d:       49 39 ca                cmp    %rcx,%r10
    4420:       0f 84 59 01 00 00       je     457f <fill_block+0x1bf>
    4426:       4d 89 cb                mov    %r9,%r11
    4429:       49 29 cb                sub    %rcx,%r11
    442c:       4d 89 d9                mov    %r11,%r9
    442f:       49 c1 e9 04             shr    $0x4,%r9
    4433:       4d 89 ca                mov    %r9,%r10
    4436:       49 c1 e2 04             shl    $0x4,%r10
    443a:       4d 85 d2                test   %r10,%r10
    443d:       0f 84 2d 01 00 00       je     4570 <fill_block+0x1b0>
    4443:       66 0f 6e f7             movd   %edi,%xmm6
    4447:       48 8d 50 01             lea    0x1(%rax),%rdx
    444b:       48 89 45 f0             mov    %rax,-0x10(%rbp)
    444f:       f3 0f 7e 45 f0          movq   -0x10(%rbp),%xmm0
    4454:       48 01 f1                add    %rsi,%rcx
    4457:       66 0f 60 f6             punpcklbw %xmm6,%xmm6
    445b:       48 89 55 f8             mov    %rdx,-0x8(%rbp)
    445f:       0f 16 45 f8             movhps -0x8(%rbp),%xmm0
    4463:       66 0f 6f 3d 45 d7 01    movdqa 0x1d745(%rip),%xmm7        # 21bb0 <ipc_connect_kbox+0x1d20>
    446a:       00 
    446b:       31 d2                   xor    %edx,%edx
    446d:       66 0f 61 f6             punpcklwd %xmm6,%xmm6
    4471:       66 0f 6f d8             movdqa %xmm0,%xmm3
    4475:       66 0f 6f 15 43 d7 01    movdqa 0x1d743(%rip),%xmm2        # 21bc0 <ipc_connect_kbox+0x1d30>
    447c:       00 
    447d:       66 0f 70 f6 00          pshufd $0x0,%xmm6,%xmm6
    4482:       66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
    4488:       66 0f 6f cb             movdqa %xmm3,%xmm1
    448c:       48 83 c2 01             add    $0x1,%rdx
    4490:       66 44 0f 6f d3          movdqa %xmm3,%xmm10
    4495:       66 0f d4 df             paddq  %xmm7,%xmm3
    4499:       66 0f d4 ca             paddq  %xmm2,%xmm1
    449d:       66 44 0f 6f c9          movdqa %xmm1,%xmm9
    44a2:       44 0f c6 d1 88          shufps $0x88,%xmm1,%xmm10
    44a7:       66 44 0f d4 ca          paddq  %xmm2,%xmm9
    44ac:       66 45 0f 6f c1          movdqa %xmm9,%xmm8
    44b1:       66 44 0f d4 c2          paddq  %xmm2,%xmm8
    44b6:       66 41 0f 6f c0          movdqa %xmm8,%xmm0
    44bb:       45 0f c6 c8 88          shufps $0x88,%xmm8,%xmm9
    44c0:       66 41 0f 6f ca          movdqa %xmm10,%xmm1
    44c5:       66 0f d4 c2             paddq  %xmm2,%xmm0
    44c9:       66 0f 6f e0             movdqa %xmm0,%xmm4
    44cd:       66 45 0f 6f c2          movdqa %xmm10,%xmm8
    44d2:       66 0f d4 e2             paddq  %xmm2,%xmm4
    44d6:       66 0f 6f ec             movdqa %xmm4,%xmm5
    44da:       0f c6 c4 88             shufps $0x88,%xmm4,%xmm0
    44de:       66 41 0f 61 c9          punpcklwd %xmm9,%xmm1
    44e3:       66 0f d4 ea             paddq  %xmm2,%xmm5
    44e7:       66 0f 6f e5             movdqa %xmm5,%xmm4
    44eb:       66 45 0f 69 c1          punpckhwd %xmm9,%xmm8
    44f0:       66 0f d4 e2             paddq  %xmm2,%xmm4
    44f4:       0f c6 ec 88             shufps $0x88,%xmm4,%xmm5
    44f8:       66 0f 6f e0             movdqa %xmm0,%xmm4
    44fc:       66 44 0f 6f c9          movdqa %xmm1,%xmm9
    4501:       66 41 0f 61 c8          punpcklwd %xmm8,%xmm1
    4506:       66 0f 61 c5             punpcklwd %xmm5,%xmm0
    450a:       66 0f 69 e5             punpckhwd %xmm5,%xmm4
    450e:       66 45 0f 69 c8          punpckhwd %xmm8,%xmm9
    4513:       66 0f 6f e8             movdqa %xmm0,%xmm5
    4517:       66 0f 61 c4             punpcklwd %xmm4,%xmm0
    451b:       66 0f 69 ec             punpckhwd %xmm4,%xmm5
    451f:       66 41 0f 61 c9          punpcklwd %xmm9,%xmm1
    4524:       66 0f 61 c5             punpcklwd %xmm5,%xmm0
    4528:       66 0f 6f e1             movdqa %xmm1,%xmm4
    452c:       66 0f 60 c8             punpcklbw %xmm0,%xmm1
    4530:       66 0f 68 e0             punpckhbw %xmm0,%xmm4
    4534:       66 0f 6f c1             movdqa %xmm1,%xmm0
    4538:       66 0f 60 cc             punpcklbw %xmm4,%xmm1
    453c:       66 0f 68 c4             punpckhbw %xmm4,%xmm0
    4540:       66 0f 6f e1             movdqa %xmm1,%xmm4
    4544:       66 0f 60 c8             punpcklbw %xmm0,%xmm1
    4548:       66 0f 68 e0             punpckhbw %xmm0,%xmm4
    454c:       66 0f 60 cc             punpcklbw %xmm4,%xmm1
    4550:       66 0f ef ce             pxor   %xmm6,%xmm1
    4554:       66 0f 7f 09             movdqa %xmm1,(%rcx)
    4558:       48 83 c1 10             add    $0x10,%rcx
    455c:       4c 39 ca                cmp    %r9,%rdx
    455f:       0f 82 23 ff ff ff       jb     4488 <fill_block+0xc8>
    4565:       4c 01 d0                add    %r10,%rax
    4568:       4d 39 d3                cmp    %r10,%r11
    456b:       74 12                   je     457f <fill_block+0x1bf>
    456d:       0f 1f 00                nopl   (%rax)
    4570:       89 fa                   mov    %edi,%edx
    4572:       31 c2                   xor    %eax,%edx
    4574:       88 10                   mov    %dl,(%rax)
    4576:       48 83 c0 01             add    $0x1,%rax
    457a:       49 39 c0                cmp    %rax,%r8
    457d:       77 f1                   ja     4570 <fill_block+0x1b0>
    457f:       e8 cc 1e 01 00          callq  16450 <heap_check>
    4584:       48 85 c0                test   %rax,%rax
    4587:       74 10                   je     4599 <fill_block+0x1d9>
    4589:       80 3d 50 10 02 00 00    cmpb   $0x0,0x21050(%rip)        # 255e0 <_tbss_start>
    4590:       74 0e                   je     45a0 <fill_block+0x1e0>
    4592:       c6 05 d7 10 02 00 01    movb   $0x1,0x210d7(%rip)        # 25670 <error_flag>
    4599:       c9                      leaveq 
    459a:       c3                      retq   
    459b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
    45a0:       48 8b 3d 59 11 02 00    mov    0x21159(%rip),%rdi        # 25700 <stderr>
    45a7:       48 89 c2                mov    %rax,%rdx
    45aa:       b9 98 1a 02 00          mov    $0x21a98,%ecx
    45af:       31 c0                   xor    %eax,%eax
    45b1:       be c0 1a 02 00          mov    $0x21ac0,%esi
    45b6:       e8 c5 ea 00 00          callq  13080 <fprintf>
    45bb:       80 3d 1e 10 02 00 00    cmpb   $0x0,0x2101e(%rip)        # 255e0 <_tbss_start>
    45c2:       75 ce                   jne    4592 <fill_block+0x1d2>
    45c4:       e8 67 a2 01 00          callq  1e830 <stacktrace_print>
    45c9:       eb c7                   jmp    4592 <fill_block+0x1d2>
    45cb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

It seems to be writing 16-byte aligned blocks of 16 bytes at a time. This corresponds to the observation that the block becomes corrupted on a 16-byte boundary. So, could the problem be that there is a bug in handling the FPU context switching? Can the FPU context of a thread get somehow corrupted while the thread does not run? I was not able to reproduce the issue when I configured the kernel to not use lazy FPU context switching, but it could have made the bug only less reproducible.

Here are the unique FPU instructions used in the above code snippet:

movd
movdqa
movhps
movq
paddq
pshufd
punpckhbw
punpckhwd
punpcklbw
punpcklwd
pxor
shufps

The explicitly used FPU / XMM registers range from %xmm0 to %xmm10.

comment:9 by Jakub Jermář, 12 years ago

Description: modified (diff)
Keywords: fpu added
Resolution: fixed
Status: acceptedclosed

Ok, I think I can explain the mechanism of this corruption, at last!

Threads that have an active FPU context (fpu_context_engaged is true) may not be migrated. The kcpulb load balancing thread honours this, but, as it turns out, this is not enough to prevent a thread with an active FPU context from being migrated to another CPU. Imagine a thread with its FPU context on the CPU goes to sleep. The CPU which wakes up the thread again (for whatever reason) will put it into its own run queue and effectively migrate the thread. Since the lazy FPU context switching code relies on the fact that the FPU context owner will not be migrated under any circumstances, this is a severe flaw of our thread_ready() function.

Now, as for what is probably happening, let us assume that the tester malloc1 thread was running on CPU1 for some time. Its saved_fpu_context contains the last saved version of its FPU context. The saved FPU context becomes obsolete as soon as the thread runs again. The current version of the FPU context exists on CPU1, the last CPU which executed the thread. Now let us further assume that the thread got blocked in the kernel on e.g. some mutex. CPU2 later wakes up our thread, which effectively migrates it to CPU2, thanks to an error in thread_ready(). When the thread executes the first FPU instruction in fill_block(), it triggers an exception on CPU2, which results in a call to scheduler_fpu_lazy_request(). In that function, the FPU on CPU2 loads the thread's saved FPU context, which is now obsolete (but may be quite similar to the current one). The current FPU context for the thread is still on CPU1. When the thread resumes execution of userspace code, it works with stale FPU context. Hadn't the thread been migrated, the behavior would be correct because the on-CPU FPU state would be first saved in the thread's saved_fpu_context, so that scheduler_fpu_lazy_request() will load the correct context a little bit later.

Note that this could have also negatively impacted destroying of threads, because thread_destroy() checks only the current CPU for remembering the FPU owner thread. Running this code on a wrong CPU defeats the algorithm and has the potential to cause further kernel memory corruptions.

Fixed in mainline,1532.

Note: See TracTickets for help on using tickets.