Opened 13 years ago
Closed 12 years ago
#396 closed defect (fixed)
tester malloc1 sometimes fails
Reported by: | Maurizio Lombardi | Owned by: | Jakub Jermář |
---|---|---|---|
Priority: | major | Milestone: | 0.5.0 |
Component: | helenos/lib/c | Version: | mainline |
Keywords: | malloc, amd64, fpu | Cc: | |
Blocker for: | Depends on: | ||
See also: |
Description (last modified by )
The tester malloc1 command sometimes fails with a "corrupted content of a data block" error.
There is not a way to reproduce this bug in a systematic way but it appears quite often while executing multiple instances of the "tester malloc1" command on a 4-way SMP configuration.
Attachments (2)
Change History (11)
by , 13 years ago
Attachment: | memory_corruption.png added |
---|
comment:1 by , 13 years ago
Component: | helenos/unspecified → helenos/uspace/libc |
---|---|
Keywords: | malloc added |
Owner: | set to |
comment:2 by , 12 years ago
Hi Maurizio,
can you still reproduce this on top of the current mainline? I had no luck with 1484. In your experience, what is the chance that the bug will show when emulating on a 4-way QEMU machine. How many tester tasks neeed to run? What QEMU version was this, did you use kvm?
Thanks,
Jakub
comment:3 by , 12 years ago
Keywords: | amd64 added |
---|
comment:4 by , 12 years ago
I managed to reproduce this on amd64.
The function which causes the test to fail is check_block()
. It detects a mismatch of a value stored somewhere in the block with the expected value. I instrumented the function to provide additional information and to perform the heap allocator consistency check. The consistency check did not reveal any heap metadata corruption, so the problem is probably only in the data itself.
Printing maximum of 20 bytes starting with the first byte in the block which differs from the expected value reveals an interesting pattern. The pattern suggests the data is not corrupted entirely, but just a little bit. Here is an example:
block=0x28230, base=0x805020, size=228516, ofs=180800, pos=base+ofs=0x831260 [actual value|expected value]: [f0|50] [f1|51] [f2|52] ... [e0|40] [e1|41] [e2|42] [e3|43]
The expected value (=fill pattern) for each byte is computed as ((block XOR pos) AND 0xff).
We can speculate that the part of the block 0x28230 with base 0x805020 at offset 180800 was rewritten by the contents of block 0x??????????????90 (0x90 BXOR 0x60 == F0). Patterns of both blocks seem to have the same phase (pos % 16).
comment:5 by , 12 years ago
Status: | new → accepted |
---|
comment:6 by , 12 years ago
So we are in a situation when two entities write to one physical frame. Here is a list of dead ends that I already verified:
- allocated
frame_t
's are not re-allocated to someone else - the range allocator does not return out-of-span addresses
- TLB invalidation instructions are not optimized out by the compiler
- the system with default 256M does not have any high memory and hence the kernel uses identity mapped addresses during these tests
comment:7 by , 12 years ago
This might not at all be a memory management related bug. The bug is reproducible with only one instance of tester malloc1
, a couple of tester loop1
instances and top
on a 7-way QEMU/amd64.
comment:8 by , 12 years ago
I noticed the amd64 version of tester
's fill_block()
function extensively uses FPU and XMM registers to populate the block with a fill pattern:
00000000000043c0 <fill_block>: 43c0: 55 push %rbp 43c1: 48 89 e5 mov %rsp,%rbp 43c4: 48 83 ec 10 sub $0x10,%rsp 43c8: 48 8b 37 mov (%rdi),%rsi 43cb: 49 89 f0 mov %rsi,%r8 43ce: 4c 03 47 08 add 0x8(%rdi),%r8 43d2: 4c 39 c6 cmp %r8,%rsi 43d5: 0f 83 a4 01 00 00 jae 457f <fill_block+0x1bf> 43db: 48 89 f1 mov %rsi,%rcx 43de: 4d 89 c2 mov %r8,%r10 43e1: 48 89 f0 mov %rsi,%rax 43e4: 48 f7 d9 neg %rcx 43e7: 49 29 f2 sub %rsi,%r10 43ea: 83 e1 0f and $0xf,%ecx 43ed: 4d 89 d1 mov %r10,%r9 43f0: 49 39 ca cmp %rcx,%r10 43f3: 49 0f 46 ca cmovbe %r10,%rcx 43f7: 49 83 fa 10 cmp $0x10,%r10 43fb: 49 0f 46 ca cmovbe %r10,%rcx 43ff: 48 85 c9 test %rcx,%rcx 4402: 74 22 je 4426 <fill_block+0x66> 4404: 0f 1f 40 00 nopl 0x0(%rax) 4408: 89 fa mov %edi,%edx 440a: 31 c2 xor %eax,%edx 440c: 88 10 mov %dl,(%rax) 440e: 48 83 c0 01 add $0x1,%rax 4412: 48 89 c2 mov %rax,%rdx 4415: 48 29 f2 sub %rsi,%rdx 4418: 48 39 d1 cmp %rdx,%rcx 441b: 77 eb ja 4408 <fill_block+0x48> 441d: 49 39 ca cmp %rcx,%r10 4420: 0f 84 59 01 00 00 je 457f <fill_block+0x1bf> 4426: 4d 89 cb mov %r9,%r11 4429: 49 29 cb sub %rcx,%r11 442c: 4d 89 d9 mov %r11,%r9 442f: 49 c1 e9 04 shr $0x4,%r9 4433: 4d 89 ca mov %r9,%r10 4436: 49 c1 e2 04 shl $0x4,%r10 443a: 4d 85 d2 test %r10,%r10 443d: 0f 84 2d 01 00 00 je 4570 <fill_block+0x1b0> 4443: 66 0f 6e f7 movd %edi,%xmm6 4447: 48 8d 50 01 lea 0x1(%rax),%rdx 444b: 48 89 45 f0 mov %rax,-0x10(%rbp) 444f: f3 0f 7e 45 f0 movq -0x10(%rbp),%xmm0 4454: 48 01 f1 add %rsi,%rcx 4457: 66 0f 60 f6 punpcklbw %xmm6,%xmm6 445b: 48 89 55 f8 mov %rdx,-0x8(%rbp) 445f: 0f 16 45 f8 movhps -0x8(%rbp),%xmm0 4463: 66 0f 6f 3d 45 d7 01 movdqa 0x1d745(%rip),%xmm7 # 21bb0 <ipc_connect_kbox+0x1d20> 446a: 00 446b: 31 d2 xor %edx,%edx 446d: 66 0f 61 f6 punpcklwd %xmm6,%xmm6 4471: 66 0f 6f d8 movdqa %xmm0,%xmm3 4475: 66 0f 6f 15 43 d7 01 movdqa 0x1d743(%rip),%xmm2 # 21bc0 <ipc_connect_kbox+0x1d30> 447c: 00 447d: 66 0f 70 f6 00 pshufd $0x0,%xmm6,%xmm6 4482: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 4488: 66 0f 6f cb movdqa %xmm3,%xmm1 448c: 48 83 c2 01 add $0x1,%rdx 4490: 66 44 0f 6f d3 movdqa %xmm3,%xmm10 4495: 66 0f d4 df paddq %xmm7,%xmm3 4499: 66 0f d4 ca paddq %xmm2,%xmm1 449d: 66 44 0f 6f c9 movdqa %xmm1,%xmm9 44a2: 44 0f c6 d1 88 shufps $0x88,%xmm1,%xmm10 44a7: 66 44 0f d4 ca paddq %xmm2,%xmm9 44ac: 66 45 0f 6f c1 movdqa %xmm9,%xmm8 44b1: 66 44 0f d4 c2 paddq %xmm2,%xmm8 44b6: 66 41 0f 6f c0 movdqa %xmm8,%xmm0 44bb: 45 0f c6 c8 88 shufps $0x88,%xmm8,%xmm9 44c0: 66 41 0f 6f ca movdqa %xmm10,%xmm1 44c5: 66 0f d4 c2 paddq %xmm2,%xmm0 44c9: 66 0f 6f e0 movdqa %xmm0,%xmm4 44cd: 66 45 0f 6f c2 movdqa %xmm10,%xmm8 44d2: 66 0f d4 e2 paddq %xmm2,%xmm4 44d6: 66 0f 6f ec movdqa %xmm4,%xmm5 44da: 0f c6 c4 88 shufps $0x88,%xmm4,%xmm0 44de: 66 41 0f 61 c9 punpcklwd %xmm9,%xmm1 44e3: 66 0f d4 ea paddq %xmm2,%xmm5 44e7: 66 0f 6f e5 movdqa %xmm5,%xmm4 44eb: 66 45 0f 69 c1 punpckhwd %xmm9,%xmm8 44f0: 66 0f d4 e2 paddq %xmm2,%xmm4 44f4: 0f c6 ec 88 shufps $0x88,%xmm4,%xmm5 44f8: 66 0f 6f e0 movdqa %xmm0,%xmm4 44fc: 66 44 0f 6f c9 movdqa %xmm1,%xmm9 4501: 66 41 0f 61 c8 punpcklwd %xmm8,%xmm1 4506: 66 0f 61 c5 punpcklwd %xmm5,%xmm0 450a: 66 0f 69 e5 punpckhwd %xmm5,%xmm4 450e: 66 45 0f 69 c8 punpckhwd %xmm8,%xmm9 4513: 66 0f 6f e8 movdqa %xmm0,%xmm5 4517: 66 0f 61 c4 punpcklwd %xmm4,%xmm0 451b: 66 0f 69 ec punpckhwd %xmm4,%xmm5 451f: 66 41 0f 61 c9 punpcklwd %xmm9,%xmm1 4524: 66 0f 61 c5 punpcklwd %xmm5,%xmm0 4528: 66 0f 6f e1 movdqa %xmm1,%xmm4 452c: 66 0f 60 c8 punpcklbw %xmm0,%xmm1 4530: 66 0f 68 e0 punpckhbw %xmm0,%xmm4 4534: 66 0f 6f c1 movdqa %xmm1,%xmm0 4538: 66 0f 60 cc punpcklbw %xmm4,%xmm1 453c: 66 0f 68 c4 punpckhbw %xmm4,%xmm0 4540: 66 0f 6f e1 movdqa %xmm1,%xmm4 4544: 66 0f 60 c8 punpcklbw %xmm0,%xmm1 4548: 66 0f 68 e0 punpckhbw %xmm0,%xmm4 454c: 66 0f 60 cc punpcklbw %xmm4,%xmm1 4550: 66 0f ef ce pxor %xmm6,%xmm1 4554: 66 0f 7f 09 movdqa %xmm1,(%rcx) 4558: 48 83 c1 10 add $0x10,%rcx 455c: 4c 39 ca cmp %r9,%rdx 455f: 0f 82 23 ff ff ff jb 4488 <fill_block+0xc8> 4565: 4c 01 d0 add %r10,%rax 4568: 4d 39 d3 cmp %r10,%r11 456b: 74 12 je 457f <fill_block+0x1bf> 456d: 0f 1f 00 nopl (%rax) 4570: 89 fa mov %edi,%edx 4572: 31 c2 xor %eax,%edx 4574: 88 10 mov %dl,(%rax) 4576: 48 83 c0 01 add $0x1,%rax 457a: 49 39 c0 cmp %rax,%r8 457d: 77 f1 ja 4570 <fill_block+0x1b0> 457f: e8 cc 1e 01 00 callq 16450 <heap_check> 4584: 48 85 c0 test %rax,%rax 4587: 74 10 je 4599 <fill_block+0x1d9> 4589: 80 3d 50 10 02 00 00 cmpb $0x0,0x21050(%rip) # 255e0 <_tbss_start> 4590: 74 0e je 45a0 <fill_block+0x1e0> 4592: c6 05 d7 10 02 00 01 movb $0x1,0x210d7(%rip) # 25670 <error_flag> 4599: c9 leaveq 459a: c3 retq 459b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 45a0: 48 8b 3d 59 11 02 00 mov 0x21159(%rip),%rdi # 25700 <stderr> 45a7: 48 89 c2 mov %rax,%rdx 45aa: b9 98 1a 02 00 mov $0x21a98,%ecx 45af: 31 c0 xor %eax,%eax 45b1: be c0 1a 02 00 mov $0x21ac0,%esi 45b6: e8 c5 ea 00 00 callq 13080 <fprintf> 45bb: 80 3d 1e 10 02 00 00 cmpb $0x0,0x2101e(%rip) # 255e0 <_tbss_start> 45c2: 75 ce jne 4592 <fill_block+0x1d2> 45c4: e8 67 a2 01 00 callq 1e830 <stacktrace_print> 45c9: eb c7 jmp 4592 <fill_block+0x1d2> 45cb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
It seems to be writing 16-byte aligned blocks of 16 bytes at a time. This corresponds to the observation that the block becomes corrupted on a 16-byte boundary. So, could the problem be that there is a bug in handling the FPU context switching? Can the FPU context of a thread get somehow corrupted while the thread does not run? I was not able to reproduce the issue when I configured the kernel to not use lazy FPU context switching, but it could have made the bug only less reproducible.
Here are the unique FPU instructions used in the above code snippet:
movd movdqa movhps movq paddq pshufd punpckhbw punpckhwd punpcklbw punpcklwd pxor shufps
The explicitly used FPU / XMM registers range from %xmm0
to %xmm10
.
comment:9 by , 12 years ago
Description: | modified (diff) |
---|---|
Keywords: | fpu added |
Resolution: | → fixed |
Status: | accepted → closed |
Ok, I think I can explain the mechanism of this corruption, at last!
Threads that have an active FPU context (fpu_context_engaged
is true) may not be migrated. The kcpulb
load balancing thread honours this, but, as it turns out, this is not enough to prevent a thread with an active FPU context from being migrated to another CPU. Imagine a thread with its FPU context on the CPU goes to sleep. The CPU which wakes up the thread again (for whatever reason) will put it into its own run queue and effectively migrate the thread. Since the lazy FPU context switching code relies on the fact that the FPU context owner will not be migrated under any circumstances, this is a severe flaw of our thread_ready()
function.
Now, as for what is probably happening, let us assume that the tester malloc1
thread was running on CPU1 for some time. Its saved_fpu_context
contains the last saved version of its FPU context. The saved FPU context becomes obsolete as soon as the thread runs again. The current version of the FPU context exists on CPU1, the last CPU which executed the thread. Now let us further assume that the thread got blocked in the kernel on e.g. some mutex. CPU2 later wakes up our thread, which effectively migrates it to CPU2, thanks to an error in thread_ready()
. When the thread executes the first FPU instruction in fill_block()
, it triggers an exception on CPU2, which results in a call to scheduler_fpu_lazy_request()
. In that function, the FPU on CPU2 loads the thread's saved FPU context, which is now obsolete (but may be quite similar to the current one). The current FPU context for the thread is still on CPU1. When the thread resumes execution of userspace code, it works with stale FPU context. Hadn't the thread been migrated, the behavior would be correct because the on-CPU FPU state would be first saved in the thread's saved_fpu_context
, so that scheduler_fpu_lazy_request()
will load the correct context a little bit later.
Note that this could have also negatively impacted destroying of threads, because thread_destroy()
checks only the current CPU for remembering the FPU owner thread. Running this code on a wrong CPU defeats the algorithm and has the potential to cause further kernel memory corruptions.
Fixed in mainline,1532.
screenshot