Opened 14 years ago
Closed 13 years ago
#396 closed defect (fixed)
tester malloc1 sometimes fails
| Reported by: | Maurizio Lombardi | Owned by: | Jakub Jermář |
|---|---|---|---|
| Priority: | major | Milestone: | 0.5.0 |
| Component: | helenos/lib/c | Version: | mainline |
| Keywords: | malloc, amd64, fpu | Cc: | |
| Blocker for: | Depends on: | ||
| See also: |
Description (last modified by )
The tester malloc1 command sometimes fails with a "corrupted content of a data block" error.
There is not a way to reproduce this bug in a systematic way but it appears quite often while executing multiple instances of the "tester malloc1" command on a 4-way SMP configuration.
Attachments (2)
Change History (11)
by , 14 years ago
| Attachment: | memory_corruption.png added |
|---|
comment:1 by , 14 years ago
| Component: | helenos/unspecified → helenos/uspace/libc |
|---|---|
| Keywords: | malloc added |
| Owner: | set to |
comment:2 by , 14 years ago
Hi Maurizio,
can you still reproduce this on top of the current mainline? I had no luck with 1484. In your experience, what is the chance that the bug will show when emulating on a 4-way QEMU machine. How many tester tasks neeed to run? What QEMU version was this, did you use kvm?
Thanks,
Jakub
comment:3 by , 14 years ago
| Keywords: | amd64 added |
|---|
comment:4 by , 14 years ago
I managed to reproduce this on amd64.
The function which causes the test to fail is check_block(). It detects a mismatch of a value stored somewhere in the block with the expected value. I instrumented the function to provide additional information and to perform the heap allocator consistency check. The consistency check did not reveal any heap metadata corruption, so the problem is probably only in the data itself.
Printing maximum of 20 bytes starting with the first byte in the block which differs from the expected value reveals an interesting pattern. The pattern suggests the data is not corrupted entirely, but just a little bit. Here is an example:
block=0x28230, base=0x805020, size=228516, ofs=180800, pos=base+ofs=0x831260 [actual value|expected value]: [f0|50] [f1|51] [f2|52] ... [e0|40] [e1|41] [e2|42] [e3|43]
The expected value (=fill pattern) for each byte is computed as ((block XOR pos) AND 0xff).
We can speculate that the part of the block 0x28230 with base 0x805020 at offset 180800 was rewritten by the contents of block 0x??????????????90 (0x90 BXOR 0x60 == F0). Patterns of both blocks seem to have the same phase (pos % 16).
comment:5 by , 14 years ago
| Status: | new → accepted |
|---|
comment:6 by , 14 years ago
So we are in a situation when two entities write to one physical frame. Here is a list of dead ends that I already verified:
- allocated
frame_t's are not re-allocated to someone else - the range allocator does not return out-of-span addresses
- TLB invalidation instructions are not optimized out by the compiler
- the system with default 256M does not have any high memory and hence the kernel uses identity mapped addresses during these tests
comment:7 by , 13 years ago
This might not at all be a memory management related bug. The bug is reproducible with only one instance of tester malloc1, a couple of tester loop1 instances and top on a 7-way QEMU/amd64.
comment:8 by , 13 years ago
I noticed the amd64 version of tester's fill_block() function extensively uses FPU and XMM registers to populate the block with a fill pattern:
00000000000043c0 <fill_block>:
43c0: 55 push %rbp
43c1: 48 89 e5 mov %rsp,%rbp
43c4: 48 83 ec 10 sub $0x10,%rsp
43c8: 48 8b 37 mov (%rdi),%rsi
43cb: 49 89 f0 mov %rsi,%r8
43ce: 4c 03 47 08 add 0x8(%rdi),%r8
43d2: 4c 39 c6 cmp %r8,%rsi
43d5: 0f 83 a4 01 00 00 jae 457f <fill_block+0x1bf>
43db: 48 89 f1 mov %rsi,%rcx
43de: 4d 89 c2 mov %r8,%r10
43e1: 48 89 f0 mov %rsi,%rax
43e4: 48 f7 d9 neg %rcx
43e7: 49 29 f2 sub %rsi,%r10
43ea: 83 e1 0f and $0xf,%ecx
43ed: 4d 89 d1 mov %r10,%r9
43f0: 49 39 ca cmp %rcx,%r10
43f3: 49 0f 46 ca cmovbe %r10,%rcx
43f7: 49 83 fa 10 cmp $0x10,%r10
43fb: 49 0f 46 ca cmovbe %r10,%rcx
43ff: 48 85 c9 test %rcx,%rcx
4402: 74 22 je 4426 <fill_block+0x66>
4404: 0f 1f 40 00 nopl 0x0(%rax)
4408: 89 fa mov %edi,%edx
440a: 31 c2 xor %eax,%edx
440c: 88 10 mov %dl,(%rax)
440e: 48 83 c0 01 add $0x1,%rax
4412: 48 89 c2 mov %rax,%rdx
4415: 48 29 f2 sub %rsi,%rdx
4418: 48 39 d1 cmp %rdx,%rcx
441b: 77 eb ja 4408 <fill_block+0x48>
441d: 49 39 ca cmp %rcx,%r10
4420: 0f 84 59 01 00 00 je 457f <fill_block+0x1bf>
4426: 4d 89 cb mov %r9,%r11
4429: 49 29 cb sub %rcx,%r11
442c: 4d 89 d9 mov %r11,%r9
442f: 49 c1 e9 04 shr $0x4,%r9
4433: 4d 89 ca mov %r9,%r10
4436: 49 c1 e2 04 shl $0x4,%r10
443a: 4d 85 d2 test %r10,%r10
443d: 0f 84 2d 01 00 00 je 4570 <fill_block+0x1b0>
4443: 66 0f 6e f7 movd %edi,%xmm6
4447: 48 8d 50 01 lea 0x1(%rax),%rdx
444b: 48 89 45 f0 mov %rax,-0x10(%rbp)
444f: f3 0f 7e 45 f0 movq -0x10(%rbp),%xmm0
4454: 48 01 f1 add %rsi,%rcx
4457: 66 0f 60 f6 punpcklbw %xmm6,%xmm6
445b: 48 89 55 f8 mov %rdx,-0x8(%rbp)
445f: 0f 16 45 f8 movhps -0x8(%rbp),%xmm0
4463: 66 0f 6f 3d 45 d7 01 movdqa 0x1d745(%rip),%xmm7 # 21bb0 <ipc_connect_kbox+0x1d20>
446a: 00
446b: 31 d2 xor %edx,%edx
446d: 66 0f 61 f6 punpcklwd %xmm6,%xmm6
4471: 66 0f 6f d8 movdqa %xmm0,%xmm3
4475: 66 0f 6f 15 43 d7 01 movdqa 0x1d743(%rip),%xmm2 # 21bc0 <ipc_connect_kbox+0x1d30>
447c: 00
447d: 66 0f 70 f6 00 pshufd $0x0,%xmm6,%xmm6
4482: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
4488: 66 0f 6f cb movdqa %xmm3,%xmm1
448c: 48 83 c2 01 add $0x1,%rdx
4490: 66 44 0f 6f d3 movdqa %xmm3,%xmm10
4495: 66 0f d4 df paddq %xmm7,%xmm3
4499: 66 0f d4 ca paddq %xmm2,%xmm1
449d: 66 44 0f 6f c9 movdqa %xmm1,%xmm9
44a2: 44 0f c6 d1 88 shufps $0x88,%xmm1,%xmm10
44a7: 66 44 0f d4 ca paddq %xmm2,%xmm9
44ac: 66 45 0f 6f c1 movdqa %xmm9,%xmm8
44b1: 66 44 0f d4 c2 paddq %xmm2,%xmm8
44b6: 66 41 0f 6f c0 movdqa %xmm8,%xmm0
44bb: 45 0f c6 c8 88 shufps $0x88,%xmm8,%xmm9
44c0: 66 41 0f 6f ca movdqa %xmm10,%xmm1
44c5: 66 0f d4 c2 paddq %xmm2,%xmm0
44c9: 66 0f 6f e0 movdqa %xmm0,%xmm4
44cd: 66 45 0f 6f c2 movdqa %xmm10,%xmm8
44d2: 66 0f d4 e2 paddq %xmm2,%xmm4
44d6: 66 0f 6f ec movdqa %xmm4,%xmm5
44da: 0f c6 c4 88 shufps $0x88,%xmm4,%xmm0
44de: 66 41 0f 61 c9 punpcklwd %xmm9,%xmm1
44e3: 66 0f d4 ea paddq %xmm2,%xmm5
44e7: 66 0f 6f e5 movdqa %xmm5,%xmm4
44eb: 66 45 0f 69 c1 punpckhwd %xmm9,%xmm8
44f0: 66 0f d4 e2 paddq %xmm2,%xmm4
44f4: 0f c6 ec 88 shufps $0x88,%xmm4,%xmm5
44f8: 66 0f 6f e0 movdqa %xmm0,%xmm4
44fc: 66 44 0f 6f c9 movdqa %xmm1,%xmm9
4501: 66 41 0f 61 c8 punpcklwd %xmm8,%xmm1
4506: 66 0f 61 c5 punpcklwd %xmm5,%xmm0
450a: 66 0f 69 e5 punpckhwd %xmm5,%xmm4
450e: 66 45 0f 69 c8 punpckhwd %xmm8,%xmm9
4513: 66 0f 6f e8 movdqa %xmm0,%xmm5
4517: 66 0f 61 c4 punpcklwd %xmm4,%xmm0
451b: 66 0f 69 ec punpckhwd %xmm4,%xmm5
451f: 66 41 0f 61 c9 punpcklwd %xmm9,%xmm1
4524: 66 0f 61 c5 punpcklwd %xmm5,%xmm0
4528: 66 0f 6f e1 movdqa %xmm1,%xmm4
452c: 66 0f 60 c8 punpcklbw %xmm0,%xmm1
4530: 66 0f 68 e0 punpckhbw %xmm0,%xmm4
4534: 66 0f 6f c1 movdqa %xmm1,%xmm0
4538: 66 0f 60 cc punpcklbw %xmm4,%xmm1
453c: 66 0f 68 c4 punpckhbw %xmm4,%xmm0
4540: 66 0f 6f e1 movdqa %xmm1,%xmm4
4544: 66 0f 60 c8 punpcklbw %xmm0,%xmm1
4548: 66 0f 68 e0 punpckhbw %xmm0,%xmm4
454c: 66 0f 60 cc punpcklbw %xmm4,%xmm1
4550: 66 0f ef ce pxor %xmm6,%xmm1
4554: 66 0f 7f 09 movdqa %xmm1,(%rcx)
4558: 48 83 c1 10 add $0x10,%rcx
455c: 4c 39 ca cmp %r9,%rdx
455f: 0f 82 23 ff ff ff jb 4488 <fill_block+0xc8>
4565: 4c 01 d0 add %r10,%rax
4568: 4d 39 d3 cmp %r10,%r11
456b: 74 12 je 457f <fill_block+0x1bf>
456d: 0f 1f 00 nopl (%rax)
4570: 89 fa mov %edi,%edx
4572: 31 c2 xor %eax,%edx
4574: 88 10 mov %dl,(%rax)
4576: 48 83 c0 01 add $0x1,%rax
457a: 49 39 c0 cmp %rax,%r8
457d: 77 f1 ja 4570 <fill_block+0x1b0>
457f: e8 cc 1e 01 00 callq 16450 <heap_check>
4584: 48 85 c0 test %rax,%rax
4587: 74 10 je 4599 <fill_block+0x1d9>
4589: 80 3d 50 10 02 00 00 cmpb $0x0,0x21050(%rip) # 255e0 <_tbss_start>
4590: 74 0e je 45a0 <fill_block+0x1e0>
4592: c6 05 d7 10 02 00 01 movb $0x1,0x210d7(%rip) # 25670 <error_flag>
4599: c9 leaveq
459a: c3 retq
459b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
45a0: 48 8b 3d 59 11 02 00 mov 0x21159(%rip),%rdi # 25700 <stderr>
45a7: 48 89 c2 mov %rax,%rdx
45aa: b9 98 1a 02 00 mov $0x21a98,%ecx
45af: 31 c0 xor %eax,%eax
45b1: be c0 1a 02 00 mov $0x21ac0,%esi
45b6: e8 c5 ea 00 00 callq 13080 <fprintf>
45bb: 80 3d 1e 10 02 00 00 cmpb $0x0,0x2101e(%rip) # 255e0 <_tbss_start>
45c2: 75 ce jne 4592 <fill_block+0x1d2>
45c4: e8 67 a2 01 00 callq 1e830 <stacktrace_print>
45c9: eb c7 jmp 4592 <fill_block+0x1d2>
45cb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
It seems to be writing 16-byte aligned blocks of 16 bytes at a time. This corresponds to the observation that the block becomes corrupted on a 16-byte boundary. So, could the problem be that there is a bug in handling the FPU context switching? Can the FPU context of a thread get somehow corrupted while the thread does not run? I was not able to reproduce the issue when I configured the kernel to not use lazy FPU context switching, but it could have made the bug only less reproducible.
Here are the unique FPU instructions used in the above code snippet:
movd movdqa movhps movq paddq pshufd punpckhbw punpckhwd punpcklbw punpcklwd pxor shufps
The explicitly used FPU / XMM registers range from %xmm0 to %xmm10.
comment:9 by , 13 years ago
| Description: | modified (diff) |
|---|---|
| Keywords: | fpu added |
| Resolution: | → fixed |
| Status: | accepted → closed |
Ok, I think I can explain the mechanism of this corruption, at last!
Threads that have an active FPU context (fpu_context_engaged is true) may not be migrated. The kcpulb load balancing thread honours this, but, as it turns out, this is not enough to prevent a thread with an active FPU context from being migrated to another CPU. Imagine a thread with its FPU context on the CPU goes to sleep. The CPU which wakes up the thread again (for whatever reason) will put it into its own run queue and effectively migrate the thread. Since the lazy FPU context switching code relies on the fact that the FPU context owner will not be migrated under any circumstances, this is a severe flaw of our thread_ready() function.
Now, as for what is probably happening, let us assume that the tester malloc1 thread was running on CPU1 for some time. Its saved_fpu_context contains the last saved version of its FPU context. The saved FPU context becomes obsolete as soon as the thread runs again. The current version of the FPU context exists on CPU1, the last CPU which executed the thread. Now let us further assume that the thread got blocked in the kernel on e.g. some mutex. CPU2 later wakes up our thread, which effectively migrates it to CPU2, thanks to an error in thread_ready(). When the thread executes the first FPU instruction in fill_block(), it triggers an exception on CPU2, which results in a call to scheduler_fpu_lazy_request(). In that function, the FPU on CPU2 loads the thread's saved FPU context, which is now obsolete (but may be quite similar to the current one). The current FPU context for the thread is still on CPU1. When the thread resumes execution of userspace code, it works with stale FPU context. Hadn't the thread been migrated, the behavior would be correct because the on-CPU FPU state would be first saved in the thread's saved_fpu_context, so that scheduler_fpu_lazy_request() will load the correct context a little bit later.
Note that this could have also negatively impacted destroying of threads, because thread_destroy() checks only the current CPU for remembering the FPU owner thread. Running this code on a wrong CPU defeats the algorithm and has the potential to cause further kernel memory corruptions.
Fixed in mainline,1532.

screenshot