Context Navigation

Invalidating Instruction Caches and Flushing Write Buffers

Let's consider two scenarios:

Scenario 1:

write memory block M
execute memory block M

What can go wrong: - the data can stay in the processor's write buffers and don't propagate into main memory (from where instructions are fetched) - if the data cache is write-back (and separate from the instruction cache), the data will stay in the data cache and stale data will be fetched into the instruction cache (as cache coherency is not maintained between data and instruction caches)

Scenario 2:

execute memory block M (previously filled with instructions)
modify memory block M
execute memory block M

What can go wrong: - anything from scenario 1 - even if the data is written to main memory, the processor's instruction cache may contain stale data (again, no cache-coherency is maintained) - even if the caches are coherent, stale instructions may linger in the cpu's instruction prefetch buffers (instruction queues)

Architecure Overview

Architecture	Unified cache	D$ write-through	VAC	Scenario 1	Scenario 2	VAC aliases	Cache shootdown
amd64	Can be both	Configured write-back	Yes	HW solution	HW solution	HW solution	No
arm32	Can be both	Can be both	Both	Susceptible	Susceptible	Susceptible	Yes
ia32	Can be both	Configured write-back		HW solution	HW solution		No
ia64	No		No	Susceptible	Susceptible	Immune	No
mips32	No	No (R4000), Yes (4K)	Yes	Susceptible	Susceptible	Virtual Coherency Exception (??)	??
ppc32	Can be both	Configurable per page	??	Susceptible	Susceptible	??	Yes
sparc64	No	Yes	Yes	Immune	Susceptible	HelenOS avoids them	No

Notes

amd64

See AMD x86-64 Architecture Programmer's Manual, Volume 2, System Programming, page 211, Self-Modifying Code:

Software that writes into a code segment is classified as self-modifying code. To avoid cache-coherency problems due to self-modifying code, a check is made during data writes to see whether the data-memory location corresponds to a code-segment memory location. If it does, implementations of the AMD64 architecture invalidate the corresponding instruction-cache line(s) during the data-memory write. Entries in the data cache are not invalidated, and it is possible for the modified instruction to be cached by the data cache following the memory write. A subsequent fetch of the modified instruction goes to main memory to get the coherent version of the instruction. If the data cache holds the most recent copy of the instruction rather than main memory, it provides that copy.

The processor determines whether a write is in a code segment by internally probing the instruction cache and prefetched instructions. If the internal probe returns a hit, the instruction-cache line and prefetched instructions are invalidated. The internal probes into the instruction cache and prefetch hardware are always performed using the physical address of an instruction in order to avoid potential aliasing problems associated with using virtual (linear) addresses.

The solution might be to simply issue a write barrier after a self-modifying code. Are we really sure that in SMP systems the store buffers cannot forward writes to other caches? In that case a no-op would be necessary to handle self-modifying code.

arm32

All parameters (whether the cache is unified, write through and/or virtually-addressed) is implementation-defined. No real ARM machine is currently supported by HelenOS. Hardware coherency checking is minimal.

See ARM Architecture Reference Manual, page A2-28 (59-60). Prefetching and self-modifying code, Instruction Memory Barrriers (IMBs)

See ARM Architecture Reference Manual, page B5-10 (552). Chapter 5-5, Memory Coherency

ia32

See IA-32 Intel Architecture Software Developer's Manual, Volume 3, 10.6. Self-Modifying Code:

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. or the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.

In practice, the check on linear addresses should not create compatibility problems among IA-32 processors. Applications that include self-modifying code use the same linear address for modifying and fetching the instruction. Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue. (See Section 7.1.3., “Handling Self- and Cross-Modifying Code”, for more information about the use of self-modifying code.)

The solution is the same as for amd64.

ia64

See Intel Itanium Architecture Software Developer's Manual, 2.5 Updating Code Images, page 2:404.

mips32

For info on 4K cores, see MIPS32 4KTM Processor Core Family Software User’s Manual, 7.5 Memory Coherence Issues:

A cache presents coherency issues within the memory huarache which must be considered in the system design. Since a cache holds a copy of memory data, it is possible for another memory master to modify a memory location, thus making other copies of that location stale if those copies are still in use. A detailed discussion of memory coherence is beyond the scope of this document, but following are a few related comments.

A 4K processor contains no direct hardware support for managing coherency with respect to its caches, so it must be handled via system design or software. The 4K caches are write-through, so all data writes will eventually be sent to memory. Due to write buffers, however, there could be a delay in how long it takes for the write to memory to actually occur. If another memory master updates cacheable memory which could also be in the 4K caches, then those locations may need to be flushed from the cache. The only way to accomplish this invalidation is by use of the CACHE instruction.

The SYNC instruction may also be useful to software enforcing memory coherence, as it flushes the 4K’s write buffers.

ppc32

See PowerPC Microprocessor Family: The Programming Environments for 32-Bit Microprocessors, chapter 5, page 209: Cache Model and Memory Coherency

Instruction caches, if they exist, are not required to be consistent with data caches, memory, or I/O data transfers. Software must use the appropriate cache management instructions to ensure that instruction caches are kept coherent when instructions are modified by the processor or by input data transfer. When a processor alters a memory location that may be contained in an instruction cache, software must ensure that updates to memory are visible to the instruction fetching mechanism. Although the instructions to enforce consistency vary among implementations, the following sequence for a uniprocessor system is typical:

dcbst (update memory)
sync (wait for update)
icbi (invalidate copy in instruction cache)
isync (perform context synchronization)

sparc64

On UltraSPARC (sun4u) processors, the D$ is write-through.

See UltreSPARC User's Manual, 14.4.4. FLUSH and Self-Modifying Code (Impdep #122), page 247:

FLUSH is needed to synchronize code and data spaces after code space is modified during program execution.

…

SPARC-V9 specifies that the FLUSH instruction has no latency on the issuing processor. In other words, a store to instruction space prior to the FLUSH instruction is visible immediately after the completion of FLUSH. MEMBAR #StoreStore is required to ensure proper ordering in multi-processing system when the memory model is not TSO. When a MEMBAR #StoreStore, FLUSH sequence is performed, UltraSPARC guarantees that earlier code modifications will be visible across the whole system.