Cache Coherence

Coherence

Accesses to a single memory location in different caches should be the same

Consistency

Apparent ordering between multiple locations

Multicore Caches

3-State Coherence Protocol

Modified

One cache has a valid copy
- Dirty, needs write back
Out of date copies in other caches are stale
Invalidate all other copies before entering this state

Shared

One or more cache have valid copy

Invalid

Doesn’t contain up to date data

Transitions take 100 cc on smaller machines and up to 2000 cc on larger ones

Core and Bus Actions

Each core has three possible actions that affect the cache

Read (load)

No intent to modify
Enter shared state

Write (store)

Intent to modify
- Invalidate all other cache copies
Modified state

Evict

Writeback to memory if modified
- Only if in modified state

Performance problems

Every transition requires bus communications
Avoid state transitions where possible

Implications for Multithreaded Design

Avoid false sharing
- Avoid placing data used by different threads in the same cache line
Align structures to cache lines
- Place related data you need to access together
Pad data structures
- Add unused fields to ensure alignment
Avoid contending on cache lines
- Reduce costly cache coherence traffic

Real World Coherence Costs

Example

Assume access time for an Intel Xeon processor caches and RAM:

3 cycles for L1 caches
11 cycles for L2
44 cycles for last level caches (LLC)
355 cycles for RAM

For different cores on the same processor

Load: 109 CC
Store: 115 CC
Atomic compare-and-swap: 120 CC

Non-uniform memory access (NUMA)

Multiple processors, each processor has local memory

Able to access local memory fast
For cores accessing data on different processors:
NUMA load: 289 CC
NUMA store: 320 CC
NUMA atomic compare-and-swap: 324 CC