Cache Coherence

Coherence

Consistency

Multicore Caches

3-State Coherence Protocol

Modified

Shared

Invalid

Transitions take 100 cc on smaller machines and up to 2000 cc on larger ones

Core and Bus Actions

Each core has three possible actions that affect the cache

Read (load)

Write (store)

Evict

Performance problems

  • Every transition requires bus communications
  • Avoid state transitions where possible

Implications for Multithreaded Design

  1. Avoid false sharing
    • Avoid placing data used by different threads in the same cache line
  2. Align structures to cache lines
    • Place related data you need to access together
  3. Pad data structures
    • Add unused fields to ensure alignment
  4. Avoid contending on cache lines
    • Reduce costly cache coherence traffic

Real World Coherence Costs

Example

Assume access time for an Intel Xeon processor caches and RAM:

  • 3 cycles for L1 caches
  • 11 cycles for L2
  • 44 cycles for last level caches (LLC)
  • 355 cycles for RAM

For different cores on the same processor

  • Load: 109 CC
  • Store: 115 CC
  • Atomic compare-and-swap: 120 CC

Non-uniform memory access (NUMA)

Multiple processors, each processor has local memory