Cache performance measurement and metric

A CPU cache is a piece of hardware that reduces access time to data in memory by keeping some part of the frequently used data of the main memory in a 'cache' of smaller and faster memory.

The performance of a computer system depends on the performance of all individual units—which include execution units like integer, branch and floating point, I/O units, bus, caches and memory systems. The gap between processor speed and main memory speed has grown exponentially. Until 2001–05, CPU speed, as measured by clock frequency, grew annually by 55%, whereas memory speed only grew by 7%.^[1] This problem is known as the memory wall. The motivation for a cache and its hierarchy is to bridge this speed gap and overcome the memory wall.

The critical component in most high-performance computers is the cache. Since the cache exists to bridge the speed gap, its performance measurement and metrics are important in designing and choosing various parameters like cache size, associativity, replacement policy, etc. Cache performance depends on cache hits and cache misses, which are the factors that create constraints to system performance. Cache hits are the number of accesses to the cache that actually find that data in the cache, and cache misses are those accesses that don't find the block in the cache. These cache hits and misses contribute to the term average access time (AAT) also known as AMAT (average memory access time), which, as the name suggests, is the average time it takes to access the memory. This is one major metric for cache performance measurement, because this number becomes highly significant and critical as processor speed increases.

Another useful metric to test the performance is Power law of cache misses. It gives you the number of misses when you change the size of the cache, given that the number of misses for one of the cache sizes is known. Similarly, when you want to test the performance of the cache in terms of misses across different associativities, Stack distance profiling is used.

Introduction to types of cache misses

Processor performance increase due to cache hierarchy depends on the number of accesses to the cache that satisfy block requests from the cache (cache hits) versus those that do not. Unsuccessful attempts to read or write data from the cache (cache misses) result in lower level or main memory access, which increases latency. There are three basic types of cache misses known as the 3Cs ^[2] and some other less popular cache misses.

Compulsory misses

Each memory block when first referenced causes a compulsory miss. This implies that the number of compulsory misses is the number of distinct memory blocks ever referenced. They are sometimes called cold misses too. Cold misses cannot be avoided unless the block is prefetched.

It has been observed that an increase in block size to a certain extent to exploit spatial locality leads to a decrease in cold misses. Increasing block size leads to prefetching of nearby words in a block and preventing future cold misses. Increasing the block size too much can lead to prefetching of useless data, thus increasing the number of cold misses.

Conflict misses

Conflict misses occur when the data required was in the cache previously, but got evicted. These evictions occur because another request was mapped to the same cache line. Generally, conflict misses are measured by subtracting the number of misses in a cache with limited associativity by the number of misses of a fully associative cache of the same size and cache block size.

Since conflict misses can be attributed to the lack of sufficient associativity, increasing the associativity to a certain extent (8-way associativity almost as effective as fully-associative) decreases the amount of conflict misses, however, such an approach increases the cache access time and consumes a lot more power than a set associative cache.

Capacity misses

A capacity miss occurs due to the limited size of a cache and not the cache's mapping function. When the working set, i.e., the data that is currently important to the program, is bigger than the cache, capacity misses occur frequently. Out of the 3Cs capacity misses are the hardest to identify, and can be thought of as non-compulsory misses in a fully associative cache. In a single processor system, the misses that exist after subtracting the number of compulsory misses and conflict misses can be categorized as capacity misses.

Since capacity misses can be attributed to the limited size of a cache, a simple way to reduce the number of such misses is to increase the cache size. Although this method is very intuitive, it leads to a longer access time and an increase in cache area and its power consumption. Also, after a certain cache size, the number of misses saturate and do not decrease even on increasing the cache size.

Effect of manipulating basic cache parameters on cache misses.^[2]
Parameters	Compulsory misses	Conflict misses	Capacity misses
Larger cache size	No effect	No effect	Decrease
Larger block size	Decrease	Uncertain effect	Uncertain effect
Larger associativity	No effect	Decrease	No effect

The above three kinds of misses only address uni-processor misses.

Coherence misses

The 3Cs group of cache misses can be extended to 4Cs when a multi-processor system with cache is involved, the fourth C being coherence misses. The coherence miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the thread's cache has been invalidated by a write from another thread.^[3] Coherence in a multi-processor system is maintained if only one copy of a memory block is present or all the copies have the same value. Even if all the copies of memory block do not have the same value, it doesn't necessarily lead to a coherence miss. A coherence miss occurs when threads execute loads such that they observe the different values of the memory block.^[4]

The coherence problem is complex and affects the scalability of parallel programs. A global order of all memory accesses to the same location must exist across the system to tackle this problem.

Coverage misses

The 4Cs group of cache misses can be further extended to 5Cs when the multi-processor system includes a coherence directory organized as a cache, i.e., that can replace entries. This fifth C stands for Coverage.^[5] The coverage miss count is the number of memory accesses that miss because a cache line that would otherwise be present in the processor's cache has been invalidated as a consequence of a directory eviction. If the directory is not able to track a cache line due to its limited capacity, the line must be invalidated from the processors' cache to maintain Coherence.

System-related misses

System activities such as interrupts, context switches and system calls lead to the process being suspended and its cache state being altered. When the process execution is resumed, it suffers cache misses to restore the cache state that was altered. These misses are called system-related misses.^[2]

Furthermore, cache misses due to context switching may be placed into two categories described below.

Replaced misses

When a context switch occurs, the cache state is modified and some of its blocks are replaced. The misses on access to these blocks are called replaced misses.

Reordered misses

Some blocks in the cache may not be replaced due to context switching but their recency is changed. Reordered misses are said to occur when misses occur due to change in recency and not due to the blocks being replaced. However, when the suspended process resumes execution, reordered blocks don't lead to context switch misses when no other misses cause these reordered blocks to be evicted.

System-related misses become significant when context switching occurs regularly. Increasing the cache size leads to a decrease in capacity and conflict misses but it has been observed that it leads to an increase in system-related misses if the cache is still smaller than the working set of the processes sharing the cache. Hence reducing the number of system-related misses presents a challenge.

Average memory access time

These cache misses directly correlate to the increase in cycles per instruction (CPI). However the amount of effect the cache misses have on the CPI also depends on how much of the cache miss can be overlapped with computations due to the ILP ( Instruction-level parallelism ) and how much of it can be overlapped with other cache misses due to Memory-level parallelism.^[2] If we ignore both these effects, then the average memory access time becomes an important metric. It provides a measure of the performance of the memory systems and hierarchies. It refers to the average time it takes to perform a memory access. It is the addition of the execution time for the memory instructions and the memory stall cycles. The execution time is the time for a cache access, and the memory stall cycles include the time to service a cache miss and access lower levels of memory. If the access latency, miss rate and miss penalty are known, the average memory access time can be calculated with:

$AMAT=T_{L1}+MR_{L1}\cdot MP_{L1}$

where $T_{L1}$ is the access latency of level one cache, $MR_{L1}$ is the miss rate of level one cache and $MP_{L1}$ is the additional cycles a miss at a higher level takes to be serviced compared to a hit at higher level, and is calculated with:

$MP_{L1}=T_{L2}+MR_{L2}\cdot MP_{L2}$

this formula can be expanded further and used recursively for all the further levels in the memory hierarchy to get the $AMAT$ . ^[6]

Power law of cache misses

The Power law of cache misses shows a trend in the capacity misses in a particular application of the program as affected by the cache size. This empirical observation led to the mathematical form of power law, which shows the relation between the miss rate and the cache size. It can be stated as

$M=M_{0}*C^{-\alpha }$

where M is the miss rate for a cache of size C and M₀ is the miss rate of a baseline cache. The exponent α is workload-specific and typically ranges from 0.3 to 0.7, with an average of 0.5. The power law was validated on quite a few of real-world benchmarks.^[7]

This relation shows that only a small fraction of cache misses can be eliminated for constant increase in cache size. This law holds true only for a certain finite range of cache sizes, up to which the miss rate doesn't flatten out. The miss rate eventually becomes stagnant at a certain, large enough cache size, and after that the relation does not give correct estimates.

Stack distance profile

The stack distance profile is a better representation of how the cache misses are affected by the cache size. The power law of cache misses just showed an rough approximation of the same. A stack distance profile captures the temporal reuse behavior of an application in a fully or set-associative cache.^[8]

Applications that exhibit more temporal reuse behavior generally access data that is more recently used. Let us assume the associativity of a cache to be $A$ . To collect the stack distance profile information of this cache, assuming it has LRU replacement policy, $A+1$ counters are used starting from $C_{0}$ to $C_{A}$ and one additional counter $C_{<}A$ , which keeps count of the misses. The counter $C_{i}$ increments when there is a hit in the $i^{th}$ way and the counter $C_{<}A$ is incremented on every miss. The stack distance profile shows the trend of hits, decreasing from the most recently used data to the least recently used. Using this stack distance profile information, the cache miss for a cache with associativity $A{'}$ and LRU replacement policy, where $A{'}<A$ can be computed as

$miss=C_{<}A+\sum _{i=A{'}+1}^{A}C_{i}$

This profiling information has a limitation that it can only capture the temporal reuse across different associativities. For other purposes, the temporal reuse must be studied in greater detail.

Notes

^ Hennessy, J. and Patterson, D. (2003). Computer Architecture : a Quantitative Approach,3rd edition. Morgan-Kaufmann Publishers, Inc. ISBN 9781558607248.{{cite book}}: CS1 maint: multiple names: authors list (link)
^ ^a ^b ^c ^d Solihin, Yan (2015-11-17). Fundamentals of Parallel Multicore Architecture, 2016 edition. Chapman & Hall. ISBN 978-1482211184.
^ "Modeling Cache Coherence Misses on Multicores" (PDF). {{cite journal}}: Cite journal requires |journal= (help)
^ Sweden, Michel Dubois, University of Southern California, USA, Murali Annavaram, University of Southern California, USA, Per Stenström, Chalmers University of Technology (2012). Parallel computer organization and design. Cambridge: Cambridge University Press. ISBN 9781139051224.{{cite book}}: CS1 maint: multiple names: authors list (link)
^ Ros, Alberto; Cuesta, Blas; Fernández-Pascual, Ricardo; Gómez, María E.; Acacio, Manuel E.; Robles, Antonio; García, José M.; Duato, José (2010). EMC2: Extending Magny-Cours Coherence for Large-Scale Servers. 17th International Conference on High Performance Computing (HiPC). pp. 1–10. doi:10.1109/HIPC.2010.5713176. ISBN 978-1-4244-8518-5.
^ Patterson, John L. Hennessy, David A. (2011). Computer architecture : a quantitative approach (5th ed.). San Francisco, Calif.: Morgan Kaufmann. ISBN 978-0-12-383872-8.{{cite book}}: CS1 maint: multiple names: authors list (link)
^ Hartstein, A.; Srinivasan, V.; Puzak, T. R.; Emma, P. G. (2006-01-01). "Cache miss behavior". Proceedings of the 3rd conference on Computing frontiers. CF '06. pp. 313–320. doi:10.1145/1128022.1128064. ISBN 978-1595933027. S2CID 17728397.
^ Mattson, R.L.; Gecsei, J.; Slutz, D. R.; Traiger, I (1970). "Evaluation Techniques for Storage Hierarchies". IBM Systems Journal. 9 (2): 78–117. doi:10.1147/sj.92.0078.

[1] Hennessy, J. and Patterson, D. (2003). Computer Architecture : a Quantitative Approach,3rd edition. Morgan-Kaufmann Publishers, Inc. ISBN 9781558607248.{{cite book}}: CS1 maint: multiple names: authors list (link)

[:0-2] Solihin, Yan (2015-11-17). Fundamentals of Parallel Multicore Architecture, 2016 edition. Chapman & Hall. ISBN 978-1482211184.

[3] "Modeling Cache Coherence Misses on Multicores" (PDF). {{cite journal}}: Cite journal requires |journal= (help)

[4] Sweden, Michel Dubois, University of Southern California, USA, Murali Annavaram, University of Southern California, USA, Per Stenström, Chalmers University of Technology (2012). Parallel computer organization and design. Cambridge: Cambridge University Press. ISBN 9781139051224.{{cite book}}: CS1 maint: multiple names: authors list (link)

[5] Ros, Alberto; Cuesta, Blas; Fernández-Pascual, Ricardo; Gómez, María E.; Acacio, Manuel E.; Robles, Antonio; García, José M.; Duato, José (2010). EMC2: Extending Magny-Cours Coherence for Large-Scale Servers. 17th International Conference on High Performance Computing (HiPC). pp. 1–10. doi:10.1109/HIPC.2010.5713176. ISBN 978-1-4244-8518-5.

[6] Patterson, John L. Hennessy, David A. (2011). Computer architecture : a quantitative approach (5th ed.). San Francisco, Calif.: Morgan Kaufmann. ISBN 978-0-12-383872-8.{{cite book}}: CS1 maint: multiple names: authors list (link)

[:2-7] Hartstein, A.; Srinivasan, V.; Puzak, T. R.; Emma, P. G. (2006-01-01). "Cache miss behavior". Proceedings of the 3rd conference on Computing frontiers. CF '06. pp. 313–320. doi:10.1145/1128022.1128064. ISBN 978-1595933027. S2CID 17728397.

[:1-8] Mattson, R.L.; Gecsei, J.; Slutz, D. R.; Traiger, I (1970). "Evaluation Techniques for Storage Hierarchies". IBM Systems Journal. 9 (2): 78–117. doi:10.1147/sj.92.0078.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]