The hierarchy of memory & caches

by Carl Burch, Hendrix College, October 2011

Creative Commons License
The hierarchy of memory & caches by Carl Burch is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
Based on a work at www.cburch.com/books/cache/.

Contents

1. Memory hierarchy
2. Direct-mapped cache
2.1. Structure
2.2. Example
2.3. Analysis
3. Other cache types
3.1. Fully associative
3.2. Set associative
4. Real-world caching
4.1. Arrandale cache structure
4.2. Issues with writing
5. Performance issues
5.1. The ijk implementation
5.2. The kij implementation
5.3. Tests

Computing systems use memory very heavily, and so performance depends in part on efficiently accessing memory. In this document, we investigate how modern systems reduce memory access time using caches.

1. Memory hierarchy

Today's computing systems include four important technologies for remembering information.

The below table summarizes each technology's access time and cost as of 2011.

access
time
cost
per GB
SRAM 1nanosecond $500.00
DRAM 10nanoseconds $6.00
flash 50microseconds $1.50
disk 4milliseconds $0.10

To make sense of this table and our discussion, you need to know your scientific units! Below is a table of the most common units appearing in computing; it's worth making sure that you're familiar with each of them.

milli-(m) 10−3: one thousandth      kilo-(K) 103: one thousand
micro-(μ) 10−6: one millionth mega-(M) 106: one million
nano-(n) 10−9: one billionth giga-(G) 109: one billion
tera-(T) 1012: one trillion

A CPU will try to complete an instruction every clock cycle. A typical clock rate is 2 GHz, which means 2 billion clock cycles each second; thus, each clock cycle takes a two-billionth of a second, or 0.5 ns (nanoseconds). Thus, SRAM can nearly keep up, while DRAM is significantly slower, and the disk is slothlike.

While we would ideally use SRAM for all our memory, the price of SRAM is so huge that we cannot afford much of it. DRAM is the only affordable and reasonably fast option in the quantity we require. But computing systems nonetheless take advantage of SRAM's speed through creating a cache built out of SRAM holding data that a program seems to be using heavily.

A cache holds a subset of the memory. It is built transparently into the system: The programmer manipulates what is in memory, with usually no way to manipulate the contents of the cache. Instead, the processor itself decides what to place into each cache, and this process is usually invisible to the programmer — though there are exceptions, as we'll see in Section 5.

A cache depends on the fact that programs display locality, which means that whenever a program accesses a particular memory location, the program is likely to access a nearby memory location in the near future. This has been tested numerous times, and invariably programs display this property in abundance.

2. Direct-mapped cache

There are three basic categories of caches: direct mapped, set associative, and fully associative. In this section, we'll examine direct-mapped caches.

2.1. Structure

In a direct-mapped cache, each memory address has only one possible location in the cache where its data might be found; that is, each memory address maps directly to a cache location.

A direct-mapped cache is divided into several lines, each holding several adjacent bytes of cached data. Each line is the same size, and the processor designer decides how long each line should be; a typical choice is 32 or 64 bytes. The lines are numbered starting at 0. Of course, the information stored in each line includes the actual memory data being cached; but each line also includes a valid bit for representing whether the line holds any useful data and a tag to identify the range of memory addresses that the cache holds.

A realistic direct-mapped cache might have 512 lines, each holding 64 bytes, for a total of 32 KB.

linevalidtagdata
0 ? ??????
1 ? ??????
2 ? ??????
: : : :
510 ? ??????
511 ? ??????

To see where a memory location is to be found in the cache, we take the memory address and divide it into three pieces: the tag, the line number, and the offset. For our example cache with 512 lines of 64 bytes each, a 32-bit memory address would include 6 bits for the offset, since 26 is 64, the number of bytes per line; it would include 9 bits for the line number, since 29 is 512, the number of lines; and it would include 17 bits for the tag, since this is the number of remaining bits (32 − 9 − 6). Thus, our example cache would regard each memory address as having three pieces as illustrated below.

To find a memory item in the cache, the direct-mapped cache breaks the desired memory address into these three pieces, tag, line, and offset. If the memory item is in the cache, it must be offset bytes into the line numbered line. To determine whether the data is present, the cache will verify that this line's valid bit is 1 and that the tag in the requested address matches the tag stored in the line. If this test reveals that the data is in the cache, the cache can respond immediately. If not, the cache loads from memory the sequence of adjacent bytes including the requested address to fill a cache line, sets the line's valid bit and tag, and then returns the requested data.

2.2. Example

To get a better handle on how this works, let's look at an example. Suppose we're working in a computer using 6-bit memory addresses, and our computer has a cache with two lines, each holding four bytes. Thus, of each 6-bit memory address, one bit will specify the line number and two will specify the offset within the line; that leaves 6 − 1 − 2 = 3 bits for the tag.

The cache looks like the following at start-up time.

linevalidtagdata
00???(none)
10???(none)

Let's see what happens if a program accesses the following sequence of bytes from memory: M[19], M[20], M[16], M[34], M[18].

  1. Read M[19]. We break 010011 into its component pieces, getting 010 for the tag, 0 for the line number, and 11 for the offset. We look at line 0 and find that the valid bit is 0; we conclude the information isn't yet in the cache, calling it a cache miss. On a cache miss, the cache loads all bytes with the matching tag and line number into the cache. Thus, we'll load from all memory address of the form 0100xx into line 0; this includes memory addresses 16 through 19.

    linevalidtagdata
    01010M[16–19]
    10010M[20–23]

    (Despite our diagram, the memory addresses 16, 17, 18, and 19 don't actually appear in the cache. The data column actually contains the four bytes found in memory at these addresses.)

  2. Read M[20]. We break 010100 into 010 for the tag, 1 for the line number, and 00 for the offset. We look at line 1, find that the valid bit is 0, and conclude that this reference is also a cache miss. We load all memory addresses of the form 0100xx into line 1.

    linevalidtagdata
    01010M[16–19]
    11010M[20–23]
  3. Read M[16]. We break 010000 into 010 for the tag, 0 for the line number, and 00 for the offset. We look at line 0, and we see that the valid bit is 0 and the line's tag matches our tag. Thus we have the desired information already in cache; this is called a cache hit. We serve the request using the information starting at offset 00#2 = 0 in the cache line.

  4. Read M[34]. We break 100010 into 100 for the tag, 0 for the line number, and 10 for the offset. We look at line 0, and we see that the valid bit is 1 but the line's tag 010 doesn't match our tag 100. We discard the information currently in line 0, replacing it with memory at addresses 32–35.

    linevalidtagdata
    01100M[32–35]
    11010M[20–23]
  5. Read M[18]. We break 010010 into 010 for the tag, 0 for the line number, and 10 for the offset. We look at line 0, and we see that the valid bit is 1 but the line's tag 100 doesn't match the address's tag 010. We discard the information currently in line 0, replacing it with memory addresses 16–19.

    linevalidtagdata
    01010M[16–19]
    11010M[20–23]

Thus, in serving this sequence of five memory accesses, we found one cache hit and four cache misses. In real-world sequences, though, there are many more hits than misses: typically, more than 99% of memory accesses are hits. This proportion of cache hits to total memory references is called the cache hit ratio.

2.3. Analysis

The primary thing to consider about a cache is how well it takes advantage of locality. We'll concentrate on two important cases.

How does it handle subsequent accesses to two locations that are somewhat close?

Usually this goes quite well. Our example trace illustrates this some: When we load M[19], we also load M[16], because this in the same line. Thus, when we access M[16] later, it is already in the cache.

One common situation where this scenario is relevant is in accessing an array in order. Such a situation is illustrated by the following code to add the elements of an array.

sum = 0;
for (i = 0i < ni++) {
    sum += a[i];
}

If our cache holds 64 bytes for each line, and if each array element is 8 bytes long, then each cache line holds eight array elements. Whenever the above code accesses an array element and misses the cache, the cache will load not only the referenced element but also the seven succeeding values, and so the miss will be followed by seven hits. Thus, this example would miss the cache only one eighth of the time, giving an 87.5% cache hit ratio.

How does it handle two accesses to the same location that are somewhat close together in time?

A direct-mapped cache usually does this well, since it will hold a line of data indefinitely, until it is ejected due to a reference to a distant memory location that happens to map to the same line.

The direct-mapped cache does, however, have a very bad case. For our above example, suppose that we have a program that happens to alternate between accessing memory addresses 16 and 32. Both memory addresses map to the same line (line 0), so each memory access would miss the cache. This is unfortunate, because frequent accesses to the same memory locations should be an easy case for the cache. This is a shortcoming of direct-mapped caches, and indeed it is the primary factor leading us to other cache designs.

For a larger cache, this outcome would be less likely. If it is a 16KB cache, for example, our program would have to be extremely interested in two memory addresses that are exactly 16KB apart (or nearly so). In practice, this doesn't happen all that often. But it does happen enough to inspire the other cache types we will study.

3. Other cache types

Caches come in three basic designs, and the direct-mapped design is only one of them. The others are set associative and fully associative. After you understand direct-mapped caches, they're relatively easy to understand.

3.1. Fully associative

In a direct-mapped cache, each memory address maps to only one possible location in the cache where that address's data might appear. As we've seen, this can lead to problems where a program frequently accesses two distant memory locations that happen to map to the same line of the cache.

In a fully associative cache, an address can be cached in any line of the cache. We no longer have reason to number the cache lines: The cache simply consists of several lines, all of which are available for caching data from any address.

Given a memory address, then, a fully associative cache will divide the address into two pieces: the tag and the line offset. The cache will search through all of the cache's lines to see if the tag for any of the lines matches the requested tag. If so, then it simply needs to return the data found in that line at the offset given in the requested address. If the tag is not found, then the cache needs to make room for the requested line. It will use an algorithm such as LRU or Clock to decide which line to remove; these algorithms are beyond the scope of this document.

You might object: Doesn't the process of searching through all lines for the requested tag take a lot of time? After all, an implementation of such a cache in C would involve a loop that iterates through one line after another until it finds the desired tag. However, we need to remember that the cache is a circuit, not a program. Each line of the cache's circuit can include its own collection of logic gates for comparing the line's tag to the requested tag, so that all of lines can compare to the requested tag at the same time.

If our goal were purely to avoid cache misses, then we'd certainly choose a fully associative cache: Nothing is more effective. Nonetheless, CPU designers rarely use them because the added effectiveness is rarely worth the added circuitry. Designers work with a fixed budget of transistors, and a given number of transistors supports a larger direct-mapped cache than a fully associative cache. The designer might judge that the bad behavior of direct-mapped caches is rare enough that spending gates on this possibility is a poor use of the limited budget of transistors that can be built into a CPU.

In practice, then, modern computers rarely use fully associative caches for caching a subset of memory. However, there are some other applications where fully associative caches are common. One prominent example found in most common processors is the translation lookaside buffer (TLB) used for supporting virtual memory — though virtual memory is beyond the scope of this document.

3.2. Set associative

The set-associative cache is a compromise between a direct-mapped cache and a fully associative cache. Here, we group the lines into sets. Typically, a set holds four to sixteen lines; a cache with eight lines per set is called 8-way set-associative. The cache maps each requested address directly to a set (akin to how a direct-mapped cache maps an address to a line), and then it treats the set as a fully associative cache.

As an example, suppose we have a 2-way set-associative cache that holds 32 bytes with 8 bytes per line. Because the cache's total of 32 bytes is divided into 8 bytes per line, our cache has 32 ÷ 8 = 4 lines. Because this is a 2-way set-associative cache, these 4 lines are divided into sets of 2, so there are 4 ÷ 2 = 2 sets. Thus, given a memory request, we'll break an 8-bit address into three portions: The bottom 3 bits provide the offset (since the line size is 23 = 8); the next 1 bit identifies the set into which the request maps (since the number of sets is 21 = 2), and the remaining 4 bits provide the tag.

Now let us suppose that we have the following sequence of five memory references: M[18], M[25], M[51], M[16], M[37].

M[18]

The memory address 00010010 maps to tag 0001, set 0, offset 010. Looking at set 0, we see that neither line yet has its valid bit set, so the set holds no useful information, and thus this is a cache miss. We load memory addresses of the form 00010xxx into a line of set 0.

validtag data
set 0 { 10001M[16–23]
0??
set 1 { 0??
0??
M[25]

The memory address 00011001 maps to tag 0001, set 1, offset 001. Looking at set 1, we see that neither line is valid, so this is a cache miss. We load memory addresses of the form 00011xxx into a line of set 1.

M[51]

The memory address 00110011 maps to tag 0011, set 0, offset 011. Looking at set 0, we see that the first line's tag of 0001 doesn't match the requested 0011 tag, and the second line is invalid; we thus have a cache miss. We load memory addresses of the form 00110xxx into a line of set 0; the best candidate would be the invalid line.

validtag data
set 0 { 10001M[16–23]
10011M[48–55]
set 1 { 10001M[24–31]
0??
M[16]

The memory address 00010000 maps to tag 0001, set 0, offset 000. Looking through set 0, we see that one of the lines already has this tag (0001), so this is a cache hit, and we can find the requested data at offset 000 within this line.

Note that a direct-mapped cache of the same size would have missed this memory reference: The previous reference to M[51] would have mapped to the same line of the cache, so M[16] would have been discarded.

M[37]

The memory address 00100101 maps to tag 0010, set 0, offset 101. Looking through set 0, we see that neither line's tag matches, so this is a cache miss. We must load a new line into set 0, but the decision of which to replace is not straightforward: The M[16–23] line is older, but it was also more recently accessed. Let's suppose we discard the line that was least recently accessed, so we'd replace the M[32–39] line; this is called the LRU policy (for Least Recently Used).

validtag data
set 0 { 10001M[16–23]
10010M[32–39]
set 1 { 10001M[24–31]
0??

As you would expect, set-associative caches require more logic gates per line than a direct-mapped cache but less than a fully associative cache. On the other hand, it avoids the bad behavior of a direct-mapped when a program repeatedly accesses two memory locations that are far apart but happen to map to the same line. It still has a problem when accessing many distant memory locations mapping to the same set: For example, with a four-way set associative cache, cycling through five different memory addresses that map into the same set would cause problems. But such a case is relatively rare.

4. Real-world caching

The distinction between different types of caches is important in theory, but it is also helpful in understanding their real-life deployment.

4.1. Arrandale cache structure

To see an example of caches in real life, let's look at the Arrandale architecture deployed by Intel in 2010. This architecture appears in several processor lines, with different numbers of processing units and slightly different cache designs: We'll look in particular at how Arrandale's architecture appeared in the Intel Core i5 line.

In the Intel Core i5 line, the Arrandale architecture supports two identical processing units called cores. Each core can independently execute different programs simultaneously. Between these two cores, the chip incorporates seven different caches; all caches use cache lines of 64 bytes.

4.2. Issues with writing

So far, our work has been with programs whose only interaction with memory is to read its values. But programs also write to memory, which brings up two additional issues for caches.

Should the cache defer writing stores into memory?

If we answer no, then we have what is called a write-through cache: Each request to store is written into the cache and also forwarded immediately to memory. If we answer yes, then we have a write-back cache: Each line has a dirty bit, which is set when any data in the line is changed; before replacing the data in a line, the cache must check whether its dirty bit is set, and if so save its modified data back into memory.

Of course, write-back caches communicate less with memory, but they require more circuitry. CPU designers have to decide whether they feel that the benefit is worth spending circuitry on it. Basically, it boils down to whether a single write is a strong indication that the program will write to the same line in the near future. Whether this is true depends on what sorts of programs run on the computer.

In general, with the increasing amount of logic that can be placed on the chip, additional logic becomes less expensive. Thus, designers have an increasing tendency to choose write-back caches for their designs.

If a program writes to an address not yet represented in the cache, should we reserve space in cache for that data?

If we answer yes, then we have a write-allocate cache; otherwise, we have a no-write-allocate cache.

Deciding between these choices is more subtle than choosing between write-back and write-through caches, because it really depends heavily on the access patterns of typical programs. Generally, write-allocate makes more sense in conjunction with write-back caches, since a write to a location indicates that a change to a nearby location is likely in the near future. But a write to a nearby location doesn't have a benefit with a write-through cache; thus, the benefit is less clear in this situation.

In our real-life example of the Arrandale architecture, the designers chose to a write-back, write-allocate policy for its L1 caches.

5. Performance issues

As programmers, we don't need to consider the cache often: The processor handles the cache transparently for us, and modern caches work well enough that we don't need to worry much about it. But the cache's behavior does occasionally impact how programs behave. To see an example, let's look at matrix multiplication.

We need to begin by reminding ourselves of how matrix multiplication works. In the below example, we multiply two 3 × 3 matrices to arrive at a third 3 × 3 matrix.

301
002
110
 × 
020
100
011
 = 
071
022
120

In this example, the entry of 7 (in the second column and first row of the product matrix) comes from adding together corresponding products of the first row of the first matrix <3, 0, 1> and the second column of the second matrix <2, 0, 1>: 3 ⋅ 2 + 0 ⋅ 0 + 1 ⋅ 1 = 7.

Algebraically, we can express this as follows: If A and B are the two matrices to multiply, and we want to determine entry (ij) of the product matrix C, we should sum the products of the corresponding numbers of the ith row of A and the jth column of B.

Cij = ∑k = 1n Aik ⋅ Bkj

5.1. The ijk implementation

The following C code is a straightforward translation of the algebraic expression.

for (i = 0i < ni++) {
    for (j = 0j < nj++) {
        for (k = 0k < nk++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

Our question is: How does a cache behave as a computer executes this code? We'll answer this question by concentrating on the innermost for loop. For each memory access in this innermost loop, we'll ask: How often does this memory access miss the cache on average? We'll assume that we're dealing with very large matrices, and so n is very large

Let's look at the access to C[i][j] first. As we iterate through the k's, this address doesn't change. Thus, though C[i][j] may not be in the cache for the first iteration of the innermost for loop, it will be in the cache thereafter. Thus, the average number of cache misses per iteration is 1/n, which is basically 0, since we're assuming n is large.

The accesses to A[i][k] go through a row of the array. Adjacent elements of the row are stored in adjacent elements of memory. Thus, when A[i][k] is loaded into the cache, the elements that fall on the same cache line will be in the same row. If we are dealing with a matrix of double values, then each array element uses the 64-bit IEEE format, so there are 8 bytes per array element. If our cache has lines of 64 bytes, then there are eight array elements per line. Whenever we load an element for A into a cache line, we also load seven other elements from the same row of A into the same line. As we go through the A[i][k], then, we'll miss the cache once every eight times through the loop, and so the average number of cache misses per iteration is 0.125.

Finally, let's look at the accesses to B[k][j]. Here, we're going through a column, and each element is 8 ⋅ n bytes beyond the previous element in the column. We'll never find B[k][j] in the cache unless it was loaded during the iteration through the previous row. But, if n is large enough, the bottom of the previous column will flush from the cache anything cached during the top of the previous column. And by the time we get to the bottom of the current column, the top of that column would flush from the cache anything that might have been loaded in the previous column. Thus, we'll miss the cache on each access to B[k][j].

Let's tabulate these results.

Average misses per iteration of innermost loop
A[i][k] misses:0.125
B[k][j] misses:1.000
C[i][j] misses:0.000
total misses:1.125

On average, we'll miss the cache 1.125 times every time through the loop.

5.2. The kij implementation

Now let's consider a slightly different implementation. The only thing that has changed here is that we have placed the k loop first, so that the innermost loop iterates through values of j instead. The body of the innermost loops is unchanged. We call this the kij implementation after the order in which the loop variables appear.

for (k = 0k < nk++) {
    for (i = 0i < ni++) {
        for (j = 0j < nj++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

If we think about how many misses we have here, we see that accesses to C[i][j] miss on average every 0.125 times through the inner loop over j (since we're going through a single row of C), the access to A[i][k] does not miss the cache, and the access to B[k][j] misses on average every 0.125 times through the inner loop.

Average misses per iteration of innermost loop
A[i][k] misses:0.000
B[k][j] misses:0.125
C[i][j] misses:0.125
total misses0.250

If cache misses were all that mattered, we would expect this alternative implementation to take as little as 0.25 / 1.125 = 22% as much time as our first ijk implementation.

This may not be the entire story, since the second implementation both reads and writes to C[i][j] each time through the loop, while the accesses to C[i][j] in the first implementation could be factored out entirely. Thus, it wouldn't be entirely surprising if the second implementation were actually slower. But at the very least, we can conclude that caching will lead these two very similar approaches to perform very differently.

5.3. Tests

We can perform some tests to see what happens. The following table reports the results of some tests performed on a Core i5 750 chip. The numbers represent the number of nanoseconds per iteration of the inner loop; thus, the given measurement comes from dividing the total amount of time spent on matrix multiplication by n3.

n ijk kij
32 116 116
64 117 120
128 175 149
256 270 148
512 335 147
1024 858 158
2048 929 171

For small n, we don't expect to see a huge difference, and that's reflected in our numbers: If the matrix is small, then a full column will fit into the cache. As a result, when we go through the inner loop for one value of j in our ijk implementation, the cache will load the entire jth column of B, and for the next value of j the cache will find every entry of B.

But for larger n, an entire column will not fit into the cache. And, indeed, we see that ijk becomes much slower as n reaches 256. Indeed, we see that when n is 1,024, the time for the kij approach takes 18% as much time as the ijk approach, which is pretty close to the 22% predicted by our analysis.