by Carl Burch, Hendrix College, October 2011
i
–j
–k
implementationk
–i
–j
implementationComputing systems use memory very heavily, and so performance depends in part on efficiently accessing memory. In this document, we investigate how modern systems reduce memory access time using caches.
Today's computing systems include four important technologies for remembering information.
disks use a magnetic material and represent a bit using the magnetic polarity at an individual location on the disk. Although some disks are portable, like the old floppy disks, we'll use the word disk to refer to a computer's hard drive.
flash memory uses special transistors that can hold their state even in the absence of any power. Like disks, then, flash memory retain any stored values even when power is off; this distinguishes it from the other two technologies (DRAM and SRAM), though in this document we're not concerned about this property.
DRAM (for Dynamic Random Access Memory) uses electrical capacitors for holding data. A capacitor holds an electronic charge for a brief instant. Storing a bit long-term requires recharging each depleted capacitor many times a second.
DRAM is so central to a computer's operation than the terms memory and RAM refer almost always to DRAM.
(Manufacturers often distinguish between varieties of DRAM. The S in SDRAM, for example, stands for synchronous.)
SRAM (for Static Random Access Memory) is built using transistors, working on the same principle as an SR latch built out of two NOR gates (or two NAND gates).
The below table summarizes each technology's access time and cost as of 2011.
access
timecost
per GBSRAM 1 nanosecond $500.00 DRAM 10 nanoseconds $6.00 flash 50 microseconds $1.50 disk 4 milliseconds $0.10
To make sense of this table and our discussion, you need to know your scientific units! Below is a table of the most common units appearing in computing; it's worth making sure that you're familiar with each of them.
milli- | (m) | 10−3: | one thousandth | kilo- | (K) | 103: | one thousand | |
micro- | (μ) | 10−6: | one millionth | mega- | (M) | 106: | one million | |
nano- | (n) | 10−9: | one billionth | giga- | (G) | 109: | one billion | |
tera- | (T) | 1012: | one trillion |
A CPU will try to complete an instruction every clock cycle. A typical clock rate is 2 GHz, which means 2 billion clock cycles each second; thus, each clock cycle takes a two-billionth of a second, or 0.5 ns (nanoseconds). Thus, SRAM can nearly keep up, while DRAM is significantly slower, and the disk is slothlike.
While we would ideally use SRAM for all our memory, the price of SRAM is so huge that we cannot afford much of it. DRAM is the only affordable and reasonably fast option in the quantity we require. But computing systems nonetheless take advantage of SRAM's speed through creating a cache built out of SRAM holding data that a program seems to be using heavily.
A cache holds a subset of the memory. It is built transparently into the system: The programmer manipulates what is in memory, with usually no way to manipulate the contents of the cache. Instead, the processor itself decides what to place into each cache, and this process is usually invisible to the programmer — though there are exceptions, as we'll see in Section 5.
A cache depends on the fact that programs display locality, which means that whenever a program accesses a particular memory location, the program is likely to access a nearby memory location in the near future. This has been tested numerous times, and invariably programs display this property in abundance.
There are three basic categories of caches: direct mapped, set associative, and fully associative. In this section, we'll examine direct-mapped caches.
In a direct-mapped cache, each memory address has only one possible location in the cache where its data might be found; that is, each memory address maps directly to a cache location.
A direct-mapped cache is divided into several lines, each holding several adjacent bytes of cached data. Each line is the same size, and the processor designer decides how long each line should be; a typical choice is 32 or 64 bytes. The lines are numbered starting at 0. Of course, the information stored in each line includes the actual memory data being cached; but each line also includes a valid bit for representing whether the line holds any useful data and a tag to identify the range of memory addresses that the cache holds.
A realistic direct-mapped cache might have 512 lines, each holding 64 bytes, for a total of 32 KB.
line valid tag data 0 ? ??? ??? 1 ? ??? ??? 2 ? ??? ??? : : : : 510 ? ??? ??? 511 ? ??? ???
To see where a memory location is to be found in the cache, we take the memory address and divide it into three pieces: the tag, the line number, and the offset. For our example cache with 512 lines of 64 bytes each, a 32-bit memory address would include 6 bits for the offset, since 26 is 64, the number of bytes per line; it would include 9 bits for the line number, since 29 is 512, the number of lines; and it would include 17 bits for the tag, since this is the number of remaining bits (32 − 9 − 6). Thus, our example cache would regard each memory address as having three pieces as illustrated below.
To find a memory item in the cache, the direct-mapped cache breaks the desired memory address into these three pieces, tag, line, and offset. If the memory item is in the cache, it must be offset bytes into the line numbered line. To determine whether the data is present, the cache will verify that this line's valid bit is 1 and that the tag in the requested address matches the tag stored in the line. If this test reveals that the data is in the cache, the cache can respond immediately. If not, the cache loads from memory the sequence of adjacent bytes including the requested address to fill a cache line, sets the line's valid bit and tag, and then returns the requested data.
To get a better handle on how this works, let's look at an example. Suppose we're working in a computer using 6-bit memory addresses, and our computer has a cache with two lines, each holding four bytes. Thus, of each 6-bit memory address, one bit will specify the line number and two will specify the offset within the line; that leaves 6 − 1 − 2 = 3 bits for the tag.
The cache looks like the following at start-up time.
line valid tag data 0 0 ??? (none) 1 0 ??? (none)
Let's see what happens if a program accesses the following sequence of bytes from memory: M[19], M[20], M[16], M[34], M[18].
Read M[19]. We break 010011 into its component pieces, getting 010 for the tag, 0 for the line number, and 11 for the offset. We look at line 0 and find that the valid bit is 0; we conclude the information isn't yet in the cache, calling it a cache miss. On a cache miss, the cache loads all bytes with the matching tag and line number into the cache. Thus, we'll load from all memory address of the form 0100xx into line 0; this includes memory addresses 16 through 19.
line valid tag data 0 1 010 M[16–19] 1 0 010 M[20–23]
(Despite our diagram, the memory addresses 16, 17, 18, and 19 don't actually appear in the cache. The data column actually contains the four bytes found in memory at these addresses.)
Read M[20]. We break 010100 into 010 for the tag, 1 for the line number, and 00 for the offset. We look at line 1, find that the valid bit is 0, and conclude that this reference is also a cache miss. We load all memory addresses of the form 0100xx into line 1.
line valid tag data 0 1 010 M[16–19] 1 1 010 M[20–23]
Read M[16]. We break 010000 into 010 for the tag, 0 for the line number, and 00 for the offset. We look at line 0, and we see that the valid bit is 0 and the line's tag matches our tag. Thus we have the desired information already in cache; this is called a cache hit. We serve the request using the information starting at offset 00#2 = 0 in the cache line.
Read M[34]. We break 100010 into 100 for the tag, 0 for the line number, and 10 for the offset. We look at line 0, and we see that the valid bit is 1 but the line's tag 010 doesn't match our tag 100. We discard the information currently in line 0, replacing it with memory at addresses 32–35.
line valid tag data 0 1 100 M[32–35] 1 1 010 M[20–23]
Read M[18]. We break 010010 into 010 for the tag, 0 for the line number, and 10 for the offset. We look at line 0, and we see that the valid bit is 1 but the line's tag 100 doesn't match the address's tag 010. We discard the information currently in line 0, replacing it with memory addresses 16–19.
line valid tag data 0 1 010 M[16–19] 1 1 010 M[20–23]
Thus, in serving this sequence of five memory accesses, we found one cache hit and four cache misses. In real-world sequences, though, there are many more hits than misses: typically, more than 99% of memory accesses are hits. This proportion of cache hits to total memory references is called the cache hit ratio.
The primary thing to consider about a cache is how well it takes advantage of locality. We'll concentrate on two important cases.
Usually this goes quite well. Our example trace illustrates this some: When we load M[19], we also load M[16], because this in the same line. Thus, when we access M[16] later, it is already in the cache.
One common situation where this scenario is relevant is in accessing an array in order. Such a situation is illustrated by the following code to add the elements of an array.
sum = 0;
for (i = 0; i < n; i++) {
sum += a[i];
}
If our cache holds 64 bytes for each line, and if each array element is 8 bytes long, then each cache line holds eight array elements. Whenever the above code accesses an array element and misses the cache, the cache will load not only the referenced element but also the seven succeeding values, and so the miss will be followed by seven hits. Thus, this example would miss the cache only one eighth of the time, giving an 87.5% cache hit ratio.
A direct-mapped cache usually does this well, since it will hold a line of data indefinitely, until it is ejected due to a reference to a distant memory location that happens to map to the same line.
The direct-mapped cache does, however, have a very bad case. For our above example, suppose that we have a program that happens to alternate between accessing memory addresses 16 and 32. Both memory addresses map to the same line (line 0), so each memory access would miss the cache. This is unfortunate, because frequent accesses to the same memory locations should be an easy case for the cache. This is a shortcoming of direct-mapped caches, and indeed it is the primary factor leading us to other cache designs.
For a larger cache, this outcome would be less likely. If it is a 16KB cache, for example, our program would have to be extremely interested in two memory addresses that are exactly 16KB apart (or nearly so). In practice, this doesn't happen all that often. But it does happen enough to inspire the other cache types we will study.
Caches come in three basic designs, and the direct-mapped design is only one of them. The others are set associative and fully associative. After you understand direct-mapped caches, they're relatively easy to understand.
In a direct-mapped cache, each memory address maps to only one possible location in the cache where that address's data might appear. As we've seen, this can lead to problems where a program frequently accesses two distant memory locations that happen to map to the same line of the cache.
In a fully associative cache, an address can be cached in any line of the cache. We no longer have reason to number the cache lines: The cache simply consists of several lines, all of which are available for caching data from any address.
Given a memory address, then, a fully associative cache will divide the address into two pieces: the tag and the line offset. The cache will search through all of the cache's lines to see if the tag for any of the lines matches the requested tag. If so, then it simply needs to return the data found in that line at the offset given in the requested address. If the tag is not found, then the cache needs to make room for the requested line. It will use an algorithm such as LRU or Clock to decide which line to remove; these algorithms are beyond the scope of this document.
You might object: Doesn't the process of searching through all lines for the requested tag take a lot of time? After all, an implementation of such a cache in C would involve a loop that iterates through one line after another until it finds the desired tag. However, we need to remember that the cache is a circuit, not a program. Each line of the cache's circuit can include its own collection of logic gates for comparing the line's tag to the requested tag, so that all of lines can compare to the requested tag at the same time.
If our goal were purely to avoid cache misses, then we'd certainly choose a fully associative cache: Nothing is more effective. Nonetheless, CPU designers rarely use them because the added effectiveness is rarely worth the added circuitry. Designers work with a fixed budget of transistors, and a given number of transistors supports a larger direct-mapped cache than a fully associative cache. The designer might judge that the bad behavior of direct-mapped caches is rare enough that spending gates on this possibility is a poor use of the limited budget of transistors that can be built into a CPU.
In practice, then, modern computers rarely use fully associative caches for caching a subset of memory. However, there are some other applications where fully associative caches are common. One prominent example found in most common processors is the translation lookaside buffer (TLB) used for supporting virtual memory — though virtual memory is beyond the scope of this document.
The set-associative cache is a compromise between a direct-mapped cache and a fully associative cache. Here, we group the lines into sets. Typically, a set holds four to sixteen lines; a cache with eight lines per set is called 8-way set-associative. The cache maps each requested address directly to a set (akin to how a direct-mapped cache maps an address to a line), and then it treats the set as a fully associative cache.
As an example, suppose we have a 2-way set-associative cache that holds 32 bytes with 8 bytes per line. Because the cache's total of 32 bytes is divided into 8 bytes per line, our cache has 32 ÷ 8 = 4 lines. Because this is a 2-way set-associative cache, these 4 lines are divided into sets of 2, so there are 4 ÷ 2 = 2 sets. Thus, given a memory request, we'll break an 8-bit address into three portions: The bottom 3 bits provide the offset (since the line size is 23 = 8); the next 1 bit identifies the set into which the request maps (since the number of sets is 21 = 2), and the remaining 4 bits provide the tag.
Now let us suppose that we have the following sequence of five memory references: M[18], M[25], M[51], M[16], M[37].
M[18] | The memory address 00010010 maps to tag 0001, set 0, offset 010. Looking at set 0, we see that neither line yet has its valid bit set, so the set holds no useful information, and thus this is a cache miss. We load memory addresses of the form 00010xxx into a line of set 0.
| ||||||||||||||||||
M[25] | The memory address 00011001 maps to tag 0001, set 1, offset 001. Looking at set 1, we see that neither line is valid, so this is a cache miss. We load memory addresses of the form 00011xxx into a line of set 1. | ||||||||||||||||||
M[51] | The memory address 00110011 maps to tag 0011, set 0, offset 011. Looking at set 0, we see that the first line's tag of 0001 doesn't match the requested 0011 tag, and the second line is invalid; we thus have a cache miss. We load memory addresses of the form 00110xxx into a line of set 0; the best candidate would be the invalid line.
| ||||||||||||||||||
M[16] | The memory address 00010000 maps to tag 0001, set 0, offset 000. Looking through set 0, we see that one of the lines already has this tag (0001), so this is a cache hit, and we can find the requested data at offset 000 within this line. Note that a direct-mapped cache of the same size would have missed this memory reference: The previous reference to M[51] would have mapped to the same line of the cache, so M[16] would have been discarded. | ||||||||||||||||||
M[37] | The memory address 00100101 maps to tag 0010, set 0, offset 101. Looking through set 0, we see that neither line's tag matches, so this is a cache miss. We must load a new line into set 0, but the decision of which to replace is not straightforward: The M[16–23] line is older, but it was also more recently accessed. Let's suppose we discard the line that was least recently accessed, so we'd replace the M[32–39] line; this is called the LRU policy (for Least Recently Used).
|
As you would expect, set-associative caches require more logic gates per line than a direct-mapped cache but less than a fully associative cache. On the other hand, it avoids the bad behavior of a direct-mapped when a program repeatedly accesses two memory locations that are far apart but happen to map to the same line. It still has a problem when accessing many distant memory locations mapping to the same set: For example, with a four-way set associative cache, cycling through five different memory addresses that map into the same set would cause problems. But such a case is relatively rare.
The distinction between different types of caches is important in theory, but it is also helpful in understanding their real-life deployment.
To see an example of caches in real life, let's look at the Arrandale architecture deployed by Intel in 2010. This architecture appears in several processor lines, with different numbers of processing units and slightly different cache designs: We'll look in particular at how Arrandale's architecture appeared in the Intel Core i5 line.
In the Intel Core i5 line, the Arrandale architecture supports two identical processing units called cores. Each core can independently execute different programs simultaneously. Between these two cores, the chip incorporates seven different caches; all caches use cache lines of 64 bytes.
Each of the two cores has a 32 KB cache that caches instructions; this is called its L1 i-cache. This cache is 4-way set associative.
Each core also has a 32 KB cache that caches the data accessed in executing instructions; this is called its L1 d-cache. This cache is 8-way set associative.
Processors often have separate caches for instructions and data, since these two types of data are normally disjoint, and since processors frequently have reason to fetch the next instruction at the same time that it executes an instruction that accesses data from memory. Having two caches permits both memory references to occur simultaneously.
Each core has a 256 KB 8-way set associative cache called its L2 cache. Both L1 caches store a subset what is found in the L2 cache. Because it is smaller, the L1 cache is slightly faster; but its smaller size also leads it to have more cache misses, which is why it pays to have an L2 cache behind it.
The two cores share a 3 MB 12-way set associative cache called the L3 cache. Each of the L2 caches stores a subset of what is found in the L3 cache. The L3 cache is slower than either of the L2 caches, both because of its larger size but also because occasionally conflicts must be resolved between the caches wanting to use the L3 cache at once.
So far, our work has been with programs whose only interaction with memory is to read its values. But programs also write to memory, which brings up two additional issues for caches.
If we answer no, then we have what is called a write-through cache: Each request to store is written into the cache and also forwarded immediately to memory. If we answer yes, then we have a write-back cache: Each line has a dirty bit, which is set when any data in the line is changed; before replacing the data in a line, the cache must check whether its dirty bit is set, and if so save its modified data back into memory.
Of course, write-back caches communicate less with memory, but they require more circuitry. CPU designers have to decide whether they feel that the benefit is worth spending circuitry on it. Basically, it boils down to whether a single write is a strong indication that the program will write to the same line in the near future. Whether this is true depends on what sorts of programs run on the computer.
In general, with the increasing amount of logic that can be placed on the chip, additional logic becomes less expensive. Thus, designers have an increasing tendency to choose write-back caches for their designs.
If we answer yes, then we have a write-allocate cache; otherwise, we have a no-write-allocate cache.
Deciding between these choices is more subtle than choosing between write-back and write-through caches, because it really depends heavily on the access patterns of typical programs. Generally, write-allocate makes more sense in conjunction with write-back caches, since a write to a location indicates that a change to a nearby location is likely in the near future. But a write to a nearby location doesn't have a benefit with a write-through cache; thus, the benefit is less clear in this situation.
In our real-life example of the Arrandale architecture, the designers chose to a write-back, write-allocate policy for its L1 caches.
As programmers, we don't need to consider the cache often: The processor handles the cache transparently for us, and modern caches work well enough that we don't need to worry much about it. But the cache's behavior does occasionally impact how programs behave. To see an example, let's look at matrix multiplication.
We need to begin by reminding ourselves of how matrix multiplication works. In the below example, we multiply two 3 × 3 matrices to arrive at a third 3 × 3 matrix.
3 0 1 0 0 2 1 1 0 ×
0 2 0 1 0 0 0 1 1 =
0 7 1 0 2 2 1 2 0
In this example, the entry of 7 (in the second column and first row of the product matrix) comes from adding together corresponding products of the first row of the first matrix <3, 0, 1> and the second column of the second matrix <2, 0, 1>: 3 ⋅ 2 + 0 ⋅ 0 + 1 ⋅ 1 = 7.
Algebraically, we can express this as follows: If A and B are the two matrices to multiply, and we want to determine entry (i, j) of the product matrix C, we should sum the products of the corresponding numbers of the ith row of A and the jth column of B.
Ci, j = ∑k = 1n Ai, k ⋅ Bk, j
i
–j
–k
implementationThe following C code is a straightforward translation of the algebraic expression.
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
Our question is: How does a cache behave as a
computer executes this code?
We'll answer this question by concentrating on the innermost for
loop.
For each memory access in this innermost loop, we'll ask:
How often does this memory access miss the cache on average?
We'll assume that we're dealing with very large
matrices, and so n
is very large
Let's look at the access to C[i][j]
first.
As we iterate through the k
's, this address doesn't change.
Thus, though C[i][j]
may not be in the cache for the first iteration of
the innermost for
loop, it will be in the cache thereafter.
Thus, the average number of cache misses per iteration is 1/n
,
which is basically 0, since we're assuming n
is large.
The accesses to A[i][k]
go through a row of the array.
Adjacent elements of the row are stored in adjacent elements of memory.
Thus, when A[i][k]
is loaded into the cache, the elements
that fall on the same cache line will be in the same row.
If we are dealing with a matrix of double
values,
then each array element uses the 64-bit IEEE format,
so there are 8 bytes per array element.
If our cache has lines of 64 bytes, then there are eight array
elements per line.
Whenever we load an element for A
into a cache line,
we also load seven other elements from the same row of A
into the same line.
As we go through the A[i][k]
, then,
we'll miss the cache once every eight times through the loop,
and so the average number of cache misses per iteration is 0.125.
Finally, let's look at the accesses to B[k][j]
.
Here, we're going through a column, and each element is
8 ⋅ n
bytes beyond the previous element in the column. We'll never find
B[k][j]
in the cache unless it was loaded during the iteration
through the previous row. But, if n
is large enough, the bottom
of the previous column will flush from the cache anything cached during
the top of the previous column. And by the time we get to the bottom of
the current column, the top of that column would flush from the cache
anything that might have been loaded in the previous column. Thus,
we'll miss the cache on each access to B[k][j]
.
Let's tabulate these results.
Average misses per iteration of innermost loop
A[i][k]
misses:0.125 B[k][j]
misses:1.000 C[i][j]
misses:0.000 total misses: 1.125
On average, we'll miss the cache 1.125 times every time through the loop.
k
–i
–j
implementationNow let's consider a slightly different implementation. The only
thing that has changed here is that we have placed the k
loop
first, so that the innermost loop iterates through values of
j
instead.
The body of the innermost loops is unchanged. We call this the
k
–i
–j
implementation after the
order in which the loop variables appear.
for (k = 0; k < n; k++) {
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
If we think about how many misses we have here, we see that
accesses to C[i][j]
miss on average every 0.125 times through the
inner loop over j
(since we're going through a single row of
C
), the access to A[i][k]
does not miss the cache,
and the access to B[k][j]
misses on average every 0.125 times
through the inner loop.
Average misses per iteration of innermost loop
A[i][k]
misses:0.000 B[k][j]
misses:0.125 C[i][j]
misses:0.125 total misses 0.250
If cache misses were all that mattered, we would expect this
alternative implementation to take as little as
0.25 / 1.125 = 22% as much time as our first
i
–j
–k
implementation.
This may not be the entire story, since the second implementation
both reads and writes to C[i][j]
each time through the loop,
while the accesses to C[i][j]
in the first implementation could
be factored out entirely. Thus, it wouldn't be entirely surprising if
the second implementation were actually slower. But at the very
least, we can conclude that caching will lead these two very
similar approaches to perform very differently.
We can perform some tests to see what happens. The following table reports the results of some tests performed on a Core i5 750 chip. The numbers represent the number of nanoseconds per iteration of the inner loop; thus, the given measurement comes from dividing the total amount of time spent on matrix multiplication by n3.
n i
–j
–k
k
–i
–j
32 116 116 64 117 120 128 175 149 256 270 148 512 335 147 1024 858 158 2048 929 171
For small n, we don't expect to see a huge difference,
and that's reflected in our numbers: If the matrix is small,
then a full column will fit into the cache. As a result, when we
go through the inner loop for one value of j
in our
i
–j
–k
implementation, the cache will
load the entire j
th column of B
, and for the next value of
j
the cache will find every entry of B
.
But for larger n, an entire column will not fit into the
cache. And, indeed, we see that
i
–j
–k
becomes much slower as n reaches 256.
Indeed, we see that when n is 1,024, the time for
the k
–i
–j
approach takes
18% as much time as the i
–j
–k
approach,
which is pretty close to the 22% predicted by our analysis.