Review H: The hierarchy of memory & caches: Questions

H1.1.

Identify the power of 10 to which each of the following corresponds. For example, if asked about kilo-, you would write 103.

a.mega-
b.milli-
c.tera-
H1.2.

For each of the following unit prefix symbols, identify the prefix and power of 10 to which it corresponds. For example, if asked about K, you would respond kilo- and 103.

a. μ

b. G

H1.3.

Identify the power of 10 to which each of the following corresponds. For example, if asked about kilo-, you would write 103.

a.nano-
b.giga-
c.micro-
H1.4.

Estimate the amount of time for each of the following on a modern computer.

a. clock cycle (Roughly one instruction is completed per clock cycle per core.)

b. access hard disk

c. access DRAM

H1.5.

Explain how a hard disk stores and retrieves data.

H2.1.

Suppose we have a system with 12-bit memory addresses using a direct-mapped cache holding a total of 1 kilobyte, using lines of 16 bytes. For each of the following memory addresses, identify the tag, line number, and offset that the direct-mapped cache will use.

a.1100 1010 1011
b.0010 0011 0101
H2.2.

Suppose we have a system using six-bit addresses which uses a direct-mapped cache with two lines, each with two bytes. And suppose the following sequence of accesses of one-byte accesses: M[10], M[2], M[1], M[3], M[10], M[2] (where the addresses are in base 10).

a. Which of the accesses in the sequence hit the cache?
b. Draw a picture of the contents of the cache after completing this sequence.
H2.3.

Suppose we have a computer system using five-bit addresses. It includes a direct-mapped cache with two lines, each with four bytes. We have the following sequence of one-byte accesses. (All addresses are in binary.)

read M[10011]
read M[10000]
read M[10100]
read M[00010]
read M[10000]
read M[10111]

a. Diagram the cache contents after completing the access sequence.

b. Which of the memory accesses hit the cache?

H2.4.

Suppose we have a computer system using five-bit addresses. It includes a direct-mapped cache with four lines, each with four bytes. We have the following sequence of one-byte accesses. (All addresses are in binary.)

M[00000]
M[11111]
M[00111]
M[00010]
M[01010]
M[10000]
M[00100]
M[00011]
    

a. Which of the memory accesses hit the cache?

b. Give the tag stored with each cache line after the computer completes the memory accesses.

H2.5.

Suppose we have two systems that are identical except for the caches. Both caches hold the same number of bytes and are direct-mapped, but one has a line size of 4 bytes while the other has a line size of 16 bytes. Identify which you would expect to perform better, and describe a typical access pattern to justify your answer.

H2.6.

Suppose we have the following global declarations in C for representing memory and a direct-mapped cache with 16 lines of 8 bytes each.

struct cache_line {
    int tag;
    char data[8];
};

struct cache_line cache[16];
char mem[1024];

Complete the following C function to retrieve a value from cache; if it is not already present in cache, it should load the relevant line from mem to cache first. Your program should make use of the C bit operators to divide the address into its relevant pieces.

char fetch(int addr) {
H2.7.

Suppose we are using a direct-mapped cache with four cache lines, each with eight bytes. Give a sequence of memory accesses in which the contents of memory address 0000000 would be loaded into cache and then removed from cache, even though memory address 000000 never appears in your access sequence.

H2.8.

Consider the following C code to add the numbers within an array.

int i;
int sum = 0;
for (i = 0i < ni++) {
    sum += a[i];
}

Note that this program never accesses the same array element twice. Explain how a direct-mapped memory cache helps this code run faster.

H2.9.

Consider the following C code fragment.

for (i = 0i < ni++) {
    j = a[i];
    sum += count[j];
}

Suppose that the CPU uses a direct-mapped cache in which each line holds four array entries. For each iteration of this loop, how many times, on average, will the CPU miss the cache? Explain your answer. (Assume that the array a holds random integers between 0 and 1,000,000.)

H2.10.

When we studied the direct-mapped cache, we regarded each memory address as having three pieces — a tag, then a line number, then an offset. Suppose we inverted this order: The offset would come first, then the line number, and then the tag. The cache would still work, but why would this have undesirable effects on the cache's performance?

H3.1.

Compare the virtues of a direct-mapped cache versus a fully associative cache.

H3.2.

Why do many CPUs implement a set-associative cache rather than the direct-mapped cache? Why use a set-associative cache rather than the fully associative cache?

H4.1.

Describe the difference between the write-through and write-back approaches to handling writes in a cache, and contrast their relative advantages.

H5.1.

Which of the below C fragments is likely to execute more quickly? Explain your answer in detail.

a.b.
sum = 0;
for (i = 0i < 1000i++) {
    for (j = 0j < 1000j++) {
        sum += a[i][j];
    }
}
sum = 0;
for (j = 0j < 1000j++) {
    for (i = 0i < 1000i++) {
        sum += a[i][j];
    }
}
H5.2.

Consider the following C code to multiply two matrices.

for (int i = 0i < ni++) {
    for (int j = 0j < nj++) {
        for (int k = 0k < nk++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

Note that the innermost loop here iterates over k.

Suppose that our CPU has a direct-mapped cache with 128 lines, each holding 64 bytes, and suppose that each array element takes 8 bytes. On average, how many times per iteration through the innermost loop will the CPU miss the cache in accessing each of the following?

a.C[i][j]
b.A[i][k]
c.B[k][j]
H5.3.

Each of the following two code fragments computes the sum of list elements found in a two-dimensional array used to represent a linked list. Each node of the list consists of an integer datum and an integer giving the index of the next node within the array (or −1 if the node is the final node).

int list[10000][2];

int n = 0;
int sum = 0;
while (n >= 0) {
    sum += list[n][0];
    n = list[n][1];
}
    int list[2][10000];

int n = list;
int sum = 0;
while (n >= 0) {
    sum += list[0][n];
    n = list[1][n]
}

In view of caching effects, which will be faster? Explain your answer, with specific reference to how the cache works.

H5.4.

Tests on an Intel Pentium III processor reveal that the performance of the below code depends heavily on the value of skip.

int sum(char *arrint skip) {
    int iktotal;
    total = 0;
    for (k = 0k < 4000k++) {
        for (i = ki < 4000000i += skip) {
            total += arr[i];
        }
    }
    return total;
}
  
skiptime
16,35240 ms
16,384480 ms

Surprisingly, with a larger value of skip, for which the code will add fewer numbers, the code takes twelve times as long.

Explain how the cache is making a difference here. (A Pentium III's L1 cache is 4-way set associative, with 512 lines, each holding 32 bytes. Note that the cache holds 512 × 32 = 16,384 bytes.)

Review H: The hierarchy of memory & caches: Solutions

H1.1.
a.mega- corresponds to 106
b.milli- corresponds to 10−3
c.tera- corresponds to 1012
H1.2.
a.μ: micro-10−6
b.G: giga-109
H1.3.
a.nano- corresponds to 10−9
b.giga- corresponds to 109
c.micro- corresponds to 10−6
H1.4.

a. 0.4 nanoseconds

b. 4 milliseconds

c. 20 nanoseconds

H1.5.

A hard disk consists of one or more platters of magnetic material, and each bit is stored at one particular location on the platter, represented by the magnetic polarity at that location.

To retrieve data, the disk determines the location on the disk — that is, how far the location is from the disk's center and at what angle the data may be found. There is a head that hovers over the disk, attached to an arm, and the disk will use motors to move the arm so that the head is the appropriate distance from the disk's center. Then the disk waits until the always-rotating platter has the requested bit beneath the head, whereupon the head senses the magnetic polarity to retrieve the requested bit. (This is slightly false, in that the head never just reads a single bit: It always read a fixed-size group of adjacent bits, typically 512 bytes.)

H2.1.
address tag line offset
a.1100 1010 1011 11 001010 1011
b.0010 0011 0101 00 100011 0101
H2.2.
a. the fourth (M[3]) access only.
b.
line
numtagline data
00000M0M1]
10000M2M3]
H2.3.
a.
line
numtagline data
010M[10000 to 10011]
110M[10100 to 10111]
b. Memory accesses M[10000] and M[10111] hit the cache.
H2.4.
a. M[00010] and M[001000] hit the cache
b. line 0:0
line 1:0
line 2:0
line 3:1
H2.5.

The line of size 16 will perform better. For example, a program that iterates through subsequent integers in an array (where the array is initially entirely not in cache) will miss every time with the 4-byte lines, as each integer will map to a different line. However, with 16-byte lines, when there is a miss, the next three integers will also be fetched into the same line, so there the next three array accesses will hit the cache; thus, we miss only 25% of the time with the 16-byte lines but 100% of the time with 4-byte lines.

H2.6.
char fetch(int addr) {
    int line;
    int tag;
    int offs;
    int i;

    tag  = addr >> 7;
    line = (addr >> 3) & 0xF;
    offs = addr & 0x7;
    if (cache[line].tag != tag) {
        cache[line].tag = tag;
        for (i = 0i < 8i++) {
            cache[line].data[i] = mem[(tag << 7) | (line << 3) | i];
        }
    }
    return cache[line].data[offs];
}
H2.7.

000001, 100000

H2.8.

This code goes through the array in order, and several array elements will map to a single line in the cache. When the code misses the cache, the cache will load the entire line containing the desired array entry, including the next several adjacent memory locations. Thus, the next several iterations of the loop would hit the cache, saving a memory access each of these times.

H2.9.

1.25. This loop goes through the a array in succession, and so when we load one element of a into the cache, the CPU will load the next three elements into the cache also. Thus, this code misses the cache when accessing a[i] an average of 0.25 each iteration. The pattern of accesses to count, however, is random, and it is highly unlikely that any access would be to an element already in the cache. Thus, the CPU is likely to miss the cache in accessing count[j] each time through the loop. We add these two numbers together (0.25 + 1.00) to get the total number of misses.

H2.10.

Each line of the cache holds memory with the same tag and line number. If the offset were first in the address, then the addresses loaded into the line would be far apart. Programs often access memory addresses that are close together (for example, when they iterate through an array). In the standard scheme with the offset last, loading a line means loading several close-together addresses, so that one miss leads to some future hits of close-together addresses. With the offset as the first portion in the address, this would be far less likely.

H3.1.

With a direct-mapped cache, in which every memory address can be cached in only one location within the cache, there is the possibility that an access to a distant memory address would eject a very recent access that will likely be used soon in the future. A fully associative cache doesn't share this occassional poor behavior, but it requires more logic circuitry per cache line, so that large and fast fully associative cache are unaffordable.

H3.2.

A direct-mapped cache has the disadvantage that if a program frequently access memory locations that happen to map to the same cache line, then these memory accesses will frequently miss. This is much less likely to happen with a set-associative cache.

With a fully associative cache, such cache misses are even less likely, but fully associative increases the circuit complexity, and set-associative caches work well enough in practice that reducing misses by using a fully associative cache instead is hardly worth the added circuit complexity.

H4.1.

In a write-through cache, each write is sent immediately into the underlying memory; whereas in a write-back cache, writes are not sent to underlying memory until the corresponding line is ejected from the cache.

Because the write-through approach is simpler than the write-back approach, it requires fewer transistors to support. But a write-back cache will result in less traffic with memory. [Also, in multi-processor systems with separate caches, the write-through approach is particularly advantageous.]

H5.1.

Fragment a. is likely to execute more quickly because it maps better to cache: It goes through the two-dimensional array at successive memory addresses, so a cache miss for one array entry will lead to several successive cache hits at subsequent array entries. By contrast, fragment b. goes through the array column by column; loading one entry in the column will not lead the next column entry to be simultaneously loaded.

H5.2.
a. 0 (This is the same memory location each time through the innermost loop, so there will be no misses after the first iteration.)
b. 0.125 (As we go through the innermost loop, we are walking through successive memory locations. Each time we miss the cache, we will load eight successive locations in the row; thus this access misses the cache one time in eight.)
c. 1.0 (Each array access in in a different row, so they will be widely separated. Loading one element in the column will not help in reaching another. Thus each access to B will be a miss.)
H5.3.

The fragment on left will be faster. With each iteration of the loop, this fragment accesses two successive memory locations, likely to occur on the same line in the cache. Consequently, there will be just one miss for each iteration.

With the fragment on right, the two memory accesses are far apart and so will be on different lines; moreover, since n will skip around throughout the array, there entries will rarely be in the cache from previous iterations. As a result, the fragment on right will typically see two cache misses per iteration, and so it will take twice as long.

H5.4.

When skip is 16,384, each access to arr[i] in the inner loop maps to the same set of cache lines, the lines accessed at the beginning of the inner loop will be dropped from the cache as the inner loop continues. When k is incremented and the program reads the array entries following those accessed in the previous loop, those lines aren't available in the cache. As a result, all array accesses miss the cache.

When skip is 16,352, then the accesses to arr[i] map to a variety of cache lines, and all the requested lines can fit into the cache. (The code would access 4000000/16362 ≈ 244 lines, and since they map to different sets, they would all fit into the cache.) When the code continues to the next value of k, the data is still in the cache from before, and so the array accesses usually hit the cache.