Exam 1 Review B: Questions

R1b.1.

Suppose we can divide the execution of an instruction into eight possible pipeline stages:

S0: 0.20 ns

S1: 0.05 ns

S2: 0.10 ns

S3: 0.15 ns

S4: 0.05 ns

S5: 0.20 ns

S6: 0.10 ns

S7: 0.15 ns

However, inserting pipeline registers between stages requires adding 0.05 ns to the time for the stage.

a. If we use this proposed 8-stage pipeline, how long must each clock cycle be?

b. Now suppose that you want to combine stages while maintaining the same clock rate. Which pipeline stages would you combine? How many stages result?

c. Now suppose that you want to combine stages with each clock cycle taking 0.05 ns longer. Which pipeline stages would you combine? How many stages result?

d. Usually more stages enables a faster clock, which means that instructions are completed more quickly. Explain why sometimes fewer stages will enable instructions to complete more quickly.

R1b.2.

Explain data hazards in the context of pipelining, and describe a technique for reducing their impact.

R1b.3.

Complete the following diagram of how instructions progress through the classic five-stage pipeline, assuming that the processor supports data forwarding. Draw arrows indicating points where forwarding is necessary.

`lw $t0, ($sp)`	IF	ID	EX	MEM	WB
`add $sp, $sp, 4`
`lw $t1, ($sp)`
`xor $t2, $t0, $t1`
`sub $t3, $t0, $t1`
`andi $t3, $t3, 0xFFFF`

R1b.4.

Explain control hazards in the context of pipelining, and describe two techniques to reduce their impact.

R1b.5.

Suppose we have an 8-stage pipeline that can complete one instruction each clock cycle in the absence of any jumps or branches. (We assume there are no data or structural hazards.) For jumps or branches, the destination of the jump is known at the end of the pipeline's second stage, whereas the decision of whether a branch is taken is determined at the end of the pipeline's fourth stage.

Assume that 5% of instructions in a typical program are jump instructions, while 15% of instructions are branch instructions, of which 2/3 are taken.

a. What is the average number of clock cycles per instruction if we always stall for every jump or branch until the next instruction's location is known?

b. What is the average number of clock cycles per instruction if we speculatively execute assuming that branches are not taken?

c. What is the average number of clock cycles per instruction if we speculatively execute assuming that branches are taken?

R1b.6.

Suppose a processor uses a a basic 4096-entry branch prediction buffer, where each entry is exactly one bit long. Explain how the processor accesses and uses the buffer when it reaches a branch instruction.

R1b.7.

Suppose we have the following global declarations in C for representing memory with 16 lines of 8 bytes each.

struct cache_line { int tag; char data[8]; }; struct cache_line cache[16]; char mem[1024];

The following function uses this structure to simulate the retrieval of a value from a direct-mapped cache.

char fetch(int addr) {

    int tag  = addr >> 7;

    int indx = (addr >> 3) & 0xF;

    int offs = addr & 0x7;

    int i;

    if (cache[indx].tag == tag) {

        return cache[indx].data[offs];

    } else {

        cache[indx].tag = tag;

        for (i = 0; i < 8; i++) {

            cache[indx].data[i] = mem[(tag << 7) | (indx << 3) | i];

        }

        return cache[indx].data[offs];

    }

}

How can we modify this to simulate a two-way set-associative cache instead? Presume that we are using a random-replacement policy, using a function randbit() taking no parameters to generate a random bit.

R1b.8.

Increasing the associativity of a cache (such as going from 2-way to 4-way set-associativity) always reduces the miss rate. Why, then, do caches usually have a low degree of associativity?

R1b.9.

Suppose that benchmarks indicate the following cache miss rates per 1,000 instructions on a pipelined processor.

1.4 in fetching instructions from a dedicated 32KB instruction cache

38.4 in accessing data from a dedicated 32KB data cache

39.4 in accessing instructions and data from a unified 64KB cache

Suppose that the miss penalty is 150 clock cycles, and that 40% of instructions access data from memory (or cache). We assume that the processor does not stall for any reason unrelated to memory access, and we ignore the rare occurrence of missing the cache simultaneously in fetching an instruction and in accessing data.

a. If we have separate 32KB caches, what is the average number of clock cycles per instruction?

b. What if we have a unified 64KB cache instead?

R1b.10.

As we saw, for a fixed-size cache, the miss rate tends to decrease as we grow the number of bytes stored per cache block, until some point where it begins to increase. Why do small block sizes tend to lead to larger miss rates? And why do large block sizes tend to lead to larger miss rates?

R1b.11.

As we saw, many processors use a multi-layer approach to caching: They might have a 64KB L1 cache backed by a 256KB L2 cache. If we can afford a single-layer 512KB cache instead, why would we not simply prefer the single-layer approach?

Exam 1 Review B: Solutions

R1b.1.

a. 0.25 ns (the length of the longest stage, plus 0.05 ns)

b. Combine S1/S2 and S3/S4, giving a six-stage pipeline.

c. Combine S0/S1, S2/S3, S4/S5, S6/S7, giving a four-stage pipeline.

d. A pipeline with fewer stages can lead to shorter stalls: For example, if an add instruction needs its data starting in stage S4, and it depends on the result of the previous instruction, which doesn't become available until that instruction reaches S7, then the eight-stage pipeline will require stalling the add instruction for three clock cycles; the four-stage pipeline of part c would only require stalling for one clock cycle.

R1b.2.

A data hazard occurs when one instruction depends on data being computed by a previous instruction that is still in the pipeline (and thus not yet complete). For example, in the following sequence, the add instruction requires $t0 from the preceding lw instruction; but a na¨i;ve pipeline implementation would retrieve the value of $t0 for add before the lw instruction manages to write the new value back into a register.

lw $t0, ($sp) add $t1, $t1, $t0

The primary way to reduce the impact of data hazards is to use forwarding: We add data pathways into the pipeline so that the computed value can be sent back to an earlier stage where another instruction might want it. In our example, we'd arrange for the add instruction to be in the EX stage when lw reaches the WB stage, and we'd have a pathway for the $t0 value to be sent directly into the EX stage.

(An alternative technique for reducing the impact of data hazards is to use out-of-order execution.)

R1b.3.

`lw $t0, ($sp)`	IF	ID	EX	MEM	WB
`add $sp, $sp, 4`		IF	ID	EX	MEM	WB
`lw $t1, ($sp)`			IF	ID	EX	MEM	WB
`xor $t2, $t0, $t1`				IF	ID	stl	EX	MEM	WB
`sub $t3, $t0, $t1`					IF	stl	ID	EX	MEM	WB
`andi $t3, $t3, 0xFFFF`							IF	ID	EX	MEM	WB

The following arrows would need to be indicated:

from after the add's EX stage to before the second lw's EX stage
from after the second lw's MEM stage to before the xor's EX stage
from after the sub's EX stage to before the andi's EX stage

R1b.4.

Control hazards are situations where new instructions cannot be fetched into the pipeline since the future sequence of instructions is unknown. This is typically due to a branch or jump instructions. Their impact is reduced by any of the following.

Move the computation of which instruction is next to come earlier in the pipeline.
In speculative execution, the processor guesses what the next instruction will be, so that it can immediately begin executing succeeding instructions accordingly. If the guess is wrong, the succeeding instructions can promptly be canceled; but if the guess is correct, the pipeline will be fully utilized throughout the branch.
Sophisticated branch prediction will improve the performance of speculative execution further, by keeping track of past behavior to improve the accuracy of the guesses.
A delay slot, where the instruction after each branch will be executed regardless of the branch decision, allows at least one instruction after each branch to not be wasted.

R1b.5.

Note that 10% of instructions are branches that are taken, while 5% of instructions or branches that are not taken.

a. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.15 ⋅ 4 = 0.8 + 0.1 + 0.6 = 1.5

b. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.1 ⋅ 4 + 0.05 ⋅ 1 = 0.8 + 0.1 + 0.4 + 0.05 = 1.35

c. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.1 ⋅ 2 + 0.05 ⋅ 4 = 0.8 + 0.1 + 0.2 + 0.2 = 1.3

R1b.6.

The processor uses the lower 12 bits of the PC as an index into the buffer. If it finds a 0 there, it predicts that the branch will not be taken (so the next instruction fetched will be the one immediately after the branch). If it finds a 1, it predicts the branch will be taken (so the next instruction fetched will be the one indicated by the branch instruction). Later in the pipeline, when the branch decision is made, the bit is updated to reflect the more recent decision.

R1b.7.

char fetch(int addr) {

    int tag  = addr >> 6;

    int indx = (addr >> 3) & 0x7;

    int offs = addr & 0x7;

    int opt, i;

    for (opt = 2 * indx; opt < 2 * indx + 1; opt++) {

        if (cache[opt].tag == tag) {

            return cache[opt].data[offs];

        }

    }

    opt = 2 * indx + randbit();

    cache[opt].tag = tag;

    for (i = 0; i < 8; i++) {

        cache[opt].data[i] = mem[(tag << 6) | (indx << 3) | i];

    }

    return cache[opt].data[offs];

}

R1b.8.

First, the reduction in miss rate from increasing associativity becomes quite small, particularly beyond an 8-way cache. However, having greater associativity increases circuit complexity (i.e., requires more transistors), and at some point it is no longer a cost-efficient way to spend transistors.

Second, greater associativity does require a slightly deeper circuit, and so increasing associativity will require slightly more time for each cache access. This increased time per cache access eventually outweighs the improvement in miss rate.

R1b.9.

a. 1 + 0.0014 ⋅ 150 + 0.0384 ⋅ 150 = 6.97 cycles.

b. 1 + 0.0394 ⋅ 150 + 0.4 = 7.31 cycles.

R1b.10.

With small block sizes, the cache does not work well for programs that access adjacent instructions/data in memory. A very basic example is stepping through an array: A smaller block size will lead the cache to fetch fewer array entries on a miss, so it will miss more frequently.

With large block sizes, there will be fewer blocks; particularly with small caches, larger blocks will lead to the increased possibility of throwing out a block that may be in current use.

R1b.11.

It takes more time to retrieve data from a larger cache. So it pays to look quickly into the 64KB L1 cache first; only rarely will we end up needing to spend the extra time to look into the backup 256KB cache.

S0:	0.20 ns
S1:	0.05 ns
S2:	0.10 ns
S3:	0.15 ns
S4:	0.05 ns
S5:	0.20 ns
S6:	0.10 ns
S7:	0.15 ns

1.4	in fetching instructions from a dedicated 32KB instruction cache
38.4	in accessing data from a dedicated 32KB data cache
39.4	in accessing instructions and data from a unified 64KB cache