Exam 1 Review B: Questions
Suppose we can divide the execution of an instruction into eight possible pipeline stages:
S0: 0.20 ns S1: 0.05 ns S2: 0.10 ns S3: 0.15 ns S4: 0.05 ns S5: 0.20 ns S6: 0.10 ns S7: 0.15 ns
However, inserting pipeline registers between stages requires adding 0.05 ns to the time for the stage.
a. If we use this proposed 8-stage pipeline, how long must each clock cycle be?
b. Now suppose that you want to combine stages while maintaining the same clock rate. Which pipeline stages would you combine? How many stages result?
c. Now suppose that you want to combine stages with each clock cycle taking 0.05 ns longer. Which pipeline stages would you combine? How many stages result?
d. Usually more stages enables a faster clock, which means that instructions are completed more quickly. Explain why sometimes fewer stages will enable instructions to complete more quickly.
Explain data hazards in the context of pipelining, and describe a technique for reducing their impact.
Complete the following diagram of how instructions progress through the classic five-stage pipeline, assuming that the processor supports data forwarding. Draw arrows indicating points where forwarding is necessary.
lw $t0, ($sp) |
IF | ID | EX | MEM | WB |
add $sp, $sp, 4 | |||||
lw $t1, ($sp) | |||||
xor $t2, $t0, $t1 | |||||
sub $t3, $t0, $t1 | |||||
andi $t3, $t3, 0xFFFF |
Explain control hazards in the context of pipelining, and describe two techniques to reduce their impact.
Suppose we have an 8-stage pipeline that can complete one instruction each clock cycle in the absence of any jumps or branches. (We assume there are no data or structural hazards.) For jumps or branches, the destination of the jump is known at the end of the pipeline's second stage, whereas the decision of whether a branch is taken is determined at the end of the pipeline's fourth stage.
Assume that 5% of instructions in a typical program are jump instructions, while 15% of instructions are branch instructions, of which 2/3 are taken.
a. What is the average number of clock cycles per instruction if we always stall for every jump or branch until the next instruction's location is known?
b. What is the average number of clock cycles per instruction if we speculatively execute assuming that branches are not taken?
c. What is the average number of clock cycles per instruction if we speculatively execute assuming that branches are taken?
Suppose a processor uses a a basic 4096-entry branch prediction buffer, where each entry is exactly one bit long. Explain how the processor accesses and uses the buffer when it reaches a branch instruction.
Suppose we have the following global declarations in C for representing memory with 16 lines of 8 bytes each.
struct cache_line {
int tag;
char data[8];
};
struct cache_line cache[16];
char mem[1024];
The following function uses this structure to simulate the retrieval of a value from a direct-mapped cache.
char fetch(int addr) {
int tag = addr >> 7;
int indx = (addr >> 3) & 0xF;
int offs = addr & 0x7;
int i;
if (cache[indx].tag == tag) {
return cache[indx].data[offs];
} else {
cache[indx].tag = tag;
for (i = 0; i < 8; i++) {
cache[indx].data[i] = mem[(tag << 7) | (indx << 3) | i];
}
return cache[indx].data[offs];
}
}
How can we modify this to simulate a two-way set-associative
cache instead? Presume that we are using a random-replacement
policy, using a function randbit()
taking no parameters
to generate a random bit.
Increasing the associativity of a cache (such as going from 2-way to 4-way set-associativity) always reduces the miss rate. Why, then, do caches usually have a low degree of associativity?
Suppose that benchmarks indicate the following cache miss rates per 1,000 instructions on a pipelined processor.
1.4 in fetching instructions from a dedicated 32KB instruction cache 38.4 in accessing data from a dedicated 32KB data cache 39.4 in accessing instructions and data from a unified 64KB cache
Suppose that the miss penalty is 150 clock cycles, and that 40% of instructions access data from memory (or cache). We assume that the processor does not stall for any reason unrelated to memory access, and we ignore the rare occurrence of missing the cache simultaneously in fetching an instruction and in accessing data.
a. If we have separate 32KB caches, what is the average number of clock cycles per instruction?
b. What if we have a unified 64KB cache instead?
As we saw, for a fixed-size cache, the miss rate tends to decrease as we grow the number of bytes stored per cache block, until some point where it begins to increase. Why do small block sizes tend to lead to larger miss rates? And why do large block sizes tend to lead to larger miss rates?
As we saw, many processors use a multi-layer approach to caching: They might have a 64KB L1 cache backed by a 256KB L2 cache. If we can afford a single-layer 512KB cache instead, why would we not simply prefer the single-layer approach?
Exam 1 Review B: Solutions
a. 0.25 ns (the length of the longest stage, plus 0.05 ns)
b. Combine S1/S2 and S3/S4, giving a six-stage pipeline.
c. Combine S0/S1, S2/S3, S4/S5, S6/S7, giving a four-stage pipeline.
d. A pipeline with fewer stages can lead to shorter
stalls: For example, if an add
instruction needs its data
starting in stage S4, and it depends on the result of the
previous instruction, which doesn't become
available until that instruction reaches S7, then the eight-stage
pipeline will require stalling the add
instruction for
three clock cycles; the four-stage pipeline of part c
would only require stalling for one clock cycle.
A data hazard occurs when one instruction depends on data
being computed by a previous instruction that is still in the
pipeline (and thus not yet complete). For example, in the
following sequence, the add
instruction requires
$t0
from the preceding lw
instruction;
but a na¨i;ve pipeline implementation would retrieve the
value of $t0
for add
before the lw
instruction manages to write the new value back into a register.
lw $t0, ($sp)
add $t1, $t1, $t0
The primary way to reduce the impact of data hazards
is to use forwarding: We add data pathways into the
pipeline so that the computed value can be sent back to an
earlier stage where another instruction might want it.
In our example, we'd arrange for the add
instruction to
be in the EX stage when lw
reaches the WB stage, and
we'd have a pathway for the $t0
value to be sent
directly into the EX stage.
(An alternative technique for reducing the impact of data hazards is to use out-of-order execution.)
lw $t0, ($sp) |
IF | ID | EX | MEM | WB | ||||||
add $sp, $sp, 4 | IF | ID | EX | MEM | WB | ||||||
lw $t1, ($sp) | IF | ID | EX | MEM | WB | ||||||
xor $t2, $t0, $t1 | IF | ID | stl | EX | MEM | WB | |||||
sub $t3, $t0, $t1 | IF | stl | ID | EX | MEM | WB | |||||
andi $t3, $t3, 0xFFFF | IF | ID | EX | MEM | WB |
The following arrows would need to be indicated:
- from after the
add
's EX stage to before the secondlw
's EX stage - from after the second
lw
's MEM stage to before thexor
's EX stage - from after the
sub
's EX stage to before theandi
's EX stage
Control hazards are situations where new instructions cannot be fetched into the pipeline since the future sequence of instructions is unknown. This is typically due to a branch or jump instructions. Their impact is reduced by any of the following.
Move the computation of which instruction is next to come earlier in the pipeline.
In speculative execution, the processor guesses what the next instruction will be, so that it can immediately begin executing succeeding instructions accordingly. If the guess is wrong, the succeeding instructions can promptly be canceled; but if the guess is correct, the pipeline will be fully utilized throughout the branch.
Sophisticated branch prediction will improve the performance of speculative execution further, by keeping track of past behavior to improve the accuracy of the guesses.
A delay slot, where the instruction after each branch will be executed regardless of the branch decision, allows at least one instruction after each branch to not be wasted.
Note that 10% of instructions are branches that are taken, while 5% of instructions or branches that are not taken.
a. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.15 ⋅ 4 = 0.8 + 0.1 + 0.6 = 1.5
b. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.1 ⋅ 4 + 0.05 ⋅ 1 = 0.8 + 0.1 + 0.4 + 0.05 = 1.35
c. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.1 ⋅ 2 + 0.05 ⋅ 4 = 0.8 + 0.1 + 0.2 + 0.2 = 1.3
The processor uses the lower 12 bits of the PC as an index into the buffer. If it finds a 0 there, it predicts that the branch will not be taken (so the next instruction fetched will be the one immediately after the branch). If it finds a 1, it predicts the branch will be taken (so the next instruction fetched will be the one indicated by the branch instruction). Later in the pipeline, when the branch decision is made, the bit is updated to reflect the more recent decision.
char fetch(int addr) {
int tag = addr >> 6;
int indx = (addr >> 3) & 0x7;
int offs = addr & 0x7;
int opt, i;
for (opt = 2 * indx; opt < 2 * indx + 1; opt++) {
if (cache[opt].tag == tag) {
return cache[opt].data[offs];
}
}
opt = 2 * indx + randbit();
cache[opt].tag = tag;
for (i = 0; i < 8; i++) {
cache[opt].data[i] = mem[(tag << 6) | (indx << 3) | i];
}
return cache[opt].data[offs];
}
First, the reduction in miss rate from increasing associativity becomes quite small, particularly beyond an 8-way cache. However, having greater associativity increases circuit complexity (i.e., requires more transistors), and at some point it is no longer a cost-efficient way to spend transistors.
Second, greater associativity does require a slightly deeper circuit, and so increasing associativity will require slightly more time for each cache access. This increased time per cache access eventually outweighs the improvement in miss rate.
a. 1 + 0.0014 ⋅ 150 + 0.0384 ⋅ 150 = 6.97 cycles.
b. 1 + 0.0394 ⋅ 150 + 0.4 = 7.31 cycles.
With small block sizes, the cache does not work well for programs that access adjacent instructions/data in memory. A very basic example is stepping through an array: A smaller block size will lead the cache to fetch fewer array entries on a miss, so it will miss more frequently.
With large block sizes, there will be fewer blocks; particularly with small caches, larger blocks will lead to the increased possibility of throwing out a block that may be in current use.
It takes more time to retrieve data from a larger cache. So it pays to look quickly into the 64KB L1 cache first; only rarely will we end up needing to spend the extra time to look into the backup 256KB cache.