printable version
Exam 1 Review B
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
Problem R1b.1.
Suppose we can divide the execution of an
instruction into eight possible pipeline stages:
S0: | 0.20 ns |
S1: | 0.05 ns |
S2: | 0.10 ns |
S3: | 0.15 ns |
S4: | 0.05 ns |
S5: | 0.20 ns |
S6: | 0.10 ns |
S7: | 0.15 ns |
However, inserting pipeline registers between stages requires
adding 0.05 ns to the time for the stage.
a. If we use this proposed 8-stage pipeline, how long
must each clock cycle be?
b. Now suppose that you want to combine stages while
maintaining the same clock rate. Which pipeline stages would you
combine? How many stages result?
c. Now suppose that you want to combine stages with
each clock cycle taking 0.05 ns longer.
Which pipeline stages would you
combine? How many stages result?
d. Usually more stages enables a faster clock, which
means that instructions are completed more quickly.
Explain why sometimes fewer stages will enable instructions to
complete more quickly.
a. 0.25 ns (the length of the longest stage, plus
0.05 ns)
b. Combine S1/S2 and S3/S4, giving a six-stage
pipeline.
c. Combine S0/S1, S2/S3, S4/S5, S6/S7, giving a
four-stage pipeline.
d. A pipeline with fewer stages can lead to shorter
stalls: For example, if an add
instruction needs its data
starting in stage S4, and it depends on the result of the
previous instruction, which doesn't become
available until that instruction reaches S7, then the eight-stage
pipeline will require stalling the add
instruction for
three clock cycles; the four-stage pipeline of part c
would only require stalling for one clock cycle.
Problem R1b.2.
Explain data hazards in the context of pipelining,
and describe a technique for reducing their impact.
A data hazard occurs when one instruction depends on data
being computed by a previous instruction that is still in the
pipeline (and thus not yet complete). For example, in the
following sequence, the add
instruction requires
$t0
from the preceding lw
instruction;
but a na¨i;ve pipeline implementation would retrieve the
value of $t0
for add
before the lw
instruction manages to write the new value back into a register.
lw $t0, ($sp)
add $t1, $t1, $t0
The primary way to reduce the impact of data hazards
is to use forwarding: We add data pathways into the
pipeline so that the computed value can be sent back to an
earlier stage where another instruction might want it.
In our example, we'd arrange for the add
instruction to
be in the EX stage when lw
reaches the WB stage, and
we'd have a pathway for the $t0
value to be sent
directly into the EX stage.
(An alternative technique for reducing the impact of data
hazards is to use out-of-order execution.)
Problem R1b.3.
Complete the following diagram of how instructions progress
through the classic five-stage pipeline,
assuming that the processor supports data forwarding.
Draw arrows indicating points where forwarding is necessary.
lw $t0, ($sp) |
IF | ID | EX | MEM | WB |
add $sp, $sp, 4 |
lw $t1, ($sp) |
xor $t2, $t0, $t1 |
sub $t3, $t0, $t1 |
andi $t3, $t3, 0xFFFF |
lw $t0, ($sp) |
IF | ID | EX | MEM | WB |
add $sp, $sp, 4 | |
IF | ID | EX | MEM | WB |
lw $t1, ($sp) | |
IF | ID | EX | MEM | WB |
xor $t2, $t0, $t1 | |
IF | ID | stl | EX | MEM | WB |
sub $t3, $t0, $t1 | |
IF | stl | ID | EX | MEM | WB |
andi $t3, $t3, 0xFFFF | |
IF | ID | EX | MEM | WB |
The following arrows would need to be indicated:
- from after the
add
's EX stage to before the second
lw
's EX stage
- from after the second
lw
's MEM stage to before the
xor
's EX stage
- from after the
sub
's EX stage to before the
andi
's EX stage
Problem R1b.4.
Explain control hazards in the context of
pipelining, and describe two techniques to reduce their
impact.
Control hazards are situations where new instructions cannot
be fetched into the pipeline since the future sequence of
instructions is unknown. This is typically due to a branch or jump
instructions. Their impact is reduced by any of the following.
Move the computation of which instruction is next to
come earlier in the pipeline.
In speculative execution, the processor guesses
what the next instruction will be, so that it can immediately begin
executing succeeding instructions accordingly. If the guess is wrong,
the succeeding instructions can promptly be canceled; but if the guess
is correct, the pipeline will be fully utilized throughout the
branch.
Sophisticated branch prediction will improve the
performance of speculative execution further, by keeping track
of past behavior to improve the accuracy of the
guesses.
A delay slot, where the instruction after each branch
will be executed regardless of the branch decision, allows at least
one instruction after each branch to not be wasted.
Problem R1b.5.
Suppose we have an 8-stage pipeline that can
complete one instruction each clock cycle
in the absence of any jumps or branches.
(We assume there are no data or structural hazards.)
For jumps or branches, the destination of the jump is known at
the end of the pipeline's second stage, whereas
the decision of whether a branch is taken is determined
at the end of the pipeline's fourth stage.
Assume that 5% of instructions in a typical program are jump
instructions, while 15% of instructions are
branch instructions, of which 2/3 are taken.
a. What is the average number of clock cycles per
instruction if we always stall for every jump or branch until
the next instruction's location is known?
b. What is the average number of clock cycles per
instruction if we speculatively execute assuming that branches
are not taken?
c. What is the average number of clock cycles per
instruction if we speculatively execute assuming that branches
are taken?
Note that 10% of instructions are branches that are taken,
while 5% of instructions or branches that are not taken.
a. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.15 ⋅ 4
= 0.8 + 0.1 + 0.6 = 1.5
b. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.1 ⋅ 4 + 0.05 ⋅ 1
= 0.8 + 0.1 + 0.4 + 0.05 = 1.35
c. 0.8 ⋅ 1 + 0.05 ⋅ 2 + 0.1 ⋅ 2 + 0.05 ⋅ 4
= 0.8 + 0.1 + 0.2 + 0.2 = 1.3
Problem R1b.6.
Suppose a processor uses a a basic 4096-entry branch prediction buffer,
where each entry is exactly one bit long.
Explain how the processor accesses and uses the buffer when it reaches a
branch instruction.
The processor uses the lower 12 bits of the PC as an index
into the buffer. If it finds a 0 there, it predicts that the
branch will not be taken (so the next instruction fetched will
be the one immediately after the branch). If it finds a 1, it
predicts the branch will be taken (so the next instruction
fetched will be the one indicated by the branch instruction).
Later in the pipeline, when the branch decision is made,
the bit is updated to reflect the more recent decision.
Problem R1b.7.
Suppose we have the following global declarations in C for
representing memory with 16 lines of 8 bytes each.
struct cache_line {
int tag;
char data[8];
};
struct cache_line cache[16];
char mem[1024];
The following function uses this structure
to simulate the retrieval of a value from a direct-mapped cache.
char fetch(int addr) {
int tag = addr >> 7;
int indx = (addr >> 3) & 0xF;
int offs = addr & 0x7;
int i;
if (cache[indx].tag == tag) {
return cache[indx].data[offs];
} else {
cache[indx].tag = tag;
for (i = 0; i < 8; i++) {
cache[indx].data[i] = mem[(tag << 7) | (indx << 3) | i];
}
return cache[indx].data[offs];
}
}
How can we modify this to simulate a two-way set-associative
cache instead? Presume that we are using a random-replacement
policy, using a function randbit()
taking no parameters
to generate a random bit.
char fetch(int addr) {
int tag = addr >> 6;
int indx = (addr >> 3) & 0x7;
int offs = addr & 0x7;
int opt, i;
for (opt = 2 * indx; opt < 2 * indx + 1; opt++) {
if (cache[opt].tag == tag) {
return cache[opt].data[offs];
}
}
opt = 2 * indx + randbit();
cache[opt].tag = tag;
for (i = 0; i < 8; i++) {
cache[opt].data[i] = mem[(tag << 6) | (indx << 3) | i];
}
return cache[opt].data[offs];
}
Problem R1b.8.
Increasing the associativity of a cache (such as going from
2-way to 4-way set-associativity) always reduces the
miss rate. Why, then, do caches usually have a low degree of
associativity?
First, the reduction in miss rate from increasing associativity
becomes quite small, particularly beyond an 8-way cache.
However, having greater associativity increases circuit
complexity (i.e., requires more transistors),
and at some point it is no longer a cost-efficient way to spend
transistors.
Second, greater associativity does require a slightly deeper
circuit, and so increasing associativity will require slightly
more time for each cache access. This increased time per cache access
eventually outweighs the improvement in miss rate.
Problem R1b.9.
Suppose that benchmarks indicate the following cache miss
rates per 1,000 instructions on a pipelined processor.
1.4 | in fetching instructions from
a dedicated 32KB instruction cache |
38.4 | in accessing data from
a dedicated 32KB data cache |
39.4 | in accessing instructions and data
from a unified 64KB cache |
Suppose that the miss penalty is 150 clock cycles, and that
40% of instructions access data from memory (or cache).
We assume that the
processor does not stall for any reason unrelated to memory access,
and we ignore the rare occurrence of missing the cache
simultaneously in fetching an instruction and in accessing data.
a. If we have separate 32KB caches, what is the average number
of clock cycles per instruction?
b. What if we have a unified 64KB cache instead?
a. 1 + 0.0014 ⋅ 150 + 0.0384 ⋅ 150
= 6.97 cycles.
b. 1 + 0.0394 ⋅ 150 + 0.4
= 7.31 cycles.
Problem R1b.10.
As we saw, for a fixed-size cache,
the miss rate tends to decrease as we grow
the number of bytes stored per cache block,
until some point where it begins to increase.
Why do small block sizes tend to lead to larger miss rates?
And why do large block sizes tend to lead to larger miss rates?
With small block sizes, the cache does not work well for
programs that access adjacent instructions/data in memory.
A very basic example is stepping through an array:
A smaller block size will lead the cache to fetch fewer array
entries on a miss, so it will miss more frequently.
With large block sizes, there will be fewer blocks;
particularly with small caches, larger blocks will lead to
the increased possibility of throwing out a block that may be
in current use.
Problem R1b.11.
As we saw, many processors use a multi-layer approach to
caching: They might have a 64KB L1 cache backed by a 256KB L2 cache.
If we can afford a single-layer 512KB cache
instead, why would we not simply prefer the single-layer approach?
It takes more time to retrieve data from a larger cache.
So it pays to look quickly into the 64KB L1 cache first; only rarely
will we end up needing to spend the extra time to look into the backup
256KB cache.