printable version
Exam 2 Review A
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Problem R2a.1.
Why is virtual memory a concern for L1 cache performance?
How can the performance penalty be reduced for cache accesses?
Virtual memory demands that each memory address be translated to
the physical address that a straightforward cache would need.
This translation process adds at least one additional clock cycle
to the memory access time.
To reduce the performance penalty, a processor can
use the fact that the lower 12 bits of the physical address matches
the lower 12 bits of the virtual address. If the cache needs the
lower 12 bits of the physical address to determine which set
contains the requested data, it can retrieve the requested set
using the lower 12 bits of the virtual address, without waiting
for the address to be translated. The virtual address
translation process can proceed at the same time this set is
being retrieved.
For this reason, a CPU designer might choose the
L1 cache parameters so that the set is fully determined by the
lower 12 bits; for example, for 64-byte cache lines, we
could have only 212 / 64 = 64 sets.
Problem R2a.2.
For the following instruction sequence, identify
all (a.) RAW, (b.) WAR, and (c.) WAW
data dependencies.
a: add $t0, $s0, $s1
b: xor $t1, $s0, $s1
c: and $t2, $t0, $t1
d: or $t0, $s0, $t2
e: sub $t2, $t0, $s1
a.
a→c,
b→c,
c→d,
d→e
b.
c→d,
d→e
c.
a→d,
c→e
Problem R2a.3.
Even in a single-pipeline processor, executing instructions out
of order can sometimes be beneficial. Provide an example of this
based on the classic five-stage MIPS pipeline,
and explain how out-of-order execution would help.
Suppose we had the following instruction sequence.
lw $t1, 0($a0)
ori $t2, $t1, 1
addi $a0, $a0, 4
In a pipelined processor doing in-order execution, the
lw
instruction would reach the MEM stage at the same
time that the ori
instruction reaches the EX stage.
The ori
instruction would have to stall at this point,
because the lw
instruction will not have determined
the value for $t1
(which the EX stage needs to execute
ori
instruction) until the MEM stage completes.
This stall adds one clock cycle to the time needed to complete
this sequence.
This wasted clock cycle could be avoided, however, if the processor
executes the addi
instruction first. Then the
addi
instruction (which doesn't depend on the result of
the lw
instruction) could complete the EX stage at the time
that the lw
instruction completes the MEM stage.
The ori
instruction would complete the EX stage at the
same time as it otherwise would have done.
Problem R2a.4.
Consider the following instruction sequence to add all
elements in an array.
loop: lw $t1, 0($a0)
add $s2, $s2, $t1
addi $a0, $a0, 4
bne $a0, $a1, loop
Since the lw
instruction takes 1 + 1 cycles
to get its result,
this instruction sequence takes 5 cycles per
array element. Using loop unrolling, show how the sequence can be
rewritten to take just 3 cycles per array element, and explain.
loop: lw $t1, 0($a0)
lw $t2, 4($a0)
add $s2, $s2, $t1
add $s2, $s2, $t2
addi $a0, $a0, 8
bne $a0, $a1, loop
For this sequence, no stalls will be necessary: The first
lw
doesn't lead to a stall since its result isn't
needed until two instructions later, and similarly for the
second lw
. So each iteration takes 6 cycles to handle
two array elements, which translates to 3 cycles per array element.
Problem R2a.5.
A dynamic out-of-order processor uses
register renaming to reduce the number of apparent conflicts between
instructions. Rewrite the MIPS code below to reflect how it
would look after renaming; in renaming, use the register values
$v0 to $v15. The $zero register should not
be renamed.
addi $t0, $zero, 0
addi $t1, $zero, 4
add $t0, $t0, $t1
addi $t1, $t1, -1
add $t0, $t0, $t1
addi $t1, $t1, -1
add $t0, $t0, $t1
addi $t1, $t1, -1
add $t0, $t0, $t1
addi $v0, $zero, 0
addi $v1, $zero, 4
add $v2, $v0, $v1
addi $v3, $v1, -1
add $v4, $v2, $v3
addi $v5, $v3, -1
add $v6, $v4, $v5
addi $v7, $v5, -1
add $v8, $v6, $v7
Problem R2a.6.
Explain why a dynamic out-of-order processor should use some
technique for renaming registers, with an example where register
renaming helps processor performance.
Without register renaming, the processor must unnecessarily delay
any instruction that results in a WAR or WAW dependency;
this dramatically reduces the opportunities for out-of-order execution.
Consider the following example.
add.d $f6, $f0, $f2
add.d $f8, $f6, $s4
add.d $f6, $f0, $f2
add.d $f10, $f6, $s4
Note that the third instruction results in a WAR dependency:
The second instruction reads $f6
, and the third
instruction writes to it.
The processor would need to wait until it is sure the second
instruction has retrieved the proper value for $f6
before it allows the third (or fourth) instruction to begin.
However, renaming would lead the third instruction to write to a
different register, so the processor could safely begin the third
instruction immediately.
Problem R2a.7.
What is the function of a “reservation station”
in Tomasulo's algorithm (without speculation)?
Whenever an instruction is issued, all information necessary
to executing the instruction in stored in its reservation
station. The reservation station's ID also identifies what
value should be used by subsequent instructions depending on
the instruction's result.
Problem R2a.8.
Suppose our processor uses Tomasulo's algorithm without
speculation, incorporating a 4-cycle addition unit and
a 7-cycle multiplication unit. Reservation stations 0 and 1 are
for the addition unit, while stations 2 and 3 are for the
multiplication unit; when choosing among available stations, the
one with the least ID is chosen.
Complete the following table showing how
each instruction proceeds through the processor.
| res. | issue | dispatch | commit |
instruction | stat. | cycle | cycle | cycle |
mul.d $f6, $f0, $f2 |
| 0 | | |
add.d $f8, $f4, $f6 |
| | | |
add.d $f10, $f0, $f2 |
| | | |
add.d $f12, $f4, $f10 |
| | | |
| res. | issue | dispatch | commit |
instruction | stat. | cycle | cycle | cycle |
mul.d $f6, $f0, $f2 |
2 | 0 | 1 | 8 |
add.d $f8, $f4, $f6 |
0 | 1 | 9 | 13 |
add.d $f10, $f0, $f2 |
1 | 2 | 3 | 7 |
add.d $f12, $f4, $f10 |
1 | 8 | 10 | 14 |
Problem R2a.9.
How could the answer to the preceding question change if
the processor had two addition units (both drawing
from the same reservation stations)? You can assume two common
data buses.
In this case, the second and final instruction could be dispatched
simultaneously, in cycle 9. They would then both complete in
cycle 13. (If there were only one common data bus, then
only one of the instructions could commit in cycle 13; the
other would wait until 14 to commit.)
Problem R2a.10.
What is the shortcoming of Tomasulo's algorithm that leads to
the introduction of speculation? How is the reorder
buffer used to introduce speculation?
Without speculation, any conditional branch will prevent a processor
from issuing instructions predicted to follow the branch,
since those instructions cannot be allowed to complete until
we are sure that they will execute.
The reorder buffer is a queue of speculated results. Upon
completion of an instruction's computation, the result is
written into the reorder buffer rather than directly to the
instruction's destination register.
The value is moved from the reorder buffer to the
destination register only once branch predictions preceding the
instruction are confirmed.