Exam 2 Review A: Questions

R2a.1.

Why is virtual memory a concern for L1 cache performance? How can the performance penalty be reduced for cache accesses?

R2a.2.

For the following instruction sequence, identify all (a.) RAW, (b.) WAR, and (c.) WAW data dependencies.

a: add $t0, $s0, $s1
b: xor $t1, $s0, $s1
c: and $t2, $t0, $t1
d: or  $t0, $s0, $t2
e: sub $t2, $t0, $s1
R2a.3.

Even in a single-pipeline processor, executing instructions out of order can sometimes be beneficial. Provide an example of this based on the classic five-stage MIPS pipeline, and explain how out-of-order execution would help.

R2a.4.

Consider the following instruction sequence to add all elements in an array.

loop: lw $t1, 0($a0)
      add $s2, $s2, $t1
      addi $a0, $a0, 4
      bne $a0, $a1, loop

Since the lw instruction takes 1 + 1 cycles to get its result, this instruction sequence takes 5 cycles per array element. Using loop unrolling, show how the sequence can be rewritten to take just 3 cycles per array element, and explain.

R2a.5.

A dynamic out-of-order processor uses register renaming to reduce the number of apparent conflicts between instructions. Rewrite the MIPS code below to reflect how it would look after renaming; in renaming, use the register values $v0 to $v15. The $zero register should not be renamed.

addi  $t0, $zero, 0
addi  $t1, $zero, 4
add   $t0, $t0, $t1
addi  $t1, $t1, -1
add   $t0, $t0, $t1
addi  $t1, $t1, -1
add   $t0, $t0, $t1
addi  $t1, $t1, -1
add   $t0, $t0, $t1
R2a.6.

Explain why a dynamic out-of-order processor should use some technique for renaming registers, with an example where register renaming helps processor performance.

R2a.7.

What is the function of a “reservation station” in Tomasulo's algorithm (without speculation)?

R2a.8.

Suppose our processor uses Tomasulo's algorithm without speculation, incorporating a 4-cycle addition unit and a 7-cycle multiplication unit. Reservation stations 0 and 1 are for the addition unit, while stations 2 and 3 are for the multiplication unit; when choosing among available stations, the one with the least ID is chosen. Complete the following table showing how each instruction proceeds through the processor.

res.issuedispatchcommit
instructionstat.cyclecyclecycle
mul.d $f6, $f0, $f2 0
add.d $f8, $f4, $f6
add.d $f10, $f0, $f2
add.d $f12, $f4, $f10
R2a.9.

How could the answer to the preceding question change if the processor had two addition units (both drawing from the same reservation stations)? You can assume two common data buses.

R2a.10.

What is the shortcoming of Tomasulo's algorithm that leads to the introduction of speculation? How is the reorder buffer used to introduce speculation?

Exam 2 Review A: Solutions

R2a.1.

Virtual memory demands that each memory address be translated to the physical address that a straightforward cache would need. This translation process adds at least one additional clock cycle to the memory access time.

To reduce the performance penalty, a processor can use the fact that the lower 12 bits of the physical address matches the lower 12 bits of the virtual address. If the cache needs the lower 12 bits of the physical address to determine which set contains the requested data, it can retrieve the requested set using the lower 12 bits of the virtual address, without waiting for the address to be translated. The virtual address translation process can proceed at the same time this set is being retrieved.

For this reason, a CPU designer might choose the L1 cache parameters so that the set is fully determined by the lower 12 bits; for example, for 64-byte cache lines, we could have only 212 / 64 = 64 sets.

R2a.2.

a. ac, bc, cd, de

b. cd, de

c. ad, ce

R2a.3.

Suppose we had the following instruction sequence.

lw   $t1, 0($a0)
ori  $t2, $t1, 1
addi $a0, $a0, 4

In a pipelined processor doing in-order execution, the lw instruction would reach the MEM stage at the same time that the ori instruction reaches the EX stage. The ori instruction would have to stall at this point, because the lw instruction will not have determined the value for $t1 (which the EX stage needs to execute ori instruction) until the MEM stage completes. This stall adds one clock cycle to the time needed to complete this sequence.

This wasted clock cycle could be avoided, however, if the processor executes the addi instruction first. Then the addi instruction (which doesn't depend on the result of the lw instruction) could complete the EX stage at the time that the lw instruction completes the MEM stage. The ori instruction would complete the EX stage at the same time as it otherwise would have done.

R2a.4.
loop: lw $t1, 0($a0)
      lw $t2, 4($a0)
      add $s2, $s2, $t1
      add $s2, $s2, $t2
      addi $a0, $a0, 8
      bne $a0, $a1, loop

For this sequence, no stalls will be necessary: The first lw doesn't lead to a stall since its result isn't needed until two instructions later, and similarly for the second lw. So each iteration takes 6 cycles to handle two array elements, which translates to 3 cycles per array element.

R2a.5.
addi  $v0, $zero, 0
addi  $v1, $zero, 4
add   $v2, $v0, $v1
addi  $v3, $v1, -1
add   $v4, $v2, $v3
addi  $v5, $v3, -1
add   $v6, $v4, $v5
addi  $v7, $v5, -1
add   $v8, $v6, $v7
R2a.6.

Without register renaming, the processor must unnecessarily delay any instruction that results in a WAR or WAW dependency; this dramatically reduces the opportunities for out-of-order execution. Consider the following example.

add.d $f6, $f0, $f2
add.d $f8, $f6, $s4
add.d $f6, $f0, $f2
add.d $f10, $f6, $s4

Note that the third instruction results in a WAR dependency: The second instruction reads $f6, and the third instruction writes to it. The processor would need to wait until it is sure the second instruction has retrieved the proper value for $f6 before it allows the third (or fourth) instruction to begin. However, renaming would lead the third instruction to write to a different register, so the processor could safely begin the third instruction immediately.

R2a.7.

Whenever an instruction is issued, all information necessary to executing the instruction in stored in its reservation station. The reservation station's ID also identifies what value should be used by subsequent instructions depending on the instruction's result.

R2a.8.
res.issuedispatchcommit
instructionstat.cyclecyclecycle
mul.d $f6, $f0, $f2 2018
add.d $f8, $f4, $f6 01913
add.d $f10, $f0, $f2 1237
add.d $f12, $f4, $f10 181014
R2a.9.

In this case, the second and final instruction could be dispatched simultaneously, in cycle 9. They would then both complete in cycle 13. (If there were only one common data bus, then only one of the instructions could commit in cycle 13; the other would wait until 14 to commit.)

R2a.10.

Without speculation, any conditional branch will prevent a processor from issuing instructions predicted to follow the branch, since those instructions cannot be allowed to complete until we are sure that they will execute.

The reorder buffer is a queue of speculated results. Upon completion of an instruction's computation, the result is written into the reorder buffer rather than directly to the instruction's destination register. The value is moved from the reorder buffer to the destination register only once branch predictions preceding the instruction are confirmed.