Exam 2 Review A

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Problem R2a.1.

Why is virtual memory a concern for L1 cache performance? How can the performance penalty be reduced for cache accesses?

Virtual memory demands that each memory address be translated to the physical address that a straightforward cache would need. This translation process adds at least one additional clock cycle to the memory access time.

To reduce the performance penalty, a processor can use the fact that the lower 12 bits of the physical address matches the lower 12 bits of the virtual address. If the cache needs the lower 12 bits of the physical address to determine which set contains the requested data, it can retrieve the requested set using the lower 12 bits of the virtual address, without waiting for the address to be translated. The virtual address translation process can proceed at the same time this set is being retrieved.

For this reason, a CPU designer might choose the L1 cache parameters so that the set is fully determined by the lower 12 bits; for example, for 64-byte cache lines, we could have only 2¹² / 64 = 64 sets.

Problem R2a.2.

For the following instruction sequence, identify all (a.) RAW, (b.) WAR, and (c.) WAW data dependencies.

a: add $t0, $s0, $s1 b: xor $t1, $s0, $s1 c: and $t2, $t0, $t1 d: or $t0, $s0, $t2 e: sub $t2, $t0, $s1

a. a→c, b→c, c→d, d→e

b. c→d, d→e

c. a→d, c→e

Problem R2a.3.

Even in a single-pipeline processor, executing instructions out of order can sometimes be beneficial. Provide an example of this based on the classic five-stage MIPS pipeline, and explain how out-of-order execution would help.

Suppose we had the following instruction sequence.

lw $t1, 0($a0) ori $t2, $t1, 1 addi $a0, $a0, 4

In a pipelined processor doing in-order execution, the lw instruction would reach the MEM stage at the same time that the ori instruction reaches the EX stage. The ori instruction would have to stall at this point, because the lw instruction will not have determined the value for $t1 (which the EX stage needs to execute ori instruction) until the MEM stage completes. This stall adds one clock cycle to the time needed to complete this sequence.

This wasted clock cycle could be avoided, however, if the processor executes the addi instruction first. Then the addi instruction (which doesn't depend on the result of the lw instruction) could complete the EX stage at the time that the lw instruction completes the MEM stage. The ori instruction would complete the EX stage at the same time as it otherwise would have done.

Problem R2a.4.

Consider the following instruction sequence to add all elements in an array.

loop: lw $t1, 0($a0) add $s2, $s2, $t1 addi $a0, $a0, 4 bne $a0, $a1, loop

Since the lw instruction takes 1 + 1 cycles to get its result, this instruction sequence takes 5 cycles per array element. Using loop unrolling, show how the sequence can be rewritten to take just 3 cycles per array element, and explain.

loop: lw $t1, 0($a0) lw $t2, 4($a0) add $s2, $s2, $t1 add $s2, $s2, $t2 addi $a0, $a0, 8 bne $a0, $a1, loop

For this sequence, no stalls will be necessary: The first lw doesn't lead to a stall since its result isn't needed until two instructions later, and similarly for the second lw. So each iteration takes 6 cycles to handle two array elements, which translates to 3 cycles per array element.

Problem R2a.5.

A dynamic out-of-order processor uses register renaming to reduce the number of apparent conflicts between instructions. Rewrite the MIPS code below to reflect how it would look after renaming; in renaming, use the register values $v0 to $v15. The $zero register should not be renamed.

addi $t0, $zero, 0 addi $t1, $zero, 4 add $t0, $t0, $t1 addi $t1, $t1, -1 add $t0, $t0, $t1 addi $t1, $t1, -1 add $t0, $t0, $t1 addi $t1, $t1, -1 add $t0, $t0, $t1

addi  $v0, $zero, 0
addi  $v1, $zero, 4
add   $v2, $v0, $v1
addi  $v3, $v1, -1
add   $v4, $v2, $v3
addi  $v5, $v3, -1
add   $v6, $v4, $v5
addi  $v7, $v5, -1
add   $v8, $v6, $v7

Problem R2a.6.

Explain why a dynamic out-of-order processor should use some technique for renaming registers, with an example where register renaming helps processor performance.

Without register renaming, the processor must unnecessarily delay any instruction that results in a WAR or WAW dependency; this dramatically reduces the opportunities for out-of-order execution. Consider the following example.

add.d $f6, $f0, $f2 add.d $f8, $f6, $s4 add.d $f6, $f0, $f2 add.d $f10, $f6, $s4

Note that the third instruction results in a WAR dependency: The second instruction reads $f6, and the third instruction writes to it. The processor would need to wait until it is sure the second instruction has retrieved the proper value for $f6 before it allows the third (or fourth) instruction to begin. However, renaming would lead the third instruction to write to a different register, so the processor could safely begin the third instruction immediately.

Problem R2a.7.

What is the function of a “reservation station” in Tomasulo's algorithm (without speculation)?

Whenever an instruction is issued, all information necessary to executing the instruction in stored in its reservation station. The reservation station's ID also identifies what value should be used by subsequent instructions depending on the instruction's result.

Problem R2a.8.

Suppose our processor uses Tomasulo's algorithm without speculation, incorporating a 4-cycle addition unit and a 7-cycle multiplication unit. Reservation stations 0 and 1 are for the addition unit, while stations 2 and 3 are for the multiplication unit; when choosing among available stations, the one with the least ID is chosen. Complete the following table showing how each instruction proceeds through the processor.

	res.	issue	dispatch	commit
instruction	stat.	cycle	cycle	cycle
`mul.d $f6, $f0, $f2`		0
`add.d $f8, $f4, $f6`
`add.d $f10, $f0, $f2`
`add.d $f12, $f4, $f10`

	res.	issue	dispatch	commit
instruction	stat.	cycle	cycle	cycle
`mul.d $f6, $f0, $f2`	2	0	1	8
`add.d $f8, $f4, $f6`	0	1	9	13
`add.d $f10, $f0, $f2`	1	2	3	7
`add.d $f12, $f4, $f10`	1	8	10	14

Problem R2a.9.

How could the answer to the preceding question change if the processor had two addition units (both drawing from the same reservation stations)? You can assume two common data buses.

In this case, the second and final instruction could be dispatched simultaneously, in cycle 9. They would then both complete in cycle 13. (If there were only one common data bus, then only one of the instructions could commit in cycle 13; the other would wait until 14 to commit.)

Problem R2a.10.

What is the shortcoming of Tomasulo's algorithm that leads to the introduction of speculation? How is the reorder buffer used to introduce speculation?

Without speculation, any conditional branch will prevent a processor from issuing instructions predicted to follow the branch, since those instructions cannot be allowed to complete until we are sure that they will execute.

The reorder buffer is a queue of speculated results. Upon completion of an instruction's computation, the result is written into the reorder buffer rather than directly to the instruction's destination register. The value is moved from the reorder buffer to the destination register only once branch predictions preceding the instruction are confirmed.