Exam 2 Review B

[1] [2] [3] [4] [5] [6] [7] [8] [9]

Problem R2b.1.

An architecture applying Tomasulo's algorithm with multiple common data busses can dispatch multiple instructions into execution units in each clock cycle, and it can complete execution of multiple instructions in each clock cycle. Nonetheless, such a CPU cannot achieve an average time per instruction below one clock cycle. Why, and what can be done about it?

A straightforward implementation will issue only one instruction in each clock cycle from the instruction queue into a reservation station. Since only one instruction will be consumed from the instruction queue per clock cycle, the average number of instructions being processed cannot exceed one per clock cycle.

To address this, we must pull multiple instructions from the queue in each clock cycle. Doing this requires careful attention to data dependencies between instructions, pulling only enough instructions to avoid pulling dependencies.

Problem R2b.2.

In a VLIW instruction set architecture (such as the Itanium's IA-64), multiple operations are grouped into “bundles.” Explain how a processor can use a VLIW bundle to process instructions more efficiently.

A VLIW instruction set requires that the instructions in a bundle can be executed independently of each other, so that all instructions in a bundle can be issued and dispatched simultaneously (without the complexity of determining dependencies and other conflicts, as in a superscalar architecture).

Problem R2b.3.

What does the term simultaneous multithreading (equivalently, hyperthreading) mean in processor design? What about fine-grained multithreading?

In multithreaded processors, the processor is aware of different threads of execution (instruction queues manipulating different sets of registers).

In a simultaneous multithreaded processor, instructions can be issued into execution units in the same clock cycle, even if they originate from different threads. Thus, the processor truly can execute two (or more) different processes at once, by issuing their instructions into different execution units.

In a fine-grained multithreaded processor, the processor rotates among threads with each clock cycle. Within a particular cycle, it dispatches all available reservation stations for that cycle's thread.

Problem R2b.4.

The following program to sum the factors of 9 leads to the sequence of branches listed at right.

i = 1; total = 0; while (true) { if (9 % i == 0) { // A total += i; } else if (2 * i > 9) { // B break; } i++; }

1. A: Y

2. A: N

3. B: N

4. A: Y

5. A: N

6. B: N

7. A: N

8. B: Y

a. Suppose we predict branches using a two-entry local prediction buffer, where each entry is a two-bit predictor initially configured to predict “weakly no” (that is, predict “no” though the last observation was “yes”). Which branches will be predicted incorrectly?

b. Suppose we predict branches using a two-bit global branch history, where each entry is again a two-bit predictor initially configured to predict “weakly no.” The previous several branches have been “yes.” Which branches will be predicted incorrectly?

c. Suppose we use a correlating predictor using a two-bit global branch history, where each entry is again a two-bit predictor initially configured to predict “weakly no.” The previous several branches have been “yes.” Which branches will be predicted incorrectly?

a. 1, 2, 5, 7, 8

b. 1, 2, 4, 7

c. 1, 2, 4, 7, 8

Problem R2b.5.

The ARM Cortex A-8 does uses a pipelined architecture (not superscalar), but it does permit completion of multiple instructions in each clock cycle. It does this by introducing a second pipeline and pulling two instructions into the two pipelines in each clock cycle when possible.

The advantage of the dual-pipeline architecture is that it is a simpler design requiring fewer transistors and hence less power. The disadvantage is that it provides less performance. Why is the Cortex A-8's performance less with this dual-pipeline architecture that with a dual-issue superscalar architecture?

The dual-pipeline architecture does in-order execution of instructions. That is, if the second instruction to be pulled depends on the first, only the first instruction is issued into a pipeline; the processor will not go beyond the second instruction until it is finished stalling for the first instruction's result. By contrast, a superscalar architecture will execute instructions out of order in order to do useful work while stalling another instruction due to dependencies.

Problem R2b.6.

What do the letters of the acronym SIMD stand for? What does the acronym mean?

SIMD stands for “Single Instruction, Multiple Data.” This refers to a system where the same instruction sequence is executed on different streams of data. Vector processors work this way: You can conceptualize such a processor as 64 (say) different processors each working on different data (in particular, the respective vector elements corresponding to each processor).

Problem R2b.7.

Consider the following sequence of instructions to compute x + y + x y for two parallel vectors V0 and V1.

addvv.d V2, V0, V1 multvv.d V3, V0, V1 addvv.d V4, V2, V3

Assume that we have a vector processor with one vector multiplication unit whose latency is 7 and one vector addition unit whose latency is 6. Let n represent the length of the vector supported on our processor, where n ≥ 16. If the processor does not avoid latency between subsequent instructions, but it does support chaining, how many clock cycles will this instruction sequence take?

5 + 2 n.

first `addvv`:	first elt:	0	…	5
first `addvv`:	last elt:	`n` − 1	…	5 + `n` − 1
`multvv`:	first elt:	6	…	12
`multvv`:	last elt:	6 + `n` − 1	…	12 + `n` − 1
second `addvv`:	first elt:	`n`	…	5 + `n`
second `addvv`:	last elt:	2 `n` − 1	…	5 + 2 `n` − 1

Problem R2b.8.

Consider the following sequence of instructions to compute x² + 4x + 1 for each element x in a vector stored in V0.

multvv.d V1, V0, V0 # V1 = x^2 multvs.d V2, V0, 4 # V2 = 4 x addvv.d V3, V1, V2 # V3 = x^2 + 4 x addvs.d V4, V3, 1 # V4 = x^2 + 4 x + 1

Provide a better instruction sequence accomplishing the same thing but requiring fewer clock cycles.

12 + 3 n.

`multvv`:	first elt:	0	…	6
`multvv`:	last elt:	`n` − 1	…	6 + `n` − 1
`multvs`:	first elt:	`n`	…	6 + `n`
`multvs`:	last elt:	2 `n` − 1	…	6 + 2 `n` − 1
`addvv`:	first elt:	7 + `n`	…	12 + `n`
`addvv`:	last elt:	7 + 2 `n` − 1	…	12 + 2 `n` − 1
`addvs`:	first elt:	7 + 2 `n`	…	12 + 2 `n`
`addvs`:	last elt:	7 + 3 `n` − 1	…	12 + 3 `n` − 1

Reordering the computation to compute the same result using only two convoys leads to the following sequence, which requires only 17 + 2 n cycles:

multvs.d V2, V0, 4 # V2 = 4 x addvs.d V3, V2, 1 # V3 = 4 x + 1 multvv.d V1, V0, V0 # V1 = x^2 addvv.d V4, V1, V3 # V4 = x^2 + 4 x + 1

(Even better is the following based on the identity x² + 4n + 1 = (x+2)² − 3.

addvs.d V1, V0, 2 # V1 = x + 2 multvv.d V2, V1, V1 # V2 = (x + 2)^2 addvs.d V4, V2, -3 # V4 = (x + 2)^2 - 3

This requires only 5 + 2 n cycles.)

Problem R2b.9.

Suppose $s0 contains the address of the first element of a 128-double array. Recall that VMIPS vectors hold only 64 elements. Using vector instructions effectively, write a VMIPS program that places into $s1 the number of entries in the array that are positive.

mtc1    $f0, $zero
lv      V0, ($s0)
sgtvs.d V0, $f0
pop     $t0, VM
lv      V1, 512($s0)    # 512 = 64 * 8
sgtvs.d V1, $f0
pop     $t1, VM
add     $s1, $t0, $t1