Exam 2 Review A: Questions

R2a.1.

[8 pts] Contrast the terms latency and bandwidth, explaining them in terms of delivering water buckets to a fire.

R2a.2.

[8 pts] Suppose we have a stack architecture supporting LOAD (direct addressing), STORE (direct addressing), ADD, and MULT instructions. Supposing that x is stored at address 1000 and y is stored at address 1004, write a sequence of instructions corresponding to the C statement “x = x * (x + y);”.

R2a.3.

[8 pts] Disassemble each of the following MIPS machine language instructions.

a. 000000 00000 10100 00100 01010 000011

b. 101011 01011 10101 00000 00000 010100

R2a.4.

[10 pts] Using standard MIPS calling conventions, write a MIPS subroutine max that takes two integer parameters and returns the larger of the two parameter values.

R2a.5.

[8 pts] Complete the following diagram of how instructions progress through the classic five-stage pipeline, assuming that the processor supports data forwarding. Draw arrows indicating where forwarding is necessary.

`lw $t0, 4($s0)`	IF	ID	EX	MEM	WB
`add $s2, $s1, $t0`
`add $s3, $s2, $t0`

R2a.6.

[8 pts] Consider the following instruction sequence.

a: add $t0, $s0, $s1
b: sub $t1, $s2, $t0
c: xor $t0, $s0, $s1
d: or  $t2, $t1, $t0

a. Identify all RAW dependencies.

b. Identify all WAR dependencies.

c. Identify all WAW dependencies.

R2a.7.

[10 pts] Suppose our processor uses Tomasulo's algorithm without speculation, incorporating a 3-cycle addition unit and a 5-cycle multiplication unit. Reservation stations 0 and 1 are for the addition unit, while stations 2 and 3 are for the multiplication unit; when choosing among available stations, the one with the least ID is chosen. Complete the following table showing how each instruction proceeds through the processor.

	res.	issue	dispatch	commit
instruction	stat.	cycle	cycle	cycle
`mul.d $f6, $f0, $f2`		0
`add.d $f8, $f4, $f6`
`mul.d $f10, $f6, $f8`
`mul.d $f12, $f0, $f4`

R2a.8.

[8 pts] In a fine-grained multithreaded processor, the processor is aware of different threads, and with each clock cycle it rotates to a different thread to dispatch whatever instructions that thread has available. By contrast, a single-threaded processor focuses exclusively on one thread until the operating system switches it to focus on a different thread.

Explain why a fine-grained multithreaded processor can potentially complete more instructions per clock cycle than a single-threaded processor, even in the absence of stalls due to L1 cache misses or interrupts.

R2a.9.

[8 pts] What distinguishes a fine-grained multithreaded processor from a simultaneous multithreaded processor (also known as hyperthreading)?

R2a.10.

[8 pts] Recall that a branch target buffer is a cache for remembering the target address (i.e., the address of the instruction being branched to) of each branch instruction encountered. Why is a branch target buffer useful for enhancing processor performance?

R2a.11.

[10 pts] Suppose we have three 64-double arrays a, b, and c; register $s0 contains the address of the first entry of a, $s1 references b's first element, and $s2 references c's first element. Moreover, suppose $f0 is pre-initialized to hold the floating-point value 4.

Without using any branches or jumps, write VMIPS assembly code that corresponds to the following C fragment.

for (i = 0; i < 64; i++) {

    a[i] += 4 * b[i] + c[i];

}

Exam 2 Review A: Solutions

R2a.1.

Latency is the amount of time to handle a single operation, whereas bandwidth is the overall rate at which operations can be completed. A basic example is putting out a fire using a bucket brigade: Latency would be the amount of time for one bucket to get from the water source to the fire, whereas whereas bandwidth would be the rate at which buckets would be poured onto the fire.

R2a.2.

LOAD 1000
LOAD 1000
LOAD 1004
ADD
MULT
STORE 1000

R2a.3.

a. sra $a0, $s4, 10

b. sw $t3, 20($s5)

R2a.4.

max:  slt $t0, $a0, $a1
      ori $v0, $zero, $a1
      bne $t0, $zero, done
      ori $v0, $zero, $a0
done: jr $ra

R2a.5.

`lw $t0, 4($s0)`	IF	ID	EX	MEM	WB
`add $s2, $s1, $t0`		IF	ID		EX	MEM	WB
`add $s3, $s2, $t0`			IF		ID	EX	MEM	WB

We would need arrows from after lw's MEM stage to the beginning of the first add's EX stage, and from after the first add's EX stage to the beginning of the second add's EX stage.

R2a.6.

a. RAW: a→b, b→d, c→d

b. WAR: b→c

c. WAW: a→c

R2a.7.

	res.	issue	dispatch	commit
instruction	stat.	cycle	cycle	cycle
`mul.d $f6, $f0, $f2`	2	0	1	5
`add.d $f8, $f4, $f6`	0	1	6	8
`mul.d $f10, $f6, $f8`	3	2	9	14
`mul.d $f12, $f0, $f4`	2	6	7	12

R2a.8.

Even without long-term stalls, there are still short-term stalls due to data dependencies and control dependencies. By switching rapidly between threads, a fine-grained multithreaded processor lengthens the time between checking which instructions to dispatch next in a particular thread; thus, instructions are more often ready when its turn comes.

For example, suppose we have two threads executing “lw $t0, 0($s0)” followed by “addi $t1, $t0, 1” using a classic 5-stage RISC pipeline. If the pipeline focuses on one thread at a time, the second instruction would need to stall for one instruction while lw loads $t0 to be used by addi. So the processor would take three cycles to process these two instructions in one thread, then three cycles ofr the second thread, for a total of six cycles. With fine-grained multithreading, the processor would execute one thread's lw, then the other's, then the first thread's addi, then the other's. There would be no need to stall, since the first thread's value for $t0 would be available by the time the processor got back around to its addi, so it would take four cycles, not six as with a single-threaded processor.

R2a.9.

With fine-grained multithreading, the processor considers dispatching instructions from only one thread in each cycle, though it rotates between which thread it considers after each cycle. With simultaneous multithreading, the processor will consider dispatching instructions from varied threads in each clock cycle. This leads to the possibility that instructions from two different threads (or more threads) will be dispatched in exactly the same cycle, which cannot happen with fine-grained multithreading.

R2a.10.

When the branch predictor predicts that the branch will be taken, the processor needs to know which instruction will be next. Even a simple branch often adds a number to the PC to compute the target; so being able to load from a cache can help to start fetching the next instruction even before the addition has taken place. The delay to determine the branch's target is particularly problematic, though, when the branch target comes from a register (such as with jr or jalr in MIPS); a branch target buffer can attempt to predict the register's value from the previous time a branch occurred, allowing instructions to be fetched and executed before the target is finally confirmed.

R2a.11.

lv      V1, ($s0)
lv      V2, ($s1)
mulvs.d V2, $f0
addvv.d V1, V1, V2
lv      V3, ($s2)
addvv.d V1, V1, V3
sv      V1, ($s0)