
Exam 2 Review A
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
Problem R2a.1.
[8 pts] Contrast the terms latency and bandwidth, explaining them in terms of delivering water buckets to a fire.
Latency is the amount of time to handle a single operation, whereas bandwidth is the overall rate at which operations can be completed. A basic example is putting out a fire using a bucket brigade: Latency would be the amount of time for one bucket to get from the water source to the fire, whereas whereas bandwidth would be the rate at which buckets would be poured onto the fire.
Problem R2a.2.
[8 pts]
Suppose we have a stack architecture supporting LOAD
(direct
addressing), STORE
(direct addressing), ADD
, and MULT
instructions.
Supposing that x
is stored at address 1000 and y
is stored at address 1004,
write a sequence of instructions corresponding to the C
statement “x = x * (x + y);
”.
LOAD 1000
LOAD 1000
LOAD 1004
ADD
MULT
STORE 1000
Problem R2a.3.
[8 pts] Disassemble each of the following MIPS machine language instructions.
a. 000000 00000 10100 00100 01010 000011
b. 101011 01011 10101 00000 00000 010100
a. sra $a0, $s4, 10
b. sw $t3, 20($s5)
Problem R2a.4.
[10 pts]
Using standard MIPS calling conventions,
write a MIPS subroutine max
that takes two integer parameters
and returns the larger of the two parameter values.
max: slt $t0, $a0, $a1
ori $v0, $zero, $a1
bne $t0, $zero, done
ori $v0, $zero, $a0
done: jr $ra
Problem R2a.5.
[8 pts] Complete the following diagram of how instructions progress through the classic five-stage pipeline, assuming that the processor supports data forwarding. Draw arrows indicating where forwarding is necessary.
lw $t0, 4($s0) |
IF | ID | EX | MEM | WB |
add $s2, $s1, $t0 |
|||||
add $s3, $s2, $t0 |
lw $t0, 4($s0) |
IF | ID | EX | MEM | WB | |||
add $s2, $s1, $t0 | IF | ID | EX | MEM | WB | |||
add $s3, $s2, $t0 | IF | ID | EX | MEM | WB |
We would need arrows from after lw
's MEM stage to
the beginning of the first add
's EX stage, and from
after the first add
's EX stage to the beginning of the
second add
's EX stage.
Problem R2a.6.
[8 pts] Consider the following instruction sequence.
a: add $t0, $s0, $s1
b: sub $t1, $s2, $t0
c: xor $t0, $s0, $s1
d: or $t2, $t1, $t0
a. Identify all RAW dependencies.
b. Identify all WAR dependencies.
c. Identify all WAW dependencies.
a. RAW:
a
→b
,
b
→d
,
c
→d
b. WAR:
b
→c
c. WAW:
a
→c
Problem R2a.7.
[10 pts] Suppose our processor uses Tomasulo's algorithm without speculation, incorporating a 3-cycle addition unit and a 5-cycle multiplication unit. Reservation stations 0 and 1 are for the addition unit, while stations 2 and 3 are for the multiplication unit; when choosing among available stations, the one with the least ID is chosen. Complete the following table showing how each instruction proceeds through the processor.
res. | issue | dispatch | commit | |
instruction | stat. | cycle | cycle | cycle |
mul.d $f6, $f0, $f2 |
0 | |||
add.d $f8, $f4, $f6 |
||||
mul.d $f10, $f6, $f8 |
||||
mul.d $f12, $f0, $f4 |
res. | issue | dispatch | commit | |
instruction | stat. | cycle | cycle | cycle |
mul.d $f6, $f0, $f2 |
2 | 0 | 1 | 5 |
add.d $f8, $f4, $f6 |
0 | 1 | 6 | 8 |
mul.d $f10, $f6, $f8 |
3 | 2 | 9 | 14 |
mul.d $f12, $f0, $f4 |
2 | 6 | 7 | 12 |
Problem R2a.8.
[8 pts] In a fine-grained multithreaded processor, the processor is aware of different threads, and with each clock cycle it rotates to a different thread to dispatch whatever instructions that thread has available. By contrast, a single-threaded processor focuses exclusively on one thread until the operating system switches it to focus on a different thread.
Explain why a fine-grained multithreaded processor can potentially complete more instructions per clock cycle than a single-threaded processor, even in the absence of stalls due to L1 cache misses or interrupts.
Even without long-term stalls, there are still short-term stalls due to data dependencies and control dependencies. By switching rapidly between threads, a fine-grained multithreaded processor lengthens the time between checking which instructions to dispatch next in a particular thread; thus, instructions are more often ready when its turn comes.
For example, suppose we have two threads executing
“lw $t0, 0($s0)
” followed by
“addi $t1, $t0, 1
” using a
classic 5-stage RISC pipeline. If the pipeline focuses on one
thread at a time, the second instruction would need to stall
for one instruction while lw
loads $t0
to be used by addi
.
So the processor would take three cycles to process these two
instructions in one thread, then three cycles ofr the second thread,
for a total of six cycles.
With fine-grained multithreading,
the processor would execute one thread's lw
,
then the other's, then the first thread's addi
, then
the other's. There would be no need to stall, since the first
thread's value for $t0
would be available by the time
the processor got back around to its addi
, so it would
take four cycles, not six as with a single-threaded processor.
Problem R2a.9.
[8 pts] What distinguishes a fine-grained multithreaded processor from a simultaneous multithreaded processor (also known as hyperthreading)?
With fine-grained multithreading, the processor considers dispatching instructions from only one thread in each cycle, though it rotates between which thread it considers after each cycle. With simultaneous multithreading, the processor will consider dispatching instructions from varied threads in each clock cycle. This leads to the possibility that instructions from two different threads (or more threads) will be dispatched in exactly the same cycle, which cannot happen with fine-grained multithreading.
Problem R2a.10.
[8 pts] Recall that a branch target buffer is a cache for remembering the target address (i.e., the address of the instruction being branched to) of each branch instruction encountered. Why is a branch target buffer useful for enhancing processor performance?
When the branch predictor predicts that the branch will be
taken, the processor needs to know which instruction
will be next. Even a simple branch often
adds a number to the PC to compute the target;
so being able to load from a cache can help to start
fetching the next instruction even before the addition has taken place.
The delay to determine the branch's target is particularly
problematic, though, when the branch target comes from a register
(such as with jr
or jalr
in MIPS);
a branch target buffer can attempt to predict the register's
value from the previous time a branch occurred, allowing
instructions to be fetched and executed before the target is
finally confirmed.
Problem R2a.11.
[10 pts]
Suppose we have three 64-double
arrays a
, b
,
and c
; register $s0
contains the address of
the first entry of a
, $s1
references b
's first element,
and $s2
references c
's first element. Moreover,
suppose $f0
is pre-initialized to hold the
floating-point value 4.
Without using any branches or jumps, write VMIPS assembly code that corresponds to the following C fragment.
for (i = 0; i < 64; i++) {
a[i] += 4 * b[i] + c[i];
}
lv V1, ($s0)
lv V2, ($s1)
mulvs.d V2, $f0
addvv.d V1, V1, V2
lv V3, ($s2)
addvv.d V1, V1, V3
sv V1, ($s0)