[8 pts] Microprocessor clock rates grew at a quick clip from 1988 to 2003, going from 16 MHz to 3.2 GHz. But then the race for faster clock rates essentially stopped. Why?
A faster clock rate leads to transitions switching state faster, and each state transition generates heat. Until 3.2 GHz, this increase in heat could be overcome through more sophisticated heat sinks and fans. But architects have found little room for improving heat dissipation affordably.
[8 pts] Ben Bitdiddle has been charged with improving the performance of a single-threaded program. He identifies two phases to the program: It starts in the first phase, spending 40% of its time there, before proceeding to the second, where it spends the remaining 60% of its time. Furthermore, he notices that the first phase is very amenable to parallelization, and he proposes to rewrite it so it will go four times as fast on a quad-core computer.
a. Fill in the blank: On a quad-core computer, Ben's revised program with the first phase parallelized will be at most times as fast as the old program. (Show your work.)
b. Name the law that applies to this scenario of speeding up a portion of a system.
a. 1.429. Tmodified = (0.4 / 4 + 0.6) Tbase = 0.7 Tbase; the speedup is Tbase / Tmodified = 1 / 0.7 = 1.429.
b. Amdahl's Law
[8 pts] Disassemble each of the following MIPS machine language instructions into assembly language.
|a.||000000 01001 10010 01100 00000 101010|
|b.||100011 00001 00010 11111 11111 111100|
slt $t4, $t1, $s2
lw $v0, -4($at)
[8 pts] Translate the following MIPS “program” into machine language (binary).
loop: addi $t0, $t0, -1
bne $t0, $zero, loop
|001000 01000 01000 11111 11111 111111|
|000101 00000 01000 11111 11111 111110|
|000000 00000 00000 00000 00000 000000|
[8 pts] Define the assembly language term “delay slot,” and explain why some instruction set architectures include it.
The delay slot is the instruction following any jump or branch instruction, which is executed regardless of whether the computer transfers execution to the instruction indicated in the jump or branch. Some ISAs include a delay slot because it enables the processor to commence execution of an instruction even while it is in the middle of processing the jump or branch; this removes one possible pipeline stall that could otherwise result due to the control hazard that a jump or branch introduces.
Suppose we decide to add a new instruction
lwo to the MIPS
instruction set to allow accessing memory based on the sum of a
base register and an offset register. In an assembly program, it would appear as
lwo $t0, $t1($s0)”. Define how to encode such an instruction
while keeping our circuit implementation simple; explain your
This would be an R-type instruction. To be parallel with the
other R-type instructions, we should use 0 for the uppermost six
bits of the instruction, and we should choose some unallocated
six-bit code for the lowermost six bits (according to the green
sheet, anything starting with 0x3 is available: to match the
lower four bits of
lw's operation code, perhaps we'd
Like other R-type registers, the destination register would go in the
“rd” slot. We would place the base register
(in the parentheses) into the “rs” slot to be
congruent to the other load instructions, and then the
“rt” slot would correspond to the offset.
According to this scheme, then, we'd encode the instruction as follows:
00000 110011 000000 10000 01001 01000 00000 110011
Write a MIPS subroutine that takes one parameter providing the address
where a subroutine f begins;
this parameter subroutine f (which you can call using
takes an integer parameter and returns an integer.
Your subroutine should repeatedly invoke f starting from 1
until it exceeds 1,000, and it should return the number of times
f was invoked.
For example, if f is the quadrupling function
(f(n) = 4 × n),
then your subroutine would return 5, since f(f(f(f(1)))) is 256
while f(256) is 1024.
The provided code in the template below handles saving some registers to the stack and restoring them at the end.
appsTo1000: addi $sp, $sp, -12
sw $s0, ($sp)
sw $s1, 4($sp)
sw $ra, 8($sp)
# insert your code here
lw $s0, ($sp)
lw $s1, 4($sp)
lw $ra, 8($sp)
add $sp, $sp, 12
move $s0, $a0
move $s1, 0
move $a0, 1
loop: jalr $s0
move $a0, $v0
slti $t1, $a0, 1001
addi $s1, $s1, 1
bne $t1, $zero, loop
move $v0, $s1
[8 pts] Define the pipelining term structural hazard, and provide an example where a structural hazard arises.
A structural hazard occurs whenever two instructions in different pipeline stages simultaneously need a hardware resource that can only support one computation at a time. An example is memory access: Memory is accessed both in the IF stage and the MEM stage; this is normally dealt with through separate caches, but if it happens that both caches miss, then we have simultaneous loads from memory. Another potential example is an unpipelined, multi-cycle division unit: If one division instruction is in progress, then another division instruction would not be able to continue until the first one completes — a structural hazard.
[10 pts] Complete the following diagram of how instructions progress through the classic five-stage pipeline, assuming that the processor supports data forwarding. Draw arrows indicating points where forwarding is necessary.
We would have the following arrows indicating forwarding:
- from after
ori's EX stage to the first
lw's EX stage.
- from after
ori's MEM stage to the second
lw's EX stage.
- from after the second
lw's MEM stage to
add's EX stage.
- from after
add's MEM stage to
sw's MEM stage.
[16 pts] The following table summarizes an 8-stage pipeline with branch prediction.
|S0:||0.20 ns||By S0's end, branch prediction is determined.|
|S1:||0.10 ns||By S1's end, jump/branch destination is computed.|
|S3:||0.15 ns||By S3's end, branch decision is determined.|
However, inserting pipeline registers requires adding 0.05 ns to the time for the stage.
a. If we use this eight-stage pipeline, what is the best time per clock cycle? Explain your answer.
b. If we combine pairs of stages into one (S0 and S1 collapse together, then S2 and S3, then S4 and S5, then S6 and S7), what is the best time per clock cycle? Explain your answer.
c. Returning to the original eight-stage pipeline, assume that 5% of instructions executed are jumps or calls, while 20% are branches (of which 3/4 are taken). And you should assume that the branch prediction algorithm is correct 90% of the time (whether the branch is taken or not taken). How many clock cycles does the average instruction take in the 8-stage system? (Ignore data and structural hazards, as well as memory access delays.) Show your work.
a. 0.25 ns, since the longest stage is 0.20 ns, and we add 0.05 ns to account for pipeline registers.
b. 0.35 ns. The four combined stages have 0.30 (adding 0.2 and 0.1 for S0 and S1), 0.20, 0.25, 0.25 as their latencies. The longest among these is 0.3, and we add 0.05 to account for pipeline registers.
c. 0.75 ⋅ 1 + 0.05 ⋅ 2 + 0.15 ⋅ (0.9 ⋅ 2 + 0.1 ⋅ 4) + 0.05 ⋅ (0.9 ⋅ 1 + 0.1 ⋅ 4) = 0.75 + 0.1 + 0.33 + 0.065 = 1.245 cycles
[10 pts] Suppose we have a two-way set-associative cache with eight lines, each containing sixteen bytes. The cache uses a least-recently-used policy to replace lines; in the tables below, the “MRU?” column represents which line is most recently used within each set.
If the cache has the initial state diagrammed below, and we then access the address sequence 0xFC, 0x84, 0x18, 0x78, 0xB0, 0xDC, 0x00, what will be the final status of the cache?
[8 pts] The following table summarizes a two-layer cache system, providing the size and associativity of each cache, along with the time to access the cache and the fraction of overall program instructions that incur a miss on that cache while accessing data.
|L1:||16 KB||two-way||1 cycle||4.1%|
|L2:||256 KB||four-way||3 cycles||1.2%|
|RAM:||4 GB||n/a||150 cycles||0.0%|
a. Ignoring the time to fetch instructions, and ignoring any pipeline hazards, what is the average time per instruction? Show your work.
b. If we removed the L2 cache, keeping only the 16 KB L1 cache and RAM, what would be the average time per instruction? Show your work.
a. 1 + 0.041 ⋅ 3 + 0.012 ⋅ 150 = 2.923 cycles
b. 1 + 0.041 ⋅ 150 = 7.15 cycles
[24 pts] After studying the 5-stage RISC pipeline that we've been using, Betsy Bitdiddle observes that a large fraction (maybe 25%) of load/store instructions use a displacement of 0. Since these load/store instructions don't need the EX stage for computing the memory address, she suggests that they should skip directly from the ID stage to MEM. Of course, the ID stage would need to identify such instructions, but that is straightforward enough that it doesn't lead the ID stage to take longer.
a. What complexities does Betsy's design entail, and how can the circuit be modified to support her idea?
b. If Betsy's idea were implemented, in what circumstances will the modified CPU be able to complete programs faster? Your explanation should include a specific example of a code sequence, and it should discuss how performance is improved in that example.