Test 1: Questions

X1.1.

[8 pts] Microprocessor clock rates grew at a quick clip from 1988 to 2003, going from 16 MHz to 3.2 GHz. But then the race for faster clock rates essentially stopped. Why?

X1.2.

[8 pts] Ben Bitdiddle has been charged with improving the performance of a single-threaded program. He identifies two phases to the program: It starts in the first phase, spending 40% of its time there, before proceeding to the second, where it spends the remaining 60% of its time. Furthermore, he notices that the first phase is very amenable to parallelization, and he proposes to rewrite it so it will go four times as fast on a quad-core computer.

a. Fill in the blank: On a quad-core computer, Ben's revised program with the first phase parallelized will be at most times as fast as the old program. (Show your work.)

b. Name the law that applies to this scenario of speeding up a portion of a system.

X1.3.

[8 pts] Disassemble each of the following MIPS machine language instructions into assembly language.

a.	000000 01001 10010 01100 00000 101010
b.	100011 00001 00010 11111 11111 111100

X1.4.

[8 pts] Translate the following MIPS “program” into machine language (binary).

loop: addi $t0, $t0, -1
      bne $t0, $zero, loop
      nop

X1.5.

[8 pts] Define the assembly language term “delay slot,” and explain why some instruction set architectures include it.

X1.6.

[12 pts] Suppose we decide to add a new instruction lwo to the MIPS instruction set to allow accessing memory based on the sum of a base register and an offset register. In an assembly program, it would appear as “lwo $t0, $t1($s0)”. Define how to encode such an instruction while keeping our circuit implementation simple; explain your answer.

X1.7.

[12 pts] Write a MIPS subroutine that takes one parameter providing the address where a subroutine f begins; this parameter subroutine f (which you can call using jalr) takes an integer parameter and returns an integer. Your subroutine should repeatedly invoke f starting from 1 until it exceeds 1,000, and it should return the number of times f was invoked. For example, if f is the quadrupling function (f(n) = 4 × n), then your subroutine would return 5, since f(f(f(f(1)))) is 256 while f(256) is 1024.

The provided code in the template below handles saving some registers to the stack and restoring them at the end.

appsTo1000: addi $sp, $sp, -12
            sw $s0, ($sp)
            sw $s1, 4($sp)
            sw $ra, 8($sp)

            # insert your code here

            lw $s0, ($sp)
            lw $s1, 4($sp)
            lw $ra, 8($sp)
            add $sp, $sp, 12
            jr $ra

X1.8.

[8 pts] Define the pipelining term structural hazard, and provide an example where a structural hazard arises.

X1.9.

[10 pts] Complete the following diagram of how instructions progress through the classic five-stage pipeline, assuming that the processor supports data forwarding. Draw arrows indicating points where forwarding is necessary.

`ori $s0, $zero, 0xFF0`	IF	ID	EX	MEM	WB
`lw $t0, ($s0)`
`lw $t1, 4($s0)`
`add $t2, $t0, $t1`
`sw $t2, ($s0)`

X1.10.

[16 pts] The following table summarizes an 8-stage pipeline with branch prediction.

S0:	0.20 ns	By S0's end, branch prediction is determined.
S1:	0.10 ns	By S1's end, jump/branch destination is computed.
S2:	0.05 ns
S3:	0.15 ns	By S3's end, branch decision is determined.
S4:	0.05 ns
S5:	0.20 ns
S6:	0.10 ns
S7:	0.15 ns

However, inserting pipeline registers requires adding 0.05 ns to the time for the stage.

a. If we use this eight-stage pipeline, what is the best time per clock cycle? Explain your answer.

b. If we combine pairs of stages into one (S0 and S1 collapse together, then S2 and S3, then S4 and S5, then S6 and S7), what is the best time per clock cycle? Explain your answer.

c. Returning to the original eight-stage pipeline, assume that 5% of instructions executed are jumps or calls, while 20% are branches (of which 3/4 are taken). And you should assume that the branch prediction algorithm is correct 90% of the time (whether the branch is taken or not taken). How many clock cycles does the average instruction take in the 8-stage system? (Ignore data and structural hazards, as well as memory access delays.) Show your work.

X1.11.

[10 pts] Suppose we have a two-way set-associative cache with eight lines, each containing sixteen bytes. The cache uses a least-recently-used policy to replace lines; in the tables below, the “MRU?” column represents which line is most recently used within each set.

If the cache has the initial state diagrammed below, and we then access the address sequence 0xFC, 0x84, 0x18, 0x78, 0xB0, 0xDC, 0x00, what will be the final status of the cache?

set	MRU?	data
0	0	M[0x00–0x0F]
0	1	M[0x40–0x4F]
1	0	M[0x10–0x1F]
1	1	M[0x50–0x5F]
2	0	M[0x20–0x2F]
2	1	M[0x60–0x6F]
3	0	M[0x30–0x3F]
3	1	M[0x70–0x7F]

X1.12.

[8 pts] The following table summarizes a two-layer cache system, providing the size and associativity of each cache, along with the time to access the cache and the fraction of overall program instructions that incur a miss on that cache while accessing data.

			access	miss
	size	assoc.	time	rate
L1:	16 KB	two-way	1 cycle	4.1%
L2:	256 KB	four-way	3 cycles	1.2%
RAM:	4 GB	n/a	150 cycles	0.0%

a. Ignoring the time to fetch instructions, and ignoring any pipeline hazards, what is the average time per instruction? Show your work.

b. If we removed the L2 cache, keeping only the 16 KB L1 cache and RAM, what would be the average time per instruction? Show your work.

X1.13.

[24 pts] After studying the 5-stage RISC pipeline that we've been using, Betsy Bitdiddle observes that a large fraction (maybe 25%) of load/store instructions use a displacement of 0. Since these load/store instructions don't need the EX stage for computing the memory address, she suggests that they should skip directly from the ID stage to MEM. Of course, the ID stage would need to identify such instructions, but that is straightforward enough that it doesn't lead the ID stage to take longer.

a. What complexities does Betsy's design entail, and how can the circuit be modified to support her idea?

b. If Betsy's idea were implemented, in what circumstances will the modified CPU be able to complete programs faster? Your explanation should include a specific example of a code sequence, and it should discuss how performance is improved in that example.

Test 1: Solutions

X1.1.

A faster clock rate leads to transitions switching state faster, and each state transition generates heat. Until 3.2 GHz, this increase in heat could be overcome through more sophisticated heat sinks and fans. But architects have found little room for improving heat dissipation affordably.

X1.2.

a. 1.429. T_modified = (0.4 / 4 + 0.6) T_base = 0.7 T_base; the speedup is T_base / T_modified = 1 / 0.7 = 1.429.

b. Amdahl's Law

X1.3.

a. slt $t4, $t1, $s2

b. lw $v0, -4($at)

X1.4.

001000 01000 01000 11111 11111 111111

000101 00000 01000 11111 11111 111110

000000 00000 00000 00000 00000 000000

X1.5.

The delay slot is the instruction following any jump or branch instruction, which is executed regardless of whether the computer transfers execution to the instruction indicated in the jump or branch. Some ISAs include a delay slot because it enables the processor to commence execution of an instruction even while it is in the middle of processing the jump or branch; this removes one possible pipeline stall that could otherwise result due to the control hazard that a jump or branch introduces.

X1.6.

This would be an R-type instruction. To be parallel with the other R-type instructions, we should use 0 for the uppermost six bits of the instruction, and we should choose some unallocated six-bit code for the lowermost six bits (according to the green sheet, anything starting with 0x3 is available: to match the lower four bits of lw's operation code, perhaps we'd choose 0x33). Like other R-type registers, the destination register would go in the “rd” slot. We would place the base register (in the parentheses) into the “rs” slot to be congruent to the other load instructions, and then the “rt” slot would correspond to the offset.

According to this scheme, then, we'd encode the instruction as follows:

000000 $s0 $t1 $t0 00000 110011

000000 10000 01001 01000 00000 110011

X1.7.

            move $s0, $a0
            move $s1, 0
            move $a0, 1
loop:       jalr $s0
            move $a0, $v0
            slti $t1, $a0, 1001
            addi $s1, $s1, 1
            bne $t1, $zero, loop
            nop
            move $v0, $s1

X1.8.

A structural hazard occurs whenever two instructions in different pipeline stages simultaneously need a hardware resource that can only support one computation at a time. An example is memory access: Memory is accessed both in the IF stage and the MEM stage; this is normally dealt with through separate caches, but if it happens that both caches miss, then we have simultaneous loads from memory. Another potential example is an unpipelined, multi-cycle division unit: If one division instruction is in progress, then another division instruction would not be able to continue until the first one completes — a structural hazard.

X1.9.

`ori $s0, $zero, 0xFF0`	IF	ID	EX	MEM	WB
`lw $t0, ($s0)`		IF	ID	EX	MEM	WB
`lw $t1, 4($s0)`			IF	ID	EX	MEM	WB
`add $t2, $t0, $t1`				IF	ID	stl	EX	MEM	WB
`sw $t2, ($s0)`					IF	stl	ID	EX	MEM	WB

We would have the following arrows indicating forwarding:

from after ori's EX stage to the first lw's EX stage.
from after ori's MEM stage to the second lw's EX stage.
from after the second lw's MEM stage to add's EX stage.
from after add's MEM stage to sw's MEM stage.

X1.10.

a. 0.25 ns, since the longest stage is 0.20 ns, and we add 0.05 ns to account for pipeline registers.

b. 0.35 ns. The four combined stages have 0.30 (adding 0.2 and 0.1 for S0 and S1), 0.20, 0.25, 0.25 as their latencies. The longest among these is 0.3, and we add 0.05 to account for pipeline registers.

c. 0.75 ⋅ 1 + 0.05 ⋅ 2 + 0.15 ⋅ (0.9 ⋅ 2 + 0.1 ⋅ 4) + 0.05 ⋅ (0.9 ⋅ 1 + 0.1 ⋅ 4) = 0.75 + 0.1 + 0.33 + 0.065 = 1.245 cycles

X1.11.

set	MRU?	data
0	0	M[0x80–0x8F]
0	1	M[0x00–0x0F]
1	0	M[0x10–0x1F]
1	1	M[0xD0–0xDF]
2	0	M[0x20–0x2F]
2	1	M[0x60–0x6F]
3	1	M[0xB0–0xBF]
3	0	M[0x70–0x7F]

X1.12.

a. 1 + 0.041 ⋅ 3 + 0.012 ⋅ 150 = 2.923 cycles

b. 1 + 0.041 ⋅ 150 = 7.15 cycles

X1.13.

TODO

000000	`$s0`	`$t1`	`$t0`	00000	110011
000000	10000	01001	01000	00000	110011