[8 pts] Define the term instruction-level parallelism (ILP), and include at least one example of a processor design technique illustrating it.
Instruction-level parallelism refers to opportunities to execute simultaneously different instructions from the same instruction stream, even though the stream is written as if it is sequential. Pipelined and superscalar architectures are two techniques that use ILP to enhance processor performance.
Suppose we have a stack architecture including
STORE (direct addressing),
MULT instructions. Supposing that
stored at address 1000 and
y is stored at address 1004,
write a sequence of instructions corresponding to the C
x = x * x + y * y;”.
[10 pts] Translate the following MIPS “program” into machine language (binary).
loop: lw $t0, 0($s0)
addi $s0, $s0, -4
bne $s0, $s1, loop
|100011 10000 01000 00000 00000 000000|
||001000 10000 10000 11111 11111 111100|
||000101 10001 10000 11111 11111 111101|
||000000 00000 00000 00000 00000 000000|
The below MIPS subroutine accepts a subroutine parameter
and counts how many times
f can be applied to its own result
before we reach zero, starting with 1000.
For example, if
f is integer division by 10,
it would return 4 (1000 / 10 / 10 / 10 / 10 = 0).
Though this is an admirable attempt, testing demonstrates that it does not work correctly. Explain what is wrong, and show how to repair the code.
move $t0, $a0 # place f into $t0
li $t1, 1000 # $t1 is number we're currently at
li $t2, 0 # $t2 counts number of applications of f
cnext: move $a0, $t1 # call f(current), result into $t1
move $t1, $v0
addi $t2, $t2, 1 # increment counter
bne $t1, $zero, cnext # repeat if current is still not zero
move $v0, $t2 # return counter
f in the
can potentially change all caller-save registers
(and it will certainly change
But the subroutine as written depends upon
$ra all retaining
their original value, even though they are caller-save
A partial solution is to use callee-save registers
$s2 wherever the
original code uses
But using these requires that the subroutine restore these
registers' original values (as well as
$ra) before returning.
So another required step would be to add code at the beginning
to push the old values onto the stack, and to add code before
jr $ra” to restore the pushed values.
The added code at the beginning would be:
addi $sp, $sp, -16
sw $ra, 12($sp)
sw $s0, 8($sp)
sw $s1, 4($sp)
sw $s2, 0($sp)
move $s0, $a0 ; original first instruction, using $s0 not $t0
And at the end would be:
move $v0, $s2 ; original next-to-last instruction, using $s2 not $t2
lw $ra, 12($sp)
lw $s0, 8($sp)
lw $s1, 4($sp)
lw $s2, 0($sp)
addi $sp, $sp, 16
Suppose we decide to add a new instruction
pop to the
MIPS instruction set to compute the number of 1 bits in the
binary representation of a register's value, placing the result
into a register. In an assembly program, it would appear as
pop $t0, $s1”. Define a system for encoding such an
instruction while keeping our circuit implementation simple;
explain your answer.
Since this instruction only has register arguments, it should
be an R-type instruction. The first six bits should therefore be 0's.
Looking at the available R-type opcodes, the last six bits might
be 0x28 (which places it next to the other bit-twiddling
nor). The R-type instructions all
place the destination register (
$t0 in the example) in
bits 11–15, while the R-type instructions that use just
one register place that register into bits 21–25. Bits
16–20 and 6–10 should be 0's.
pop $t0, $s1, then, the encoding would be
000000 10001 00000 01000 00000 101000.
[8 pts] Why do some instruction architectures choose to encode instructions into 16 bits, even though this severely restricts both the number of operations that the architecture can support and the number of registers?
A 16-bit encoding leads to encoding programs in less memory. A 32-bit encoding tends to lead to fewer instructions per program, and hence a greater opportunity for increased performance; but for low-cost systems, performance is not as much a priority as cost and power. (Less memory in the system means a lower power requirement.)
[8 pts] Assume that we can divide a pipeline for processing instructions into as many equal-sized pipeline stages as we want. Why might we decide it better to have fewer pipeline stages than possible?
One reason is that a more finely separated pipeline requires running the clock faster to get any advantage, but running the clock beyond 4 GHz is basically impossible due to the amount of heat generated by the fast-switching transistors.
Another is that increasing the pipeline length does not increase the clock speed proportionately, due to the delay introduced by each pipeline register. In itself, this is not a problem, until one considers data and control hazards that lead to pipeline delays. For instance, suppose that 10% of instructions turn out to be mispredicted branches, and branch predictions aren't determined until halfway through the pipeline. If an original instruction takes T time and a pipeline register introduces a delay of p, then a 10-stage pipeline would have a clock running at T/10 + p; but 10% of instructions would complete after 5 clock cycles, so the average clocks per instruction would be 0.9 ⋅ 1 + 0.1 ⋅ 5 = 1.4, for an average time per instruction of 1.4 (T/10 + p) = 0.14 T + 1.4 p. For a 20-stage pipeline, a clock cycle takes T/20 + p time, while the average clocks per instruction is 0.9 ⋅ 1 + 0.1 ⋅ 10 = 1.9, for an average time of 1.9 (T/20 + p) = 0.95 T + 1.9 p. If p is at least 9% of T, then the 10-stage pipeline will end up completing instructions more quickly.
A final reason — though not a major concern — is that the forwarding logic becomes increasingly unwieldy for longer pipelines, as logic must exist basically to forward data from every stage to every earlier stage. Eventually this logic becomes too unwieldy to be a practical use of transistors.
[12 pts] Suppose we have a 6-stage pipeline that can complete one instruction each clock cycle in the absence of any jumps or branches. (We assume there are no data or structural hazards.) For jumps or branches, the destination of the jump is known at the end of the pipeline's third stage, whereas the decision of whether a branch is taken is determined at the end of the pipeline's fourth stage.
Assume that 5% of instructions in a typical program are jump instructions, while 15% of instructions are branch instructions, of which 2/3 are taken.
a. What is the average number of clock cycles per instruction if we always stall for every jump or branch until the next instruction's location is known? Show your work.
b. What is the average number of clock cycles per instruction if we speculatively execute assuming that branches are not taken? Show your work.
c. What is the average number of clock cycles per instruction if we speculatively execute assuming that branches are taken? Show your work.
a. 0.8 * 1 + 0.05 * 3 + 0.15 * 4 = 1.55.
b. 0.8 * 1 + 0.05 * 3 + 0.10 * 4 + 0.05 * 1 = 1.4.
c. 0.8 * 1 + 0.05 * 3 + 0.10 * 3 + 0.05 * 4 = 1.45.
[10 pts] When executing the following program, which happens to count the number of e's in the word eve, the first six branches occur as listed at right.
a. Suppose we predict branches using a local prediction buffer, where each entry is a two-bit predictor initially configured to predict “weakly no” (that is, predict “no” though the last observation was “yes”). Which branches will be predicted incorrectly? Show your work.
b. Suppose we predict branches using a three-bit global branch history, where each entry is again a two-bit predictor initially configured to predict “weakly no.” The previous several branches have been “yes.” Which branches will be predicted incorrectly? Show your work.
a. 2, 3, 5, 6. The predictions would be N, N, N, N, N, Y. The below table shows how the cache would change after each prediction.
A WN/0 SN/1 WN/2 SN/4 B WN/0 SY/3 WY/6
b. 2, 3, 6. The predictions would be N, N, N, N, Y, Y. The below table shows how the cache would change after each prediction; row “NY” represents the situation where the previous branch result was “yes” but the one before that was “no”.
NNN WN/0 NNY WN/0 NYN WN/0 NYY WN/0 SN/4 YNN WN/0 YNY WN/0 SY/3 WY/6 YYN WN/0 SY/2 SY/5 YYY WN/0 SN/1
[10 pts] Give an example of a MIPS instruction sequence where out-of-order execution improves performance on a 5-stage pipeline. Explain how out-of-order execution would help.
Suppose we had the following instruction sequence.
lw $t1, 0($a0)
ori $t2, $t1, 1
addi $a0, $a0, 4
In a pipelined processor doing in-order execution, the
lw instruction would reach the MEM stage at the same
time that the
ori instruction reaches the EX stage.
ori instruction would have to stall at this point,
lw instruction will not have determined
the value for
$t1 (which the EX stage needs to execute
ori instruction) until the MEM stage completes.
This stall adds one clock cycle to the time needed to complete
This wasted clock cycle could be avoided, however, if the processor
addi instruction first. Then the
addi instruction (which doesn't depend on the result of
lw instruction) could complete the EX stage at the time
lw instruction completes the MEM stage.
ori instruction would complete the EX stage at the
same time as it otherwise would have done.
[8 pts] Define the term simultaneous multithreading in processor design?
In a simultaneous multithreaded processor, the processor is aware of different threads of execution (instruction queues manipulating different sets of registers), and instructions from different threads can be dispatched into execution units in the same clock cycle.
[10 pts] Assume we have a vector processor with vector registers of length n where n ≥ 16. The processor has one load/store unit (latency 7) and one addition unit (latency 3). Assuming the processor does not avoid latency between consecutive instructions, but it does support chaining, complete the following table showing in which clock cycles each indicated element is in an execution unit.
|instruction||first elt||last elt|
|instruction||first elt||last elt|
||0…n − 1||6…6 + n − 1|
||n…2 n − 1||6 + n…6 + 2 n − 1|
||6 + n…6 + 2 n − 1||8 + n…8 + 2 n − 1|
||2 n…3 n − 1||6 + 2 n…6 + 3 n − 1|
[8 pts] In a graphics processing unit (GPU), distinguish the terms “private memory,” “local memory,” and “global memory.”
“Private” memory is available to just one computation unit (“stream processor”). “Local” memory that is available to one group of computation units (“multiprocessor”, driving multiple stream processors with the same instruction stream). And “global” memory is available across the GPU (and CPU).
Using the MIPS
write a fragment of MIPS assembly code that atomically waits until the
value at the memory address contained in
$s0 is 0 before setting
that memory to 1. (This is a basic “lock”
lock: ll $t0, ($s0)
beq $t0, $zero, lock
li $t1, 1
sc $t1, ($s0)
beq $t1, $zero, lock
[10 pts] Define the snoopy approach to cache coherence, and explain why is it usually applicable only to systems with less than eight cores.
The snoopy approach involves connecting each core to a common bus, so that all communication between a core and memory is visible to all cores. Individual cores can update their private caches based on the traffic they see on the bus.
This system tends to break down beyond eight cores because each core needs a certain bandwidth to retrieve memory, but a bus is limited to handling only one message per clock cycle. The amount of traffic that the bus can handle ends up becoming a bottleneck beyond eight cores.
[8 pts] Distinguish the terms cache coherency and cache consistency in multicore processors.
Cache coherency considers each memory address in isolation, looking for a guarantee that the sequence of accesses to that individual memory address is reasonable; it also allows for some fudging where individual processors' accesses are “near-simultaneous.” Cache consistency looks for a more precise ordering of instructions between different threads, and it looks for consistency in the order of access across all memory addresses.
Suppose two cores execute the following threads having access to
the shared-memory variables
|Thread A||Thread B|
|Store 6 into
|Store 2 into
Assuming our cache is sequentially consistent, what are the possible results displayed by Thread B? Explain each answer.
4: We could execute all of B before executing any of A, so B
would get the initial values 1 and 3 for
7: We could execute B's first instruction (getting 1),
then A's first instruction (setting
b to 6), then
B's second instruction (getting 6).
8: We could execute all of A before executing any of B,
so B would get the values for
b as set by A:
2 + 6.
[10 pts] Suppose we have a set of text files, each corresponding to a class offered at Hendrix, where each text file includes one student ID per line followed by that student's letter grade in the course. Describe how to use MapReduce/Hadoop to compute each student's GPA (grade point average): We average the grades where an A is worth 4, B 3, C 2, D 1, and F 0. (You can assume all grades are A/B/C/D/F, that all courses have the same credit value, and that students never re-take a course to replace a former grade.) For example, if a student earned an A in one course, a B in two other courses, and an F in another course, then that student's GPA is computed as (4 + 3 + 3 + 0) / 4 = 3.0 — we divide by four since this student has taken four courses.
In the Map phase, we would decode the file and emit (studentName, gradeValue) for each student/grade pair found in the file. (We'd translate the letter grade into a number between 0 and 4 at this point.)
In the Reduce phase, we'd take the list of gradeValues received and compute both the sum of the list and the length of the list, reporting the quotient as our result for that student.