Final Review A
What are CUDA and OpenCL for?
CUDA and OpenCL both exist to provide an interface for writing programs that will execute directly on the graphics processing unit (GPU). A GPU can provide additional processing power, particularly for computational problems that are amenable to parallelization.
Characterize some differences of a typical GPU architecture versus a typical CPU.
A GPU has several computation units (“thread processors”) that are all driven with the same instructions, though each has its own register set.
A GPU distinguishes between three types of memory: “private” memory available to just one computation unit, “local” memory that is available to one group of computation units (all driven by the same instruction stream), and “global” memory that is available across the processor. By contrast, a CPU has only registers (anologous to a GPU's private memory) and memory (analogous to a GPU's global memory).
A GPU includes aggressive multithreading, with 96 threads per multiprocessor being typical. By contrast, a CPU handles fewer threads at any one time, but it works more aggressively to discover instruction-level parallelism so that each individual thread can be completed more quickly.
Using the MIPS
write code that atomically swaps the value in register
$s0 with the value stored at the memory address found
again: ll $t1, ($s1)
move $t2, $s0
sc $t2, ($s1)
beq $t2, $zero, again
move $s2, $t1
Suppose we want to copy the value at the memory
address found in
$s0 to the memory address found
The natural approach is to use “
lw $t0, ($s0)”
followed by “
sw $t0, ($s1)”.
But in a multi-threaded application, other threads' work
may be inserted between these two instructions, and those
threads could change the memory at
sw would be working
with an old value when it is finally executed.
Using the MIPS
write code that accomplishes this copy atomically — i.e.,
ensures that the store works exactly with the load.
again: ll $t1, ($s1)
lw $t0, ($s0)
sc $t0, ($s1)
beq $t0, $zero, again
Symmetric multiprocessors are simpler to implement than distributed shared memory. Why, then, is distributed shared memory generally a better option when there are more than eight processors?
A symmetric multiprocessor relies on a single, shared memory bus shared among all processors for communication with the single shared memory. Beyond eight processors, the amount of traffic on this bus and the amount of traffic with the main memory becomes a bottleneck impeding the system's speed. A distributed system allows direct communication between processors along a switching network.
Suppose two cores execute the following threads having access
to the shared-memory variables
Thread A Thread B
a = 1;
b = 11;
b = 12;
a = 2;
Assuming our cache is coherent, which
of the following are possible final values for
a. 1 11 b. 1 12 c. 2 11 d. 2 12
All four possibilities could occur. It is easy to come up
with cases that lead to
(a.), (c.), and (d.):
We could execute all of B before A, or execute the first
instruction of each before the second instruction of each, or we
could execute all of A before B. However, (b.) is
sequentially inconsistent, so there's no such sequence.
However, if A and B occur at virtually the same time,
the definition of coherence could permit
a to eventually become 1
b would eventually become 12.
Describe the write invalidate protocol for cache coherency in shared-memory systems.
In the write invalidate protocol, whenever a processor wishes to write to a memory location, the cache first broadcasts onto the bus that other caches should invalidate their copies. From then on, the cache considers itself to have an exclusive copy of that data, so that subsequent writes to that memory location can reside purely in the cache. When another processor requests the memory location, or when the processor decides that it is time to use the cache location for another purpose, then the processor sends the actual value written onto the bus.
There are two basic ways of implementing snooping to support cache coherency in shared memory systems: write broadcast and write invalidate. What is the primary advantage of write invalidate over write broadcast?
Write invalidate's primary advantage is that it requires less traffic for multiple writes to the same cache line: With write broadcast, every write to a cache line must be broadcast onto the bus, but with write invalidate, only the first write needs a broadcast on the bus. (Of course, the value must be broadcast later when the cache decides no longer to include that memory location, or when another cache needs to read from that memory location.)
Explain what the directory system for cache coherence entails. Why is it particularly applicable for a DSM (distributed shared memory) system?
In a directory system, there is a central location for each line of memory that tracks which processors' caches hold a copy of that line. Having a central location allows processors to determine the status of how a line is shared. This is particularly relevant for a DSM system, since messages are not broadcast to all other procesors (across a bus) and so needs to know which other processors to communicate with.
In class we studied four potential network topologies: fully connected, grid, torus, and hypercube. Another alternative is a ring topology, where processors are essentially lined up in a circle. Suppose we have a ring topology over n processors.
a. What is the diameter of this topology?
b. What is the average distance between any two nodes of this topology?
c. How many direct connections does this topology include?
a. n / 2, since for any node the furthest node is halfway around the ring.
b. n / 4, since for any node there is one node that is 0 way, one that is n / 2 away, and two nodes for each integer distance between 0 and n / 2.