Final Review A

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Problem Rfa.1.

What are CUDA and OpenCL for?

CUDA and OpenCL both exist to provide an interface for writing programs that will execute directly on the graphics processing unit (GPU). A GPU can provide additional processing power, particularly for computational problems that are amenable to parallelization.

Problem Rfa.2.

Characterize some differences of a typical GPU architecture versus a typical CPU.

A GPU has several computation units (“thread processors”) that are all driven with the same instructions, though each has its own register set.
A GPU distinguishes between three types of memory: “private” memory available to just one computation unit, “local” memory that is available to one group of computation units (all driven by the same instruction stream), and “global” memory that is available across the processor. By contrast, a CPU has only registers (anologous to a GPU's private memory) and memory (analogous to a GPU's global memory).
A GPU includes aggressive multithreading, with 96 threads per multiprocessor being typical. By contrast, a CPU handles fewer threads at any one time, but it works more aggressively to discover instruction-level parallelism so that each individual thread can be completed more quickly.

Problem Rfa.3.

Using the MIPS ll and sc instructions, write code that atomically swaps the value in register $s0 with the value stored at the memory address found in $s1.

again:  ll $t1, ($s1)
        move $t2, $s0
        sc $t2, ($s1)
        beq $t2, $zero, again
        move $s2, $t1

Problem Rfa.4.

Suppose we want to copy the value at the memory address found in $s0 to the memory address found in $s1. The natural approach is to use “lw $t0, ($s0)” followed by “sw $t0, ($s1)”. But in a multi-threaded application, other threads' work may be inserted between these two instructions, and those threads could change the memory at $s0; thus, the sw would be working with an old value when it is finally executed.

Using the MIPS ll and sc instructions, write code that accomplishes this copy atomically — i.e., ensures that the store works exactly with the load.

again:  ll $t1, ($s1)
        lw $t0, ($s0)
        sc $t0, ($s1)
        beq $t0, $zero, again

Problem Rfa.5.

Symmetric multiprocessors are simpler to implement than distributed shared memory. Why, then, is distributed shared memory generally a better option when there are more than eight processors?

A symmetric multiprocessor relies on a single, shared memory bus shared among all processors for communication with the single shared memory. Beyond eight processors, the amount of traffic on this bus and the amount of traffic with the main memory becomes a bottleneck impeding the system's speed. A distributed system allows direct communication between processors along a switching network.

Problem Rfa.6.

Suppose two cores execute the following threads having access to the shared-memory variables a and b.

initially, a is 0, b is 10

Thread A Thread B
a = 1; b = 11; b = 12; a = 2;

Assuming our cache is coherent, which of the following are possible final values for a and b?

a b

a. 1 11

b. 1 12

c. 2 11

d. 2 12

All four possibilities could occur. It is easy to come up with cases that lead to (a.), (c.), and (d.): We could execute all of B before A, or execute the first instruction of each before the second instruction of each, or we could execute all of A before B. However, (b.) is sequentially inconsistent, so there's no such sequence. However, if A and B occur at virtually the same time, the definition of coherence could permit a to eventually become 1 while b would eventually become 12.

Problem Rfa.7.

Describe the write invalidate protocol for cache coherency in shared-memory systems.

In the write invalidate protocol, whenever a processor wishes to write to a memory location, the cache first broadcasts onto the bus that other caches should invalidate their copies. From then on, the cache considers itself to have an exclusive copy of that data, so that subsequent writes to that memory location can reside purely in the cache. When another processor requests the memory location, or when the processor decides that it is time to use the cache location for another purpose, then the processor sends the actual value written onto the bus.

Problem Rfa.8.

There are two basic ways of implementing snooping to support cache coherency in shared memory systems: write broadcast and write invalidate. What is the primary advantage of write invalidate over write broadcast?

Write invalidate's primary advantage is that it requires less traffic for multiple writes to the same cache line: With write broadcast, every write to a cache line must be broadcast onto the bus, but with write invalidate, only the first write needs a broadcast on the bus. (Of course, the value must be broadcast later when the cache decides no longer to include that memory location, or when another cache needs to read from that memory location.)

Problem Rfa.9.

Explain what the directory system for cache coherence entails. Why is it particularly applicable for a DSM (distributed shared memory) system?

In a directory system, there is a central location for each line of memory that tracks which processors' caches hold a copy of that line. Having a central location allows processors to determine the status of how a line is shared. This is particularly relevant for a DSM system, since messages are not broadcast to all other procesors (across a bus) and so needs to know which other processors to communicate with.

Problem Rfa.10.

In class we studied four potential network topologies: fully connected, grid, torus, and hypercube. Another alternative is a ring topology, where processors are essentially lined up in a circle. Suppose we have a ring topology over n processors.

a. What is the diameter of this topology?

b. What is the average distance between any two nodes of this topology?

c. How many direct connections does this topology include?

a. n / 2, since for any node the furthest node is halfway around the ring.

b. n / 4, since for any node there is one node that is 0 way, one that is n / 2 away, and two nodes for each integer distance between 0 and n / 2.

c. n.

initially, `a` is 0, `b` is 10
Thread A	Thread B
`a = 1; b = 11;`	`b = 12; a = 2;`