Assignment 10: Parallel clustering
Due: 5:00pm, Friday, November 30. Value: 40 pts.
We studied the message-passing approach to concurrent computing, particularly with the actor model, but our examples have been “toy examples” without much sophistication. In this assignment we'll investigate how to adapt a more complex algorithm to message-passing.
Clustering algorithm
In particular, we'll study an algorithm for clustering: Given a set of points pts that aren't evenly distributed, we want to partition the points into clusters so that each cluster is reasonably compact. One important application of clustering is in marketing, where one gets a massive amount of purchase data with the hope of clustering customers into target groups; another application is in search engines, where we may want to cluster documents to pick representatives of different categories.
Lloyd's algorithm is an popular process for computing clusters. Given the set P of points to be clustered and a number k of clusters to find, Lloyd's algorithm begins by randomly selecting k “centers.” From there the algorithm iterates the following refinement process: For each point from P, we associate the point with the center to which it is closest; then for each center we average the points associated with it, and we move the center to the average. We continue iterating through this the refinement process until it barely moves any of the centers (say, each moves at most 0.0001 away).
The following Scala method implements this algorithm for two-dimensional points.
def cluster(pts: Array[(Double, Double)], numClusters: Int):
Array[(Double, Double)] = {
// pick initial centers randomly
val centers: Array[(Double, Double)] = new Array(numClusters)
for (i <- 0 until numClusters) {
centers(i) = pts(i)
}
var biggestMove = 1.0
while (biggestMove > 0.0001) {
val centerData: Array[(Int, Double, Double)] = new Array(numClusters)
for (i <- 0 until numClusters) {
centerData(i) = (0, 0.0, 0.0)
}
// Phase A: for each point, determine nearest center; update sums for center
for (pt <- pts) {
var bestCenter: Int = -1
var bestDist: Double = 1e100
for (i <- 0 until numClusters) {
val dist = distSquared(pt, centers(i))
if (dist < bestDist) {
bestCenter = i
bestDist = dist
}
}
val (num, sumx, sumy) = centerData(bestCenter)
centerData(bestCenter) = (num + 1, sumx + pt._1, sumy + pt._2)
}
// Phase B: for each center, move to the average of associated points
biggestMove = 0.0
for (i <- 0 until numClusters) {
val (num, sumx, sumy) = centerData(i)
if (num > 0) {
val old = centers(i)
centers(i) = (sumx / num, sumy / num)
val delta = distSquared(old, centers(i))
if (delta > biggestMove) {
biggestMove = delta
}
}
}
}
return centers
}
A distributed version
Translating this algorithm to the actor model begins with reimagining its duties spread among several actors. We'll work with actors of three different types.
The
ManagerActor
is a single actor that coordinates the work of the other actors. It sendsGroupClassify
messages toGroupActor
s to kick off Phase A of each iteration, and it sendsCenterUpdate
messages toCenterActor
s when it's time to initiate Phase B of each iteration.Each
GroupActor
is allocated a subset of the query points to be clustered, and the actor basically handles the algorithm's Phase A: When it is given a list of the algorithm's current centers (using aGroupClassify
message), it determines which center is closest to each of its points, and it notifies each center's actor about the points closest to it. Once it has completed, it notifies theManagerActor
with aManagerGroupDone
message. (The manager counts these messages it receives, and once it receives as many as there areGroupActor
s it proceeds to Phase B.)Each
CenterActor
manages an individual center from the algorithm. It receivesCenterAddPoint
messages from individualGroupActor
s telling it about the points closest to it. Later, it will receive aCenterUpdate
message from theManagerActor
, which tells it to update the center's position and then send back aManagerCenterDone
message. (Again, the manager counts how many such messages it receives, and once it receives as many as there areCenterActor
s it proceeds back to Phase A — unless the maximum distance moved is below 0.0001.)
Your job
Two files are distributed as part of this assignment.
cluster.scala contains a framework for the actor-based version of Lloyd's algorithm. This is the file that you should modify for this assignment.
main.scala contains code for generating randomly clustered points, executing the sequential algorithm given in the Scala code provided above, and invoking the parallel algorithm to be implemented in cluster.scala. You should not modify or submit this file.
You can use the command
“scalac main.scala cluster.scala
”
to compile your program.
After compiling successfully, you can execute it with the command
“scala Main
”.
The distributed program's output first includes the centers expected based on how the points are generated; you can basically ignore these. Then the distributed program executes the sequential code provided above; the output will include the centers as it completes each iteration, followed by the final result. Finally, it calls your parallel version, and it displays the centers computed by that. If your implementation is correct, the final parallel result should match the final sequential result almost exactly. (There may be minor differences in the twelfth decimal place or later due to floating-point arithmetic issues.)