Assignment 10: Parallel clustering

Due: 5:00pm, Friday, November 30. Value: 40 pts.

We studied the message-passing approach to concurrent computing, particularly with the actor model, but our examples have been “toy examples” without much sophistication. In this assignment we'll investigate how to adapt a more complex algorithm to message-passing.

Clustering algorithm

In particular, we'll study an algorithm for clustering: Given a set of points pts that aren't evenly distributed, we want to partition the points into clusters so that each cluster is reasonably compact. One important application of clustering is in marketing, where one gets a massive amount of purchase data with the hope of clustering customers into target groups; another application is in search engines, where we may want to cluster documents to pick representatives of different categories.

Lloyd's algorithm is an popular process for computing clusters. Given the set P of points to be clustered and a number k of clusters to find, Lloyd's algorithm begins by randomly selecting k “centers.” From there the algorithm iterates the following refinement process: For each point from P, we associate the point with the center to which it is closest; then for each center we average the points associated with it, and we move the center to the average. We continue iterating through this the refinement process until it barely moves any of the centers (say, each moves at most 0.0001 away).

The following Scala method implements this algorithm for two-dimensional points.

def cluster(pts: Array[(Double, Double)], numClusters: Int):

        Array[(Double, Double)] = {

    // pick initial centers randomly

    val centers: Array[(Double, Double)] = new Array(numClusters)

    for (i <- 0 until numClusters) {

        centers(i) = pts(i)

    }



    var biggestMove = 1.0

    while (biggestMove > 0.0001) {

        val centerData: Array[(Int, Double, Double)] = new Array(numClusters)

        for (i <- 0 until numClusters) {

            centerData(i) = (0, 0.0, 0.0)

        }



        // Phase A: for each point, determine nearest center; update sums for center

        for (pt <- pts) {

            var bestCenter: Int = -1

            var bestDist: Double = 1e100

            for (i <- 0 until numClusters) {

                val dist = distSquared(pt, centers(i))

                if (dist < bestDist) {

                    bestCenter = i

                    bestDist = dist

                }

            }

            val (num, sumx, sumy) = centerData(bestCenter)

            centerData(bestCenter) = (num + 1, sumx + pt._1, sumy + pt._2)

        }



        // Phase B: for each center, move to the average of associated points

        biggestMove = 0.0

        for (i <- 0 until numClusters) {

            val (num, sumx, sumy) = centerData(i)

            if (num > 0) {

                val old = centers(i)

                centers(i) = (sumx / num, sumy / num)

                val delta = distSquared(old, centers(i))



                if (delta > biggestMove) {

                    biggestMove = delta

                }

            }

        }

    }

    return centers

}

A distributed version

Translating this algorithm to the actor model begins with reimagining its duties spread among several actors. We'll work with actors of three different types.

The ManagerActor is a single actor that coordinates the work of the other actors. It sends GroupClassify messages to GroupActors to kick off Phase A of each iteration, and it sends CenterUpdate messages to CenterActors when it's time to initiate Phase B of each iteration.
Each GroupActor is allocated a subset of the query points to be clustered, and the actor basically handles the algorithm's Phase A: When it is given a list of the algorithm's current centers (using a GroupClassify message), it determines which center is closest to each of its points, and it notifies each center's actor about the points closest to it. Once it has completed, it notifies the ManagerActor with a ManagerGroupDone message. (The manager counts these messages it receives, and once it receives as many as there are GroupActors it proceeds to Phase B.)
Each CenterActor manages an individual center from the algorithm. It receives CenterAddPoint messages from individual GroupActors telling it about the points closest to it. Later, it will receive a CenterUpdate message from the ManagerActor, which tells it to update the center's position and then send back a ManagerCenterDone message. (Again, the manager counts how many such messages it receives, and once it receives as many as there are CenterActors it proceeds back to Phase A — unless the maximum distance moved is below 0.0001.)

Your job

Two files are distributed as part of this assignment.

cluster.scala contains a framework for the actor-based version of Lloyd's algorithm. This is the file that you should modify for this assignment.
main.scala contains code for generating randomly clustered points, executing the sequential algorithm given in the Scala code provided above, and invoking the parallel algorithm to be implemented in cluster.scala. You should not modify or submit this file.

You can use the command “scalac main.scala cluster.scala” to compile your program. After compiling successfully, you can execute it with the command “scala Main”.

The distributed program's output first includes the centers expected based on how the points are generated; you can basically ignore these. Then the distributed program executes the sequential code provided above; the output will include the centers as it completes each iteration, followed by the final result. Finally, it calls your parallel version, and it displays the centers computed by that. If your implementation is correct, the final parallel result should match the final sequential result almost exactly. (There may be minor differences in the twelfth decimal place or later due to floating-point arithmetic issues.)