# Assignment 10: Parallel clustering

**Due:** 5:00pm, Friday, November 30.
**Value:** 40 pts.

We studied the message-passing approach to concurrent computing, particularly with the actor model, but our examples have been “toy examples” without much sophistication. In this assignment we'll investigate how to adapt a more complex algorithm to message-passing.

## Clustering algorithm

In particular, we'll study an algorithm for *clustering*:
Given a set of points `pts` that
aren't evenly distributed, we want to partition the points into
*clusters* so that each cluster is reasonably compact.
One important application of clustering is
in marketing, where one gets a massive amount of purchase data
with the hope of clustering customers into target groups;
another application is in search engines, where we may want to
cluster documents to pick representatives of different categories.

*Lloyd's algorithm* is an popular process for
computing clusters.
Given the set `P` of points to be clustered
and a number `k` of clusters to find,
Lloyd's algorithm begins by randomly selecting `k`
“centers.”
From there the algorithm iterates the following refinement process:
For each point from `P`, we associate the point with the center to
which it is closest;
then for each center we average the points associated with it,
and we move the center to the average.
We continue iterating through this the refinement process until it barely moves
any of the centers (say, each moves at most 0.0001 away).

The following Scala method implements this algorithm for two-dimensional points.

**def** `cluster`(`pts`: `Array`[(`Double`, `Double`)], `numClusters`: `Int`):

`Array`[(`Double`, `Double`)] = {

*// pick initial centers randomly*

**val** `centers`: `Array`[(`Double`, `Double`)] = **new** `Array`(`numClusters`)

**for** (`i` <- `0` `until numClusters`) {

`centers`(`i`) = `pts`(`i`)

}

**var** `biggestMove` = `1.0`

**while** (`biggestMove` > `0.0001`) {

**val** `centerData`: `Array`[(`Int`, `Double`, `Double`)] = **new** `Array`(`numClusters`)

**for** (`i` <- `0` `until numClusters`) {

`centerData`(`i`) = (`0`, `0.0`, `0.0`)

}

*// Phase A: for each point, determine nearest center; update sums for center*

**for** (`pt` <- `pts`) {

**var** `bestCenter`: `Int` = `-1`

**var** `bestDist`: `Double` = `1``e100`

**for** (`i` <- `0` `until numClusters`) {

**val** `dist` = `distSquared`(`pt`, `centers`(`i`))

**if** (`dist` < `bestDist`) {

`bestCenter` = `i`

` bestDist` = `dist`

}

}

**val** (`num`, `sumx`, `sumy`) = `centerData`(`bestCenter`)

`centerData`(`bestCenter`) = (`num` + `1`, `sumx` + `pt`.`_1`, `sumy` + `pt`.`_2`)

}

*// Phase B: for each center, move to the average of associated points*

`biggestMove` = `0.0`

**for** (`i` <- `0` `until numClusters`) {

**val** (`num`, `sumx`, `sumy`) = `centerData`(`i`)

**if** (`num` > `0`) {

**val** `old` = `centers`(`i`)

`centers`(`i`) = (`sumx` **/** `num`, `sumy` **/** `num`)

**val** `delta` = `distSquared`(`old`, `centers`(`i`))

**if** (`delta` > `biggestMove`) {

`biggestMove` = `delta`

}

}

}

}

**return** `centers`

}

## A distributed version

Translating this algorithm to the actor model begins with reimagining its duties spread among several actors. We'll work with actors of three different types.

The

is a single actor that coordinates the work of the other actors. It sends`ManagerActor`

messages to`GroupClassify`

s to kick off Phase A of each iteration, and it sends`GroupActor`

messages to`CenterUpdate`

s when it's time to initiate Phase B of each iteration.`CenterActor`Each

is allocated a subset of the query points to be clustered, and the actor basically handles the algorithm's Phase A: When it is given a list of the algorithm's current centers (using a`GroupActor`

message), it determines which center is closest to each of its points, and it notifies each center's actor about the points closest to it. Once it has completed, it notifies the`GroupClassify`

with a`ManagerActor`

message. (The manager counts these messages it receives, and once it receives as many as there are`ManagerGroupDone`

s it proceeds to Phase B.)`GroupActor`Each

manages an individual center from the algorithm. It receives`CenterActor`

messages from individual`CenterAddPoint`

s telling it about the points closest to it. Later, it will receive a`GroupActor`

message from the`CenterUpdate`

, which tells it to update the center's position and then send back a`ManagerActor`

message. (Again, the manager counts how many such messages it receives, and once it receives as many as there are`ManagerCenterDone`

s it proceeds back to Phase A — unless the maximum distance moved is below 0.0001.)`CenterActor`

## Your job

Two files are distributed as part of this assignment.

`cluster.scala`contains a framework for the actor-based version of Lloyd's algorithm. This is the file that you should modify for this assignment.`main.scala`contains code for generating randomly clustered points, executing the sequential algorithm given in the Scala code provided above, and invoking the parallel algorithm to be implemented in`cluster.scala`. You should not modify or submit this file.

You can use the command
“`scalac main.scala cluster.scala`

”
to compile your program.
After compiling successfully, you can execute it with the command
“`scala Main`

”.

The distributed program's output first includes the centers expected based on how the points are generated; you can basically ignore these. Then the distributed program executes the sequential code provided above; the output will include the centers as it completes each iteration, followed by the final result. Finally, it calls your parallel version, and it displays the centers computed by that. If your implementation is correct, the final parallel result should match the final sequential result almost exactly. (There may be minor differences in the twelfth decimal place or later due to floating-point arithmetic issues.)