Homework 2 (Programming)

In this collaborative experiment, we will evaluate a few initialization routines that are used to find the initial k centers using which the Lloyd's
algorithm is executed. We will evaluate these initialization routines on the following criteria:

Criteria
Comments
Initial Cost
This is the cost of the solution produced by the initialization procedure.
Final Cost
This is the cost of the solution after running the Lloyd's algorithm (up to convergence) using the initial centers produced by the initialization routine.
Number of Lloyd's iterations
This is the number of iterations of the Lloyd's algorithm (up to convergence) using the initial centers produced by the initialization routine.
CPU time
This is the time for the initialization procedure.

Here are the initialization procedures that you need to evaluate. Please work in groups of size 2. The group working on Random will also
have to evaluate Maximin and similarly the group working on k-means++ seeding will also have to evaluate greedy k-means++.
For randomized procedures, you should run the procedure for 100 times and record mean and variance. For deterministic algorithms,
you just need to run the algorithm once.

Initialization Method
Comments
Random
Randomly pick k points to be the initial k centers. This is called MacQueen Method.
Maximin
Pick the first center arbitrarily. The ith center is chosen to be the point that has the greatest minimum-distance to the previously chosen centers.
k-means++ seeding
This refers to the seeding procedure discussed in the class. Note that in the original paper, k-means++ refers to the seeding followed by Lloyd's steps. Here, we are only interested in the seeding procedure.
Greedy k-means++
This algorithm selects log(k) centers in each of the k steps of the k-means++ seeding procedure and picks the center that reduces the cost the most.
Scalable k-means++
Please refer this paper for the algorithm. The algorithm is stated as Algorithm 2 in this paper. For the re-clustering step 8, use k-means++ seeding procedure.
Local Search
This is the local search procedure that we discussed in the class. You may also use this paper as reference.
PCA-Part
This is a hierarchical clustering procedure. You may find this algorithm in this paper.
Var-Part
Another Hierarchical clustering procedure. You may find this algorithm in this paper.
Streaming using k-means++
Recall the 2-level streaming algorithm that we discussed in the lecture that uses O(\sqrt{n}) space. You will have to implement this algorithm using k-means++ initialization at both the levels.

Here are the datasets for evaluation.

Dataset
n
d
k
Breast Cancer Wisconsin (CSV)
683
9
2
Clean 2 (CSV)
6598
166
2
Cloud (CSV)
1024
10
8
Concrete (CSV)
1030
9
8
Covertype (CSV)
581012
10
7
Ecoli (CSV)
336
7
8
Isolet (CSV)
7797
617
26
Landsat (CSV)
6435
36
6
Letter Recognition (CSV)
20000
16
26
MAGIC (CSV)
19020
10
2
MiniBooNE (CSV)
130064
50
2
Optical Digits (CSV)
5620
64
10
Pen Digits (CSV)
10992
16
10
Person Activity (CSV)
164860
3
11
Shuttle (CSV)
58000
9
7
Steel Plates (CSV)
1941
27
7