Homework 2 (Programming)

In this collaborative experiment, we will evaluate a few initialization routines that are used to find the initial k centers using which the Lloyd's
algorithm is executed. We will evaluate these initialization routines on the following criteria:

Criteria	Comments
Initial Cost	This is the cost of the solution produced by the initialization procedure.
Final Cost	This is the cost of the solution after running the Lloyd's algorithm (up to convergence) using the initial centers produced by the initialization routine.
Number of Lloyd's iterations	This is the number of iterations of the Lloyd's algorithm (up to convergence) using the initial centers produced by the initialization routine.
CPU time	This is the time for the initialization procedure.

Here are the initialization procedures that you need to evaluate. Please work in groups of size 2. The group working on Random will also
have to evaluate Maximin and similarly the group working on k-means++ seeding will also have to evaluate greedy k-means++.
For randomized procedures, you should run the procedure for 100 times and record mean and variance. For deterministic algorithms,
you just need to run the algorithm once.

Initialization Method	Comments
Random	Randomly pick k points to be the initial k centers. This is called MacQueen Method.
Maximin	Pick the first center arbitrarily. The ith center is chosen to be the point that has the greatest minimum-distance to the previously chosen centers.
k-means++ seeding	This refers to the seeding procedure discussed in the class. Note that in the original paper, k-means++ refers to the seeding followed by Lloyd's steps. Here, we are only interested in the seeding procedure.
Greedy k-means++	This algorithm selects log(k) centers in each of the k steps of the k-means++ seeding procedure and picks the center that reduces the cost the most.
Scalable k-means++	Please refer this paper for the algorithm. The algorithm is stated as Algorithm 2 in this paper. For the re-clustering step 8, use k-means++ seeding procedure.
Local Search	This is the local search procedure that we discussed in the class. You may also use this paper as reference.
PCA-Part	This is a hierarchical clustering procedure. You may find this algorithm in this paper.
Var-Part	Another Hierarchical clustering procedure. You may find this algorithm in this paper.
Streaming using k-means++	Recall the 2-level streaming algorithm that we discussed in the lecture that uses O(\sqrt{n}) space. You will have to implement this algorithm using k-means++ initialization at both the levels.

Here are the datasets for evaluation.

Dataset	n	d	k
Breast Cancer Wisconsin (CSV)	683	9	2
Clean 2 (CSV)	6598	166	2
Cloud (CSV)	1024	10	8
Concrete (CSV)	1030	9	8
Covertype (CSV)	581012	10	7
Ecoli (CSV)	336	7	8
Isolet (CSV)	7797	617	26
Landsat (CSV)	6435	36	6
Letter Recognition (CSV)	20000	16	26
MAGIC (CSV)	19020	10	2
MiniBooNE (CSV)	130064	50	2
Optical Digits (CSV)	5620	64	10
Pen Digits (CSV)	10992	16	10
Person Activity (CSV)	164860	3	11
Shuttle (CSV)	58000	9	7
Steel Plates (CSV)	1941	27	7