Criteria |
Comments |
Initial Cost |
This is the cost of the solution produced by
the initialization procedure. |
Final Cost |
This is the cost of the solution after
running the Lloyd's algorithm (up to convergence) using the
initial centers produced by the initialization routine. |
Number of Lloyd's iterations |
This is the number of iterations of the Lloyd's algorithm (up to convergence) using the initial centers produced by the initialization routine. |
CPU time |
This is the time for the initialization
procedure. |
Initialization Method |
Comments |
Random |
Randomly pick k points to be the
initial k centers. This is called MacQueen Method. |
Maximin |
Pick the first center arbitrarily. The ith
center is chosen to be the point that has the greatest
minimum-distance to the previously chosen centers. |
k-means++ seeding |
This refers to the seeding
procedure discussed in the class. Note that in the original paper,
k-means++ refers to the seeding followed by Lloyd's steps.
Here, we are only interested in the seeding procedure. |
Greedy k-means++ |
This algorithm selects
log(k) centers in each of the k steps of the k-means++
seeding procedure and picks the center that reduces the cost
the most. |
Scalable k-means++ |
Please refer this paper for
the algorithm. The algorithm is stated as Algorithm 2 in
this paper. For the re-clustering step 8, use k-means++
seeding procedure. |
Local Search |
This is the local search
procedure that we discussed in the class. You may also use this paper as
reference. |
PCA-Part |
This is a hierarchical clustering procedure.
You may find this algorithm in this paper. |
Var-Part |
Another Hierarchical clustering procedure. You may find this algorithm in this paper. |
Streaming using k-means++ |
Recall the 2-level streaming algorithm that
we discussed in the lecture that uses O(\sqrt{n}) space. You
will have to implement this algorithm using k-means++
initialization at both the levels. |
Dataset |
n |
d |
k |
Breast Cancer Wisconsin (CSV) |
683 |
9 |
2 |
Clean 2 (CSV) |
6598 |
166 |
2 |
Cloud (CSV) |
1024 |
10 |
8 |
Concrete (CSV) |
1030 |
9 |
8 |
Covertype (CSV) |
581012 |
10 |
7 |
Ecoli (CSV) |
336 |
7 |
8 |
Isolet (CSV) |
7797 |
617 |
26 |
Landsat (CSV) |
6435 |
36 |
6 |
Letter Recognition (CSV) |
20000 |
16 |
26 |
MAGIC (CSV) |
19020 |
10 |
2 |
MiniBooNE (CSV) |
130064 |
50 |
2 |
Optical Digits (CSV) |
5620 |
64 |
10 |
Pen Digits (CSV) |
10992 |
16 |
10 |
Person Activity (CSV) |
164860 |
3 |
11 |
Shuttle (CSV) |
58000 |
9 |
7 |
Steel Plates (CSV) |
1941 |
27 |
7 |