Skip to content

Commit 569fd95

Browse files
committedDec 19, 2020
Simplify k-Means clustering algorithm.
1 parent b7cd425 commit 569fd95

File tree

5 files changed

+107
-116
lines changed

5 files changed

+107
-116
lines changed
 

‎README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ a set of rules that precisely define a sequence of operations.
147147
* **Machine Learning**
148148
* `B` [NanoNeuron](https://github.com/trekhleb/nano-neuron) - 7 simple JS functions that illustrate how machines can actually learn (forward/backward propagation)
149149
* `B` [k-NN](src/algorithms/ml/knn) - k-nearest neighbors classification algorithm
150-
* `B` [k-Means](src/algorithms/ml/kmeans) - k-Means clustering algorithm
150+
* `B` [k-Means](src/algorithms/ml/k-means) - k-Means clustering algorithm
151151
* **Uncategorized**
152152
* `B` [Tower of Hanoi](src/algorithms/uncategorized/hanoi-tower)
153153
* `B` [Square Matrix Rotation](src/algorithms/uncategorized/square-matrix-rotation) - in-place algorithm

‎src/algorithms/ml/kmeans/README.md ‎src/algorithms/ml/k-means/README.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# k-Means Algorithm
22

3-
The **k-Means algorithm** is an unsupervised Machine Learning algorithm. It's a clustering algorithm, which groups the sample data on the basis of similarity between dimentions of vectors.
3+
The **k-Means algorithm** is an unsupervised Machine Learning algorithm. It's a clustering algorithm, which groups the sample data on the basis of similarity between dimensions of vectors.
44

5-
In k-Means classification, the output is a set of classess asssigned to each vector. Each cluster location is continously optimized in order to get the accurate locations of each cluster such that they represent each group clearly.
5+
In k-Means classification, the output is a set of classes assigned to each vector. Each cluster location is continuously optimized in order to get the accurate locations of each cluster such that they represent each group clearly.
66

7-
The idea is to calculate the similarity between cluster location and data vectors, and reassign clusters based on it. [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) is used mostly for this task.
7+
The idea is to calculate the similarity between cluster location and data vectors, and reassign clusters based on it. [Euclidean distance](https://github.com/trekhleb/javascript-algorithms/tree/master/src/algorithms/math/euclidean-distance) is used mostly for this task.
88

99
![Euclidean distance between two points](https://upload.wikimedia.org/wikipedia/commons/5/55/Euclidean_distance_2d.svg)
1010

@@ -13,19 +13,19 @@ _Image source: [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance)_
1313
The algorithm is as follows:
1414

1515
1. Check for errors like invalid/inconsistent data
16-
2. Initialize the k cluster locations with initial/random k points
16+
2. Initialize the `k` cluster locations with initial/random `k` points
1717
3. Calculate the distance of each data point from each cluster
18-
4. Assign the cluster label of each data point equal to that of the cluster at it's minimum distance
18+
4. Assign the cluster label of each data point equal to that of the cluster at its minimum distance
1919
5. Calculate the centroid of each cluster based on the data points it contains
2020
6. Repeat each of the above steps until the centroid locations are varying
2121

2222
Here is a visualization of k-Means clustering for better understanding:
2323

2424
![KNN Visualization 1](https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif)
2525

26-
_Image source: [Wikipedia](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)_
26+
_Image source: [Wikipedia](https://en.wikipedia.org/wiki/K-means_clustering)_
2727

28-
The centroids are moving continously in order to create better distinction between the different set of data points. As we can see, after a few iterations, the difference in centroids is quite low between iterations. For example between itrations `13` and `14` the difference is quite small because there the optimizer is tuning boundary cases.
28+
The centroids are moving continuously in order to create better distinction between the different set of data points. As we can see, after a few iterations, the difference in centroids is quite low between iterations. For example between iterations `13` and `14` the difference is quite small because there the optimizer is tuning boundary cases.
2929

3030
## References
3131

Original file line numberDiff line numberDiff line change
@@ -1,36 +1,40 @@
1-
import kMeans from '../kmeans';
1+
import KMeans from '../kMeans';
22

33
describe('kMeans', () => {
44
it('should throw an error on invalid data', () => {
55
expect(() => {
6-
kMeans();
7-
}).toThrowError('Either dataSet or labels or toClassify were not set');
6+
KMeans();
7+
}).toThrowError('The data is empty');
88
});
99

1010
it('should throw an error on inconsistent data', () => {
1111
expect(() => {
12-
kMeans([[1, 2], [1]], 2);
13-
}).toThrowError('Inconsistent vector lengths');
12+
KMeans([[1, 2], [1]], 2);
13+
}).toThrowError('Matrices have different shapes');
1414
});
1515

1616
it('should find the nearest neighbour', () => {
17-
const dataSet = [[1, 1], [6, 2], [3, 3], [4, 5], [9, 2], [2, 4], [8, 7]];
17+
const data = [[1, 1], [6, 2], [3, 3], [4, 5], [9, 2], [2, 4], [8, 7]];
1818
const k = 2;
19-
const expectedCluster = [0, 1, 0, 1, 1, 0, 1];
20-
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
19+
const expectedClusters = [0, 1, 0, 1, 1, 0, 1];
20+
expect(KMeans(data, k)).toEqual(expectedClusters);
21+
22+
expect(KMeans([[0, 0], [0, 1], [10, 10]], 2)).toEqual(
23+
[0, 0, 1],
24+
);
2125
});
2226

2327
it('should find the clusters with equal distances', () => {
2428
const dataSet = [[0, 0], [1, 1], [2, 2]];
2529
const k = 3;
2630
const expectedCluster = [0, 1, 2];
27-
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
31+
expect(KMeans(dataSet, k)).toEqual(expectedCluster);
2832
});
2933

3034
it('should find the nearest neighbour in 3D space', () => {
3135
const dataSet = [[0, 0, 0], [0, 1, 0], [2, 0, 2]];
3236
const k = 2;
3337
const expectedCluster = [1, 1, 0];
34-
expect(kMeans(dataSet, k)).toEqual(expectedCluster);
38+
expect(KMeans(dataSet, k)).toEqual(expectedCluster);
3539
});
3640
});

‎src/algorithms/ml/k-means/kMeans.js

+85
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
import * as mtrx from '../../math/matrix/Matrix';
2+
import euclideanDistance from '../../math/euclidean-distance/euclideanDistance';
3+
4+
/**
5+
* Classifies the point in space based on k-Means algorithm.
6+
*
7+
* @param {number[][]} data - array of dataSet points, i.e. [[0, 1], [3, 4], [5, 7]]
8+
* @param {number} k - number of clusters
9+
* @return {number[]} - the class of the point
10+
*/
11+
export default function KMeans(
12+
data,
13+
k = 1,
14+
) {
15+
if (!data) {
16+
throw new Error('The data is empty');
17+
}
18+
19+
// Assign k clusters locations equal to the location of initial k points.
20+
const dataDim = data[0].length;
21+
const clusterCenters = data.slice(0, k);
22+
23+
// Continue optimization till convergence.
24+
// Centroids should not be moving once optimized.
25+
// Calculate distance of each candidate vector from each cluster center.
26+
// Assign cluster number to each data vector according to minimum distance.
27+
28+
// Matrix of distance from each data point to each cluster centroid.
29+
const distances = mtrx.zeros([data.length, k]);
30+
31+
// Vector data points' classes. The value of -1 means that no class has bee assigned yet.
32+
const classes = Array(data.length).fill(-1);
33+
34+
let iterate = true;
35+
while (iterate) {
36+
iterate = false;
37+
38+
// Calculate and store the distance of each data point from each cluster.
39+
for (let dataIndex = 0; dataIndex < data.length; dataIndex += 1) {
40+
for (let clusterIndex = 0; clusterIndex < k; clusterIndex += 1) {
41+
distances[dataIndex][clusterIndex] = euclideanDistance(
42+
[clusterCenters[clusterIndex]],
43+
[data[dataIndex]],
44+
);
45+
}
46+
// Assign the closest cluster number to each dataSet point.
47+
const closestClusterIdx = distances[dataIndex].indexOf(
48+
Math.min(...distances[dataIndex]),
49+
);
50+
51+
// Check if data point class has been changed and we still need to re-iterate.
52+
if (classes[dataIndex] !== closestClusterIdx) {
53+
iterate = true;
54+
}
55+
56+
classes[dataIndex] = closestClusterIdx;
57+
}
58+
59+
// Recalculate cluster centroid values via all dimensions of the points under it.
60+
for (let clusterIndex = 0; clusterIndex < k; clusterIndex += 1) {
61+
// Reset cluster center coordinates since we need to recalculate them.
62+
clusterCenters[clusterIndex] = Array(dataDim).fill(0);
63+
let clusterSize = 0;
64+
for (let dataIndex = 0; dataIndex < data.length; dataIndex += 1) {
65+
if (classes[dataIndex] === clusterIndex) {
66+
// Register one more data point of current cluster.
67+
clusterSize += 1;
68+
for (let dimensionIndex = 0; dimensionIndex < dataDim; dimensionIndex += 1) {
69+
// Add data point coordinates to the cluster center coordinates.
70+
clusterCenters[clusterIndex][dimensionIndex] += data[dataIndex][dimensionIndex];
71+
}
72+
}
73+
}
74+
// Calculate the average for each cluster center coordinate.
75+
for (let dimensionIndex = 0; dimensionIndex < dataDim; dimensionIndex += 1) {
76+
clusterCenters[clusterIndex][dimensionIndex] = parseFloat(Number(
77+
clusterCenters[clusterIndex][dimensionIndex] / clusterSize,
78+
).toFixed(2));
79+
}
80+
}
81+
}
82+
83+
// Return the clusters assigned.
84+
return classes;
85+
}

‎src/algorithms/ml/kmeans/kmeans.js

-98
This file was deleted.

0 commit comments

Comments
 (0)
Please sign in to comment.