K Means#
A fast online centroid-based hard clustering algorithm capable of grouping linearly separable data points given some prior knowledge of the target number of clusters (defined by k). K Means is trained using adaptive Mini Batch Gradient Descent and minimizes the inertia cost function at each epoch. Inertia is defined as the average sum of distances between each sample and its nearest cluster centroid.
Interfaces: Estimator, Learner, Online, Probabilistic, Persistable, Verbose
Data Type Compatibility: Continuous
Parameters#
# | Name | Default | Type | Description |
---|---|---|---|---|
1 | k | int | The number of target clusters. | |
2 | batch size | 128 | int | The size of each mini batch in samples. |
3 | epochs | 1000 | int | The maximum number of training rounds to execute. |
4 | min change | 1e-4 | float | The minimum change in the inertia for training to continue. |
5 | window | 5 | int | The number of epochs without improvement in the validation score to wait before considering an early stop. |
6 | kernel | Euclidean | Distance | The distance kernel used to compute the distance between sample points. |
7 | seeder | PlusPlus | Seeder | The seeder used to initialize the cluster centroids. |
Example#
use Rubix\ML\Clusterers\KMeans;
use Rubix\ML\Kernels\Distance\Euclidean;
use Rubix\ML\Clusterers\Seeders\PlusPlus;
$estimator = new KMeans(3, 128, 300, 10.0, 10, new Euclidean(), new PlusPlus());
Additional Methods#
Return the k computed centroids of the training set:
public centroids() : array[]
Return the number of training samples that each centroid is responsible for:
public sizes() : int[]
Return an iterable progress table with the steps from the last training session:
public steps() : iterable
use Rubix\ML\Extractors\CSV;
$extractor = new CSV('progress.csv', true);
$extractor->export($estimator->steps());
Return the loss for each epoch from the last training session:
public losses() : float[]|null
References#
-
D. Sculley. (2010). Web-Scale K-Means Clustering. ↩