Skip to content

[source]

K Means#

A fast online centroid-based hard clustering algorithm capable of grouping linearly separable data points given some prior knowledge of the target number of clusters (defined by k). K Means is trained using adaptive Mini Batch Gradient Descent and minimizes the inertia cost function at each epoch. Inertia is defined as the average sum of distances between each sample and its nearest cluster centroid.

Interfaces: Estimator, Learner, Online, Probabilistic, Persistable, Verbose

Data Type Compatibility: Continuous

Parameters#

# Name Default Type Description
1 k int The number of target clusters.
2 batch size 128 int The size of each mini batch in samples.
3 epochs 1000 int The maximum number of training rounds to execute.
4 min change 1e-4 float The minimum change in the inertia for training to continue.
5 window 5 int The number of epochs without improvement in the validation score to wait before considering an early stop.
6 kernel Euclidean Distance The distance kernel used to compute the distance between sample points.
7 seeder PlusPlus Seeder The seeder used to initialize the cluster centroids.

Example#

use Rubix\ML\Clusterers\KMeans;
use Rubix\ML\Kernels\Distance\Euclidean;
use Rubix\ML\Clusterers\Seeders\PlusPlus;

$estimator = new KMeans(3, 128, 300, 10.0, 10, new Euclidean(), new PlusPlus());

Additional Methods#

Return the k computed centroids of the training set:

public centroids() : array[]

Return the number of training samples that each centroid is responsible for:

public sizes() : int[]

Return an iterable progress table with the steps from the last training session:

public steps() : iterable

use Rubix\ML\Extractors\CSV;

$extractor = new CSV('progress.csv', true);

$extractor->export($estimator->steps());

Return the loss for each epoch from the last training session:

public losses() : float[]|null

References#


  1. D. Sculley. (2010). Web-Scale K-Means Clustering.