Choosing an Estimator#

Estimators make up the core of the Rubix ML library and include classifiers, regressors, clusterers, anomaly detectors, and meta-estimators organized into their own namespaces. They are responsible for making predictions and are usually trained with data. Most estimators allow tuning by adjusting their user-defined hyper-parameters. Hyper-parameters are arguments to the learning algorithm that effect its behavior during training and inference. The values for the hyper-parameters can be chosen by intuition, tuning, or completely at random. The defaults provided by the library are a good place to start for most problems. To instantiate a new estimator, pass the desired values of the hyper-parameters to the estimator's constructor like in the example below.

use Rubix\ML\Classifiers\KNearestNeighbors;
use Rubix\ML\Kernels\Distance\Minkowski;

$estimator = new KNearestNeighbors(10, false, new Minkowski(2.5));

Classifiers#

Classifiers are supervised learners that predict a categorical class label. They can be used to recognize (cat, dog, turtle), differentiate (spam, not spam), or describe (running, walking) the samples in a dataset based on the labels they were trained on. In addition, classifiers that implement the Probabilistic interface can infer the joint probability distribution of each possible class given an unclassified sample.

Name	Flexibility	Proba	Online	Ranks Features	Verbose	Data Compatibility
AdaBoost	High	●			●	Depends on base learner
Classification Tree	Medium	●		●		Categorical, Continuous
Extra Tree Classifier	Medium	●		●		Categorical, Continuous
Gaussian Naive Bayes	Medium	●	●			Continuous
K-d Neighbors	Medium	●				Depends on distance kernel
K Nearest Neighbors	Medium	●	●			Depends on distance kernel
Logistic Regression	Low	●	●	●	●	Continuous
Logit Boost	High	●		●	●	Categorical, Continuous
Multilayer Perceptron	High	●	●		●	Continuous
Naive Bayes	Medium	●	●			Categorical
Radius Neighbors	Medium	●				Depends on distance kernel
Random Forest	High	●		●		Categorical, Continuous
Softmax Classifier	Low	●	●		●	Continuous
SVC	High					Continuous

Regressors#

Regressors are a type of supervised learner that predict a continuous-valued outcome such as 1.275 or 655. They can be used to quantify a sample such as its credit score, age, or steering wheel position in units of degrees. Unlike classifiers whose range of predictions is bounded by the number of possible classes in the training set, a regressor's range is unbounded - meaning, the number of possible values a regressor could predict is infinite.

Name	Flexibility	Online	Ranks Features	Verbose	Persistable	Data Compatibility
Adaline	Low	●	●	●	●	Continuous
Extra Tree Regressor	Medium		●		●	Categorical, Continuous
Gradient Boost	High		●	●	●	Categorical, Continuous
K-d Neighbors Regressor	Medium				●	Depends on distance kernel
KNN Regressor	Medium	●			●	Depends on distance kernel
MLP Regressor	High	●		●	●	Continuous
Radius Neighbors Regressor	Medium				●	Depends on distance kernerl
Regression Tree	Medium		●		●	Categorical, Continuous
Ridge	Low		●		●	Continuous
SVR	High					Continuous

Clusterers#

Clusterers are unsupervised learners that predict an integer-valued cluster number such as 0, 1, ..., n. They are similar to classifiers, however since they lack a supervised training signal, they cannot be used to recognize or describe samples. Instead, clusterers differentiate and group samples using only the information found within the structure of the samples without their labels.

Name	Flexibility	Proba	Online	Verbose	Persistable	Data Compatibility
DBSCAN	High					Depends on distance kernel
Fuzzy C Means	Low	●		●	●	Continuous
Gaussian Mixture	Medium	●		●	●	Continuous
K Means	Low	●	●	●	●	Continuous
Mean Shift	Medium	●		●	●	Continuous

Anomaly Detectors#

Anomaly Detectors are unsupervised learners that predict whether a sample should be classified as an anomaly or not. We use the value 1 to indicate an outlier and 0 for a regular sample and the predictions can be cast to their boolean equivalent if needed. Anomaly detectors that implement the Scoring interface can output an anomaly score that can be used to sort the samples by their degree of anomalousness.

Name	Scope	Scoring	Online	Persistable	Data Compatibility
Gaussian MLE	Global	●	●	●	Continuous
Isolation Forest	Local	●		●	Categorical, Continuous
Local Outlier Factor	Local	●		●	Depends on distance kernel
Loda	Local	●	●	●	Continuous
One Class SVM	Global			●	Continuous
Robust Z-Score	Global	●		●	Continuous

Model Flexibility Tradeoff#

A characteristic of most estimator types is the notion of flexibility. Flexibility can be expressed in different ways but greater flexibility usually comes with the capacity to handle more complex tasks. The tradeoff for flexibility is increased computational complexity, reduced model interpretability, and greater susceptibility to overfitting. In contrast, low flexibility models tend to be easier to interpret and quicker to train but are more prone to underfitting. In general, we recommend choosing the simplest model that does not underfit the training data for your project.

No Free Lunch Theorem#

At some point you may ask yourself "Why do we need so many different learning algorithms?" The answer to that question can be understood by the No Free Lunch Theorem which states that, when averaged over the space of all possible problems, no algorithm performs any better than the next. Perhaps a more useful way of stating NFL is that certain learners perform better at certain tasks and worse in others. This is explained by the fact that all learning algorithms have some prior knowledge inherent in them whether it be via the choice of hyper-parameters or the design of the algorithm itself. Another consequence of No Free Lunch is that there exists no single estimator that performs better for all problems.

Last update: 2021-07-05