Choosing an Estimator#
Estimators make up the core of the Rubix ML library and include classifiers, regressors, clusterers, anomaly detectors, and meta-estimators organized into their own namespaces. They are responsible for making predictions and are usually trained with data. Some meta-estimators such as Pipeline and Grid Search are polymorphic i.e. they bear the type of the base estimator they wrap. Most estimators allow tuning by adjusting their hyper-parameters. To instantiate a new estimator, pass the desired values of the hyper-parameters to the estimator's constructor like in the example below.
use Rubix\ML\Classifiers\KNearestNeighbors; use Rubix\ML\Kernels\Distance\Minkowski; $estimator = new KNearestNeighbors(10, false, new Minkowski(2.0));
It is important to note that not all estimators are created equal and choosing the right estimator for your project is important for achieving the best results. In the following sections, we'll break down the estimators available to you in Rubix ML and point out some of their advantages and disadvantages.
Classifiers can often be graded on their ability to form decision boundaries between areas that define the classes. Simple linear classifiers such as Logistic Regression can only handle classes that are linearly separable. On the other hand, highly flexible models such as the Multilayer Perceptron can theoretically handle any decision boundary. The tradeoff for increased flexibility is reduced interpretability, increased computational complexity, and greater susceptibility to overfitting.
|AdaBoost||High||●||Boosts most classifiers, Learns influences and sample weights||Sensitive to noise, Susceptible to overfitting|
|Classification Tree||Moderate||●||Interpretable model, automatic feature selection||High variance, Susceptible to overfitting|
|Extra Tree Classifier||Moderate||●||Faster training, Lower variance||Similar to Classification Tree|
|Gaussian Naive Bayes||Moderate||●||●||Requires little data, Highly scalable||Strong Gaussian and feature independence assumption, Sensitive to noise|
|K-d Neighbors||Moderate||●||Faster inference||Not compatible with certain distance kernels|
|K Nearest Neighbors||Moderate||●||●||Intuitable model, Zero-cost training||Slower inference, Suffers from the curse of dimensionality|
|Logistic Regression||Low||●||●||Interpretable model, Highly Scalable||Prone to underfitting, Only handles 2 classes|
|Multilayer Perceptron||High||●||●||Handles very high dimensional data, Universal function approximator||High computation and memory cost, Black box|
|Naive Bayes||Moderate||●||●||Requires little data, Highly scalable||Strong feature independence assumption|
|Radius Neighbors||Moderate||●||Robust to outliers, Quasi-anomaly detector||Not guaranteed to return a prediction|
|Random Forest||High||●||Stable, Computes reliable feature importances||High computation and memory cost|
|Softmax Classifier||Low||●||●||Highly Scalable||Prone to underfitting|
|SVC||High||Handles high dimensional data||Difficult to tune, Not suitable for large datasets|
In terms of regression, flexibility is expressed as the ability of a model to fit a regression line to potentially complex non-linear data. Linear models such as Ridge tend to underfit data that is non-linear while more flexible models such as Gradient Boost are prone to overfit the training data if not tuned properly. In general, it's best to choose the simplest regressor that doesn't underfit your dataset.
|Adaline||Low||●||●||Interpretable model, Highly Scalable||Prone to underfitting|
|Extra Tree Regressor||Moderate||Faster training, Lower variance||Similar to Regression Tree|
|Gradient Boost||High||●||High precision, Computes reliable feature importances||Prone to overfitting, High computation and memory cost|
|K-d Neighbors Regressor||Moderate||Faster inference||Not compatible with certain distance kernels|
|KNN Regressor||Moderate||●||Intuitable model, Zero-cost training||Slower inference, Suffers from the curse of dimensionality|
|MLP Regressor||High||●||●||Handles very high dimensional data, Universal function approximator||High computation and memory cost, Black box|
|Radius Neighbors Regressor||Moderate||Robust to outliers, Quasi-anomaly detector||Not guaranteed to return a prediction|
|Regression Tree||Moderate||Interpretable model, automatic feature selection||High variance, Susceptible to overfitting|
|Ridge||Low||Interpretable model||Prone to underfitting|
|SVR||High||Handles high dimensional data||Difficult to tune, Not suitable for large datasets|
Clusterers can be rated by their ability to represent an outer hull surrounding the samples in the cluster. Simple centroid-based models such as K Means establish a uniform hypersphere around the clusters. More flexible clusterers such as Gaussian Mixture can better conform to the outer shape of the cluster by allowing the surface of the hull to be irregular and bumpy. The tradeoff for flexibility typically results in more model parameters and with it increased computational complexity.
|DBSCAN||High||Finds arbitrarily-shaped clusters, Quasi-anomaly detector||Cannot be trained, Slower inference|
|Fuzzy C Means||Low||●||Fast training and inference, Soft clustering||Solution highly depends on initialization, Not suitable for large datasets|
|Gaussian Mixture||Moderate||●||Captures non-spherical clusters||Higher memory cost|
|K Means||Low||●||●||Fast training and inference, Highly scalable||Has local minima|
|Mean Shift||Moderate||●||Handles non-convex clusters, No local minima||Slower training|
Anomaly detectors can be thought of as belonging to one of two groups. There are the anomaly detectors that consider the entire training data when making a prediction, and there are those that only consider a local region of the training set. Local anomaly detectors are typically more accurate but come with higher computational complexity. Global anomaly detectors are more suited for real-time applications but may produce a higher number of false positives and/or negatives.
|Gaussian MLE||Global||●||●||Fast training and inference, Highly scalable||Strong Gaussian and feature independence assumption, Sensitive to noise|
|Isolation Forest||Local||●||Fast training, Handles high dimensional data||Slower Inference|
|Local Outlier Factor||Local||●||Intuitable model, Finds anomalies within clusters||Suffers from the curse of dimensionality|
|Loda||Global||●||●||Highly scalable||High memory cost|
|One Class SVM||Global||Handles high dimensional data||Difficult to tune, Not suitable for large datasets|
|Robust Z-Score||Global||●||Requires little data, Robust to outliers||Has problem with skewed datasets|
Meta-estimators enhance other estimators with their own added functionality. They include ensembles, model selectors, and other model enhancers that wrap a compatible base estimator.
|Bootstrap Aggregator||Ensemble||●||Classifiers, Regressors, Anomaly Detectors|
|Committee Machine||Ensemble||●||●||Classifiers, Regressors, Anomaly Detectors|
|Grid Search||Model Selection||●||●||Any|
|Persistent Model||Model Persistence||Any persistable model|
In the example below, we'll use the Bootstrap Aggregator meta-estimator to wrap a Regression Tree.
use Rubix\ML\BootstrapAggregator; use Rubix\ML\Regressors\RegressionTree; $estimator = new BootstrapAggregator(new RegressionTree(4), 1000);
No Free Lunch Theorem#
At some point you may ask yourself "Why do we need so many different learning algorithms? Can't we just use one that works all the time?" The answer to those questions can be understood by the No Free Lunch (NFL) theorem. The No Free Lunch theorem states that, when averaged over all possible problems, no learner performs any better than the next. Another way of saying that is certain learners perform better in some tasks and worse in others. This is explained by the fact that all learning algorithms have some prior knowledge inherent in them whether it be via the selection of certain hyper-parameters or the design of the algorithm itself. Another consequence of No Free Lunch is that there exists no single estimator that performs better for all problems.