What is Machine Learning?#

Machine learning (ML) is when a program is able to progressively improve performance on a task through training and data without explicitly being programmed. It is a way of programming with data. Once a learner has been trained, we can use it to make predictions about future outcomes (referred to as inference). There are two types of machine learning that Rubix ML supports out of the box - Supervised and Unsupervised.

Supervised Learning#

Supervised learning is a type of ML that incorporates a training signal in the form of human annotations called labels along with the training samples. There are two types of supervised learning to consider in Rubix ML.

Classification#

For classification problems, a supervised learner is trained to differentiate samples among a set of k possible discrete classes. In this type of problem, the training labels are the classes that each sample belongs to. Examples of class labels include "cat", "dog", "ship", "human", etc. Classification problems range from simple to very complex and include image recognition, text sentiment analysis, and Iris flower classification.

Regression#

Regression is a learning problem that aims to predict a continuous-valued outcome. In this case, the training labels are continuous data types such as integers and floating point numbers. Unlike classifiers, a regressor can predict infinitely many real values. Regression problems include determining the angle of an automobile steering wheel, estimating the sale price of a home, and credit scoring.

Note: By convention in Rubix ML, discrete (referred to as categorical) variables are always denoted by a string, whereas continuous variables are given as either integers or floating point numbers.

Unsupervised Learning#

A form of learning that does not require training labels is called Unsupervised learning. Unsupervised learners aim to detect patterns using just raw data. Since it is not always easy or possible to obtain labeled data, an unsupervised method is often the first step in discovering information about your data. There are three types of unsupervised learning to consider in Rubix ML.

Clustering#

Clustering takes a dataset of unlabeled samples and assigns each sample a cluster number based on its similarity to other samples in the training set. Samples that are most similar will be assigned to the same cluster. Clustering is used in tissue differentiation from PET scan images, customer database market segmentation, and to discover communities within social networks.

Anomaly Detection#

Anomalies are samples that have been generated by a different process than normal or those that do not conform to the expected distribution of the training data. Samples can either be flagged or ranked based on their anomaly score. Anomaly detection is used in security for intrusion and denial of service detection, and in the financial industry to detect fraud.

Manifold Learning#

Manifold learning is a type of unsupervised non-linear dimensionality reduction used for embedding datasets into dense feature representations. Embedders are used for visualizing high dimensional datasets in low (1 to 3) dimensions, and for compressing samples before input to a learning algorithm.

Obtaining Data#

Machine learning projects typically begin with a question. For example, you might want to answer the question of "who of my friends are most likely to stay married to their partner?" One way to go about answering this question with machine learning would be to go out and ask a bunch of happily married and divorced couples the same set of questions about their partner and then use that data to build a model to predict successful relationships based on the answers they gave you. In ML terms, the answers you collect are the values of the features that constitute measurements of the phenomena being observed. The number of features in a sample is called the dimensionality of the sample. For example, a sample with 20 features is said to be 20 dimensional.

As an alternative to collecting data yourself, you can access one of the many open datasets that are free to use from a public repository. The advantages of using a public dataset is that the data has most likely already been cleaned and prepared for you. We recommend the University of California Irvine Machine Learning Repository as a great place to get started with using open datasets.

Hint: See the 'Extracting Data' section to learn more about extracting data from different storage formats.

The Dataset Object#

In Rubix ML, data are passed in specialized containers called Dataset objects. Dataset objects handle selecting, subsampling, transforming, randomizing, and sorting of the samples and labels for you. In general, there are two types of datasets, Labeled and Unlabeled. Labeled datasets are used for supervised learning and for providing the ground-truth during testing. Unlabeled datasets are used for unsupervised learning and for making predictions (inference) on unknown samples.

Suppose that you went out and asked 4 couples (2 married and 2 divorced) to rate their partner's communication skills (between 1 and 5), attractiveness (between 1 and 5), and time spent together per week (hours per week). You could construct a Labeled dataset from this data by passing the samples and labels into the constructor like in the example below.

use Rubix\ML\Datasets\Labeled;

$samples = [
    [3, 4, 50.5], [1, 5, 24.7], [4, 4, 62.0], [3, 2, 31.1]
];

$labels = ['married', 'divorced', 'married', 'divorced'];

$dataset = new Labeled($samples, $labels);

Hint: See the 'Representing your Data' section for an in-depth description of how Rubix ML treats various forms of data.

Choosing an Estimator#

Estimators make up the core of the Rubix ML library. They provide the predict() API and are responsible for making predictions on unknown samples. Estimators that can be trained with data are called Learners and must be trained before making predictions.

For our example we will focus on an intuitable distance-based supervised learner called K Nearest Neighbors. KNN is a type of estimator called a Classifier because it takes unknown samples and assigns them a class label. In our example the output of KNN will either be married or divorced since those are the class labels that we train it with.

Creating the Estimator Instance#

The K Nearest Neighbors classifier works by locating the closest training samples to an unknown sample and choosing the class label that is most common. Like most estimators, the K Nearest Neighbors (KNN) classifier requires a set of parameters (called hyper-parameters) to be chosen up-front by the user. These parameters control how the learner behaves during training and inference. These parameters can be selected based on some prior knowledge of the problem space, or completely at random. The defaults provided in Rubix ML are a good place to start for most problems.

In KNN, the hyper-parameter k is the number of nearest points from the training set to compare an unknown sample to in order to infer its class label. For example, if the 5 closest neighbors to a given unknown sample have 4 married and 1 divorced label, then the algorithm will output a prediction of married with a probability of 0.8.

use Rubix\ML\Classifiers\KNearestNeighbors;

$estimator = new KNearestNeighbors(5);

Training the Learner#

Training is the process of feeding the learning algorithm data so that it can build an internal representation of the problem space. This representation is often called a model and it consists of all of the parameters (except hyper-parameters) that are required to make a prediction. In the case of K Nearest Neighbors, this representation is a high-dimensional Euclidean space in which each sample is considered a point.

Note: If you try to make predictions using an untrained learner, it will throw an exception.

$estimator->train($dataset);

We can verify that the learner has been trained by calling the trained() method:

var_dump($estimator->trained());
bool(true)

For our small training set, the training process should only take a matter of microseconds, but larger datasets with higher dimensionality can take much longer. Once the learner has been trained, we can feed in some unknown samples to see what the model predicts.

Hint: See the 'Training' section for a closer look at training a learner.

Making Predictions#

Suppose that we went out and collected 4 new data points from our friends using the same questions we asked the couples we interviewed for our training set. We could predict whether or not they will stay married by taking their answers and running them through the trained KNN estimator in and Unlabeled dataset. The process of making predictions is called inference because the estimator uses the model constructed during training to infer the label of the unknown samples.

use Rubix\ML\Datasets\Unlabeled;

$samples = [
    [4, 3, 44.2], [2, 2, 16.7], [2, 4, 19.5], [3, 3, 55.0],
];

$dataset = new Unlabeled($samples);

$predictions = $estimator->predict($dataset);

var_dump($predictions);
array(4) {
    [0] => 'married'
    [1] => 'divorced'
    [2] => 'divorced'
    [4] => 'married'
}

The output of the KNN classifier are the predicted class labels of the unknown samples in the order they were feed to the estimator. We could either trust these predictions or we could procees to further evaluate the model. In the next section, we'll learn how to test the generalization performance of our estimator.

Hint: Check out the section on 'Inference' for more info on making predictions with an estimator.

Model Evaluation#

To test that the estimator can correctly generalize what it has learned during training to the real world we use a process called cross validation. The goal of cross validation is to train and test the learner on different subsets of the dataset in order to produce a validation score. For the purposes of the introduction, we will use the Hold Out validator which takes a portion of the dataset for testing and leaves the rest for training. The reason we do not use all of the data for training is because we want to test the estimator on samples that it has never seen before.

The Hold Out validator requires the user to set the ratio of testing to training samples as a constructor parameter. Let's choose to use a factor of 0.2 (20%) of the dataset for testing leaving the rest (80%) for training.

Note: Typically, 0.2 is a good default choice however your mileage may vary. The important thing to note here is the trade off between more data for training and more data to produce better testing results.

To return a score from the Hold Out validator using the Accuracy metric, pass in an untrained estimator instance along with the entire dataset.

use Rubix\ML\CrossValidation\HoldOut;
use Rubix\ML\CrossValidation\Metrics\Accuracy;

$validator = new HoldOut(0.2);

$score = $validator->test($estimator, $dataset, new Accuracy());

var_dump($score);
float(0.945)

The output of the cross validator is a validation score that can be interpretted as the degree to which the learner is able to accurately generalize its training to unknown data. In the example above, our model is about 95% accurate according to our chosen metric.

Next Steps#

Congratulations! You've completed the basic introduction to machine learning in PHP with Rubix ML. For a more in-depth tutorial using the K Nearest Neighbors classifier, check out the Iris Flower example project. We highly recommend browsing the rest of the documentation and the other example projects which range from beginner to advanced skill level. Have fun and stay curious!