What is Machine Learning?#

Machine learning (ML) is when a program is able to progressively improve performance on a task through training and data without explicitly being programmed. It is a way of programming with data. Once a learner has been trained, we can use it to make predictions about future outcomes (referred to as inference). There are two types of machine learning that Rubix ML supports out of the box - Supervised and Unsupervised.

Supervised Learning#

Supervised learning is a type of ML that incorporates a training signal in the form of human annotations called labels along with the training samples. There are two types of supervised learning to consider in Rubix ML.

Classification#

For classification problems, a supervised learner is trained to differentiate samples among a set of k possible discrete classes. In this type of problem, the training labels are the classes that each sample belongs to. Examples of class labels include "cat", "dog", "ship", "human", etc. Classification problems range from simple to very complex and include image recognition, text sentiment analysis, and Iris flower classification.

Regression#

Regression is a learning problem that aims to predict a continuous-valued outcome. In this case, the training labels are continuous data types such as integers and floating point numbers. Unlike classifiers, a regressor can predict infinitely many real values. Regression problems include determining the angle of an automobile steering wheel, estimating the sale price of a home, and credit scoring.

Note: By convention in Rubix ML, discrete (referred to as categorical) variables are always denoted by a string, whereas continuous variables are given as either integers or floating point numbers.

Unsupervised Learning#

A form of learning that does not require training labels is called Unsupervised learning. Unsupervised learners aim to detect patterns using just raw data. Since it is not always easy or possible to obtain labeled data, an unsupervised method is often the first step in discovering information about your data. There are three types of unsupervised learning to consider in Rubix ML.

Clustering#

Clustering takes a dataset of unlabeled samples and assigns each sample a cluster number based on its similarity to other samples in the training set. Samples that are most similar will be assigned to the same cluster. Clustering is used in tissue differentiation from PET scan images, customer database market segmentation, and to discover communities within social networks.

Anomaly Detection#

Anomalies are samples that have been generated by a different process than normal or those that do not conform to the expected distribution of the training data. Samples can either be flagged or ranked based on their anomaly score. Anomaly detection is used in security for intrusion and denial of service detection, and in the financial industry to detect fraud.

Manifold Learning#

Manifold learning is a type of unsupervised non-linear dimensionality reduction used for embedding datasets into dense feature representations. Embedders are used for visualizing high dimensional datasets in low (1 to 3) dimensions, and for compressing samples before input to a learning algorithm.

Obtaining Data#

Machine learning projects typically begin with a question. For example, you might want to answer the question of "who of my friends are most likely to stay married to their partner?" One way to go about answering this question with machine learning would be to go out and ask a bunch of happily married and divorced couples the same set of questions about their partner and then use that data to build a model to predict successful relationships based on the answers they gave you. In ML terms, the answers you collect are the values of the features that constitute measurements of the phenomena being observed. The number of features in a sample is called the dimensionality of the sample. For example, a sample with 20 features is said to be 20 dimensional.

As an alternative to collecting data yourself, one may access one of the many open datasets that are free to use from a public repository. The advantages of using a public dataset is that the data has most likely already been cleaned and prepared for you. We recommend the University of California Irvine Machine Learning Repository as a great place to get started with using open source datasets.

Extracting Data#

Before our data can become useful, we need to load it into our script from its stored format. There are many PHP libraries that help make extracting data from various sources easy and intuitive, and we recommend checking them out as a great place to start.

In addition, PHP has a number of built-in functions and extensions that allow you to access data stored in various formats including CSV and Database.

The Dataset Object#

In Rubix ML, data are passed in specialized containers called Dataset objects. Dataset objects handle all the selecting, splitting, folding, transforming, randomizing, and sorting of the samples and labels while keeping their indices aligned. In general, there are two types of datasets, Labeled and Unlabeled. Labeled datasets are used for supervised learning and for providing the ground-truth during testing. Unlabeled datasets are used for unsupervised learning and for making predictions (inference) on unknown samples.

Suppose that you went out and asked 100 couples (50 married and 50 divorced) to rate their partner's communication skills (between 1 and 5), attractiveness (between 1 and 5), and time spent together per week (hours per week). You could construct a Labeled dataset from this data by passing the samples and labels into the constructor.

use Rubix\ML\Datasets\Labeled;

$samples = [
    [3, 4, 50.5], [1, 5, 24.7], [4, 4, 62.0], [3, 2, 31.1]
];

$labels = ['married', 'divorced', 'married', 'divorced'];

$dataset = new Labeled($samples, $labels);

Choosing an Estimator#

Estimators make up the core of the Rubix ML library. They provide the predict() API and are responsible for making predictions on samples. Estimators that can be trained with data are called Learners and must be trained before making predictions.

For our example we will focus on an easily intuitable distance-based classifier called K Nearest Neighbors. Since the label of each training sample will be a discrete class (married or divorced), the type of estimator we need is a classifier.

Note: In practice, you will test out a number of different estimators to get the best sense of what works for your particular dataset.

Creating the Estimator Instance#

The K Nearest Neighbors classifier works by locating the closest training samples to an unknown sample and choosing the class label that is most common. Like most estimators, the K Nearest Neighbors (KNN) classifier requires a set of parameters (called hyper-parameters) to be chosen up front by the user. These parameters control how the learner behaves during training and inference. These parameters can be selected based on some prior knowledge of the problem space, or completely at random. The defaults provided in Rubix ML are a good place to start for most problems.

In KNN, the hyper-parameter k is the number of nearest points from the training set to compare an unknown sample to in order to infer its class label. For example, if the 5 closest neighbors to a given unknown sample have 4 married and 1 divorced label, then the algorithm will output a prediction of married with a probability of 0.8.

use Rubix\ML\Classifiers\KNearestNeighbors;

$estimator = new KNearestNeighbors(5);

Training and Prediction#

Training is the process of feeding the learning algorithm data so that it can build an internal representation of the problem. This representation is called a model and it consists of all of the parameters (except hyper-parameters) that are required for the estimator to make a prediction. If you try to make a prediction using an untrained learner, it will throw an exception.

$estimator->train($dataset);

We can verify that the learner has been trained by calling the trained() method:

var_dump($estimator->trained());
bool(true)

For our small training set, the training process should only take a matter of microseconds, but larger datasets with higher dimensionality can take much longer. Once the learner has been fully trained, we can feed in some unknown samples to see what it predicts.

Suppose that we went out and collected 5 new data points from our friends using the same questions we asked the couples we interviewed for our training set. We could predict whether or not they will stay married by taking their answers and running them past the trained KNN estimator in and Unlabeled dataset.

use Rubix\ML\Datasets\Unlabeled;

$samples = [
    [4, 3, 44.2], [2, 2, 16.7], [2, 4, 19.5], [1, 5, 8.6], [3, 3, 55.0],
];

$dataset = new Unlabeled($samples);

$predictions = $estimator->predict($dataset);

var_dump($predictions);
array(5) {
    [0] => 'married'
    [1] => 'divorced'
    [2] => 'divorced'
    [3] => 'divorced'
    [4] => 'married'
}

Model Evaluation#

To test that the estimator can correctly generalize what it has learned during training to the real world we use a process called cross validation. The goal of cross validation is to train and test the learner on different subsets of the dataset as to produce a validation score. For the purposes of this introduction, we will use the Hold Out validator which takes a portion of the dataset for testing and leaves the rest for training. The reason we do not use all of the data for training is because we want to test the estimator on samples that it has never seen before.

The Hold Out validator requires you to set the ratio of testing to training samples as a constructor parameter. Let's choose to use a factor of 0.2 (20%) of the dataset for testing leaving the rest (80%) for training.

Note: Typically, 0.2 is a good default choice however your mileage may vary. The important thing to note here is the trade off between more data for training and more data to produce better testing results.

To return a score from the Hold Out validator using the Accuracy metric, pass in an untrained estimator instance along with a dataset.

use Rubix\ML\CrossValidation\HoldOut;
use Rubix\ML\CrossValidation\Metrics\Accuracy;

$validator = new HoldOut(0.2);

$score = $validator->test($estimator, $dataset, new Accuracy());

var_dump($score);
float(0.945)

Congratulations! You're done with the basic introduction to machine learning in Rubix ML.

Next Steps#

For a more in-depth tutorial using the K Nearest Neighbors classifier, check out the Iris Flower example project. We highly recommend browsing the rest of the documentation and the other example projects which range from beginner to advanced skill level.