What is Machine Learning?#
Machine Learning (ML) is a form of Artificial Intelligence (AI) that uses data to train a computer to perform tasks. Unlike traditional programming, in which rules are programmed explicitly, machine learning uses algorithms to build rulesets automatically. At a high level, machine learning is a collection of techniques borrowed from many disciplines including statistics, probability theory, and neuroscience combined with novel ideas for the purpose of gaining insight through data and computation. Machine Learning is further broken down into subcategories based on how the learners are trained and the tasks they handle.
Supervised Learning#
Supervised learning is a type of machine learning that incorporates a training signal in the form of labels which are often determined by a human expert. Labels are the desired output of a learner given the sample we are showing it. For this reason, you can think of supervised leaning as learning by example. There are two types of supervised learning to consider in Rubix ML.
Classification#
For classification problems, a learner is trained to differentiate samples among a set of k possible discrete classes. In this type of problem, the training labels are the classes that each sample belongs to. Examples of class labels include cat
, dog
, human
, or any other categorical label. Classification problems include image recognition, text sentiment analysis, and Iris flower classification.
Regression#
Regression is a learning problem that aims to predict a continuous-valued outcome. In this case, the training labels are continuous data types such as integers and floating point numbers. Regression problems include estimating house prices, credit scoring, and the steering angle of an autonomous vehicle.
Unsupervised Learning#
A form of learning that does not require labeled data is called Unsupervised learning. Unsupervised learners focus on digesting patterns within just the samples. There are three types of unsupervised learning offered in Rubix ML.
Clustering#
Clustering takes a dataset and assigns each of the samples a discrete cluster number based on its similarity to other samples from the training set. It can be viewed as a weaker form of classification where the class names are unknown. Clustering is used to group colors, segment customer databases, and to discover communities within social networks for example.
Anomaly Detection#
Anomalies are defined as samples that have been generated by a different process than normal. Samples can either be flagged or ranked based on their anomaly score. Anomaly detection is used in information security for intrusion and denial of service detection, and in the financial industry to detect fraud.
Manifold Learning#
Manifold learning is a type of unsupervised non-linear dimensionality reduction used for embedding datasets into dense feature representations. Embedders can be used for visualizing high dimensional (4 or more) datasets in low (1 to 3) dimensions, or for compressing the information within the samples before input to a learning algorithm.
Deep Learning#
Deep Learning is a subset of machine learning that incorporates layers of computation that form feature representations of greater and greater complexity. It is a paradigm shift from feature engineering to letting the learner construct its own features from the raw data. Deep Learning is used in image recognition, natural language processing (NLP), and for other tasks involving very high-dimensional raw inputs.
AutoML#
Automated Machine Learning (AutoML) is the application of automated tools when designing machine learning models. The goal of AutoML is to simplify the machine learning lifecycle for non-experts and to facilitate rapid prototyping. In addition, AutoML can aid in the discovery of simpler and more accurate solutions than could otherwise be discovered by human intuition alone. Rubix provides a number of tools to help automate the machine learning process including hyper-parameter optimizers and feature selectors.
Other Forms of ML#
Although the supervised and unsupervised learning framework covers a substantial number of problems, there are other types of machine learning that the library does not support out of the box.
Reinforcement Learning#
Reinforcement Learning (RL) is a type of machine learning that aims to learn the optimal control of an agent within an environment through cumulative reward. The data used to train an RL learner are the states obtained by performing some action and then observing the response. If supervised learning is learning by example then reinforcement learning is learning from mistakes. Reinforcement learning is used to train AIs to play games such as Go, Chess, and Starcraft 2, and in robotics for movement planning.
Sequence Learning#
Sequence Learning is a type of ML that aims to predict the next value in a sequence such as the next word in a sentence or a future stock price. It differs from learning from sets of data in that the order of the samples matter. Time-series analysis is a special case of sequence learning where the sequences are ordered by time. Sequence-to-sequence Learning is used to denote when the output is not just the next value but the next sequence of values.
Self-supervised Learning#
A hybrid approach to learning is Self-supervised learning in which a learner is trained to predict the parts of a sample that were partially omitted during training. As such, supervised methods can be employed on unlabeled data to learn representations. Self-supervised learning is used in language models such as GPT-3 to generate sequences of text or in autonomous robots to learn from ancillary sensor feedback.