Preprocessing#

Sometimes, one or more preprocessing steps may need to be taken to transform the dataset before handing it off to a Learner. Some examples of preprocessing include feature extraction, standardization, normalization, imputation, and dimensionality reduction.

Transformers#

Transformers are objects that perform various preprocessing steps to the samples in a dataset. Stateful transformers are a type of transformer that must be fitted to a dataset. Fitting a dataset to a transformer is much like training a learner but in the context of preprocessing rather than inference. After fitting a stateful transformer, it will expect the features to be present in the same order when transforming subsequent datasets. A few transformers are supervised meaning they must be fitted with a Labeled dataset. Elastic transformers can have their fittings updated with new data after an initial fitting.

Transform a Dataset#

An example of a transformation is one that converts the categorical features of a dataset to continuous ones using a one hot encoding. To accomplish this with the library, pass a One Hot Encoder instance as an argument to the Dataset object's apply() method. Note that the apply() method also handles fitting a Stateful transformer automatically.

use Rubix\ML\Transformers\OneHotEncoder;

$dataset->apply(new OneHotEncoder());

Transformations can be chained by calling the apply() method fluently.

use Rubix\ML\Transformers\RandomHotDeckImputer;
use Rubix\ML\Transformers\OneHotEncoder;
use Rubix\ML\Transformers\MinMaxNormalizer;

$dataset->apply(new RandomHotDeckImputer(5))
    ->apply(new OneHotEncoder())
    ->apply(new MinMaxNormalizer());

Note: Transformers do not alter the labels in a dataset. Instead, you can use the transformLabels() method on a Labeled dataset instance.

Manually Fitting#

If you need to fit a Stateful transformer to a dataset other than the one it was meant to transform, you can fit the transformer manually by calling the fit() method before applying the transformation.

use Rubix\ML\Transformers\WordCountVectorizer;

$transformer = new WordCountVectorizer(5000);

$transformer->fit($dataset1);

$dataset2->apply($transformer);

Update Fitting#

To update the fitting of an Elastic transformer call the update() method with a new dataset.

$transformer->update($dataset);

Transform a Single Column#

Sometimes, we might just want to transform a single column of the dataset. In the example below we use the transformColumn() method on the dataset to log transform a specified column.

$dataset->transformColumn(6, 'log1p');

Standardization and Normalization#

Oftentimes, the continuous features of a dataset will be on different scales because they were measured by different methods. For example, age (0 - 100) and income (0 - 9,999,999) are on two widely different scales. Standardization is the processes of transforming a dataset such that the features are all on one common scale. Normalization is the special case where the transformed features have a range between 0 and 1. Depending on the transformer, it may operate on the columns or the rows of the dataset.

Transformer Operates On Range Stateful Elastic
L1 Normalizer Rows [0, 1]
L2 Normalizer Rows [0, 1]
Max Absolute Scaler Columns [-1, 1]
Min Max Normalizer Columns [min, max]
Robust Standardizer Columns [-∞, ∞]
Z Scale Standardizer Columns [-∞, ∞]

Feature Conversion#

Feature converters are transformers that convert feature columns of one data type to another by changing their representation.

Transformer From To Stateful Elastic
Interval Discretizer Continuous Categorical
One Hot Encoder Categorical Continuous
Numeric String Converter Categorical Continuous

Dimensionality Reduction#

Dimensionality reduction is a preprocessing technique for embedding a dataset into a lower dimensional vector space. It allows a learner to train and infer quicker by producing a dataset with fewer but more informative features.

Transformer Supervised Stateful Elastic
Gaussian Random Projector
Linear Discriminant Analysis
Principal Component Analysis
Sparse Random Projector

Feature Selection#

Similarly to dimensionality reduction, feature selection aims to reduce the number of features in a dataset, however, feature selection seeks to keep the best features as-is and drop the less informative ones entirely. Adding feature selection can help speed up training and inference by creating a more parsimonious model. It can also improve the performance of the model by removing noise features and features that are uncorrelated with the outcome.

Transformer Supervised Stateful Elastic
Recursive Feature Eliminator
Variance Threshold Filter

Imputation#

A technique for handling missing values in your dataset is a preprocessing step called imputation. Imputation is the process of replacing missing values with a pretty good guess.

Transformer Continuous Categorical Stateful Elastic
KNN Imputer
Missing Data Imputer
Random Hot Deck Imputer

Text Transformers#

The library provides a number of transformers for natural language processing (NLP) and information retrieval (IR) such as those for text cleaning, normalization, and feature extraction from raw text blobs.

Transformer Stateful Elastic
HTML Stripper
Regex Filter
Text Normalizer
Multibyte Text Normalizer
Stop Word Filter
TF-IDF Transformer
Whitespace Trimmer
Word Count Vectorizer

Image Transformers#

Since image have their own high-level data type, they can be preprocessed in a dataset by applying any number of image transformers.

Transformer Stateful Elastic
Image Resizer
Image Vectorizer

Transformer Pipelines#

Pipeline meta-estimators help you automate a series of transformations. In addition, Pipeline objects are Persistable allowing you to save and load transformer fittings between processes. Whenever a dataset object is passed to a learner wrapped in a Pipeline, it will automatically be fitted and/or transformed before it arrives in the learner's context.

Let's apply the same 3 transformers as in the example above by passing the transformer instances in the order we want them applied along with a base estimator to the constructor of Pipeline like in the example below.

use Rubix\ML\Pipeline;
use Rubix\ML\Transformers\RandomHotDeckImputer;
use Rubix\ML\Transformers\OneHotEncoder;
use Rubix\ML\Transformers\ZScaleStandardizer;
use Rubix\ML\Clusterers\KMeans;

$estimator = new Pipeline([
    new RandomHotDeckImputer(5),
    new OneHotEncoder(),
    new ZScaleStandardizer(),
], new KMeans(10, 256));

Calling train() or partial() will result in the transformers being fitted or updated before being passed to the Softmax Classifier.

$estimator->train($dataset); // Transformers fitted and applied

$estimator->partial($dataset); // Transformers updated and applied

Any time a dataset is passed to the Pipeline it will automatically be transformed before being handed to the underlying estimator.

$predictions = $estimator->predict($dataset); // Dataset transformed automatically

Advanced Preprocessing#

In some cases, certain features of a dataset may require a different set of preprocessing steps than the others. In such a case, we are able to extract only certain features, preprocess them, and then join them to another set of features. In the example below, we'll extract just the text reviews and their sentiment labels into a dataset object and put the sample's category, number of clicks, and ratings into a another using two Column Pickers. Then, we can apply a separate set of transformations to each set of features and use the join() method after to combine them into one dataset. We can even apply another set of transformation to the dataset after that.

use Rubix\ML\Dataset\Labeled;
use Rubix\ML\Extractors\ColumnPicker;
use Rubix\ML\Extractors\NDJSON;
use Rubix\ML\Dataset\Unlabeled;
use Rubix\ML\Transformers\TextNormalizer;
use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Transformers\TfIdfTransformer;
use Rubix\ML\Transformers\OneHotEncoder;
use Rubix\ML\Transformers\ZScaleStandardizer;

$extractor1 = new ColumnPicker(new NDJSON('dataset.ndjson'), [
    'review', 'sentiment',
]);

$extractor2 = new ColumnPicker(new NDJSON('dataset.ndjson'), [
    'category', 'clicks', 'rating',
]);

$dataset1 = Labeled::fromIterator($extractor1)
    ->apply(new TextNormalizer())
    ->apply(new WordCountVectorizer(5000))
    ->apply(new IfIdfTransformer());

$dataset2 = Unlabeled::fromIterator($extractor2)
    ->apply(new OneHotEncoder());

$dataset = $dataset1->join($dataset2)
    ->apply(new ZScaleStandardizer());

Filtering Records#

In some cases, you may want to remove entire rows from the dataset. For example, you may want to remove records that contain features with abnormally low/high values as these samples can be interpreted as noise. The filterByColumn() method on the dataset object uses a callback function to determine whether or not to return a row in the new dataset by the value of the feature at a given column offset.

$tallPeople = $dataset->filterByColumn(3, function ($value) {
    return $value > 178.5;
});

De-duplication#

When it is undesirable for a dataset to contain duplicate records, you can remove all duplicates by calling the deduplicate() method on the dataset object.

$dataset->deduplicate();

Note: De-duplication of large datasets may take a significant amount of processing time.

Saving a Dataset#

If you ever want to preprocess a dataset and then save it for later you can do so by calling one of the conversion methods (toCSV(), toNDJSON()) on the Dataset object. Then, call the write() method on the returned encoding object to save the data to a file at a given path like in the example below.

use Rubix\ML\Transformers\MissingDataImputer;

$dataset->apply(new MissingDataImputer())->toCSV()->write('dataset.csv');