Dataset Objects#

Data are passed in specialized in-memory containers called Dataset objects. Dataset objects are table-like data structures that have operations for data manipulation. They can hold a heterogeneous mix of data types and they make it easy to transport data in a canonical way. Datasets consist of a matrix of samples in which each row constitutes a sample and each column represents the value of the feature represented by that column. They have the additional constraint that each feature column must contain values of the same high-level data type. Some datasets can contain labels for training or cross validation. In the example below, we instantiate a new Labeled dataset object by passing the samples and their labels as arguments to the constructor.

use Rubix\ML\Datasets\Labeled;

$samples = [
    [0.1, 20, 'furry'],
    [2.0, -5, 'rough'],
];

$labels = ['not monster', 'monster'];

$dataset = new Labeled($samples, $labels);

Factory Methods#

Build a dataset with the records of a 2-dimensional iterable data table:

public static fromIterator(Traversable $iterator) : self

Note

When building a Labeled dataset, the label values should be in the last column of the data table.

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Extractors\CSV;

$dataset = Labeled::fromIterator(new CSV('example.csv'));

Properties#

Return the number of rows in the dataset:

public numSamples() : int

Return the number of columns in the samples matrix:

public numFeatures() : int

Return a 2-tuple with the shape of the samples matrix:

public shape() : array{int, int}

[$m, $n] = $dataset->shape();

echo "$m x $n";

1000 x 30

Data Types#

Return the data types for each column in the data table:

public types() : Rubix\ML\DataType[]

Return the data types for each feature column:

public featureTypes() : Rubix\ML\DataType[]

Return the data type for a given column offset:

public featureType(int $offset) : Rubix\ML\DataType

echo $dataset->featureType(15);

categorical

Selecting#

Return all the samples in the dataset in a 2-dimensional array:

public samples() : array[]

Select a single row containing the sample at a given offset beginning at 0:

public sample(int $offset) : mixed[]

Return the columns of the sample matrix:

public features() : array[]

Select the values of a feature column at a given offset :

public feature(int $offset) : mixed[]

Dropping#

Drop a feature at a given column offset from the dataset:

public dropFeature(int $offset) : self

Head and Tail#

Return the first n rows of data in a new dataset object:

public head(int $n = 10) : self

$subset = $dataset->head(10);

Return the last n rows of data in a new dataset object:

public tail(int $n = 10) : self

Taking and Leaving#

Remove n rows from the dataset and return them in a new dataset:

public take(int $n = 1) : self

Leave n samples on the dataset and return the rest in a new dataset:

public leave(int $n = 1) : self

Splitting#

Split the dataset into left and right subsets:

public split(float $ratio = 0.5) : array{self, self}

[$training, $testing] = $dataset->split(0.8);

Folding#

Fold the dataset to form k equal size datasets:

public fold(int $k = 10) : self[]

Note

If there are not enough samples to completely fill the last fold of the dataset then it will contain slightly fewer samples than the rest of the folds.

$folds = $dataset->fold(8);

Slicing and Splicing#

Return an n size portion of the dataset in a new dataset:

public slice(int $offset, int $n) : self

Remove a size n chunk of the dataset starting at offset and return it in a new dataset:

public splice(int $offset, int $n) : self

Batching#

Batch the dataset into subsets containing a maximum of n rows per batch:

public batch(int $n = 50) : self[]

$batches = $dataset->batch(250);

Randomization#

Randomize the order of the dataset and return it for method chaining:

public randomize() : self

Generate a random subset of the dataset without replacement of size n:

public randomSubset(int $n) : self

$subset = $dataset->randomSubset(50);

Generate a random subset with replacement:

public randomSubsetWithReplacement(int $n) : self

$subset = $dataset->randomSubsetWithReplacement(500);

Generate a random weighted subset with replacement of size n:

public randomWeightedSubsetWithReplacement(int $n, array $weights) : self

$subset = $dataset->randomWeightedSubsetWithReplacement(200, $weights);

Applying Transformations#

You can apply a Transformer to the samples in a Dataset object by passing it as an argument to the apply() method on the dataset object. If a Stateful transformer has not been fitted beforehand, it will automatically be fitted before being applied to the samples.

public apply(Transformer $transformer) : self

use Rubix\ML\Transformers\RobustStandardizer;

$dataset->apply(new RobustStandardizer);

To reverse the transformation, pass a Reversible transformer to the dataset objects reverseApply() method.

public apply(Reversible $transformer) : self

use Rubix\ML\Transformers\MaxAbsoluteScaler;

$transformer = new MaxAbsoluteScaler();

$dataset->apply($transformer);

// Do something

$dataset->reverseApply($transformer);

Filtering#

Filter the records of the dataset using a callback function to determine if a row should be included in the return dataset:

public filter(callable $callback) : self

$tallPeople = function ($record) {
    return $record[3] > 178.5;
};

$dataset = $dataset->filter($tallPeople);

Stacking#

Stack any number of dataset objects on top of each other to form a single dataset:

public static stack(array $datasets) : self

Note

Datasets must have the same number of feature columns i.e. dimensionality.

use Rubix\ML\Datasets\Labeled;

$dataset = Labeled::stack([
    $dataset1,
    $dataset2,
    $dataset3,
    // ...
]);

Merging and Joining#

To merge the rows of this dataset with another dataset:

public merge(Dataset $dataset) : self

Note

Datasets must have the same number of columns.

$dataset = $dataset1->merge($dataset2);

To join the columns of this dataset with another dataset:

public join(Dataset $dataset) : self

Note

Datasets must have the same number of rows.

$dataset = $dataset1->join($dataset2);

Descriptive Statistics#

Return an array of statistics such as the central tendency, dispersion and shape of each continuous feature column and the joint probabilities of each category for every categorical feature column:

public describe() : Rubix\ML\Report

echo $dataset->describe();

[
    {
        "offset": 0,
        "type": "categorical",
        "num categories": 2,
        "probabilities": {
            "friendly": 0.6666666666666666,
            "loner": 0.3333333333333333
        }
    },
    {
        "offset": 1,
        "type": "continuous",
        "mean": 0.3333333333333333,
        "standard deviation": 3.129252661934191,
        "skewness": -0.4481030843690633,
        "kurtosis": -1.1330702741786107,
        "min": -5,
        "25%": -1.375,
        "median": 0.8,
        "75%": 2.825,
        "max": 4
    }
]

Sorting#

Sort the records in the dataset using a callback for comparisons between samples. The callback function accepts two records to be compared and should return true if the records should be swapped.

public function sort(callable $callback) : self

$sorted = $dataset->sort(function ($recordA, $recordB) {
    return $recordA[2] > $recordB[2];
});

De-duplication#

Remove duplicate rows from the dataset:

public deduplicate() : self

Exporting#

Export the dataset to the location and format given by a Writable extractor:

public exportTo(Writable $extractor) : void

use Rubix\ML\Extractors\NDJSON;

$dataset->exportTo(new NDJSON('example.ndjson'));