Dataset Objects#

Data are passed in specialized in-memory containers called Dataset objects. Dataset objects are table-like structures employing a high-level type system that have operations for data manipulation. They can hold a heterogeneous mix of data types and they make it easy to transport data in a canonical way. Datasets require a table of samples in which each row constitutes a sample and each column represents the value of the feature represented by that column. They have the additional constraint that each feature column must be homogenous i.e. they must contain values of the same high-level data type. For example, a continuous feature column must only contain integer or floating point numbers. A stray string or other data type will throw an exception upon validation.

In the example below, we instantiate a new Labeled dataset object by passing the samples and their labels to the constructor.

use Rubix\ML\Datasets\Labeled;

$samples = [
    [0.1, 20, 'furry'],
    [2.0, -5, 'rough'],

$labels = ['not monster', 'monster'];

$dataset = new Labeled($samples, $labels);

Missing Values#

By convention, missing continuous values are denoted by NaN and missing categorical values are denoted by a special placeholder category (ex. the ? category). Dataset objects do not allow missing values of resource or other data types.

$samples = [
    [0.001, NAN, 'rough'], // Missing a continuous value
    [0.25, -1000, '?'], // Missing a categorical value
    [0.01, -500, 'furry'], // Complete sample

Factory Methods#

Build a dataset with the rows from a 2-dimensional iterable data table:

public static fromIterator(Traversable $iterator) : self


use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Extractors\CSV;

$dataset = Labeled::fromIterator(new CSV('example.csv'));

Note: The data must be in the format of a table where each row is an n-d array of values. By convention, labels are always the last column of the data table.


Return all the samples in the dataset in a 2-dimensional array:

public samples() : array

Select a single row containing the sample at a given offset (offsets begin at 0):

public sample(int $offset) : array

Select the values of a feature column at a given offset (offsets begin at 0):

public column(int $offset) : array

Return the columns of the sample matrix:

public columns() : array

Return the columns of the sample matrix of a particular type:

public columnsByType(DataType $type) : array


use Rubix\ML\DataType;

$columns = $dataset->columnsByType(DataType::continuous());


Return the number of rows in the dataset:

public numRows() : int

Return the number of columns in the dataset:

public numColumns() : int

Return the data types for each feature column:

public types() : array

Return the data type for a given column offset:

public columnType(int $offset) : DataType


echo $dataset->columnType(5);

Applying Transformations#

You can apply a Transformer directly to a Dataset by passing it to the apply() method on the dataset object. The method returns self for chaining.

public apply(Transformer $transformer) : self


use Rubix\ML\Transformers\RandomHotDeckImputer;
use Rubix\ML\Transformers\OneHotEncoder;

$dataset->apply(new RandomHotDeckImputer())
    ->apply(new OneHotEncoder());

You can also transform a single feature column using a callback function with the transformColumn() method.

public transformColumn(int $column, callable $callback) : self


$dataset = $dataset->transformColumn(0, 'log1p');

$dataset = $dataset->transformColumn(6, function ($value) {
    return $value === 0 ? NAN : $value;

$dataset = $dataset->transformColumn(5, function ($value) {
    return min($value, 1000);

Stacking Datasets#

Stack any number of dataset objects on top of each other to form a single dataset:

public static stack(array $datasets) : self

Note: Datasets must have the same number of feature columns i.e. dimensionality.


use Rubix\ML\Datasets\Labeled;

$dataset = Labeled::stack([
    // ...

Merging Datasets#

To merge the rows of this dataset with another dataset:

public merge(Dataset $dataset) : self

Note: Datasets must have same number of columns to merge.

To merge the columns of this dataset with another dataset:

public augment(Dataset $dataset) : self

Note: Datasets must have same number of rows to augment.


use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Unlabeled;

$dataset = $dataset->merge(new Labeled($samples,  $labels));

$dataset = $dataset->augment(new Unlabeled($samples));

Head and Tail#

Return the first n rows of data in a new dataset object:

public head(int $n = 10) : self

Return the last n rows of data in a new dataset object:

public tail(int $n = 10) : self


// Return the first 5 rows in a new dataset
$subset = $dataset->head(5);

// Return the last 10 rows in a new dataset
$subset = $dataset->tail(10);

Taking and Leaving#

Remove n rows from the dataset and return them in a new dataset:

public take(int $n = 1) : self

Leave n samples on the dataset and return the rest in a new dataset:

public leave(int $n = 1) : self

Slicing and Splicing#

Return an n size portion of the dataset in a new dataset:

public slice(int $offset, int $n) : self

Remove a size n chunk of the dataset starting at offset and return it in a new dataset:

public splice(int $offset, int $n) : self


Split the dataset into left and right subsets given by a ratio:

public split(float $ratio = 0.5) : array

Partition the dataset into left and right subsets based on the value of a feature in a specified column:

public partition(int $offset, mixed $value) : array


// Split the dataset 50/50 into left and right subsets
[$left, $right] = $dataset->split(0.5);

// Split the dataset into training and testing sets 80/20.
[$training, $testing] = $dataset->split(0.8);

// Partition the dataset by the feature column at offset 4 by value '50'
[$left, $right] = $dataset->partition(4, 50);


Fold the dataset to form k equal size datasets:

public fold(int $k = 10) : array

Note: If there are not enough samples to completely fill the last fold of the dataset then it will contain slightly fewer samples than the rest.


$folds = $dataset->fold(8);

foreach ($folds as $fold) {
    // ...


Batch the dataset into subsets containing a maximum of n rows per batch:

public batch(int $n = 50) : array


$batches = $dataset->batch(250);

foreach ($batches as $batch) {
    // ...


Randomize the order of the dataset and return it for method chaining:

public randomize() : self

Generate a random subset of the dataset without replacement of size n:

public randomSubset(int $n) : self

Generate a random subset with replacement:

public randomSubsetWithReplacement($n) : self

Generate a random weighted subset with replacement of size n:

public randomWeightedSubsetWithReplacement($n, array $weights) : self


// Randomize and split the dataset into two subsets
[$left, $right] = $dataset->randomize()->split(0.6);

$subset = $dataset->randomSubset(50);

$subset = $dataset->randomSubsetWithReplacement(500);

// Sample a random subset according to a user-defined weight distribution
$subset = $dataset->randomWeightedSubsetWithReplacement(200, $weights);

// Sample a random subset using the values of a column as sample weights
$subset = $dataset->randomWeightedSubsetWithReplacement(200, $dataset->column(5));


Filter the rows of the dataset using the values of a feature column at the given offset as the arguments to a filter callback. The callback should return false for rows that should be filtered.

public filterByColumn(int $offset, callable $fn) : self


$tallPeople = $dataset->filterByColumn(3, function ($value) {
    return $value > 178.5;


To sort a dataset in place by a specific feature column:

public sortByColumn(int $offset, bool $descending = false) : self




array(3) {
    [0]=> array(3) {
        [0]=> string(4) "mean"
        [1]=> string(4) "furry"
        [2]=> int(8)
    [1]=> array(3) {
        [0]=> string(4) "nice"
        [1]=> string(4) "rough"
        [2]=> int(1)
    [2]=> array(3) {
        [0]=> string(4) "nice"
        [1]=> string(4) "rough"
        [2]=> int(6)

array(3) {
    [0]=> array(3) {
        [0]=> string(4) "nice"
        [1]=> string(4) "rough"
        [2]=> int(1)
    [1]=> array(3) {
        [0]=> string(4) "nice"
        [1]=> string(4) "rough"
        [2]=> int(6)
    [2]=> array(3) {
        [0]=> string(4) "mean"
        [1]=> string(4) "furry"
        [2]=> int(8)

Dropping Rows and Columns#

Drop the row at the given offset:

public dropRow(int $offset) : self

Drop the rows at the given offsets:

public dropRows(array $indices) : self

Drop the column at the given offset:

public dropColumn(int $offset) : self

Drop the columns at the given indices:

public dropColumns(array $indices) : self

Descriptive Statistics#

Return an array of statistics such as the central tendency, dispersion and shape of each continuous feature column and the joint probabilities of each category for every categorical feature column:

public describe() : array


    [2] => Array
            [type] => categorical
            [num_categories] => 2
            [probabilities] => Array
                    [friendly] => 0.66666666666667
                    [loner] => 0.33333333333333


    [3] => Array
            [type] => continuous
            [mean] => 0.33333333333333
            [variance] => 9.7922222222222
            [std_dev] => 3.1292526619342
            [skewness] => -0.44810308436906
            [kurtosis] => -1.1330702741786
            [min] => -5
            [25%] => -1.375
            [median] => 0.8
            [75%] => 2.825
            [max] => 4


Remove duplicate rows from the dataset:

public deduplicate() : self

Output Formats#

Return the dataset object as a data table array:

public toArray() : array

Return a JSON representation of the dataset:

public toJSON(bool $pretty = false) : string

Return a newline delimited JSON representation of the dataset:

public toNDJSON() : string

Return the dataset as comma-separated values (CSV) string:

public toCSV(string $delimiter = ',', string $enclosure = '"') : string


file_put_contents('dataset.csv', $dataset->toCSV());

// ...

file_put_contents('dataset.ndjson', $dataset->toNDJSON());

Previewing in the Console#

You can echo the dataset object to preview the first few rows and columns in the console.

echo $dataset;
| Column 0    | Column 1    | Column 2    | Column 3    | Label       |
| nice        | furry       | friendly    | 4           | not monster |
| mean        | furry       | loner       | -1.5        | monster     |
| nice        | rough       | friendly    | 2.6         | not monster |
| mean        | rough       | friendly    | -1          | monster     |
| nice        | rough       | friendly    | 2.9         | not monster |
| nice        | furry       | loner       | -5          | not monster |