Dataset Objects#

Data are passed in specialized in-memory containers called Dataset objects. Dataset objects are table-like data structures that have operations for data manipulation. They can hold a heterogeneous mix of data types and they make it easy to transport data in a canonical way. Datasets consist of a matrix of samples in which each row constitutes a sample and each column represents the value of the feature represented by that column. They have the additional constraint that each feature column must contain values of the same high-level data type. Some datasets can contain labels for training or cross validation. In the example below, we instantiate a new Labeled dataset object by passing the samples and their labels as arguments to the constructor.

use Rubix\ML\Datasets\Labeled;

$samples = [
    [0.1, 20, 'furry'],
    [2.0, -5, 'rough'],

$labels = ['not monster', 'monster'];

$dataset = new Labeled($samples, $labels);

Missing Values#

By convention, missing continuous values are denoted by the NAN constant and missing categorical values are denoted by a special placeholder category (ex. the ? category). Dataset objects do not allow missing values of resource or other data types.

$samples = [
    [0.01, -500, 'furry'], // Complete sample
    [0.001, NAN, 'rough'], // Missing a continuous value
    [0.25, -1000, '?'], // Missing a categorical value

Factory Methods#

Build a dataset with the rows from a 2-dimensional iterable data table:

public static fromIterator(Traversable $iterator) : self

Note: When building a Labeled dataset, the label values should be in the last column of the data table.


use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Extractors\CSV;

$dataset = Labeled::fromIterator(new CSV('example.csv'));


Return all the samples in the dataset in a 2-dimensional array:

public samples() : array

Select a single row containing the sample at a given offset (offsets begin at 0):

public sample(int $offset) : array

Select the values of a feature column at a given offset (offsets begin at 0):

public column(int $offset) : array

Return the columns of the sample matrix:

public columns() : array

Return the columns of the sample matrix of a particular type:

public columnsByType(DataType $type) : array


use Rubix\ML\DataType;

$columns = $dataset->columnsByType(DataType::continuous());


Return the number of rows in the dataset:

public numRows() : int

Return the number of columns in the samples matrix:

public numColumns() : int


$m = $dataset->numRows();

$n = $dataset->numColumns();

Return a 2-tuple with the shape of the samples matrix:

public shape() : array


[$m, $n] = $dataset->shape();

var_dump($m, $n);

Return the data types for each feature column:

public columnTypes() : array

Return the data type for a given column offset:

public columnType(int $offset) : DataType


echo $dataset->columnType(15);

Applying Transformations#

You can apply a Transformer directly to the samples in a Dataset object by passing it as an arguent to the apply() method on the dataset object.

public apply(Transformer $transformer) : self


use Rubix\ML\Transformers\OneHotEncoder;

$dataset->apply(new OneHotEncoder());

You can also transform a single feature column using a callback function with the transformColumn() method.

public transformColumn(int $column, callable $callback) : self


$dataset->transformColumn(0, 'log1p');

$dataset->transformColumn(5, function ($value) {
    return $value === 0 ? NAN : $value;

$dataset->transformColumn(6, function ($value) {
    return min($value, 1000);

Stacking Datasets#

Stack any number of dataset objects on top of each other to form a single dataset:

public static stack(array $datasets) : self

Note: Datasets must have the same number of feature columns i.e. dimensionality.


use Rubix\ML\Datasets\Labeled;

$dataset = Labeled::stack([
    // ...

Merging Datasets#

To merge the rows of this dataset with another dataset:

public merge(Dataset $dataset) : self

Note: Datasets must have the same number of columns.


$dataset = $dataset1->merge($dataset2);

To merge the columns of this dataset with another dataset:

public augment(Dataset $dataset) : self

Note: Datasets must have the same number of rows.


$dataset = $dataset1->augment($dataset2);

Head and Tail#

Return the first n rows of data in a new dataset object:

public head(int $n = 10) : self

Return the last n rows of data in a new dataset object:

public tail(int $n = 10) : self


$subset = $dataset->head(10);

$subset = $dataset->tail(30);

Taking and Leaving#

Remove n rows from the dataset and return them in a new dataset:

public take(int $n = 1) : self

Leave n samples on the dataset and return the rest in a new dataset:

public leave(int $n = 1) : self


Split the dataset into left and right subsets:

public split(float $ratio = 0.5) : array


[$training, $testing] = $dataset->split(0.8);


Fold the dataset to form k equal size datasets:

public fold(int $k = 10) : array

Note: If there are not enough samples to completely fill the last fold of the dataset then it will contain slightly fewer samples than the rest of the folds.


$folds = $dataset->fold(8);

Slicing and Splicing#

Return an n size portion of the dataset in a new dataset:

public slice(int $offset, int $n) : self

Remove a size n chunk of the dataset starting at offset and return it in a new dataset:

public splice(int $offset, int $n) : self


Batch the dataset into subsets containing a maximum of n rows per batch:

public batch(int $n = 50) : array


$batches = $dataset->batch(250);


Randomize the order of the dataset and return it for method chaining:

public randomize() : self



Generate a random subset of the dataset without replacement of size n:

public randomSubset(int $n) : self


$subset = $dataset->randomSubset(50);

Generate a random subset with replacement:

public randomSubsetWithReplacement($n) : self


$subset = $dataset->randomSubsetWithReplacement(500);

Generate a random weighted subset with replacement of size n:

public randomWeightedSubsetWithReplacement($n, array $weights) : self


$subset = $dataset->randomWeightedSubsetWithReplacement(200, $weights);


Filter the rows of the dataset using the values of a feature column at the given offset as the arguments to a filter callback. The callback should return false for rows that should be filtered.

public filterByColumn(int $offset, callable $fn) : self


$tallPeople = $dataset->filterByColumn(3, function ($value) {
    return $value > 178.5;


To sort a dataset in place by a specific feature column:

public sortByColumn(int $offset, bool $descending = false) : self



Dropping Rows and Columns#

Drop the row at the given offset:

public dropRow(int $offset) : self

Drop the rows at the given offsets:

public dropRows(array $indices) : self

Drop the column at the given offset:

public dropColumn(int $offset) : self

Drop the columns at the given indices:

public dropColumns(array $indices) : self

Descriptive Statistics#

Return an array of statistics such as the central tendency, dispersion and shape of each continuous feature column and the joint probabilities of each category for every categorical feature column:

public describe() : array


    [2] => Array
            [type] => categorical
            [num_categories] => 2
            [probabilities] => Array
                    [friendly] => 0.66666666666667
                    [loner] => 0.33333333333333


    [3] => Array
            [type] => continuous
            [mean] => 0.33333333333333
            [variance] => 9.7922222222222
            [std_dev] => 3.1292526619342
            [skewness] => -0.44810308436906
            [kurtosis] => -1.1330702741786
            [min] => -5
            [25%] => -1.375
            [median] => 0.8
            [75%] => 2.825
            [max] => 4


Remove duplicate rows from the dataset:

public deduplicate() : self

Output Formats#

Return the dataset object as a data table array:

public toArray() : array


$table = $dataset->toArray();

Return a JSON representation of the dataset:

public toJSON(bool $pretty = false) : string

Return a newline delimited JSON representation of the dataset:

public toNDJSON() : string


file_put_contents('dataset.ndjson', $dataset->toNDJSON());

Return the dataset as comma-separated values (CSV) string:

public toCSV(string $delimiter = ',', string $enclosure = '"') : string


file_put_contents('dataset.csv', $dataset->toCSV());

Previewing in the Console#

You can echo the dataset object to preview the first few rows and columns in the console.

echo $dataset;
| Column 0    | Column 1    | Column 2    | Column 3    | Label       |
| nice        | furry       | friendly    | 4           | not monster |
| mean        | furry       | loner       | -1.5        | monster     |
| nice        | rough       | friendly    | 2.6         | not monster |
| mean        | rough       | friendly    | -1          | monster     |
| nice        | rough       | friendly    | 2.9         | not monster |
| nice        | furry       | loner       | -5          | not monster |