Dataset Objects#
Data are passed in specialized in-memory containers called Dataset objects. Dataset objects are table-like data structures that have operations for data manipulation. They can hold a heterogeneous mix of data types and they make it easy to transport data in a canonical way. Datasets consist of a matrix of samples in which each row constitutes a sample and each column represents the value of the feature represented by that column. They have the additional constraint that each feature column must contain values of the same high-level data type. Some datasets can contain labels for training or cross validation. In the example below, we instantiate a new Labeled dataset object by passing the samples and their labels as arguments to the constructor.
use Rubix\ML\Datasets\Labeled;
$samples = [
[0.1, 20, 'furry'],
[2.0, -5, 'rough'],
];
$labels = ['not monster', 'monster'];
$dataset = new Labeled($samples, $labels);
Factory Methods#
Build a dataset with the rows from a 2-dimensional iterable data table:
public static fromIterator(Traversable $iterator) : self
Note: When building a Labeled dataset, the label values should be in the last column of the data table.
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Datasets\Extractors\CSV;
$dataset = Labeled::fromIterator(new CSV('example.csv'));
Selecting#
Return all the samples in the dataset in a 2-dimensional array:
public samples() : array
Select a single row containing the sample at a given offset (offsets begin at 0):
public sample(int $offset) : array
Select the values of a feature column at a given offset (offsets begin at 0):
public column(int $offset) : array
Return the columns of the sample matrix:
public columns() : array
Return the columns of the sample matrix of a particular type:
public columnsByType(DataType $type) : array
use Rubix\ML\DataType;
$columns = $dataset->columnsByType(DataType::continuous());
Properties#
Return the number of rows in the dataset:
public numRows() : int
Return the number of columns in the samples matrix:
public numColumns() : int
$m = $dataset->numRows();
$n = $dataset->numColumns();
Return a 2-tuple with the shape of the samples matrix:
public shape() : array
[$m, $n] = $dataset->shape();
var_dump($m, $n);
int(1000)
int(30)
Return the data types for each feature column:
public columnTypes() : array
Return the data type for a given column offset:
public columnType(int $offset) : DataType
echo $dataset->columnType(15);
categorical
Applying Transformations#
You can apply a Transformer directly to the samples in a Dataset object by passing it as an arguent to the apply()
method on the dataset object.
public apply(Transformer $transformer) : self
use Rubix\ML\Transformers\RobustStandardizer;
$dataset->apply(new RobustStandardizer);
You can also transform a single feature column using a callback function with the transformColumn()
method.
public transformColumn(int $column, callable $callback) : self
$dataset->transformColumn(0, 'log1p');
$dataset->transformColumn(5, function ($value) {
return $value === 0 ? NAN : $value;
});
$dataset->transformColumn(6, function ($value) {
return min($value, 1000);
});
Stacking Datasets#
Stack any number of dataset objects on top of each other to form a single dataset:
public static stack(array $datasets) : self
Note: Datasets must have the same number of feature columns i.e. dimensionality.
use Rubix\ML\Datasets\Labeled;
$dataset = Labeled::stack([
$dataset1,
$dataset2,
$dataset3,
// ...
]);
Merging Datasets#
To merge the rows of this dataset with another dataset:
public merge(Dataset $dataset) : self
Note: Datasets must have the same number of columns.
$dataset = $dataset1->merge($dataset2);
To join the columns of this dataset with another dataset:
public join(Dataset $dataset) : self
Note: Datasets must have the same number of rows.
$dataset = $dataset1->join($dataset2);
Head and Tail#
Return the first n rows of data in a new dataset object:
public head(int $n = 10) : self
Return the last n rows of data in a new dataset object:
public tail(int $n = 10) : self
$subset = $dataset->head(10);
$subset = $dataset->tail(30);
Taking and Leaving#
Remove n rows from the dataset and return them in a new dataset:
public take(int $n = 1) : self
Leave n samples on the dataset and return the rest in a new dataset:
public leave(int $n = 1) : self
Splitting#
Split the dataset into left and right subsets:
public split(float $ratio = 0.5) : array
[$training, $testing] = $dataset->split(0.8);
Folding#
Fold the dataset to form k equal size datasets:
public fold(int $k = 10) : array
Note: If there are not enough samples to completely fill the last fold of the dataset then it will contain slightly fewer samples than the rest of the folds.
$folds = $dataset->fold(8);
Slicing and Splicing#
Return an n size portion of the dataset in a new dataset:
public slice(int $offset, int $n) : self
Remove a size n chunk of the dataset starting at offset and return it in a new dataset:
public splice(int $offset, int $n) : self
Batching#
Batch the dataset into subsets containing a maximum of n rows per batch:
public batch(int $n = 50) : array
$batches = $dataset->batch(250);
Randomization#
Randomize the order of the dataset and return it for method chaining:
public randomize() : self
Generate a random subset of the dataset without replacement of size n:
public randomSubset(int $n) : self
$subset = $dataset->randomSubset(50);
Generate a random subset with replacement:
public randomSubsetWithReplacement($n) : self
$subset = $dataset->randomSubsetWithReplacement(500);
Generate a random weighted subset with replacement of size n:
public randomWeightedSubsetWithReplacement($n, array $weights) : self
$subset = $dataset->randomWeightedSubsetWithReplacement(200, $weights);
Filtering#
Filter the rows of the dataset using the values of a feature column at the given offset as the arguments to a filter callback. The callback should return false for rows that should be filtered.
public filterByColumn(int $offset, callable $fn) : self
$tallPeople = $dataset->filterByColumn(3, function ($value) {
return $value > 178.5;
});
Sorting#
To sort a dataset in place by a specific feature column:
public sortByColumn(int $offset, bool $descending = false) : self
$dataset->sortByColumn(5);
Dropping Rows and Columns#
Drop the row at the given offset:
public dropRow(int $offset) : self
Drop the rows at the given offsets:
public dropRows(array $indices) : self
Drop the column at the given offset:
public dropColumn(int $offset) : self
Drop the columns at the given indices:
public dropColumns(array $indices) : self
Descriptive Statistics#
Return an array of statistics such as the central tendency, dispersion and shape of each continuous feature column and the joint probabilities of each category for every categorical feature column:
public describe() : Report
echo $dataset->describe();
[
{
"type": "categorical",
"num_categories": 2,
"probabilities": {
"friendly": 0.6666666666666666,
"loner": 0.3333333333333333
}
},
{
"type": "continuous",
"mean": 0.3333333333333333,
"variance": 9.792222222222222,
"std_dev": 3.129252661934191,
"skewness": -0.4481030843690633,
"kurtosis": -1.1330702741786107,
"min": -5,
"25%": -1.375,
"median": 0.8,
"75%": 2.825,
"max": 4
}
]
De-duplication#
Remove duplicate rows from the dataset:
public deduplicate() : self
Encode the Dataset#
Return a JSON representation of the dataset:
public toJSON(bool $pretty = false) : Encoding
Return a newline delimited JSON encoding of the dataset:
public toNDJSON(?array $header = null) : string
$encoding = $dataset->toNDJSON([
'sepal length', 'sepal width', 'petal length', 'petal width',
]);
echo $encoding;
{"sepal length":4.5,"sepal width":2.3,"petal length":1.3,"petal width":0.3,"class":"Iris-setosa"}
{"sepal length":4.4,"sepal width":3.2,"petal length":1.3,"petal width":0.2,"class":"Iris-setosa"}
{"sepal length":5.0,"sepal width":3.5,"petal length":1.6,"petal width":0.6,"class":"Iris-setosa"}
Return the dataset as comma-separated values (CSV) encoding with an optional header:
public toCSV(?array $header = null, string $delimiter = ',', string $enclosure = '"') : Encoding
$encoding = $dataset->toCSV([
'sepal length', 'sepal width', 'petal length', 'petal width', 'class',
]);
echo $encoding;
sepal length,sepal width,petal length,petal width,class
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
Previewing in the Console#
You can echo the dataset object to preview the first few rows and columns in the console.
echo $dataset;
| Column 0 | Column 1 | Column 2 | Column 3 | Label |
-----------------------------------------------------------------------
| nice | furry | friendly | 4 | not monster |
-----------------------------------------------------------------------
| mean | furry | loner | -1.5 | monster |
-----------------------------------------------------------------------
| nice | rough | friendly | 2.6 | not monster |
-----------------------------------------------------------------------
| mean | rough | friendly | -1 | monster |
-----------------------------------------------------------------------
| nice | rough | friendly | 2.9 | not monster |
-----------------------------------------------------------------------
| nice | furry | loner | -5 | not monster |
-----------------------------------------------------------------------