Source

Labeled#

A Labeled dataset is used to train supervised learners and for testing a model using cross validation. In addition to the standard dataset object methods, a Labeled dataset can perform operations such as stratification and sorting the dataset by label.

Parameters#

# Param Default Type Description
1 samples array A 2-dimensional array consisting of rows of samples and columns with feature values.
2 labels array A 1-dimensional array of labels that correspond to the samples in the dataset.
3 validate true bool Should we validate the data?

Additional Methods#

Factory Methods#

Build a new labeled dataset with validation:

public static build(array $samples = [], array $labels = []) : self

Build a new labeled dataset foregoing validation:

public static quick(array $samples = [], array $labels = []) : self

Build a dataset using a pair of iterators:

public static fromIterator(iterable $samples, iterable $labels) : self

Build a labeled dataset from a data table with the last column containing the label:

public static unzip(array $table) : self

Example

use Rubix\ML\Datasets\Labeled;

// Import samples and labels

$dataset = new Labeled($samples, $labels, true);  // Using the constructor

$dataset = Labeled::build($samples, $labels);  // Build a new dataset with validation

$dataset = Labeled::quick($samples, $labels);  // Build a new dataset without validation

$dataset = Labeled::fromItertor($samples, $labels); // From a pair of iterators

$dataset = Labeled::unzip($table); // From a data table

Selectors#

Return an array of labels:

public labels() : array

Zip the samples and labels together in a Generator:

public zip() : Generator

Return the label at the given row offset:

public label(int $index) : mixed

Return the type of the label encoded as an integer:

public labelType() : int

Return all of the possible outcomes i.e. the unique labels:

public possibleOutcomes() : array

Example

// Return the labels in the dataset
$labels = $dataset->labels();

// Return the label at row offset 3
$label = $dataset->label(3);

// Return an array of unique labels
$outcomes = $dataset->possibleOutcomes();

var_dump($labels);
var_dump($label);
var_dump($outcomes);
array(4) {
    [0]=> string(5) "female"
    [1]=> string(4) "male"
    [2]=> string(5) "female"
    [3]=> string(4) "male"
}

string(4) "male"

array(2) {
    [0]=> string(5) "female"
    [1]=> string(4) "male"
}

Transform#

Transform the labels in the dataset using a callback function and return self for method chaining:

public transformLabels(callable $fn) : self

Note: The callback function is given a label as its only argument and should return the transformed label as a continuous or categorical value.

Example

$dataset->transformLabels('intval'); // To integers

$dataset->transformLabels('floatval'); // To floats

// From integers to discrete classes
$dataset->transformLabels(function ($label) {
    switch ($label) {
        case 1:
            return 'male';

        case 2:
            return 'female';

        default:
            return 'other';
    }
});

// From a continuous value to binary classes
$dataset->transformLabels(function ($label) {
    return $label > 0.5 ? 'yes' : 'no';
});

Filter#

Filter the dataset by label:

public filterByLabel(callable $fn) : self

Note: The callback function is given a label as its only argument and should return true if the row should be kept or false if the row should be filtered out of the result.

Example

// Remove rows with label values greater than 10000
$filtered = $dataset->filterByLabel(function ($label)) {
    return $label > 10000 ? false : true;
});

Sorting#

Sort the dataset by label and return self for method chaining:

public sortByLabel(bool $descending = false) : self

Stratification#

Group the samples by label and return them in their own dataset:

public stratify() : array

Split the dataset into left and right stratified subsets with a given ratio of samples in each:

public stratifiedSplit($ratio = 0.5) : array

Return k equal size subsets of the dataset:

public stratifiedFold($k = 10) : array

Example

// Put each sample with label 'x' into its own dataset
$strata = $dataset->stratify();

// Fold the dataset into 5 equal size stratified subsets
$folds = $dataset->stratifiedFold(5);

// Split the dataset into two stratified subsets
[$left, $right] = $dataset->stratifiedSplit(0.8);

Describe the Labels#

Return an array of descriptive statistics about the labels in the dataset.

public describeLabels() : array

Example

$desc = $dataset->describeLabels();

print_r($desc);
Array
(
    [type] => categorical
    [num_categories] => 2
    [probabilities] => Array
        (
            [monster] => 0.33333333333333
            [not monster] => 0.66666666666667
        )

)