[source]

Labeled#

A Labeled dataset is used to train supervised learners and for testing a model by providing the ground-truth. In addition to the standard dataset API, a labeled dataset can perform operations such as stratification and sorting the dataset using the label column.

Note: Since PHP silently converts integer strings (ex. '1') to integers in some circumstances, you should not use integer strings as class labels. Instead, use an appropriate non-integer string class name such as 'class 1', '#1', or 'first'.

Parameters#

# Param Default Type Description
1 samples array A 2-dimensional array consisting of rows of samples and columns with feature values.
2 labels array A 1-dimensional array of labels that correspond to each sample in the dataset.
2 verify true bool Should we verify the data?

Example#

use Rubix\ML\Datasets\Labeled;

$samples = [
    [0.1, 20, 'furry'],
    [2.0, -5, 'rough'],
    [0.01, 5, 'furry'],
];

$labels = ['not monster', 'monster', 'not monster'];

$dataset = new Labeled($samples, $labels);

Additional Methods#

Selectors#

Return the labels of the dataset in an array:

public labels() : array

Return a single label at the given row offset:

public label(int $offset) : mixed

Return the data type of the label:

public labelType() : DataType
echo $dataset->labelType();
continuous

Return all of the possible outcomes i.e. the unique labels in an array:

public possibleOutcomes() : array
var_dump($dataset->possibleOutcomes());
array(2) {
    [0]=> string(5) "female"
    [1]=> string(4) "male"
}

Stratification#

Group samples by their class label and return them in their own dataset:

public stratify() : array
$strata = $dataset->stratify();

Split the dataset into left and right subsets such that the proportions of class labels remain intact:

public stratifiedSplit($ratio = 0.5) : array
[$training, $testing] = $dataset->stratifiedSplit(0.8);

Return k equal size subsets of the dataset such that class proportions remain intact:

public stratifiedFold($k = 10) : array
$folds = $dataset->stratifiedFold(3);

Transform Labels#

Transform the labels in the dataset using a callback function and return self for method chaining:

public transformLabels(callable $fn) : self

Note: The callback function called for each individual label and should return the transformed label as a continuous or categorical value.

$dataset->transformLabels('intval');

$dataset->transformLabels('floatval');

$dataset->transformLabels(function ($label) {
    switch ($label) {
        case 0:
            return 'disagree';

        case 1:
            return 'neutral';

        case 2:
            return 'agree';

        default:
            return '?';
    }
});

$dataset->transformLabels(function ($label) {
    return $label > 0.5 ? 'yes' : 'no';
});

Filter#

Filter the dataset by label:

public filterByLabel(callable $fn) : self

Note: The callback function is given a label as its only argument and should return true if the row should be kept or false if the row should be filtered out of the result.

$filtered = $dataset->filterByLabel(function ($label)) {
    return $label <= 10000;;
});

Sorting#

Sort the dataset by label and return self for method chaining:

public sortByLabel(bool $descending = false) : self

Describe by Label#

Describe the features of the dataset broken down by label:

public describeByLabel() : Report
echo $dataset->describeByLabel();
{
    "not monster": [
        {
            "type": "categorical",
            "num_categories": 2,
            "probabilities": {
                "friendly": 0.75,
                "loner": 0.25
            }
        },
        {
            "type": "continuous",
            "mean": 1.125,
            "variance": 12.776875,
            "std_dev": 3.574475485997911,
            "skewness": -1.0795676577113944,
            "kurtosis": -0.7175867765792474,
            "min": -5,
            "25%": 0.6999999999999993,
            "median": 2.75,
            "75%": 3.175,
            "max": 4
        }
    ],
    "monster": [
        {
            "type": "categorical",
            "num_categories": 2,
            "probabilities": {
                "loner": 0.5,
                "friendly": 0.5
            }
        },
        {
            "type": "continuous",
            "mean": -1.25,
            "variance": 0.0625,
            "std_dev": 0.25,
            "skewness": 0,
            "kurtosis": -2,
            "min": -1.5,
            "25%": -1.375,
            "median": -1.25,
            "75%": -1.125,
            "max": -1
        }
    ]
}

Describe the Labels#

Return an array of descriptive statistics about the labels in the dataset:

public describeLabels() : Report
echo $dataset->describeLabels();
{
    "type": "categorical",
    "num_categories": 2,
    "probabilities": {
        "not monster": 0.6666666666666666,
        "monster": 0.3333333333333333
    }
}