Source

Word Count Vectorizer#

The Word Count Vectorizer builds a vocabulary from the training samples and transforms text blobs into fixed length feature vectors. Each feature column represents a word or token from the vocabulary and the value denotes the number of times that word appears in a given sample.

Interfaces: Transformer, Stateful

Data Type Compatibility: Categorical

Parameters#

# Param Default Type Description
1 max vocabulary PHP_INT_MAX int The maximum number of words to encode into each document vector.
2 min document frequency 1 int The minimum number of documents a word must appear in to be added to the vocabulary.
3 tokenizer Word object The tokenizer that extracts individual words from samples of text.

Additional Methods#

Return the fitted vocabulary i.e. the words that will be vectorized:

public vocabulary() : array

Return the size of the vocabulary:

public size() : int

Example#

use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Other\Tokenizers\SkipGram;

$transformer = new WordCountVectorizer(10000, 3, new SkipGram());