Word Count Vectorizer#

The Word Count Vectorizer builds a vocabulary from the training samples and transforms text blobs into fixed length sparse feature vectors. Each feature column represents a word or token from the vocabulary and the value denotes the number of times that word appears in a given document.

Interfaces: Transformer, Stateful, Persistable

Data Type Compatibility: Categorical

Parameters#

#	Name	Default	Type	Description
1	maxVocabularySize	PHP_INT_MAX	int	The maximum number of unique tokens to embed into each document vector.
2	minDocumentFrequency	0.0	float	The minimum proportion of documents a word must appear in to be added to the vocabulary.
3	maxDocumentFrequency	1.0	float	The maximum proportion of documents a word can appear in to be added to the vocabulary.
4	tokenizer	Word	Tokenizer	The tokenizer used to extract features from blobs of text.

Example#

use Rubix\ML\Transformers\WordCountVectorizer;
use Rubix\ML\Tokenizers\NGram;

$transformer = new WordCountVectorizer(10000, 0.01, 0.9, new NGram(1, 2));

Additional Methods#

Return an array of words that comprise each of the vocabularies:

public vocabularies() : array

Last update: 2021-04-02