Token Hashing Vectorizer#
Token Hashing Vectorizer builds token count vectors on the fly by employing a hashing trick. It is a stateless transformer that uses the CRC32 (Cyclic Redundancy Check) hashing algorithm to assign token occurrences to a bucket in a vector of user-specified dimensionality. The advantage of hashing over storing a fixed vocabulary is that there is no memory footprint however there is a chance that certain tokens will collide with other tokens especially in lower-dimensional vector spaces.
Interfaces: Transformer
Data Type Compatibility: Categorical
Parameters#
# | Param | Default | Type | Description |
---|---|---|---|---|
1 | dimensions | int | The dimensionality of the vector space. | |
2 | tokenizer | Word | Tokenizer | The tokenizer used to extract tokens from blobs of text. |
Example#
use Rubix\ML\Transformers\TokenHashingVectorizer;
use Rubix\ML\Tokenizers\NGram;
$transformer = new TokenHashingVectorizer(10000, new NGram(1, 2));
Additional Methods#
This transformer does not have any additional methods.
Last update:
2021-09-05