Token Hashing Vectorizer#

Token Hashing Vectorizer builds token count vectors on the fly by employing a hashing trick. It is a stateless transformer that uses a hashing algorithm to assign token occurrences to a bucket in a vector of user-specified dimensionality. The advantage of hashing over storing a fixed vocabulary is that there is no memory footprint however there is a chance that certain tokens will collide with other tokens especially in lower-dimensional vector spaces.

Note

The default hashing function is CRC32 and is a good mix between speed and output space utilization. MurmurHash has even greater utilization but at the cost of some speed and it is only available on PHP 8.1 and above. FNV1 is comparable to CRC32 but with slightly more overhead.

Interfaces: Transformer

Data Type Compatibility: Categorical

Parameters#

#	Param	Default	Type	Description
1	dimensions		int	The dimensionality of the vector space.
2	tokenizer	Word	Tokenizer	The tokenizer used to extract tokens from blobs of text.
3	hashFn	callable	'crc32'	The hash function that accepts a string token and returns an integer.

Example#

use Rubix\ML\Transformers\TokenHashingVectorizer;
use Rubix\ML\Tokenizers\Word();

$transformer = new TokenHashingVectorizer(10000, new Word(), TokenHashingVectorizer::MURMUR3);

Additional Constants#

The CRC32 callback function.

public const CRC32 callable(string):int

The MurmurHash3 callback function.

public const MURMUR3 callable(string):int

The FNV1 callback function.

public const FNV1 callable(string):int

Additional Methods#

The MurmurHash3 hashing function:

public static murmur3(string $input) : int

Note

MurmurHash3 is only available on PHP 8.1 or above.

The FNV1a 32-bit hashing function:

public static fnv1a32(string $input) : int