Skip to content

[source]

K-Skip-N-Gram#

K-skip-n-grams are a technique similar to n-grams, whereby n-grams are formed but in addition to allowing adjacent sequences of words, the next k words will be skipped forming n-grams of the new forward looking sequences. The tokenizer outputs tokens ranging from min to max number of words per token.

Parameters#

# Name Default Type Description
1 min 2 int The minimum number of words in a single token.
2 max 2 int The maximum number of words in a single token.
3 skip 2 int The number of words to skip over to form new sequences.

Example#

use Rubix\ML\Tokenizers\KSkipNGram;

$tokenizer = new KSkipNGram(2, 3, 2);