K-skip-n-grams are a technique similar to n-grams, whereby n-grams are formed but in addition to allowing adjacent sequences of words, the next k words will be skipped forming n-grams of the new forward looking sequences. The tokenizer outputs tokens ranging from min to max number of words per token.
|1||min||2||int||The minimum number of words in a single token.|
|2||max||2||int||The maximum number of words in a single token.|
|3||skip||2||int||The number of words to skip over to form new sequences.|
use Rubix\ML\Other\Tokenizers\KSkipNGram; $tokenizer = new KSkipNGram(2, 3, 2);