BM25 Transformer#

BM25 is a sublinear term weighting scheme that takes term frequency (TF), document frequency (DF), and document length into account. It is similar to TF-IDF but with variable sublinearity and the addition of document length normalization.

Note: BM25 Transformer assumes that its inputs are token frequency vectors such as those created by Word Count Vectorizer.

Interfaces: Transformer, Stateful, Elastic

Data Type Compatibility: Continuous only

Parameters#

#	Param	Default	Type	Description
1	dampening	1.2	float	The term frequency (TF) dampening factor i.e. the `K1` parameter in the formula. Lower values will cause the TF to saturate quicker.
2	normalization	0.75	float	The importance of document length in normalizing the term frequency i.e. the `b` parameter in the formula.

Example#

use Rubix\ML\Transformers\BM25Transformer;

$transformer = new BM25Transformer(1.2, 0.75);

Additional Methods#

Return the document frequencies calculated during fitting:

public dfs() : ?array

Return the average number of tokens per document:

public averageDocumentLength() : ?float

References#

S. Robertson et al. (2009). The Probabilistic Relevance Framework: BM25 and Beyond.

K. Sparck Jones et al. (2000). A probabilistic model of information retrieval: development and comparative experiments.