Skip to content

[source]

BM25 Transformer#

BM25 is a sublinear term weighting scheme that takes term frequency (TF), document frequency (DF), and document length into account. It is similar to TF-IDF but with variable sublinearity and the addition of document length normalization.

Note: BM25 Transformer assumes that its inputs are token frequency vectors such as those created by Word Count Vectorizer.

Interfaces: Transformer, Stateful, Elastic

Data Type Compatibility: Continuous only

Parameters#

# Param Default Type Description
1 dampening 1.2 float The term frequency (TF) dampening factor i.e. the K1 parameter in the formula. Lower values will cause the TF to saturate quicker.
2 normalization 0.75 float The importance of document length in normalizing the term frequency i.e. the b parameter in the formula.

Example#

use Rubix\ML\Transformers\BM25Transformer;

$transformer = new BM25Transformer(1.2, 0.75);

Additional Methods#

Return the document frequencies calculated during fitting:

public dfs() : ?array

Return the average number of tokens per document:

public averageDocumentLength() : ?float

References#

  • S. Robertson et al. (2009). The Probabilistic Relevance Framework: BM25 and Beyond.
  • K. Sparck Jones et al. (2000). A probabilistic model of information retrieval: development and comparative experiments.