Skip to content

[source]

Deduplicator#

Removes duplicate records from a dataset while the records are in flight. Deduplicator uses a Bloom filter under the hood to probabilistically identify records that have already been seen before.

Note

Due to its probabilistic nature, Deduplicator may mistakenly drop unique records at a bounded rate.

Interfaces: Extractor

Parameters#

# Name Default Type Description
1 iterator Traversable The base iterator.
2 maxFalsePositiveRate 0.001 float The false positive rate to remain below.
3 numHashes 4 int The number of hash functions used, i.e. the number of slices per layer. Set to null for auto.
4 layerSize 32000000 int The size of each layer of the filter in bits.

Example#

use Rubix\ML\Extractors\Deduplicator;
use Rubix\ML\Extractors\CSV;

$extractor = new Deduplicator(new CSV('example.csv', true), 0.01, 3, 32000000);

Additional Methods#

Return the number of records that have been dropped so far.

public dropped() : int