Deduplicator#

Removes duplicate records from a dataset while the records are in flight. Deduplicator uses a Bloom filter under the hood to probabilistically identify records that have already been seen before.

Note

Due to its probabilistic nature, Deduplicator may mistakenly drop unique records at a bounded rate.

Interfaces: Extractor

Parameters#

#	Name	Default	Type	Description
1	iterator		Traversable	The base iterator.
2	maxFalsePositiveRate	0.001	float	The false positive rate to remain below.
3	numHashes	4	int	The number of hash functions used, i.e. the number of slices per layer. Set to null for auto.
4	layerSize	32000000	int	The size of each layer of the filter in bits.

Example#

use Rubix\ML\Extractors\Deduplicator;
use Rubix\ML\Extractors\CSV;

$extractor = new Deduplicator(new CSV('example.csv', true), 0.01, 3, 32000000);

Additional Methods#

Return the number of records that have been dropped so far.

public dropped() : int

Last update: 2021-10-31