Duplicate Detection

Discover the duplicates in your data. This analyzer will detect duplicates based on a matching model.

Component tutorial: Duplicate detection
Duplicate Detection in DataCleaner
Analyzer
  • Duplicate groups

    The number of groups found.

    Not parameterized
  • Duplicate pairs

    The number of duplicate pairs found.

    Not parameterized
  • Duplicate records

    The number of records that are part of at least one group.

    Not parameterized
  • Non-duplicate records

    The number of records that are not part of any group.

    Not parameterized
  • Processed records

    The number of records processed.

    Not parameterized
  • Key column

    Key column to use as unique identifier for each record. If not specified, the row number will be applied as the key.

    InputColumn<Object> Optional
  • Columns

    Columns to include in matching model

    List of InputColumn<Object> Required
  • Mode

    The execution mode of the Duplicate detection feature. Untrained detection offers a quick route to initial results. For high quality matching and production usage, 'Model training' mode is required to establish a good matching model. When the model is built, apply the model in 'Detection' mode.

    Choice: Untrained duplicate detection Detection model training Duplicate detection Required
  • Matching strictness

    Determines how similar rows must be to match them as duplicates.

    Choice: Very Strict Strict Normal Lax Very Lax Required
  • Matching model

    DeduplicationModelAndFile Optional
  • Max records for training

    The number of records sampled for 'Model training' mode

    int Required