Duplicate Detection

Discover the duplicates in your data. This analyzer will detect duplicates based on a matching model.

Component tutorial: Duplicate detection
Duplicate Detection in DataCleaner
Analyzer Concurrent
  • Duplicate groups

    The number of groups found by Duplicate detection.

    Not parameterized
  • Duplicate pairs

    The number of matching records ('pairs') found by Duplicate detection

    Not parameterized
  • Duplicate records

    The number of records found in any of the duplicate groups.

    Not parameterized
  • Non-duplicate records

    The number of non-duplicate/unique records found by Duplicate detection

    Not parameterized
  • Processed records

    The number of records processed by Duplicate detection

    Not parameterized
  • Columns

    Columns to include in matching model

    List of InputColumn<Object> Required
  • Key column

    Key column to use as unique identifier for each record. If not specified, the row number will be applied as the key.

    InputColumn<Object> Optional
  • Mode

    The execution mode of the Duplicate detection feature. Untrained detection offers a quick route to initial results. For high quality matching and production usage, 'Model training' mode is required to establish a good matching model. When the model is built, apply the model in 'Detection' mode.

    Choice: Untrained duplicate detection Detection model training Duplicate detection Required
  • Matching strictness

    Determines how similar rows must be to match them as duplicates.

    Choice: Very Strict Strict Normal Lax Required
  • Matching model

    FileBasedDeduplicationModel Required
  • Max records for training

    The number of records sampled for 'Model training' mode

    int Required