Duplicate detection

The 'Duplicate detection' function allows you to do fuzzy matching of duplicate records - records that represent the same person, organization, product or other entity.

The main characteristics of the Duplicate detection function is:

  1. High Quality - Quality is the hallmark of matching, our duplicate detection feature delivers on this promise.

  2. Scalable - For large datasets Duplicate detection leverages the Hadoop framework for practically unlimited scalability.

  3. Fast and interactive - On a single machine you can work quickly and interactively to refine your duplicate detection model.

  4. International - International data is supported and no regional knowledge has been encoded into the deduplication engine - you provide the business rules externally.

  5. Machine Learning based - The Duplicate detection engine is configured by examples. During a series of training sessions you can refine the deduplication model simply by having a conversation with the tool about what is and what isn't a good example of a duplicate.

Tip

Duplicate detection does work fine with raw data. But if you have dirty data and the way data is registered has a lot of variance, we suggest you first do your best to standardize the data before finding duplicates.

Standardization can be made by trimming, tokenizing, removing unwanted characters, replace synonyms and things like that. Explore the transformations available in DataCleaner in order to get your data cleansed before trying to deduplicate it.

In the following sections we will walk through how to use the 'Duplicate detection' function. The function has three modes: Model training, Detection and Untrained detection.

'Model training' mode

In the Model training mode the user of Duplicate detection is looking to train the Machine Learning engine. When running your job in Model training mode you will be shown a number of potential duplicate record pairs, and determine if they are duplicate or not.

To start the Training mode, simply add the function and select the columns that you wish to use for matching. Additionally you might wish to configure:

Table 5.1. Model training properties

PropertyDescription
Max records for trainingThe Training tool will keep a random sample of the dataset in memory to provide as training input, and an interactive user experience. This number determines how many records will be selected for this sample.
Key columnIf your dataset has a unique key, we encourage you to select it using this property. Configuring the key column has the benefit that if you wish to export a training reference later, it can be re-applied very easily.

In contrast to most other analyzers in DataCleaner which shows a result screen after execution, the Training mode opens a new dialog when started. The training tool dialog allows users to train matching models. The top of the dialog contains a button bar. Below the button bar, the training tool shows some tab buttons. By default the potential duplicates will be shown. For each potential duplicate you can toggle the button on the right side to determine if the pair is a duplicate or not:

To help you, columns with equal values are shown in a grey font, while different values are shown in black.

Right-clicking on the classification button opens a small menu that allows you to mark all examples on the (remainder of) this page or all examples on all pages as Undecided, duplicates or uniques. This helps when almost all examples are duplicates or uniques. You can mark all examples as duplicates, review, and only toggle the examples that are no duplicates.

You do not need to classify all samples shown. Recommended usage is:

  1. Classify at least 20-30 duplicate pairs or more (more is better)

  2. Classify at least 20-30 unique records or more (more is better)

Once you've classified records you can press the 'Train model' button in the upper right corner. This will refine the matching model and present a new set of interesting potential duplicates. You can continue this way and quite quickly have classified the required amount of pairs.

The model is automatically saved every time after training. There is no need to save the model by hand. The saved model includes the matching rules, settings, and all pairs the user classified as duplicate or unique.

Some more hints for training:

  1. Classifying uniques is just as important as classifying duplicates. Keep the numbers of duplicate examples and unique examples roughly equal.

  2. Try to find and mark some examples of every duplicate category that you know of. You can use the "search pairs" tool to help you.

  3. Sometimes the machine learning gets skewed and does not provide examples of a category of duplicate records or unique records. In those cases, close then re-open the training tool as described below, but do not press the train model button yet. The training tool shows a less specialised set of duplicate samples. You should now be able to find examples of the category you need added to the model.

All duplicate detection models may have irregularities. When you ask a computer to do a complex task like matching, it may come up with a model that has slight differences from your classifications. You can inspect the current model's differences from your classifications in the tab 'Discrepancies'.

Every time you classify a duplicate, it is added to the reference of the Training session. You can inspect your complete reference in the tab 'Duplicates reference'.

If you're looking for particular types of duplicate pair examples, you may want to go to the 'Search pairs' tab. In this tab you will find options to search for records with matching or non-matching values for particular fields. This may be a very useful shortcut for finding proper duplicate examples.

Finally, the tab 'Training parameters' presents buttons and sliders for you to influence the Machine Learning approach.

Moving the top slider to the left makes duplicate detection compare more records. This will take more time, but also increase the matching quality. Moving this slider to the right makes duplicate detection to make less comparisons, resulting in higher speed, but can lead to more missed matches (false negatives).

Moving the bottom slider to the left makes comparison of records more strict. Moving this slider to the right makes it more lax.

The user defined rules enable you to enforce fixed rules. The possible types of fixed rules are listed below. You can apply a rule to each column. Rules that force a pair to be unique take precedence over rules that force a pair to be duplicate. Empty values count as different.

  1. forces pairs to be duplicate when equals - The pair is always a duplicate if any column marked with this value is equal.

  2. forces pairs to be unique when equals - The pair is never a duplicate if any columns marked with this value are different.

  3. forces pairs to be unique when different - The pair is never a duplicate if any columns marked with this value are different

  4. forces pairs to be duplicate when equals and unique when different - The pair is never a duplicate if any columns marked with this value is different, but the pairs is always a duplicate if all columns marked with this value are equal.

  5. forces pairs to be unique when equals and unique when different - The pair is never a duplicate unless the value in one of the records is empty.

We recommend applying fixed rules only after training the model and only when strictly necessary.

After updating the matching model, the user can continue in 2 ways. If the user is satisfied with the model (few false positives and false negatives) then he can save the model and start using it in duplicate detection. Otherwise, the user can classify more of the presented samples and refine the model again.

More training typically allows for a more advanced matching model, capable of handling more corner cases. The false negatives and false positives lists give a good impression of the current state of the matching model. The user should continue training until the differences in these lists are acceptable.

To validate the training results and obtain the best model, training can be repeated on a different sample. The already classified record pairs will automatically be added to the new sample.

  1. Close the training tool.

  2. Re-run the Training tool. A new sample will be generated. All marked pairs in the saved reference are automatically included in the new sample.

  3. Press the 'Train model' button in the Training tool. This will train a model on the existing reference.

  4. You can view the discrepancies (false positives, false negatives) of the trained model against the records in the new sample.

  5. You can review the potential duplicates to determine if a category of duplicates is missing.

  6. Add more pairs to the reference as needed.

'Detection' mode

When the matching model is complete you are ready to find all the duplicates in the dataset. Use the same Duplicate detection component, but change the execution mode in "Duplicate detection".

When you run the job, you will see a Duplicate detection result with complete groups of duplicates, like this:

Once you have a duplicate detection result that you wish to post-process, e.g. manual inspection, you can export the result by clicking the 'Write duplicates' button in the top of the result screen. You can save the duplicate records, the duplicate pairs and also the unique records in a datastore table of your choice. Or you can create an excel file or a staging table.

Tip

It is now possible to feed the result directly to the merger, using the new data streams feature. You can read more about linking them together in the documentation of merge duplicates

The Duplicate detection analyzer can run stand-alone to find duplicates in datasets up to a half to 1 million records (depending on the amount of columns). For larger datasets, the Duplicate detection component can be used in combination with an Hadoop server component. This server component is an Enterprise edition feature.

'Untrained detection' mode

Finally, there's also a 'Untrained detection' mode. This allows you to skip 'Model training' and just ask the application to do its best effort without any proper model. This mode is not recommended for production use and is considered 'experimental', but may provide a great quick impression of some of the duplicates you have in your dataset.