Merge duplicates

Merging duplicates is the next step after detecting them in a dataset. It helps to restore a single version of truth by combining information from all the duplicate records representing the same physical entity.

In this section, we assume that the steps from previous section "Duplicate detection" has been completed, and you have a job with the duplicate detection analyzer prepared.

Merge duplicates

In the Improve menu, in the Deduplication submenu, there are two components that are useful for merging duplicates into a single record - Merge duplicates (simple) and Merge duplicates (advanced). The simple version just picks one record from the group of duplicates while the advanced one enables the user to combine records, taking some values from one record, some from another. In this example we will use the simple version.

Add "Merge duplicates (Simple)" transformer to your job, and configure the new component. Change the scope to "Duplicate Detection: Duplicate and unique rows". Now click "Select all", then disable the four synthetic columns Record id,Group id, Group size and Group score.

Set the property Group id to the column Group id, and the property Group count to the column Group size.

The resulting configuration should look something like this.

Merge duplicates (Simple) transformer will output all the input columns as output + additional colums carrying metadata about the record. One of them is "Merge status" that can have three possible values: SURVIVOR, NON_SURVIVOR and UNIQUE.

  1. SURVIVOR records are records that has duplicates, but has been chosen as the representative of their duplicate group, and should therefore "survive".

  2. NON_SURVIVOR records are records that has duplicates, but where another record has been chosen as the representative of the duplicate group. These records should not be used.

  3. UNIQUE records are records that never had any duplicates.

Let's add an Equals filter to the job in order to write only these SURVIVOR and UNIQUE records, to create a data set with no duplicates.

Save the output of the filter to a new staging table.

Warning

Unlike duplicate detection, "Merge duplicates" is a transformer, which means that new columns are created with the result in them. Make sure that you only choose the merged columns in your staging table.

The whole job graph should look similar to this:

Conclusion

Following above sections, we obtained a datastore with unique values and merged duplicate inside, ready to be exported to the format of user's preference.