Rosette Dedupe for Rapidminer


The Rosette Deduplicate Names operator identifies candidate duplicates from a list of names by assigning “group ids” to groups of matching names. The operator can process lists of up to 10,000 English names and assigns group ids based on a user-specified match threshold. The threshold sets the minimum similarity score required for two names to be considered duplicates. Thresholds can be set by clicking on the operator and entering a value between 0 and 1 in the “Threshold” field. We recommend starting with a .8 threshold, and experimenting with higher or lower values depending upon your use case and results.

Given a list of names as input, the output is a list of cluster IDs (integers) for each name—not in any particular order. The output may then be sorted by cluster ID to group together possible duplicate names.