Journal of Computers, Vol 5, No 12 (2010), 1800-1809, Dec 2010
doi:10.4304/jcp.5.12.1800-1809

A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates

Kazi Shah Nawaz Ripon, Ashiqur Rahman, G.M. Atiqur Rahaman

Abstract


Data mining algorithms generally assume that data will be clean and consistent. However, in practice, this is not always the case, and for this reason the detection and elimination of duplicate records is an important part of data cleaning. The presence of similar-duplicate records causes over-representation of data. If the database contains different representations of the same data, the results obtained from the data mining algorithm will be erroneous. The detection of similar-duplicate records is a difficult task, especially when the records are domain-independent. In this paper, we propose a novel domain-independent technique for better reconciling the similar-duplicate records. We also introduce new ideas for making similar-duplicate detection algorithms faster and more efficient. In addition, a significant modification of the transitivity rule is also proposed. Finally, we propose an algorithm that incorporates all these techniques for similar-duplicate detection into a domain-independent environment. The performance of the proposed method has been compared to other methods and the superiority of the proposed method has been confirmed by the experimental results.



Keywords


Data cleaning, similar-duplicate, domain-independent, transitivity rule, approximate duplicate.

References



Full Text: PDF


Journal of Computers (JCP, ISSN 1796-203X)

Copyright @ 2006-2012 by ACADEMY PUBLISHER – All rights reserved.