Investigation of Techniques for Efficient & Accurate Indexing for Scalable Record Linkage & Deduplication
Sunitha Yeddula1, K.Lakshmaiah2

1Sunitha Yeddula, (M.Tech),IInd yr,cse, MITS, Madanapalle, Chittoor dist, A.P, India,
2K.Lakshmaiah, M.tech.,(Ph.D),Associate Profes- or, cse dept, MITS, Madanapalle, Chittoor dist, A.P, India,
Manuscript received on September 01, 2012. | Revised Manuscript received on September 02, 2012. | Manuscript published on September 05, 2012. | PP: 242-246 | Volume-2 Issue-4, September 2012. | Retrieval Number: D0957082412/2012©BEIESP
Open Access | Ethics and Policies | Cite
© The Authors. Published By: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process. and also, the complexity of the matching process becomes one of the major challenge. Various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious non-matching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets.
Keywords: Data matching, data linkage, entity resolution, index techniques, blocking, experimental evaluation, scalability.`