Efficient Algorithm for Removing Duplicate Documents
Suresh Subramanian1, Sivaprakasam2
1Suresh Subramanian, Suresh Subramanian, Research Scholar, Department of Computer Science, Karpagam University, Coimbatore, Tamilnadu, India.
2Sivaprakasam,Sivaprakasam, Department of Computer Science, Sri Vasavi College, Erode, Tamilnadu, India.
Manuscript received on January 01, 2014. | Revised Manuscript received on January 02, 2014. | Manuscript published on January 05, 2014. | PP: 218-221 | Volume-3 Issue-6, January 2014. | Retrieval Number: F2057013614
Open Access | Ethics and Policies | Cite
© The Authors. Published By: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Internet or Web world has a large amount of information, which may be html documents, word, pdf files, audio and video files, images etc. Huge challenges are being faced by the researches to provide the required and related documents to the users according to the user query. Additional overheads are available for researchers pertaining to identify the duplicate and near duplicate web documents. This paper addresses these issues through Genetic Algorithm and Duplicate Web Documents Identification Function is used to improve relevance of retrieved documents by removing the duplicate records from the dataset.
Keywords: Duplicate Web-pages; Inverted Index; Genetic Algorithm; Web Content Mining.