A Novel Approach to Perform Document Clustering Using Effectiveness and Efficiency of Simhash
Lavanya Pamulaparty1, C.V. Guru Rao2
1Lavanya Pamulaparty, Department of Computer Science and Engineering, and Technology, Hyderabad, India.
2Dr. C. V. Guru Rao, Department of Computer Science and Engineering, SR Engineering College, Warangal, India.
Manuscript received on January 21, 2013. | Revised Manuscript received on February 07, 2013. | Manuscript published on February 28, 2013. | PP: 312-315 | Volume-2 Issue-3, February 2013. | Retrieval Number: C1116022313/2013©BEIESP
Open Access | Ethics and Policies | Cite
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Similarity is the most important feature of document clustering as the amount of web documents and the need of integrating documents from the huge multiple repositories, one of the challenging issues is to perform clustering of similar documents efficiently. . A measure of the similarity between two patterns drawn from the same feature space is essential to most clustering procedures. From huge repositories, similar document identification for clustering is costly both in terms of space and time duration, and specially when finding near documents where documents could be added or deleted. In this paper, we try to find the effectiveness of Simhash based similarity measurement technique for detecting the similar documents which are used to perform clustering of documents using novel based K-means clustering method.
Keywords: Document clustering, Simhash similarity measure, k-means clustering, near documents, fingerprints.