Journal of Software, Vol 6, No 8 (2011), 1521-1528, Aug 2011
doi:10.4304/jsw.6.8.1521-1528

A Novel Weighted Phrase-Based Similarity for Web Documents Clustering

Ruilong Yang, Qingsheng Zhu, Yunni Xia

Abstract


Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerative Clustering (GHAC) algorithm to develop a new Web document clustering approach. According to the structures of the Web documents, different document parts are assigned different levels of significance as structure weights stored in the nodes of the weighted suffix tree which is constructed with sentences instead of documents. By mapping each node and its weights in WSTD model into a unique feature term in the Vector Space Document (VSD) model, the new weighted phrase-based document similarity naturally inherits the term TF-IDF weighting scheme in computing the document similarity with weighted phrases. The evaluation experiments indicate that the new clustering approach is very effective on clustering the Web documents. Its quality greatly surpasses the traditional phrase-based approach in which the Web documents structures are ignored. In conclusion, the weighted phrase-based similarity works much better than ordinary phrase-based similarity.


Keywords


suffix tree; web document clustering;weight computing; phrase-based similarity; document structure

References


N. Oikonomakou, and M. Vazirgiannis, "A Review of Web Document Clustering Approaches," Data Mining and Knowledge Discovery Handbook, pp. 921-943: Springer US, 2005.

L. Yanjun, “Text Clustering with Feature Selection by Using Statistical Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, pp. 641-652, 2007.
http://dx.doi.org/10.1109/TKDE.2007.190740

Y. Li, S. M. Chung, and J. D. Holt, “Text Document Clustering Based on Frequent Word Meaning Sequences,” Data & Knowledge Engineering, vol. 64, no. 1, pp. 381-404, 2008.
http://dx.doi.org/10.1016/j.datak.2007.08.001

K. M. Hammouda, and M. S. Kamel, “Efficient Phrase-Based Document Indexing for Web Document Clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 10, pp. 1279-1296, 2004.
http://dx.doi.org/10.1109/TKDE.2004.58

H. Chim, and X. Deng, “Efficient Phrase-Based Document Similarity for Clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 9, pp. 1217-1229, 2008.
http://dx.doi.org/10.1109/TKDE.2008.50

O. Zamir, and O. Etzioni, "Web Document Clustering: A Feasibility Demonstration," Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46-54, 1998.

S. Zu Eissen, B. Stein, and M. Potthast, “The Suffix Tree Document Model Revisited,” in Proceedings of the 5th International Conference on Knowledge Management (I-KNOW 05), Graz, Austria, 2005, pp. 596-603.

C. Manning, P. Raghavan, and H. Schütze, "An introduction to information retrieval," p. 377~400, Cambridge,England: Cambridge University Press, 2009.

E. Ukkonen, “On-Line Construction of Suffix Trees,” Algorithmica, vol. 14, no. 3, pp. 249-260, 1995.
http://dx.doi.org/10.1007/BF01206331

C. Carpineto, and G. Romano. "Ambient Dataset," 2008; http://credo.fub.it/ambient/.

S. OsiƄski, and D. Weiss, “Carrot 2: Design of a Flexible and Efficient Web Information Retrieval Framework,” Advances in Web Intelligence, vol. 3528, pp. 439-444, 2005.
http://dx.doi.org/10.1007/11495772_68


Full Text: PDF


Journal of Software (JSW, ISSN 1796-217X)

Copyright @ 2006-2012 by ACADEMY PUBLISHER – All rights reserved.