Journal of Software, Vol 6, No 12 (2011), 2361-2368, Dec 2011
doi:10.4304/jsw.6.12.2361-2368

A Hybrid Method for XML Clustering by Structure and Content

Yong Piao, Xiukun Wang

Abstract


An effective XML cluster method called neighbor center clustering algorithm (NCC) is presented in this paper, whose similarity is obtained through both structural and content information contained in XML files. Structural similarity is firstly measured by frequency-path model and its similarity calculation algorithm with position and frequency weight by longest common subsequence is introduced. In order to improve the performance and precision, the frequency-path model is further extended by considering the structure and content information simultaneously. Experiments show that the NCC embed with hybrid similarity calculation method can obtain high purity and F-measure value and is effective and applicable for clustering XML with both homogenous and heterogeneous structures.


Keywords


neighbor center clustering; position and frequency weight; longest common subsequence; hybrid similarity calculation

References


T. Dalamagas, T. Cheng, K. J. Winkel, and T. Sellis, “A Methodology for Clustering XML Documents by Structure,” Information Systems, vol. 31, pp. 187-228, 2006.
http://dx.doi.org/10.1016/j.is.2004.11.009

G. Costa, G. Manco, R. Ortale, and A. Tagarelli, “A Tree-Based Approach to Clustering XML Documents by Structure,” Knowledge Discovery in Databases: PKDD 2004, pp. 137-148, 2004.
http://dx.doi.org/10.1007/978-3-540-30116-5_15

Pan Youneng, “Research on XML Document Cluster,” Journal of the China Society for Scientific and Technical Information, 2006, 25(2).

Wang Lian, David Wai-Lok Cheung, Nikos Mamoulis, and Siu-Ming Yiu, “An Efficient and Scalable Algorithm for Clustering XML Documents by Structure,” IEEE Transactions on Knowledge and Data Engineering, 2004, 16(1), pp. 82-96.
http://dx.doi.org/10.1109/TKDE.2004.1264824

Liu Jiang and Wang Jun, “Reseach on Web XML Document Clustering,” Public Science, 1002-6908(2007)0620038-03.

Sudarshan S. Chawathe, Anand Rajaraman, Hector Garcia-Molina, and Jennifer Widom, “Change Detection in Hierarchically Structured Information,” In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, 1996, pp. 493-504.

Zhang K and Shasha D, “On the Editing Distance between Unordered Labeled Trees,” Information Processing Letters, 1992, 42(3), pp. 133-139.
http://dx.doi.org/10.1016/0020-0190(92)90136-J

A. Wojnar, I. Mlynkova and J. Dokulil, “Structural and Semantic Aspects of Similarity of Document Type Definitions and XML Schemas,” Information Sciences, vol. 180, pp. 1817-1836, 2010.
http://dx.doi.org/10.1016/j.ins.2009.12.024

M. Torjmen, K. Pinel-Sauvagnat, and M. Boughanem, “Towards a Structure-based Multimedia Retrieval Model,” in 1st International ACM Conference on Multimedia Information Retrieval, MIR2008, August 30, 2008 - August 31, 2008, Vancouver, BC, Canada, 2008, pp. 350-357.

Sachindra Joshi, Neeraj Agrawal, Raghu Krishnapuram, and Sumit Negi, “A Bag of Paths Model for Measuring Structural Similarity in Web Documents,” SIGKDD’03, 2003, pp. 24-27.

YANG Hou-Qun, HE Zhong-Sh, and LEI Jing-Sheng, “Research of Clustering XML Documents Based on Partition,” Computer Science, 2008, 35(3).

Ho-pong Leung, Fu-lai Chung, Stephen C.F. Chan, and Robert Luk, “XML Document Clustering Using Common XPath,” In Proceedings of the 2005 International Workshop on Challenges in Web Information Retrieval and Integration, WIRI’05, 2005, pp. 91-96.

T. Tran and R. Nayak, “Evaluating the Performance of XML Document Clustering by Structure Only,” in 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, December 17, 2006 - December 20, 2006, Dagstuhl Castle, Germany, 2007, pp. 473-484.

PIAO Yong, TIAN Wei, and WANG XiuKun, “An Effective Path-based Algorithm to Calculate XML Similarity,” Control and Decision, 2010, 25(4), pp. 497-501.

Y. Piao and X. K. Wang, “A Hybrid Method for XML Clustering,” in 3rd International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2010, December 18-20, 2010, Dalian, China, 2010, pp.286-290.

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, “An Introduction to Information Retrieval,” Cambridge University Press, Cambridge, England, 2009, pp. 356-360.

A. Kurt and T. Engin, “Classification of XSLT-generated Web Documents with Support Vector Machines,” In Knowledge Discovery from XML Documents, 2006, pp.33-42.

E. Bertino, G. Guerrini and M. Mesiti, “Measuring the Structural Similarity among XML Documents and DTDs,” Journal of Intelligent Information Systems, vol. 30, pp. 55-92, 2008.
http://dx.doi.org/10.1007/s10844-006-0023-y

E. Bertino, G. Guerrini and M. Mesiti, “A Matching Algorithm for Measuring the Structural Similarity between an XML Document and a DTD and its Applications,” Information Systems, vol. 29, pp. 23-46, 2004.
http://dx.doi.org/10.1016/S0306-4379(03)00031-0

C. Wang, X. Yuan, H. Zhang, B. Sun, and H. Zhang, “Structural Query and Ranking for XML Information Retrieval,” Journal of Computational Information Systems, vol. 5, pp. 1429-1435, 2009.

H. Zhang, X. Yuan, N. Yang, and Z. Liu, “Similarity Computation for XML Documents by XML Element Sequence Patterns,” Progress in WWW Research and Development, pp. 227-232, 2008.

C. Wang, X. Yuan, H. Ning, and X. Lian, “Similarity Evaluation of XML Documents Based on Weighted Element Tree Model,” Advanced Data Mining and Applications, pp. 680-687, 2009.

A. Nierman and H. V. Jagadish, “Evaluating Structural Similarity in XML Documents,” in Proceedings of the Fifth International Workshop on the Web and Databases WebDB, 2002.


Full Text: PDF


Journal of Software (JSW, ISSN 1796-217X)

Copyright @ 2006-2012 by ACADEMY PUBLISHER – All rights reserved.