Journal of Emerging Technologies in Web Intelligence, Vol 3, No 1 (2011), 11-19, Feb 2011
doi:10.4304/jetwi.3.1.11-19

Detecting a Multi-Level Content Similarity from Microblogs based on Community Structures and Named Entities

Swit Phuvipadawat, Tsuyoshi Murata

Abstract


This paper presents a method for finding the content similarity for microblogs. In particular, we process data from Twitter for a breaking news detection and tracking application. The goal is to find a collection of similar messages. The method gives two levels of collections. In the first level, similarity is defined by TF-IDF. Since contents in microblogs have short lengths, we emphasize on specific terms called named entities. Message groups are obtained in the first level. In the second level, we construct a network from the message groups and named entities and perform a community detection. We evaluate and visualize the community results based on several community detection algorithms. We demonstrate that this method can be used to explore similar messages with results in both tightly and loosely coupled manners.


Keywords


Twitter, Topic Detection and Tracking, Information Retrieval, Network Analysis

References


[1] E. Morozov, “Iran Elections: A twitter Revolution?” The Washington Post, June 17, 2009, http://www.washingtonpost.com/wp-dyn/content/discussion/2009/06/17/DI2009061702232.html.

[2] D. Gross, “Twitter claims 105 million registered users,” http://scitech.blogs.cnn.com/2010/04/14/twitter-claims-105-million-registered-users/, Accessed August 1, 2010.

[3] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social network or a news media?” in WWW ’10: Proceedings of the 19th International Conference on World Wide Web. New York, NY, USA: ACM, 2010, pp. 591– 600.

[4] A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, and H. Zha, “Time is of the essence: improving recency ranking using twitter data,” in WWW ’10: Proceedings of the 19th International Conference on World Wide Web. New York, NY, USA: ACM, 2010, pp. 331–340.

[5] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter users: real-time event detection by social sensors,” in WWW ’10: Proceedings of the 19th International Conference on World Wide Web. New York, NY, USA: ACM, 2010, pp. 851–860.

[6] S. D. D. Ramage and D. Liebling, “Characterizing microblogs with topic models,” in WSDM ’10: Proceedings of the Third ACM International Conference on Web Search and Data Mining. New York, NY, USA: ACM, 2010.

[7] J. Weng, E.-P. Lim, J. Jiang, and Q. He, “Twitterrank: finding topic-sensitive influential twitterers,” in WSDM ’10: Proceedings of the Third ACM International Conference on Web Search and Data Mining. New York, NY, USA: ACM, 2010, pp. 261–270.

[8] R. Mateosian, “Micro Review: Twitter,” IEEE Micro, vol. 29, Issue 4, pp. 87–88, July-August 2009.
doi:10.1109/MM.2009.69

[9] J. Allen, Topic Detection and Tracking. Norwell, Massachusetts: Kluwer Academic, 2002, pp. 17–30.

[10] “Twitter Streaming API,” http://apiwiki.twitter.com/ Streaming-API-Documentation, Accessed February 1, 2010.

[11] “Apache Lucene,” http://lucene.apache.org, accessed February 1, 2010.

[12] C. D. Manning, P. Raghavan, and H. Sch¨utze, Introdution to Information Retrieval. New York: Cambridge University Press, 2008, pp. 108–115, 356–358.

[13] “Similarity (Lucene 3.0.0 API),” http://lucene.apache.org/java/300/api/all/org/apache/lucene/search/Similarity.html, accessed February 1, 2010.

[14] J. Finkel, T. Grenager, and C. Manning, “Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling,” in Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), 2005, pp. 363–370.

[15] “The R Project for Statistical Computing,” http://www.r-project.org/.

[16] “The igraph library for complex network research,” http://igraph.sourceforge.net.

[17] M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review, vol. E 69, no. 026113, 2004.

[18] M. E. J. Newman, “Fast algorithm for detecting community structure in networks,” Physical Review, vol. E 69, no. 066133, 2004.

[19] C. M. Aaron Clauset, M. E. J. Newman, “Finding community structure in very large networks,” Physical Review, vol. E 70, no. 066111, 2004.

[20] R. A. Usha Nandini Raghavan and S. Kumara, “Near linear time algorithm to detect community structures in largescale networks,” Physical Review, vol. E 76, no. 03610, 2007.

[21] M. L. Pascal Pons, “Computing communities in large networks using random walks,” Journal of Graph Algorithms and Applications, vol. 10, no. 2, pp. 191–218, 2008.


Full Text: PDF


Journal of Emerging Technologies in Web Intelligence (JETWI, ISSN 1798-0461)

Copyright @ 2006-2012 by ACADEMY PUBLISHER – All rights reserved.