Journal of Networks, Vol 7, No 2 (2012), 259-266, Feb 2012
doi:10.4304/jnw.7.2.259-266

Applying Stylometric Analysis Techniques to Counter Anonymity in Cyberspace

Jianwen Sun, Zongkai Yang, Sanya Liu, Pei Wang

Abstract


Due to the ubiquitous nature and anonymity abuses in cyberspace, it’s difficult to make criminal identity tracing in cybercrime investigation. Writeprint identification offers a valuable tool to counter anonymity by applying stylometric analysis technique to help identify individuals based on textual traces. In this study, a framework for online writeprint identification is proposed. Variable length character n-gram is used to represent the author’s writing style. The technique of IG seeded GA based feature selection for Ensemble (IGAE) is also developed to build an identification model based on individual author level features. Several specific components for dealing with the individual feature set are integrated to improve the performance. The proposed feature and technique are evaluated on a real world data set encompassing reviews posted by 50 Amazon customers. The experimental results show the effectiveness of the proposed framework, with accuracy over 94% for 20 authors and over 80% for 50 ones. Compared with the baseline technique (Support Vector Machine), a higher performance is achieved by using IGAE, resulting in a 2% and 8% improvement over SVM for 20 and 50 authors respectively. Moreover, it has been shown that IGAE is more scalable in terms of the number of authors, than author group level based methods.


Keywords


stylometric analysis; writeprint identification; character n-gram; ensemble learning; genetic algorithm

References


[1] Jialun Qin, Yilu Zhou, Edna Reid, Guanpi Lai, and Hsinchun Chen, "Analyzing terror campaigns on the internet: Technical sophistication, content richness, and Web interactivity," International Journal of Human-Computer Studies, v.65, n.1, pp.71-84, January, 2007.
http://dx.doi.org/10.1016/j.ijhcs.2006.08.012

[2] Li, J., Zheng, R., and Chen, H., "From fingerprint to writeprint," Communications of the ACM, 49(4), pp.76-82, 2006.
http://dx.doi.org/10.1145/1121949.1121951

[3] Abbasi, A., Chen, H., "Applying authorship analysis to extremist-group web forum messages," IEEE Intelligent Systems, 20(5), pp.67-75, 2005.
http://dx.doi.org/10.1109/MIS.2005.81

[4] Grieve, J., "Quantitative authorship attribution: An evaluation of techniques," Literary and Linguistic Computing,22(3), pp.251-270, 2007.
http://dx.doi.org/10.1093/llc/fqm020

[5] Stamatatos, E., "Ensemble-based author identification using character n-grams," In Proceedings of the 3rd International Workshop on Text-based Information Retrieval, pp.41-46, 2006.

[6] Juola, P., "Ad-hoc authorship attribution competition," In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, pp.175-176, 2004.

[7] Stamatatos, E., "A survey of modern authorship attribution methods," Journal of the American Society of Information Science and Technology, 60(3), pp.538-556, 2009.
http://dx.doi.org/10.1002/asi.21001

[8] Jianwen Sun, Zongkai Yang, Pei Wang, and Sanya Liu, "Variable length character n-gram approach for online writeprint identification," 2010 International Conference on Multimedia Information Networking and Security, pp.486-490, 2010.
http://dx.doi.org/10.1109/MINES.2010.109

[9] Peng, F., Schuurmans, D., Keselj, V., and Wang, S., "Automated authorship attribution with character level language models," In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), pp.267-274, 2003.

[10] Chaski, C. E., "Empirical evaluation of language-based author identification techniques," Forensic Linguist, 8,1, pp.1-65, 2001.
http://dx.doi.org/10.1558/sll.2001.8.1.1

[11] Dietterich, T. G., "Ensemble methods in machine learning," In Proceedings of the 1st International Workshop on Multiple Classifier Systems, pp.1-15, 2000.
http://dx.doi.org/10.1007/3-540-45014-9_1

[12] Houvardas, J., and Stamatatos E., "N-gram feature selection for authorship identification," In Proceedings of the 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications, pp.77-86, Springer, 2006.

[13] Forman, G., "An extensive empirical study of feature selection metrics for text classification," Journal of Machine Learning Research, 3, pp.1289-1305, 2003.

[14] Shannon, C. E., "A mathematical theory of communication," Bell System Technical Journal, Vol.27, pp.379-423 and 623-656, July and October, 1948.

[15] Yang, J., Honavar, V., "Feature subset selection using a genetic algorithm," IEEE Intelligent Systems, 13, 2, pp.44-49, 1998.
http://dx.doi.org/10.1109/5254.671091

[16] Yang, Y., Pederson, J.O., "A comparative study on feature selection in text categorization," In Proceedings of the 14th International Conference on Machine Learning, pp.412-420, 1997.

[17] L.I Kuncheva, L.C. Jain, "Designing classifier fusion systems by genetic algorithms," IEEE Transactions on Evolutionary Computation, 4(4), pp.327-336, 2000.
http://dx.doi.org/10.1109/4235.887233

[18] Alexey Tsymbal, Mykola Pechenizkiy, Pádraig Cunningham, "Diversity in search strategies for ensemble feature selection," Information Fusion, Vol.6, pp.83-98, 2005.
http://dx.doi.org/10.1016/j.inffus.2004.04.003

[19] J. Cohen, "A coefficient of agreement for nominal scales," Educational and Psychological Measurement, 20, pp.37-46, 1960.
http://dx.doi.org/10.1177/001316446002000104

[20] Zheng, R., Li, J., Chen, H., and Huang, Z., "A framework for authorship identification of online messages: Writing style features and classification techniques," Journal of the American Society of Information Science and Technology, 57(3), pp.378-393, 2006.
http://dx.doi.org/10.1002/asi.20316

[21] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, "LIBLINEAR: A library for large linear classication," The Journal of Machine Learning Research, 9:1871–1874, 2008.

[22] D. Opitz, "Feature selection for ensembles," Proc. 16th National Conf. on Artificial Intelligence, AAAI Press, pp.379–384, 1999.


Full Text: PDF


Journal of Networks (JNW, ISSN 1796-2056)

Copyright @ 2006-2013 by ACADEMY PUBLISHER – All rights reserved.