Journal of Networks, Vol 7, No 6 (2012), 935-945, Jun 2012
doi:10.4304/jnw.7.6.935-945

Application of Rank Correlation, Clustering and Classification in Information Security

Gleb Beliakov, John Yearwood, Andrei Kelarev

Abstract


This article is devoted to experimental investigation of a novel application of a clustering technique introduced by the authors recently in order to use robust and stable consensus functions in information security, where it is often necessary to process large data sets and monitor outcomes in real time, as it is required, for example, for intrusion detection. Here we concentrate on a particular case of application to profiling of phishing websites. First, we apply several independent clustering algorithms to a randomized sample of data to obtain independent initial clusterings. Silhouette index is used to determine the number of clusters. Second, rank correlation is used to select a subset of features for dimensionality reduction. We investigate the effectiveness of the Pearson Linear Correlation Coefficient, the Spearman Rank Correlation Coefficient and the Goodman--Kruskal Correlation Coefficient in this application. Third, we use a consensus function to combine independent initial clusterings into one consensus clustering. Fourth, we train fast supervised classification algorithms on the resulting consensus clustering in order to enable them to process the whole large data set as well as new data. The precision and recall of classifiers at the final stage of this scheme are critical for the effectiveness of the whole procedure. We investigated various combinations of several correlation coefficients, consensus functions, and a variety of supervised classification algorithms.


Keywords


consensus functions; clustering; classification; phishing websites

References


 

[1] R. Dazeley, J. Yearwood, B. Kang, and A. Kelarev, "Consensus clustering and supervised classification for profiling phishing emails in internet commerce security", in Knowledge Management and Acquisition for Smart Systems and Services, PKAW2010, Lecture Notes in Computer Science, vol. 6232, 2010, pp. 235-246.
http://dx.doi.org/10.1007/978-3-642-15037-1_20

[2] J. Yearwood, D. Webb, L. Ma, P. Vamplew, B. Ofoghi, and A. Kelarev, "Applying clustering and ensemble clustering approaches to phishing profiling". in Data Mining and Analytics 2009, Proc. 8th Australasian Data Mining Conference: AusDM 2009, CRPIT, vol. 101, 2009, pp. 25-34.

[3] APWG, "Anti-Phishing Working Group", http://apwg.org/, accessed 15 December 2011.

[4] OECD, "Organisation for Economic Cooperation and Development, OECD task force on spam, OECD anti-spam toolkit and its annexes", http://www.oecd.org/dataoecd/63/28/36494147.pdf, accessed 20 November 2011.

[5] PhishTank, "Developer information", http://www.phishtank.com/developer_info.php, viewed 20 September 2011.

[6] T. Joachims, "A probabilistic analysis of the rocchio algorithm with TF-IDF for text categorization", in Proc. 14th International Conference on Machine Learning, 1997, pp. 143-151.

[7] H. Liu and H. Motoda, Feature Extraction, Construction and Selection: A Data Mining Perspective. Dordrecht: Kluwer, 1988.

[8] NIST/SEMATECH, "E-handbook of statistical methods", http://www.itl.nist.gov/div898/handbook/, viewed 21 October 2011.

[9] A. Jain and R. Dubes, Algorithms for Clustering Data. Upper Saddle River, NJ, USA: Prentice-Hall, 1988.

[10] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. Amsterdam: Elsevier/Morgan Kaufman, 2005.

[11] A. Jain, M. Murty, and P. Flynn, "Data clustering: a review", ACM Computing Surveys, vol. 31, pp. 264-323, 1999.
http://dx.doi.org/10.1145/331499.331504

[12] D. Fisher, "Knowledge acquisition via incremental conceptual clustering", Machine Learning, vol. 2, pp. 139-172, 1987.
http://dx.doi.org/10.1007/BF00114265

[13] J. Gennari, P. Langley, and D. Fisher, "Models of incremental concept formation", Artificial Intelligence, vol. 40, pp. 11-61, 1990.
http://dx.doi.org/10.1016/0004-3702(89)90046-5

[14] S. Hochbaum, "A best possible heuristic for the k-center problem", Mathematics of Operations Research, vol. 10, pp. 180-184, 1985.
http://dx.doi.org/10.1287/moor.10.2.180

[15] P. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis", J. Comp. Appl. Math., vol. 20, pp. 53-65, 1987.
http://dx.doi.org/10.1016/0377-0427(87)90125-7

[16] P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Boston, MA, USA: Addison-Wesley, 2005.

[17] X. Fern and C. Brodley, "Solving cluster ensemble problems by bipartite graph partitioning", in 21st International Conference on Machine Learning, ICML'04, vol. 69. New York, NY, USA: ACM, 2004, pp. 36-43.

[18] A. Strehl and J. Ghosh, "Cluster ensembles - a knowledge reuse framework for combining multiple partitions", J. Machine Learning Research, vol. 3, pp. 583-617, 2002.

[19] A. Topchy, A. Jain, and W. Punch, "Combining multiple weak clusterings", in IEEE International Conference on Data Mining, 2003, pp. 331-338.
http://dx.doi.org/10.1109/ICDM.2003.1250937

[20] G.Karypis and V. Kumar, "Metis: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices", University of Minnesota, Department of Computer Science and Engineering, Army HPC Research Centre, Minneapolis, Technical Report, 1998.

[21] I. Guyon and A. Elisseeff, "An introduction to variable and feature selection", Journal of Machine Learning Research}, vol. 3, pp. 1157-1182, 2003.

[22] D. Sridhar, E. Bartlett, and R. Seagrave, "Information theoretic subset selection for neural network models", Computers & Chemical Engineering, vol. 22, pp. 613-626, 1998.
http://dx.doi.org/10.1016/S0098-1354(97)00227-5

[23] Y. Hong, S. Kwong, Y. Chang, and Q. Ren, "Consensus unsupervised feature ranking from multiple views", Pattern Recognition Letters, vol. 29, pp. 595-602, 2008.
http://dx.doi.org/10.1016/j.patrec.2007.11.012

[24] G. Corder and D. Foreman, Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach. New York: Wiley Interscience, 2009.
http://dx.doi.org/10.1002/9781118165881

[25] M. Kendall and J. Gibbons, Rank Correlation Methods, 5th ed. London: Oxford University Press, 1990.

[26] R. Bouckaert, E. Frank, M. Hall, R. Kirkby, P. Reutemann, A. Seewald, and D. Scuse, "Weka manual for version 3-7-3", http://www.cs.waikato.ac.nz/ml/weka/, viewed 15 August 2011.

[27] R. Kohavi, "The power of decision tables", in 8th European Conference on Machine Learning, 1995, pp. 174-189.

[28] D. Aha and D. Kibler, "Instance-based learning algorithms", Machine Learning, vol. 6, pp. 37-66, 1991.
http://dx.doi.org/10.1007/BF00153759

[29] R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993.

[30] W. Cohen, "Fast effective rule induction", in Proc. 12th Internat. Conf. Machine Learning, 1995, pp. 115-123.

[31] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, "Liblinear - a library for large linear classification", Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear/, viewed 10 August 2011.

[32] C.-C. Chang and C.-J. Lin, "Libsvm~-- a library for support vector machines", Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm/, viewed 12 June 2011.

[33] R.-E. Fan, P.-H. Chen, and C.-J. Lin, "Working set selection using second order information for training svm", J. Machine Learning Research, vol. 6, pp. 1889-1918, 2005.

[34] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. New York: Wiley, 2001.

[35] F. Frank and I. Witten, "Generating accurate rule sets without global optimization", in Proc. 15th Internat. Conf. on Machine Learning, 1998, pp. 144-151.

[36] T. Hastie and R. Tibshirani, "Classification by pairwise coupling", in Advances in Neural Information Processing Systems, 1998.

[37] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy, "Improvements to Platt's SMO algorithm for SVM classifier design", Neural Computation, vol. 13, no. 3, pp. 637-649, 2001.
http://dx.doi.org/10.1162/089976601300014493

[38] J. Platt, "Fast training of support vector machines using sequential minimal optimization", in Advances in Kernel Methods - Support Vector Learning, 1998.

[39] G. Demiroz and A. Guvenir, "Classification by voting feature intervals", in Proc. 9th European Conference on Machine Learning, 1997, pp. 85-92.

[40] X. Wu, V. Kumar, J. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. McLachlan, A. Ng, B. Liu, P. Yu, Z. Zhou, M. Steinbach, D. Hand, and D. Steinberg, "Top 10 algorithms in data mining", Knowledge Inf. Systems, vol. 14, pp. 1-37, 2007.
http://dx.doi.org/10.1007/s10115-007-0114-2

[41] C.-W. Hsu, C.-C. Chang, and C.-J. Lin, "A practical guide to support vector classification", Dept. Computer Science, National Taiwan University, http://www.csie.ntu.edu.tw/~cjlin, Initial version: 2003, last updated: April 15, 2010.

[42] G. Beliakov and J. Ugon, "Implementation of novel methods of global and non-smooth optimization: GANSO programming library", Optimization, vol. 56, pp. 543-546, 2007.
http://dx.doi.org/10.1080/02331930701617429


Full Text: PDF


Journal of Networks (JNW, ISSN 1796-2056)

Copyright @ 2006-2013 by ACADEMY PUBLISHER – All rights reserved.