A Data-drive Feature Selection Method in Text Categorization
Abstract
Text Categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. It has become a key technique for handling and organizing text data. One of the most important issues in TC is Feature Selection (FS). Many FS methods have been put forward and widely used in TC field, such as Information Gain (IG), Document Frequency thresholding (DF) and Mutual Information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI). A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, we present a formal study of FS in TC. We first define three desirable constraints that any reasonable FS function should satisfy, then check these constraints on some popular FS methods, including IG, DF, MI and two other methods. We find that IG satisfies the first two constraints, and that there are strong statistical correlations between DF and the first constraint, whilst MI does not satisfy any of the constraints. Experimental results indicate that the empirical performance of a FS function is tightly related to how well it satisfies these constraints and none of the investigated FS functions can satisfy all the three constraints at the same time. Finally we present a novel framework for developing FS functions which satisfy all the three constraints, and design several new FS functions using this framework. Experimental results on Reuters21578 and Newsgroup corpora show that our new FS function DFICF outperforms IG and DF when using either Micro- or Macro-averaged-measures.
Keywords
References
[1] L. Douglas Baker and Andrew K. Mccallum. Distributional clustering of words for text categorization. In Proceedings of the.21th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’ 98), pages 96-103
[2] Hui Fang, ChengXiang Zhai: An exploration of axiomatic approaches to information retrieval. SIGIR 2005: 480-487
[3] Hui Fang, Tao Tao, ChengXiang Zhai: A formal study of information retrieval heuristics. SIGIR 2004: 49-56
[4] R. M. Gray, Entropy and Information Theory, Springer-Verlag, 1990
[5] Bong,Chih How,Narayanan K. 2004. An Empirical Study of Feature Selection for Text Categorization based on Term Weightage. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence(WI'04).
[6] Lang K. Newsweeder: Learning to Filter Netnews. International Conference on Machine Learning (ICML). San Francisco:Morgan Kanfman Publishers,1995. 51-60.
[7] Sparck Jones, K. (1972), “A statistical interpretation of term specificity and its application in retrieval”, Journal of Documentation, Vol. 28, pp. 11–21
doi:10.1108/eb026526
[8] David D. Lewis,Reuters-21578 test collection. Reuters21578 http://www.daviddlewis.com/resources/testcollections/reuters21578/.
[9] David D. Lewis, Li F, Rose T, Yang Y.RCV1:A new benchmark collection for text categorization research.Journal of Machine Learning Research,2004,5(3):361-397.
[10] David D. Lewis, Robert E. Schapire, James P. Callan, and Ron Papka. Training algorithms for linear text classifiers. (SIGIR’96), pp. 298-306, 1996
[11] Shoushan Li, Chenqing Zong. A New Approach to Feature Selection for Text, Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE '05.
[12] H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, pages 159–165, 1958.
doi:10.1147/rd.22.0159
[13] A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
[14] T. Mitchell. Machine Learning. McCraw Hill, 1996
[15] Andrew Moore. Statistical Data Mining Tutorials. http://www.autonlab.org/tutorials/
[16] C.J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.
[17] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11):613–620, 1975.
doi:10.1145/361219.361220
[18] F. Sebastiani, Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47. 2002.
doi:10.1145/505282.505283
[19] S.R.S.Varadhan. Probability Theory.Courant Institute of Mathematical Sciences. New York University. August 31, 2000
[20] Weka. http://www.cs.waikato.ac.nz/ml/weka/
[21] Y. Yang and X. Liu. A Re-examination of Text Categorization Methods. (SIGIR’99), pp. 42-49, 1999
[22] Y. Yang, Jan O. Pedersen. 1997. A Comparative Study on Feature Selection in Text Categorization. Proceedings of ICML-97, pp. 412-420
[23] Stewart M.Yang, Xiao-Bin Wu, Zhi-Hong Deng, Ming Zhang, Dong-Qing Yang. 2002 Modification of Feature Selection Methods Using Relative Term Frequency. Proceedings of ICMLC-2002, pp. 1432-1436
Full Text: PDF


