Support-vector-based iteratively adjusted centroid classifier for text categorization
-
摘要: 针对质心分类算法容易产生归纳偏置或模型失配问题的不足,提出一种基于支持向量的迭代修正质心分类算法.该方法仅使用由支持向量机(SVMs,Support Vector Machines)选出的支持向量来构造质心向量,然后利用训练集误分样本来迭代修正初始质心向量.与其他分类算法相比,该算法取得较好的宏平均F1和微平均F1,在8个常用文本分类数据集上的实验验证了该算法的有效性,特别是在不均衡文本语料上.Abstract: To address the lackness of centroid-based classifier (CC) that is prone to generate inductive bias or model misfit, a support-vector-based iteratively-adjusted centroid classifier (IACC_SV) was proposed, which employs support vectors found by some routines, e.g., linear support vector machines (SVMs) to construct centroid vectors for CC, and then iteratively adjusts the initial centroid vectors according to the misclassified training samples. Compared with traditional classification algorithms, IACC_SV achieves better performance in terms of macro-F1 and micro-F1, and the extensive experiments on 8 real-world text corpora demonstrate the effectiveness of the proposed algorithm, especially on text corpora with highly imbalanced classes.
-
Key words:
- text categorization /
- centroid vector /
- support vector /
- iterative adjustment /
- support vector machines
-
[1] Sebastiani F.Machine learning in automated text categorization[J].ACM Computing Surveys,2002,34(1):1-47 [2] Wang D,Zhang H,Liu R,et al.Predicting bugs' components via mining bug reports[J].Journal of Software,2012,7(5): 1149-1154 [3] Han E H,Karypis G.Centroid-based document classification: analysis & experimental results[C]//Proceedings of PKDD'00.London:Springer-Verlag,2000:424-431 [4] Tam V,Santoso A,Setiono R.A comparative study of centroidbased,neighborhood-based and statistical approaches for effective document categorization[C]//Proceedings of 16th ICPR.Washington:IEEE Computer Society,2002:235-238 [5] Guan H, Zhou J,Guo M.A class-feature-centroid classifier for text categorization[C]//Proceedings of WWW.New York:ACM,2009:201-210 [6] Tan S.An improved centroid classifier for text categorization[J].Expert Systems with Applications,2008,35(1/2):1279-1285 [7] Tan S,Wang Y,Wu G.Adapting centroid classifier for document categorization[J].Expert Systems with Applications,2011, 38(8):10264-10273 [8] Lertnattee V,Theeramunkong T.Effect of term distributions on centroid-based text categorization[J].Information Sciences,2004,158:89-115 [9] Shankar S,Karypis G.Weight adjustment schemes for a centroid based classifier .TR 00-035,2000 [10] Foody G M.Issues in training set selection and refinement for classification by a feedforward neural network[C]//Proceedings of IGARSS.Seattle:IEEE,1998:409-411 [11] Cortes C,Vapnik V.Support-vector networks[J].Machine Learning,1995,20:273-297 [12] Joachims T.Text categorization with support vector machines .TR-23,University of Dortmund,1997 [13] Salton G,Buckley C.Term-weighting approaches in automatic text retrieval[J].Information Processing & Management,1988,24(5):513-523 [14] Jones K S.A statistical interpretation of term specificity and its application in retrieval[J].J Documentation,1972,28(1):11-21 [15] Han E H.Tmdata .Minnesota:University of Minnesota,2000 .http://www.cs.umn.edu/~han/data/tmdata.tar.gz [16] Xiong H,Wu J,Chen J.K-means clustering versus validation measures:a data-distribution perspective[J].IEEE Transactions on Systems,Man,and Cybernetics Part B,2009,39(2):318-331 [17] Lewis D.Reuters-21578 .Dublin:Trinty College,2007 .http://ronaldo.cs.tcd.ie/esslli07/sw/step01.tgz [18] Lang Ken.20Newsgroup .Massachusetts:Massachusetts Institute of Technology,2007 .http://people.csail.mit.edu/jrennie/20Newsgroups/ [19] Lewis D D.Evaluating and optimizing autonomous text classification systems[C]//Proceedings of 18th SIGIR.New York:ACM,1995:246-254 [20] Yu H,Hsieh C J,Chang K W,et al.Large linear classification when data cannot fit in memory[C]//Proceedings of KDD-10.New York:ACM,2010:833-842 [21] Yang Y,Liu X.A re-examination of text categorization methods[C]//Proceedings of SIGIR '99.New York:ACM,1999: 42- 49 [22] Chang C C,Lin C J.Libsvm:a library for support vector machines .Taiwan:Department of Computer Science and Information Engineering,National Taiwan University,2001 .http://www.csie.ntu.edu.tw/~cjlin/libsvm
点击查看大图
计量
- 文章访问数: 1484
- HTML全文浏览量: 200
- PDF下载量: 603
- 被引次数: 0