基于支持向量的迭代修正质心文本分类算法

王德庆; 张辉

基于支持向量的迭代修正质心文本分类算法

王德庆,
张辉

北京航空航天大学软件开发环境国家重点实验室, 北京 100191

基金项目: 核高基重大专项资助项目(2010ZX01042-002)

详细信息

中图分类号: TP181
计量
- 文章访问数: 1523
- HTML全文浏览量: 203
- PDF下载量: 603
- 被引次数: 0
出版历程
- 收稿日期: 2012-01-11
- 网络出版日期: 2013-02-28

Support-vector-based iteratively adjusted centroid classifier for text categorization

State Key Laboratory of Software Development Environment, Beijing University of Aeronautics and Astronautics, Beijing 100191, China

摘要

摘要: 针对质心分类算法容易产生归纳偏置或模型失配问题的不足,提出一种基于支持向量的迭代修正质心分类算法.该方法仅使用由支持向量机(SVMs,Support Vector Machines)选出的支持向量来构造质心向量,然后利用训练集误分样本来迭代修正初始质心向量.与其他分类算法相比,该算法取得较好的宏平均F₁和微平均F₁,在8个常用文本分类数据集上的实验验证了该算法的有效性,特别是在不均衡文本语料上.
- 文本分类 /
- 质心向量 /
- 支持向量 /
- 迭代修正 /
- 支持向量机
Abstract: To address the lackness of centroid-based classifier (CC) that is prone to generate inductive bias or model misfit, a support-vector-based iteratively-adjusted centroid classifier (IACC_SV) was proposed, which employs support vectors found by some routines, e.g., linear support vector machines (SVMs) to construct centroid vectors for CC, and then iteratively adjusts the initial centroid vectors according to the misclassified training samples. Compared with traditional classification algorithms, IACC_SV achieves better performance in terms of macro-F₁ and micro-F₁, and the extensive experiments on 8 real-world text corpora demonstrate the effectiveness of the proposed algorithm, especially on text corpora with highly imbalanced classes.
- text categorization /
- centroid vector /
- support vector /
- iterative adjustment /
- support vector machines

HTML全文

参考文献(22)

[1]	Sebastiani F.Machine learning in automated text categorization[J].ACM Computing Surveys,2002,34(1):1-47
[2]	Wang D,Zhang H,Liu R,et al.Predicting bugs' components via mining bug reports[J].Journal of Software,2012,7(5): 1149-1154
[3]	Han E H,Karypis G.Centroid-based document classification: analysis & experimental results[C]//Proceedings of PKDD'00.London:Springer-Verlag,2000:424-431
[4]	Tam V,Santoso A,Setiono R.A comparative study of centroidbased,neighborhood-based and statistical approaches for effective document categorization[C]//Proceedings of 16th ICPR.Washington:IEEE Computer Society,2002:235-238
[5]	Guan H, Zhou J,Guo M.A class-feature-centroid classifier for text categorization[C]//Proceedings of WWW.New York:ACM,2009:201-210
[6]	Tan S.An improved centroid classifier for text categorization[J].Expert Systems with Applications,2008,35(1/2):1279-1285
[7]	Tan S,Wang Y,Wu G.Adapting centroid classifier for document categorization[J].Expert Systems with Applications,2011, 38(8):10264-10273
[8]	Lertnattee V,Theeramunkong T.Effect of term distributions on centroid-based text categorization[J].Information Sciences,2004,158:89-115
[9]	Shankar S,Karypis G.Weight adjustment schemes for a centroid based classifier .TR 00-035,2000
[10]	Foody G M.Issues in training set selection and refinement for classification by a feedforward neural network[C]//Proceedings of IGARSS.Seattle:IEEE,1998:409-411
[11]	Cortes C,Vapnik V.Support-vector networks[J].Machine Learning,1995,20:273-297
[12]	Joachims T.Text categorization with support vector machines .TR-23,University of Dortmund,1997
[13]	Salton G,Buckley C.Term-weighting approaches in automatic text retrieval[J].Information Processing & Management,1988,24(5):513-523
[14]	Jones K S.A statistical interpretation of term specificity and its application in retrieval[J].J Documentation,1972,28(1):11-21
[15]	Han E H.Tmdata .Minnesota:University of Minnesota,2000 .http://www.cs.umn.edu/～han/data/tmdata.tar.gz
[16]	Xiong H,Wu J,Chen J.K-means clustering versus validation measures:a data-distribution perspective[J].IEEE Transactions on Systems,Man,and Cybernetics Part B,2009,39(2):318-331
[17]	Lewis D.Reuters-21578 .Dublin:Trinty College,2007 .
[18]	Lang Ken.20Newsgroup .Massachusetts:Massachusetts Institute of Technology,2007 .
[19]	Lewis D D.Evaluating and optimizing autonomous text classification systems[C]//Proceedings of 18^th SIGIR.New York:ACM,1995:246-254
[20]	Yu H,Hsieh C J,Chang K W,et al.Large linear classification when data cannot fit in memory[C]//Proceedings of KDD-10.New York:ACM,2010:833-842
[21]	Yang Y,Liu X.A re-examination of text categorization methods[C]//Proceedings of SIGIR '99.New York:ACM,1999: 42- 49
[22]	Chang C C,Lin C J.Libsvm:a library for support vector machines .Taiwan:Department of Computer Science and Information Engineering,National Taiwan University,2001 .http://www.csie.ntu.edu.tw/～cjlin/libsvm