留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于信息理论的网络文本组合聚类

王扬 袁昆 刘洪甫 吴俊杰 包秀国

王扬, 袁昆, 刘洪甫, 等 . 基于信息理论的网络文本组合聚类[J]. 北京航空航天大学学报, 2016, 42(8): 1603-1611. doi: 10.13700/j.bh.1001-5965.2015.0507
引用本文: 王扬, 袁昆, 刘洪甫, 等 . 基于信息理论的网络文本组合聚类[J]. 北京航空航天大学学报, 2016, 42(8): 1603-1611. doi: 10.13700/j.bh.1001-5965.2015.0507
WANG Yang, YUAN Kun, LIU Hongfu, et al. Information-theoretic ensemble clustering on web texts[J]. Journal of Beijing University of Aeronautics and Astronautics, 2016, 42(8): 1603-1611. doi: 10.13700/j.bh.1001-5965.2015.0507(in Chinese)
Citation: WANG Yang, YUAN Kun, LIU Hongfu, et al. Information-theoretic ensemble clustering on web texts[J]. Journal of Beijing University of Aeronautics and Astronautics, 2016, 42(8): 1603-1611. doi: 10.13700/j.bh.1001-5965.2015.0507(in Chinese)

基于信息理论的网络文本组合聚类

doi: 10.13700/j.bh.1001-5965.2015.0507
基金项目: 国家自然科学基金(71531001,71322104,71171007,71471009);国家“863”计划(SS2014AA012303);中央高校基本科研业务费专项资金
详细信息
    作者简介:

    王扬,男,博士研究生。主要研究方向:应急管理。Tel.:010-82339105。E-mail:wyang@buaa.edu.cn;吴俊杰,男,博士,教授,博士生导师。主要研究方向:数据挖掘、社会舆情和社交网络分析。Tel.:010-82339983。E-mail:wujj@buaa.edu.cn;包秀国,男,博士研究生。主要研究方向:信息安全与大数据存储。Tel.:010-82338497。E-mail:baoxiuguo@139.com

    通讯作者:

    包秀国,Tel.:010-82338497,E-mail:baoxiuguo@139.com

  • 中图分类号: V221+.3;TB553

Information-theoretic ensemble clustering on web texts

  • 摘要: 尽管近年来针对文本聚类问题进行了大量研究,其仍然是数据挖掘领域的一个富有挑战性的问题,特别在弱相关特征乃至噪声特征的处理上,仍然存在诸多挑战。针对这一问题提出了文本聚类的分解-组合算法框架——DIAS。该方法首先通过简单随机特征抽样将高维文本数据进行分解得到多样化的结构知识,其优点是能够较好地避免产生大量的噪声特征。然后采用基于信息理论的一致性聚类(ICC)将多视角基础聚类知识组合起来,得到高质量的一致性划分。最后通过在8个真实文本数据集上的实验,证明DIAS算法相较于其他被广泛使用的算法具有明显优势,特别在处理弱基础聚类上具有突出效果。由于在分布式计算上的天然优势,DIAS有望成为大规模文本聚类的主流算法。

     

  • [1] ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clustering of web documents[C]//Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mininge.New York:AIAA,1997:287-290.
    [2] CUTTING D R,KARGER D R,PEDERSEN J O,et al.Scatter/gather:A cluster-based approach to browsing large document collections[C]//Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval.New York:ACM,1992:318-329.
    [3] CHA M,KWAK H,RODRIGUEZ P,et al.I tube,you tube,everybody tubes:Analyzing the world's largest user generated content video system[C]//Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement.New York:ACM,2007:1-14.
    [4] DUDA R O,HART P E,STORK D G.Pattern classification[M].2nd ed.New York:Wiley-Interscience,2000:14-15.
    [5] JOLLIFFE I T.Principal component analysis[M].2nd ed.New York:Springer,2002:8-21.
    [6] CAO J,WU Z,WU J,et al.Sail:Summation-based incremental learning for information-theoretic text clustering[J].IEEE Transactions on Cybernetics,2013,43(2):570-584.
    [7] WU J,LIU H,XIONG H,et al.K-means-based consensus clustering:A unified view[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(1):155-169.
    [8] GOKCAY E,PRINCIPE J C.Information theoretic clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(2):158-171.
    [9] DHILLON I,MALLELA S,MODHA D.Information-theoretic co-clustering[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2003:89-93.
    [10] Digitical Technology Center (DTC).CLUTO-Software for clustering high-dimensional datasets[DS/OL].(2006-10-18)[2015-01-30].http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview.
    [11] CAI D.The 20 newgroups data set[DS/OL].(2008-01-14)[2015-01-30].http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html.
    [12] CAI D,WANG X,HE X.Probabilistic dyadic data analysis with local and global consistency[C]//Proceedings of the 26th International Conference on Machine Learning (ICML'09).New York:ACM,2009:105-112.
    [13] LI R L.English text segmentation corpus[DS/OL].(2011-10-30)[2015-1-30].http://www.datatang.com/data/11968.
    [14] ZHAO Y,KARYPIS G.Empirical and theoretical comparisons of selected criterion functions for document clustering[J].Machine Learning,2004,55(3):311-331.
    [15] STREHL A,GHOSH J.Cluster ensembles-A knowledge reuse framework for combining partitions[J].Journal of Machine Learning Research,2003,3:583-617.
    [16] ZHONG S,GHOSH J.Generative model-based document clustering:A comparative study[J].Knowledge and Information Systems,2005,8(3):374-384.
    [17] FRED A L N,JAIN A K.Combining multiple clusterings using evidence accumulation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(6):835-850.
    [18] AGGARWAL C C,ZHAI C.Mining text data[M].New York:Springer,2012:81-86.
    [19] BERRY M W,DUMAIS S T,O'BRIEN G W.Using linear algebra for intelligent information retrieval[J].SIAM Review,1995,37(4):573-595.
    [20] HYVARINEN A,OJA E.Independent component analysis:Algorithms and applications[J].Neural Networks,2000,13(4-5):411-430.
    [21] BOUTSIDIS C,ZOUZIAS A,DRINEAS P.Random projections for K-means clustering[C]//Advances in Neural Information Processing Systems.Cambridge:MIT Press,2010:298-306.
    [22] AGRAWAL R,GEHRKE J,GUNOPULOS D,et al.Automatic subspace clustering of high dimensional data for data mining applications[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.New York:ACM,1998:94-105.
    [23] CHENG C H,FU A W,ZHANG Y.Entropy-based subspace clustering for mining numerical data[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999:84-93.
    [24] GOIL S,NAGESH H,CHOUDHARY A.MAFIA:Efficient and scalable subspace clustering for very large data sets[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999:443-452.
    [25] AGGARWAL C C,YU P S.Finding generalized projected clusters in high dimensional spaces[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.New York:ACM,2000:70-81.
    [26] FRIEDMAN J H,MEULMAN J J.Clustering objects on subsets of attributes[J].Journal of the Royal Statistical Society:Series B-Statistical Methodology,2004,66(4):815-849.
    [27] WOO K G,LEE J H,KIM M H,et al.FINDIT:A fast and intelligent subspace clustering algorithm using dimension voting[J].Information and Software Technology,2004,46(4):255-271.
    [28] LI T,DING C,JORDAN M I.Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization[C]//Proceedings of 7th IEEE International Conference on Data Mining.Piscataway,NJ:IEEE Press,2007:577-582.
    [29] 杨燕,靳蕃,KAMEL M.聚类组合研究的新进展[J].计算机工程与应用,2008,44(11):142-144.YANG Y,JIN F,KAMEL M.Latest development of clustering ensemble[J].Computer Engineering and Applications,2008,44(11):142-144(in Chinese).
  • 加载中
计量
  • 文章访问数:  491
  • HTML全文浏览量:  3
  • PDF下载量:  415
  • 被引次数: 0
出版历程
  • 收稿日期:  2015-07-30
  • 刊出日期:  2016-08-20

目录

    /

    返回文章
    返回
    常见问答