Information-theoretic ensemble clustering on web texts
-
摘要: 尽管近年来针对文本聚类问题进行了大量研究,其仍然是数据挖掘领域的一个富有挑战性的问题,特别在弱相关特征乃至噪声特征的处理上,仍然存在诸多挑战。针对这一问题提出了文本聚类的分解-组合算法框架——DIAS。该方法首先通过简单随机特征抽样将高维文本数据进行分解得到多样化的结构知识,其优点是能够较好地避免产生大量的噪声特征。然后采用基于信息理论的一致性聚类(ICC)将多视角基础聚类知识组合起来,得到高质量的一致性划分。最后通过在8个真实文本数据集上的实验,证明DIAS算法相较于其他被广泛使用的算法具有明显优势,特别在处理弱基础聚类上具有突出效果。由于在分布式计算上的天然优势,DIAS有望成为大规模文本聚类的主流算法。
-
关键词:
- 文本聚类 /
- 分解-组合算法 /
- 基于信息理论的一致性聚类 /
- K-均值 /
- 大数据聚类
Abstract: Although being extensively studied, text clustering remains a critical challenge in data mining community due to the curse of dimensionality. Various techniques have been proposed to overcome this difficulty, but the negative impact of weakly related or even noisy features is yet the hunting nightmare. Meanwhile, we should never lose sight of the explosive growth of unlimited user-generated content on social media, which is extremely sparse and poses further challenge on the efficiency issue. In light of this, a disassemble-assemble (DIAS) framework is proposed for text clustering. Simple random feature sampling is employed by DIAS to disassemble high-dimensional text data and gain diverse structural knowledge by avoiding the bulk of noisy features. Then the multi-view knowledge is assembled by fast information-theoretic consensus clustering (ICC) to gain a high-quality consensus partitioning. Extensive experiments on eight real-world text data sets are conducted to demonstrate the advantages of DIAS over some widely used methods. In particular, DIAS shows appealing merits in learning from a bulk of very weak basic partitionings. Its natural suitability for distributed computing makes DIAS become a promising candidate for big text clustering. -
[1] ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clustering of web documents[C]//Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mininge.New York:AIAA,1997:287-290. [2] CUTTING D R,KARGER D R,PEDERSEN J O,et al.Scatter/gather:A cluster-based approach to browsing large document collections[C]//Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval.New York:ACM,1992:318-329. [3] CHA M,KWAK H,RODRIGUEZ P,et al.I tube,you tube,everybody tubes:Analyzing the world's largest user generated content video system[C]//Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement.New York:ACM,2007:1-14. [4] DUDA R O,HART P E,STORK D G.Pattern classification[M].2nd ed.New York:Wiley-Interscience,2000:14-15. [5] JOLLIFFE I T.Principal component analysis[M].2nd ed.New York:Springer,2002:8-21. [6] CAO J,WU Z,WU J,et al.Sail:Summation-based incremental learning for information-theoretic text clustering[J].IEEE Transactions on Cybernetics,2013,43(2):570-584. [7] WU J,LIU H,XIONG H,et al.K-means-based consensus clustering:A unified view[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(1):155-169. [8] GOKCAY E,PRINCIPE J C.Information theoretic clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2002,24(2):158-171. [9] DHILLON I,MALLELA S,MODHA D.Information-theoretic co-clustering[C]//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2003:89-93. [10] Digitical Technology Center (DTC).CLUTO-Software for clustering high-dimensional datasets[DS/OL].(2006-10-18)[2015-01-30].http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. [11] CAI D.The 20 newgroups data set[DS/OL].(2008-01-14)[2015-01-30].http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html. [12] CAI D,WANG X,HE X.Probabilistic dyadic data analysis with local and global consistency[C]//Proceedings of the 26th International Conference on Machine Learning (ICML'09).New York:ACM,2009:105-112. [13] LI R L.English text segmentation corpus[DS/OL].(2011-10-30)[2015-1-30].http://www.datatang.com/data/11968. [14] ZHAO Y,KARYPIS G.Empirical and theoretical comparisons of selected criterion functions for document clustering[J].Machine Learning,2004,55(3):311-331. [15] STREHL A,GHOSH J.Cluster ensembles-A knowledge reuse framework for combining partitions[J].Journal of Machine Learning Research,2003,3:583-617. [16] ZHONG S,GHOSH J.Generative model-based document clustering:A comparative study[J].Knowledge and Information Systems,2005,8(3):374-384. [17] FRED A L N,JAIN A K.Combining multiple clusterings using evidence accumulation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(6):835-850. [18] AGGARWAL C C,ZHAI C.Mining text data[M].New York:Springer,2012:81-86. [19] BERRY M W,DUMAIS S T,O'BRIEN G W.Using linear algebra for intelligent information retrieval[J].SIAM Review,1995,37(4):573-595. [20] HYVARINEN A,OJA E.Independent component analysis:Algorithms and applications[J].Neural Networks,2000,13(4-5):411-430. [21] BOUTSIDIS C,ZOUZIAS A,DRINEAS P.Random projections for K-means clustering[C]//Advances in Neural Information Processing Systems.Cambridge:MIT Press,2010:298-306. [22] AGRAWAL R,GEHRKE J,GUNOPULOS D,et al.Automatic subspace clustering of high dimensional data for data mining applications[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.New York:ACM,1998:94-105. [23] CHENG C H,FU A W,ZHANG Y.Entropy-based subspace clustering for mining numerical data[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999:84-93. [24] GOIL S,NAGESH H,CHOUDHARY A.MAFIA:Efficient and scalable subspace clustering for very large data sets[C]//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,1999:443-452. [25] AGGARWAL C C,YU P S.Finding generalized projected clusters in high dimensional spaces[C]//Proceedings of the ACM SIGMOD International Conference on Management of Data.New York:ACM,2000:70-81. [26] FRIEDMAN J H,MEULMAN J J.Clustering objects on subsets of attributes[J].Journal of the Royal Statistical Society:Series B-Statistical Methodology,2004,66(4):815-849. [27] WOO K G,LEE J H,KIM M H,et al.FINDIT:A fast and intelligent subspace clustering algorithm using dimension voting[J].Information and Software Technology,2004,46(4):255-271. [28] LI T,DING C,JORDAN M I.Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization[C]//Proceedings of 7th IEEE International Conference on Data Mining.Piscataway,NJ:IEEE Press,2007:577-582. [29] 杨燕,靳蕃,KAMEL M.聚类组合研究的新进展[J].计算机工程与应用,2008,44(11):142-144.YANG Y,JIN F,KAMEL M.Latest development of clustering ensemble[J].Computer Engineering and Applications,2008,44(11):142-144(in Chinese).
点击查看大图
计量
- 文章访问数: 687
- HTML全文浏览量: 82
- PDF下载量: 421
- 被引次数: 0