-
摘要:
应用模拟退火算法解决组合聚类问题有两方面,一是有效利用基础聚类作为先验信息,以获得尽可能好的组合聚类结果;二是降低模拟退火过程的随机性,提高算法收敛速度。针对这2个问题,提出了基于投票的快速模拟退火(BV-RSA)模型。该模型利用基础聚类对样本划分的完全或部分一致性作为启发信息,构建超点集合和超点投票箱,由超点取代其代表的样本子集参与退火过程,超点运动方向在投票箱范围内随机选择,降低了超点运动随机性,加速了组合聚类过程。数据集实验表明,BV-RSA模型在聚类精度和鲁棒性方面表现良好。
Abstract:There are two key issues in applying simulated annealing algorithm to solve the problem of ensemble clustering. One is how to use basic partition information in annealing process to obtain better result, and the other is how to accelerate the algorithm convergence. In this paper, the rapid simulated annealing based on voting (BV-RSA) model is presented, in which the complete and partial consensuses of basic partitions are used to recognize super-objects and construct voting box for each super-object. In the process of simulated annealing, some data samples represented by a super-object are controlled to move in a group, and the motion direction of a super-object is selected randomly in the scope of its voting box, thus reducing moving randomness and speeding up the clustering of super-objects. Experiments on multiple data sets demonstrate that the BV-RSA model performs well in both clustering accuracy and robustness.
-
Key words:
- ensemble clustering /
- simulated annealing /
- super-object /
- voting /
- combinatorial optimization
-
表 1 基础聚类与超点示例
Table 1. An example of basic partition and super-objects
数据样本 基础聚类的簇标签 超点 π1 π2 π3 x1 C1 C2 C3 S1 x2 C1 C2 C3 x3 C1 C2 C3 x4 C1 C1 C1 S2 x5 C2 C1 C1 S3 x6 C2 C1 C1 x7 C2 C1 C1 x8 C2 C1 C1 x9 C3 C3 C1 S4 x10 C3 C3 C2 S5 x11 C3 C3 C2 x12 C3 C3 C2 表 2 实验数据集描述
Table 2. Description of experimental datasets
数据集 样本数 属性数 类别数 来源 iris 150 3 3 UCI glass 214 9 6 UCI segment 2 310 19 7 UCI hitech 2 301 126 321 6 CLUTO k1b 2 340 21 839 6 CLUTO la12 6 279 31 472 6 CLUTO re1 1 657 3 758 25 CLUTO reviews 4 069 126 373 5 CLUTO sports 8 580 126 373 7 CLUTO tr11 414 6 429 9 CLUTO tr12 313 5 804 8 CLUTO tr41 878 7 454 10 CLUTO tr45 690 8 261 10 CLUTO letter 20 000 16 26 LIBSVM mnist 70 000 786 10 LIBSVM 表 3 各模型聚类结果的精度比较
Table 3. Comparison of clustering result precision of different models
数据集 BV-RSA/C BV-RSA/R CSPA HGPA MCLA k-means iris 0.91 0.91 0.79 0.73 0.91 0.89 glass 0.53 0.54 0.60 0.56 0.53 0.54 segment 0.67 0.65 0.59 0.52 0.67 0.66 hitech 0.56 0.56 0.56 0.53 0.56 0.51 k1b 0.89 0.87 0.86 0.86 0.91 0.64 la12 0.74 0.75 0.71 0.60 0.76 0.69 re1 0.71 0.70 0.70 0.62 0.71 0.43 reviews 0.67 0.67 0.68 0.62 0.67 0.64 sports 0.93 0.93 0.65 0.49 0.93 0.64 tr11 0.85 0.86 0.79 0.65 0.85 0.68 tr12 0.78 0.74 0.79 0.73 0.78 0.64 tr41 0.82 0.81 0.72 0.61 0.81 0.65 tr45 0.80 0.78 0.79 0.65 0.74 0.68 letter 0.34 0.33 0.24 0.24 0.30 0.26 mnist 0.57 0.55 — 0.35 0.56 0.54 表 4 输入数据描述
Table 4. Description of input data
字段名称 字段含义 uid 司机id morning_cnt 司机在早高峰时间段总共接单数 night_cnt 司机在晚高峰时间段总共接单数 usual_cnt 司机在平峰时间段总共接单数 weekday_cnt 司机在工作日总共接单数 weekend_cnt 司机在周末总共接单数 income 司机的收入 complain_rate 司机被投诉率 route_num 司机设置有效的常用路线数 gender 性别(1-男;2-女) Age 年龄 -
[1] NGUYEN N, CARUANA R.Consensus clusterings[C]//IEEE International Conference on Data Mining.Piscataway, NJ: IEEE Press, 2007: 607-612. [2] STREHL A, GHOSH J.Cluster ensembles:A knowledge reuse framework for combining partitionings[J].Journal of Machine Learning Research, 2002, 3(3):583-617. http://cn.bing.com/academic/profile?id=95607976ad8495bf99b3e460dbad4285&encoded=0&v=paper_preview&mkt=zh-cn [3] AYAD H, KAMEL M.Refined shared nearest neighbors graph for combining multiple data clusterings[C]//Advances in Intelligent Data Analysis.Berlin: Springer, 2003: 307-318. https://www.researchgate.net/publication/221460998_Refined_Shared_Nearest_Neighbors_Graph_for_Combining_Multiple_Data_Clusterings [4] YANG Y, KAMEL M S.An aggregated clustering approach using multi-ant colonies algorithms[J].Pattern Recognition, 2006, 39(7):1278-1289. doi: 10.1016/j.patcog.2006.02.012 [5] FERN X Z, BRODLEY C E.Solving cluster ensemble problems by bipartite graph partitioning[C]//Proceedings of Twenty-First International Conference on Machine Learning.New York: ACM, 2004: 281-288. http://web.engr.oregonstate.edu/~xfern/graph_icml04.pdf [6] YU Z, HAN G, LI L, et al.Adaptive noise immune cluster ensemble using affinity propagation[C]//IEEE International Conference on Data Engineering.Piscataway, NJ: IEEE Press, 2016: 1454-1455. Adaptive noise immune cluster ensemble using affinity propagation [7] FRED A L N, JAIN A K.Combining multiple clusterings using evidence accumulation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(6):835-850. doi: 10.1109/TPAMI.2005.113 [8] WANG X, YANG C, ZHOU J.Clustering aggregation by probability accumulation[J].Pattern Recognition, 2009, 42(5):668-675. doi: 10.1016/j.patcog.2008.09.013 [9] HU M, DENG X, YAO Y.A sequential three-way approach to constructing a co-association matrix in consensus clustering[C]//International Joint Conference on Rough Sets.Berlin: Springer, 2018, 11103: 599-613. doi: 10.1007%2F978-3-319-99368-3_47 [10] LIU H, LIU T, WU J, et al.Spectral ensemble clustering[C]//Proceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York: ACM, 2015: 715-724. https://wenku.baidu.com/view/b5f0cb917fd5360cbb1adb37.html [11] HUANG D, WANG C D, LAI J H.Locally weighted ensemble clustering[J].IEEE Transactions on Cybernetics, 2016, 48(5):1460-1473. http://d.old.wanfangdata.com.cn/Periodical/nygcxb201314022 [12] ZHOU Z H, TANG W.Clusterer ensemble[J].Knowledge-Based Systems, 2006, 19(1):77-83. http://d.old.wanfangdata.com.cn/Periodical/rjxb200504002 [13] FU Y, YANG Y, LIU Y.A decision model for fuzzy clustering ensemble[C]//Proceedings of the International Conference on Intelligent Systems and Knowledge Engineering.Paris: Atlantis Press, 2007. https://www.researchgate.net/publication/266650562_A_Decision_Model_for_Fuzzy_Clustering_Ensemble [14] 陈晓云, 陈刚.基于最大内聚度基准的加权投票聚类集成[J].控制与决策, 2014(2):236-240. http://d.old.wanfangdata.com.cn/Periodical/kzyjc201402008CHEN X Y, CHEN G.Weighted voting clustering ensemble based on maximum cohesion[J].Control and Decision, 2014(2):236-240(in Chinese). http://d.old.wanfangdata.com.cn/Periodical/kzyjc201402008 [15] LU Z, PENG Y, XIAO J.From comparing clusterings to combining clusterings[C]//23rd AAAI Conference on Artificial Intelligence and the 20th Innovative Applications of Artificial Intelligence Conference.Palo Alto: AAAI, 2008, 2: 665-670. https://www.aaai.org/Papers/AAAI/2008/AAAI08-106.pdf [16] HUANG D, LAI J, WANG C D.Ensemble clustering using factor graph[J].Pattern Recognition, 2016, 50(C):131-142. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=b837e47b9d728114f197c99a1582ceec [17] TOPCHY A P, JAIN A K, PUNCH W F.A mixture model for clustering ensembles[C]//Proceedings of the Fourth SIAM International Conference on Data Mining.Philadelphia: SIAM Publications, 2004: 379-390. http://dataclustering.cse.msu.edu/papers/topchy_mixture_siam_accepted.pdf [18] 蒋君, 徐蔚鸿, 潘楚.基于粒计算和模拟退火的K-medoids聚类算法[J].计算机仿真, 2015, 32(12):214-217. doi: 10.3969/j.issn.1006-9348.2015.12.045JIANG J, XU W H, PAN C.Improved K-medoids clustering algorithm based on many factors[J].Computer Simulation, 2015, 32(12):214-217(in Chinese). doi: 10.3969/j.issn.1006-9348.2015.12.045 [19] SARTAKHTI J S, AFRABANDPEY H, SARAEE M.Simulated annealing least squares twin support vector machine (SA-LSTSVM) for pattern classification[J].Soft Computing, 2017, 21(15):4361-4373. doi: 10.1007/s00500-016-2067-4