-
摘要:
快速搜索和发现密度峰值聚类(DPC) 算法是一种基于密度的聚类算法。该算法不需要迭代和过多的设定参数,但由于计算局部密度时没有考虑数据的局部结构,导致无法识别簇密度小的聚类中心。针对此问题,提出基于K互近邻(KN)和核密度估计(KDE)的DPC (KKDPC)算法。通过K近邻和核密度估计方法得到数据点的K互近邻数量和局部核密度;将K互近邻数量与局部核密度进行加和获得新的局部密度;根据数据点的局部密度得到相对距离,并通过构建决策图选取聚类中心及分配非中心点。利用人工数据集和真实数据集进行实验,并与DPC、基于密度的噪声空间聚类应用(DBSCAN)、K-means、模糊C均值聚类算法(FCM)、基于K近邻的DPC(DPC-KNN)、近邻优化DPC(DPC-NNO)、基于模糊加权共享邻居的DPC(DPC-FWSN)算法进行对比。通过计算调整互信息(AMI)、调整兰德指数(ARI)、归一化互信息(NMI)来验证KKDPC算法的性能。实验结果表明:KKDPC算法能更加准确地识别聚类中心,有效地提高聚类精度。
Abstract:Clustering by fast search and find of density peaks (DPC) algorithm is a density-based clustering algorithm that does not require iteration or too many parameter settings. However, it fails to identify cluster centers with low cluster density because the local structure of data is not considered when computing local density. To solve this problem, a DPC algorithm based on K-reciprocal neighbors (KN) and kernel density estimation (KDE), called KKDPC was proposed. The number of KN and local kernel density of data points were obtained using the
k -nearest neighbor and KDE methods. The number of KNs and local kernel density were weighted to obtain the new local density. The relative distance of data points was obtained based on their local density, and cluster centers and non-center points were selected based on the decision graph. Experiments were conducted on artificial and real datasets and compared with seven clustering algorithms including DPC, density-based spatial clustering of applications with noise (DBSCAN), K-means, fuzzy C-means clustering (FCM), DPC based on K nearest neighbors (DPC-KNN) algorithm, DPC with nearest neighbor optimization (DPC-NNO) algorithm, and DPC-FWSN algorithm. The performance of the KKDPC algorithm was verified by calculating the adjusted mutual information (AMI), adjusted Rand index (ARI), and normalized mutual information (NMI). The experimental results show that the proposed KKDPC algorithm can accurately identify cluster centers and improve clustering accuracy effectively. -
表 1 人工数据集
Table 1. Artificial datasets
表 2 人工数据集实验结果
Table 2. Experimental results on artificial datasets
算法 AMI A3 Aggregation Flame Jain Line blobs Spiral KKDPC 0.977 5 1.000 0 1.000 0 1.000 0 1.000 0 1.000 0 DPC[16] 0.976 7 0.992 2 1.000 0 0.709 2 0.779 9 1.000 0 DBSCAN[12] 0.867 4 0.984 1 0.938 6 1.000 0 1.000 0 1.000 0 K-means[8] 0.963 9 0.835 7 0.550 6 0.491 6 0.590 5 −0.004 7 FCM[24] 0.937 3 0.805 0 0.589 7 0.515 2 0.586 6 −0.004 7 DPC-KNN[17] 0.977 4 0.992 2 1.000 0 0.631 9 0.779 4 1.000 0 DPC-NNO[22] 0.979 4 0.990 5 1.000 0 1.000 0 1.000 0 1.000 0 DPC-FWSN[23] 0.977 3 0.995 5 1.000 0 1.000 0 1.000 0 1.000 0 算法 NMI A3 Aggregation Flame Jain Line blobs Spiral KKDPC 0.965 9 1.000 0 1.000 0 1.000 0 1.000 0 1.000 0 DPC[16] 0.962 9 0.995 6 1.000 0 0.822 4 0.721 0 1.000 0 DBSCAN[12] 0.685 9 0.991 1 0.986 6 1.000 0 1.000 0 1.000 0 K-means[8] 0.921 4 0.765 3 0.565 8 0.576 7 0.500 8 −0.005 0 FCM[24] 0.859 3 0.731 8 0.622 4 0.585 3 0.494 2 −0.005 0 DPC-KNN[17] 0.965 6 0.995 6 1.000 0 0.757 7 0.717 9 1.000 0 DPC-NNO[22] 0.966 5 0.994 9 1.000 0 1.000 0 1.000 0 1.000 0 DPC-FWSN[23] 0.963 9 0.997 8 1.000 0 1.000 0 1.000 0 1.000 0 算法 NMI A3 Aggregation Flame Jain Line blobs Spiral KKDPC 0.978 6 1.000 0 1.000 0 1.000 0 1.000 0 1.000 0 DPC[16] 0.977 8 0.992 4 1.000 0 0.742 9 0.788 5 1.000 0 DBSCAN[12] 0.920 4 0.986 0 0.969 0 1.000 0 1.000 0 1.000 0 K-means[8] 0.968 6 0.883 6 0.565 2 0.528 8 0.595 8 0.001 1 FCM[24] 0.945 7 0.840 1 0.604 8 0.554 7 0.591 5 0.000 4 DPC-KNN[17] 0.978 7 0.992 4 1.000 0 0.676 2 0.787 1 1.000 0 DPC-NNO[22] 0.980 3 0.991 5 1.000 0 1.000 0 1.000 0 1.000 0 DPC-FWSN[23] 0.978 4 0.995 8 1.000 0 1.000 0 1.000 0 1.000 0 表 3 UCI数据集
Table 3. UCI datasets
UCI数据集 数据量 维度 类簇个数 Blood 748 4 2 Breast 277 9 2 Diabetes 768 8 2 Ecoli 336 7 8 Heart 270 13 2 Ionosphere 351 34 2 Iris 150 4 3 Pima 768 8 2 Seeds 210 7 3 Thyroid 215 5 3 Vehicle 846 18 4 WDBC 569 30 2 Wine 178 13 3 Zoo 101 16 7 Banknote 1372 4 2 Waveform 5000 21 3 MFCCs 7195 22 10 Pendigits 10992 16 10 表 4 UCI数据集实验结果
Table 4. Experimental results on UCI datasets
UCI数据集 AMI KKDPC DPC[16] DBSCAN[12] K-means[8] FCM[24] DPC-KNN[17] DPC-NNO[22] DPC-FWSN[23] Blood 0.059 1 0.088 5 0.008 4 −0.000 8 −0.000 7 0.010 1 0.043 7 0.050 3 Breast 0.074 1 −0.001 4 0.001 7 0.074 1 −0.001 4 0.067 2 0.063 4 0.069 6 Diabetes 0.045 0 0.034 1 0.077 6 0.051 2 0.062 5 0.001 7 0.044 0 0.002 2 Ecoli 0.679 8 0.489 0 0.530 5 0.585 5 0.465 3 0.614 6 0.676 2 0.647 7 Heart 0.313 3 0.229 7 0.085 0 0.283 2 0.255 2 0.245 3 0.279 3 0.263 2 Ionosphere 0.321 3 0.087 6 0.212 3 0.129 4 0.124 6 0.150 4 0.366 7 0.381 1 Iris 0.862 3 0.862 3 0.604 4 0.733 1 0.737 2 0.862 3 0.883 1 0.883 1 Pima 0.045 0 0.034 1 0.077 6 0.051 2 0.077 1 0.001 7 0.044 0 0.002 2 Seeds 0.723 3 0.717 2 0.541 4 0.661 5 0.678 5 0.706 3 0.773 8 0.785 5 Thyroid 0.513 3 0.354 4 0.550 1 0.516 8 0.589 7 0.194 5 0.668 5 0.412 3 Vehicle 0.236 5 0.164 7 0.163 6 0.158 4 0.094 9 0.173 6 0.193 7 0.181 4 WDBC 0.779 2 0.026 7 0.374 9 0.611 0 0.607 6 0.506 9 0.735 5 0.707 6 Wine 0.778 3 0.706 5 0.548 4 0.873 5 0.828 1 0.739 1 0.888 6 0.888 6 Zoo 0.899 4 0.889 6 0.820 6 0.858 1 0.740 9 0.747 1 0.848 7 0.726 2 Banknote 0.755 3 0.931 7 0.762 4 0.016 8 0.032 9 0.923 5 0.937 7 0.935 9 Waveform 0.367 9 0.240 5 0.012 7 0.364 2 0.330 2 0.356 8 0.362 9 0.355 0 MFCCs 0.796 1 0.639 0 0.726 6 0.754 1 0.461 6 0.745 1 0.764 1 0.723 3 Pendigits 0.733 2 0.732 1 0.517 5 0.702 9 0.618 1 0.769 6 0.808 2 0.782 1 UCI数据集 ARI KKDPC DPC[16] DBSCAN[12] K-means[8] FCM[24] DPC-KNN[17] DPC-NNO[22] DPC-FWSN[23] Blood 0.111 0 0.196 9 0.032 8 −0.006 2 −0.007 2 0.031 1 0.026 2 0.065 2 Breast 0.162 2 −0.003 1 0.015 9 0.157 0 −0.003 1 0.152 8 0.128 4 0.162 2 Diabetes 0.107 2 0.078 0 0.148 1 0.104 0 0.106 9 0.014 3 0.099 5 0.013 1 Ecoli 0.760 1 0.561 8 0.649 1 0.537 3 0.368 9 0.734 7 0.734 6 0.741 0 Heart 0.403 6 0.305 6 0.018 0 0.366 6 0.331 4 0.322 7 0.366 3 0.348 5 Ionosphere 0.435 0 0.131 7 0.222 7 0.177 6 0.172 7 0.235 7 0.484 0 0.491 5 Iris 0.885 7 0.885 7 0.639 7 0.761 3 0.728 7 0.885 7 0.903 8 0.903 8 Pima 0.107 2 0.078 0 0.148 1 0.104 0 0.117 6 0.014 3 0.099 5 0.013 1 Seeds 0.764 1 0.734 1 0.544 9 0.693 4 0.714 9 0.741 9 0.824 6 0.836 9 Thyroid 0.599 9 0.425 8 0.793 2 0.638 2 0.692 7 0.273 1 0.771 2 0.455 0 Vehicle 0.157 8 0.127 3 0.068 2 0.109 2 0.074 5 0.109 3 0.155 2 0.136 9 WDBC 0.870 2 0.004 8 0.494 1 0.730 2 0.730 5 0.577 3 0.830 6 0.811 3 Wine 0.786 9 0.672 4 0.529 2 0.899 2 0.849 8 0.726 9 0.914 9 0.914 9 Zoo 0.957 0 0.957 0 0.807 4 0.894 2 0.644 4 0.622 4 0.926 0 0.770 6 Banknote 0.837 9 0.962 4 0.826 6 0.022 3 0.045 2 0.956 7 0.962 4 0.965 3 Waveform 0.316 1 0.226 7 0.003 6 0.253 6 0.243 6 0.305 7 0.297 7 0.303 2 MFCCs 0.905 3 0.729 2 0.855 4 0.895 6 0.208 0 0.841 1 0.884 2 0.813 3 Pendigits 0.579 0 0.635 0 0.598 0 0.623 3 0.529 1 0.623 8 0.700 9 0.6645 UCI数据集 NMI KKDPC DPC[16] DBSCAN[12] K-means[8] FCM[24] DPC-KNN[17] DPC-NNO[22] DPC-FWSN[23] Blood 0.067 0 0.092 7 0.026 1 0.000 3 0.000 4 0.035 0 0.048 3 0.057 5 Breast 0.107 4 0.001 3 0.017 2 0.078 4 0.001 3 0.071 0 0.107 4 0.096 2 Diabetes 0.062 3 0.035 6 0.080 6 0.052 7 0.064 7 0.017 1 0.059 8 0.015 4 Ecoli 0.739 4 0.576 1 0.588 9 0.665 3 0.562 0 0.707 7 0.690 3 0.709 1 Heart 0.316 1 0.238 1 0.151 3 0.286 2 0.258 3 0.255 0 0.286 8 0.269 6 Ionosphere 0.351 6 0.091 6 0.272 3 0.134 9 0.129 9 0.152 9 0.401 8 0.415 3 Iris 0.864 2 0.864 2 0.676 2 0.741 9 0.743 3 0.864 2 0.885 1 0.885 1 Pima 0.062 3 0.035 6 0.080 6 0.052 7 0.080 2 0.017 1 0.059 8 0.015 4 Seeds 0.726 6 0.723 8 0.609 0 0.665 4 0.682 2 0.710 3 0.775 9 0.787 5 Thyroid 0.583 3 0.482 5 0.648 7 0.602 7 0.665 7 0.333 3 0.691 1 0.502 1 Vehicle 0.266 8 0.196 0 0.193 5 0.203 4 0.098 6 0.203 3 0.211 4 0.214 1 WDBC 0.781 1 0.056 2 0.377 2 0.623 2 0.615 2 0.547 5 0.747 1 0.718 2 Wine 0.783 8 0.710 4 0.591 8 0.878 2 0.833 6 0.743 5 0.892 6 0.892 6 Zoo 0.916 9 0.911 9 0.856 8 0.889 3 0.815 8 0.804 8 0.899 8 0.795 0 Banknote 0.755 3 0.931 7 0.762 4 0.017 4 0.032 9 0.923 5 0.931 7 0.935 9 Waveform 0.367 9 0.240 5 0.344 9 0.364 2 0.330 2 0.356 8 0.362 9 0.355 0 MFCCs 0.796 1 0.639 0 0.726 6 0.754 1 0.461 6 0.745 1 0.764 1 0.723 3 Pendigits 0.773 9 0.754 4 0.699 0 0.706 5 0.639 3 0.796 7 0.824 9 0.812 6 -
[1] ZUO W D, HOU X M. An improved probability propagation algorithm for density peak clustering based on natural nearest neighborhood[J]. Array, 2022, 15: 100232. doi: 10.1016/j.array.2022.100232 [2] KHATER I M, NABI I R, HAMARNEH G. A review of super-resolution single-molecule localization microscopy cluster analysis and quantification methods[J]. Patterns, 2020, 1(3): 100038. doi: 10.1016/j.patter.2020.100038 [3] GRIFFIÉ J, BURN G L, WILLIAMSON D J, et al. Dynamic Bayesian cluster analysis of live-cell single molecule localization microscopy datasets[J]. Small Methods, 2018, 2(9): 1800008. doi: 10.1002/smtd.201800008 [4] FANG U, LI J X, LU X Q, et al. Robust image clustering via context-aware contrastive graph learning[J]. Pattern Recognition, 2023, 138: 109340. doi: 10.1016/j.patcog.2023.109340 [5] GUAN J Y, LI S, HE X X, et al. Peak-graph-based fast density peak clustering for image segmentation[J]. IEEE Signal Processing Letters, 2021, 28: 897-901. doi: 10.1109/LSP.2021.3072794 [6] HU N, TIAN Z H, LU H, et al. A multiple-kernel clustering based intrusion detection scheme for 5G and IoT networks[J]. International Journal of Machine Learning and Cybernetics, 2021, 12(11): 3129-3144. doi: 10.1007/s13042-020-01253-w [7] 张喜梅, 解滨, 徐童童, 等. 基于反向K近邻和密度峰值初始化的加权Kmeans聚类入侵检测算法[J]. 南京理工大学学报, 2023, 47(1): 56-65.ZHANG X M, XIE B, XU T T, et al. Intrusion detection algorithm based on weighted Kmeans clustering with reverse K-nearest neighbor and density peak initialization[J]. Journal of Nanjing University of Science and Technology, 2023, 47(1): 56-65(in Chinese). [8] MACQUEEN J. Some methods for classification and analysis of multivariate observations[J]. Berkeley Symposium on Mathematical Statistics and Probability, 1967, 1: 281-297. [9] PARK H S, JUN C H. A simple and fast algorithm for K-medoids clustering[J]. Expert Systems with Applications, 2009, 36(2): 3336-3341. doi: 10.1016/j.eswa.2008.01.039 [10] GUHA S, RASTOGI R, SHIM K. Cure: an efficient clustering algorithm for large databases[J]. Information Systems, 2001, 26(1): 35-58. doi: 10.1016/S0306-4379(01)00008-4 [11] ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH: an efficient data clustering method for very large databases[J]. ACM SIGMOD Record, 1996, 25(2): 103-114. doi: 10.1145/235968.233324 [12] ESTER M, KRIEGEL H P, SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Palo Alto: AAAI Press, 1996: 226-231. [13] ANKERST M, BREUNIG M M, KRIEGEL H P, et al. OPTICS: ordering points to identify the clustering structure[J]. ACM SIGMOD Record, 1999, 28(2): 49-60. doi: 10.1145/304181.304187 [14] SHEIKHOLESLAMI G, CHATTERJEE S, ZHANG A D. WaveCluster: a wavelet-based clustering approach for spatial data in very large databases[J]. The VLDB Journal, 2000, 8(3): 289-304. [15] WANG W, YANG J, MUNTZ R R. STING: a statistical information grid approach to spatial data mining[C]//Proceedings of the 23rd International Conference on Very Large Data Bases. New York: ACM, 1997: 186-195. [16] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496. doi: 10.1126/science.1242072 [17] DU M J, DING S F, JIA H J. Study on density peaks clustering based on K-nearest neighbors and principal component analysis[J]. Knowledge-Based Systems, 2016, 99: 135-145. doi: 10.1016/j.knosys.2016.02.001 [18] JIANG D, ZANG W K, SUN R, et al. Adaptive density peaks clustering based on K-nearest neighbor and gini coefficient[J]. IEEE Access, 2020, 8: 113900-113917. doi: 10.1109/ACCESS.2020.3003057 [19] YUAN X N, YU H, LIANG J, et al. A novel density peaks clustering algorithm based on K nearest neighbors with adaptive merging strategy[J]. International Journal of Machine Learning and Cybernetics, 2021, 12(10): 2825-2841. doi: 10.1007/s13042-021-01369-7 [20] XIE J Y, GAO H C, XIE W X, et al. Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors[J]. Information Sciences, 2016, 354: 19-40. doi: 10.1016/j.ins.2016.03.011 [21] LIU R, WANG H, YU X M. Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J]. Information Sciences, 2018, 450: 200-226. doi: 10.1016/j.ins.2018.03.031 [22] 陈蔚昌, 赵嘉, 肖人彬, 等. 面向密度分布不均数据的近邻优化密度峰值聚类算法[J]. 控制与决策, 2024, 39(3): 919-928.CHEN W C, ZHAO J, XIAO R B, et al. Density peaks clustering algorithm with nearest neighbor optimization for data with uneven density distribution[J]. Control and Decision, 2024, 39(3): 919-928 (in Chinese). [23] ZHAO J, WANG G, PAN J S, et al. Density peaks clustering algorithm based on fuzzy and weighted shared neighbor for uneven density datasets[J]. Pattern Recognition, 2023, 139: 109406. doi: 10.1016/j.patcog.2023.109406 [24] BEZDEK J C. Pattern recognition with fuzzy objective function algorithms[M]. New York : Plenum Press, 1981. [25] DING S F, LI C, XU X, et al. A sampling-based density peaks clustering algorithm for large-scale data[J]. Pattern Recognition, 2023, 136: 109238. doi: 10.1016/j.patcog.2022.109238 [26] CHENG D D, ZHANG S L, HUANG J L. Dense members of local cores-based density peaks clustering algorithm[J]. Knowledge-Based Systems, 2020, 193: 105454. doi: 10.1016/j.knosys.2019.105454 [27] KARKKAINEN I, FRANTI P. Dynamic local search algorithm for the clustering problem[M]. Joensuu: University of Joensuu, 2002. [28] GIONIS A, MANNILA H, TSAPARAS P. Clustering aggregation[J]. ACM Transactions on Knowledge Discovery from Data, 2007, 1(1): 4. doi: 10.1145/1217299.1217303 [29] FU L M, MEDICO E. FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data[J]. BMC Bioinformatics, 2007, 8(1): 3. doi: 10.1186/1471-2105-8-3 [30] JAIN A K, LAW M H C. Data clustering: a user’s dilemma[M]// Pattern Recognition and Machine Intelligence. Berlin: Springer, 2005: 1-10. [31] XU X, DING S F, WANG L J, et al. A robust density peaks clustering algorithm with density-sensitive similarity[J]. Knowledge-Based Systems, 2020, 200: 106028. doi: 10.1016/j.knosys.2020.106028 [32] CHANG H, YEUNG D Y. Robust path-based spectral clustering[J]. Pattern Recognition, 2008, 41(1): 191-203. doi: 10.1016/j.patcog.2007.04.010 [33] BACHE K, LICHMAN M. UCI machine learning repository[EB/OL]. (2020-12-04)[2023-02-01]. http://archive.ics.uci.edu/ml. [34] SAMARIA F S, HARTER A C. Parameterisation of a stochastic model for human face identification[C]// Proceedings of The IEEE Workshop on Applications of Computer Vision. Piscataway: IEEE Press, 1994: 138-142. -