一种傅里叶域海量数据高速谱聚类方法

张熳; 徐兆瑞; 沈项军

doi:10.13700/j.bh.1001-5965.2021.0537

一种傅里叶域海量数据高速谱聚类方法

doi: 10.13700/j.bh.1001-5965.2021.0537

江苏大学计算机科学与通信工程学院, 镇江 212013

基金项目:

国家自然科学基金 61572240

详细信息

通讯作者:
沈项军, E-mail: xjshen@ujs.edu.cn

中图分类号: TP37
计量
- 文章访问数: 333
- HTML全文浏览量: 152
- PDF下载量: 42
- 被引次数: 0
出版历程
- 收稿日期: 2021-09-08
- 录用日期: 2021-10-17
- 网络出版日期: 2022-05-19
- 整期出版日期: 2022-08-20

A high-speed spectral clustering method in Fourier domain for massive data

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China

Funds:

National Natural Science Foundation of China 61572240

More Information

Corresponding author: SHEN Xiangjun, E-mail: xjshen@ujs.edu.cn

摘要

摘要:
谱聚类方法广泛应用于数据挖掘和模式识别等领域，但大规模数据上高计算代价的特征向量求解及大数据带来的巨大内存需求，使得其应用于大规模数据时受到了极大的限制。为此，研究了基于傅里叶域的海量数据高速谱聚类方法。利用数据模式的重复性特点在傅里叶域建模，将耗时的特征向量计算转化为对预先确定的傅里叶域判别基进行选择来确定最终的特征向量，计算过程只需进行简单的乘法和加法运算，计算量得到极大的约减; 分批次训练样本，使用部分样本即可估计出整体数据的特征向量分布，确定最终的特征向量，压缩了计算时间和内存需求。在Ijcnn1、RCV1、Covtype-mult、Poker及MNIST-8M等大规模数据上的实验结果表明，所提方法在聚类精度等各项指标基本保持的前提下，训练时间相比FastESC、LSSHC、SC_RB、SSEIGS及USPEC等方法最高快了810.58倍，证明了所提方法在处理大规模聚类数据方面具有显著优势。
- 谱聚类 /
- 傅里叶域 /
- 海量数据 /
- 高速计算 /
- 低内存需求
Abstract:
Spectral clustering is widely used in data mining and pattern recognition. However, due to the high computational cost of eigenvector solutions and the huge memory requirements brought by big data, spectral clustering algorithm is greatly limited when it is applied to large-scale data. Therefore, this paper studies a high-speed spectral clustering method for massive data in the Fourier domain. This method makes full use of the repeatability of data pattern, and uses this characteristic to model in the Fourier domain. To get final eigenvectors, the time-consuming eigenvector pursuitcan be transformed into the selection of the pre-determined discriminant basis in the Fourier domain. The calculation process only needs simple multiplication and addition, so the amount of time for calculation is greatly reduced. On the other hand, due to the characteristics of calculation in the Fourier domain, another advantage of this method is that it can train the samples in batches, that is, only using part of the samples can well estimate eigenvector distribution in the whole data. The experimental results on large-scale data such as Ijcnn1, RCV1, Covtype-mult, Poker and MNIST-8M show that the training time of the proposed method is at most 810.58 times faster than that of algorithms FastESC, LSSHC, SC_RB, SSEIGS and USPEC, on the premise that the clustering accuracy and other indicators are basically maintained, which proves that the proposed method has significant advantages in processing large-scale data.
- spectral clustering /
- Fourier domain /
- large-scale data /
- high-speed computation /
- low memory requirement

HTML全文

图 1 本文方法流程示意图

Figure 1. Schematic diagram of the proposed method

下载: 全尺寸图片幻灯片

图 2 各方法在不同数据集上的运行时间对比

Figure 2. Running time comparison of each method on different datasets

下载: 全尺寸图片幻灯片

图 3 聚类准确率随维度变化的曲线

Figure 3. Variation curves of clustering accuracy with dimensionality

下载: 全尺寸图片幻灯片

表 1 大规模数据集

Table 1. Large-scale datasets

数据集名称	样本数	特征维度	类别数
Ijcnn1	126 701	22	2
RCV1	534 135	47 236	52
Covtype-mult	581 012	54	7
Poker	1 025 010	10	10
MNIST-8M	8 000 000	784	10

下载: 导出CSV

表 2 不同方法的平均准确率

Table 2. Average accuracy of various methods

方法	Ijcnn1	RCV1	Covtype-mult	Poker	MNIST-8M
FastESC	0.875 0±0.008 5	0.111 0±0	0.472 0±0.001 2	0.471 0±0.004 3	0.559 3±0.021 6
LSSHC	0.576 9±0.010 4	0.207 3±0.000 2	0.233 9±0.000 1	0.666 7±0.014 7	0.778 5±0.010 6
SC_RB	0.900 4±0	0.300 1±0.005 3	0.443 3±0.006 7	0.677 9±0.024 8	0.600 7±0.017 5
SSEIGS	0.869 1±0	0.299 9±0	0.338 8±0.000 2	0.703 1±0.034 5	0.599 0±0.002 1
USPEC	0.890 3±0.000 7	0.246 5±0.019 2	0.417 6±0.016 2	0.722 8±0.020 1	0.743 1±0.010 1
本文方法	0.902 4±0	0.338 4±0.001 5	0.487 6±0	0.728 2±0.010 4	0.779 2±0.020 6

下载: 导出CSV

表 3 不同方法的平均纯度

Table 3. Average purity of various methods

方法	Ijcnn1	RCV1	Covtype-mult	Poker	MNIST-8M
FastESC	0.904 3±0	0.357 9±0	0.488 1±0	0.534 7±0	0.472 1±0.001 2
LSSHC	0.904 3±0	0.432 4±0.000 1	0.510 5±0.000 1	0.679 1±0.021 3	0.624 1±0.019 9
SC_RB	0.904 3±0	0.381 2±0.001 1	0.496 2±0.000 2	0.679 2±0.021 5	0.549 9±0.001 3
SSEIGS	0.904 3±0	0.381 2±0.001 2	0.511 4±0.002 1	0.679 2±0.012 2	0.549 9±0.022 4
USPEC	0.904 3±0.000 7	0.446 4±0.072 6	0.496 7±0.022 5	0.679 3±0.005 0	0.572 8±0.001 5
本文方法	0.904 4±0	0.452 3±0.000 2	0.521 3±0	0.662 5±0.010 3	0.626 6±0.012 2

下载: 导出CSV

表 4 不同方法的平均精度

Table 4. Average precision of various methods

方法	Ijcnn1	RCV1	Covtype-mult	Poker	MNIST-8M
FastESC	0.826 9±0	0.200 7±0	0.378 3±0	0.668 0±0	0.319 5±0.002
LSSHC	0.826 9±0	0.106 3±0.000 1	0.413 2±0.000 3	0.667 6±0.003 2	0.334 6±0.002 3
SC_RB	0.826 9±0	0.465 4±0.012 0	0.393 9±0.000 1	0.664 3±0.026 4	0.319 9±0.004 6
SSEIGS	0.827 0±0	0.466 5±0.010 2	0.529 1±0.010 1	0.663 4±0.021 3	0.376 1±0.015 6
USPEC	0.826 7±0.000 6	0.297 6±0.090 7	0.373 0±0.012 7	0.667 8±0.021 5	0.379 4±0.011 2
本文方法	0.827 0±0	0.466 7±0.000 5	0.386 5±0	0.668 1±0	0.424 0±0

下载: 导出CSV

表 5 不同方法的平均召回率

Table 5. Average recall of various methods

方法	Ijcnn1	RCV1	Covtype-mult	Poker	MNIST-8M
FastESC	0.999 9±0	0.040 4±0	0.753 4±0.021 6	0.677 6±0	0.332 9±0.001 2
LSSHC	0.538 2±0.000 9	0.181 6±0.000 6	0.172 9±0.000 1	0.708 4±0.000 2	0.377 7±0.000 3
SC_RB	0.999 9±0.000 4	0.410 4±0.010 2	0.786 0±0.038 5	0.702 5±0.012 3	0.330 1±0.001 1
SSEIGS	0.969 1±0	0.399 9±0.012 6	0.599 7±0.000 9	0.705 5±0.004 3	0.481 5±0.001 1
USPEC	0.970 2±0.000 9	0.111 7±0	0.706 7±0.047 4	0.710 0±0.001 2	0.486 8±0.021 3
本文方法	0.999 9±0	0.415 1±0	0.786 3±0.024 8	0.705 1±0.010 2	0.486 9±0.002 3

下载: 导出CSV

表 6 不同方法的平均F₁

Table 6. Average F₁ of various methods

方法	Ijcnn1	RCV1	Covtype-mult	Poker	MNIST-8M
FastESC	0.905 2±0	0.067 3±0	0.492 2±0	0.500 1±0.000 1	0.326 1±0.000 2
LSSHC	0.651 6±0.000 5	0.133 5±0.000 1	0.243 7±0.000 1	0.657 7±0.005 0	0.553 9±0.000 4
SC_RB	0.905 0±0.000 1	0.350 1±0.001 2	0.445 4±0.001 0	0.662 8±0.001 2	0.427 0±0.021 2
SSEIGS	0.892 1±0	0.359 8±0.001 2	0.493 2±0.000 3	0.657 8±0.016 1	0.496 1±0.010 1
USPEC	0.892 7±0.000 7	0.162 1±0.003 8	0.487 6±0.021 7	0.678 4±0.001 2	0.424 8±0.002 3
本文方法	0.905 2±0	0.365 5±0	0.498 8±0.002 0	0.644 4±0.000 1	0.554 8±0.000 1

下载: 导出CSV

参考文献(32)

[1]	PAN P, YOSHIDA Y. Average sensitivity of spectral clustering[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM, 2020: 1132-1140.
[2]	XING L Q. Intelligent multimedia urban planning construction based on spectral clustering algorithms of large data mining[J]. Multimedia Tools and Applications, 2020, 79: 35183-35194. doi: 10.1007/s11042-019-7572-x
[3]	ALSHAMMARI M, STAVRAKAKIS J, TAKATSUKA M. Refining a k-nearest neighbor graph for a computationally efficient spectral clustering[J]. Pattern Recognition, 2021, 114: 107869. doi: 10.1016/j.patcog.2021.107869
[4]	谢娟英, 丁丽娟, 王明钊. 基于谱聚类的无监督特征选择算法[J]. 软件学报, 2020, 31(4): 1009-1024. https://www.cnki.com.cn/Article/CJFDTOTAL-RJXB202004009.htm XIE J Y, DING L J, WANG M Z. Spectral clustering based unsupervised feature selection algorithms[J]. Journal of Software, 2020, 31(4): 1009-1024(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-RJXB202004009.htm
[5]	朱光辉, 黄圣彬, 袁春风, 等. SCoS: 基于Spark的并行谱聚类算法设计与实现[J]. 计算机学报, 2018, 41(4): 868-885. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJX201804008.htm ZHU G H, HUANG S B, YUAN C F, et al. SCoS: The design and implementation of parallel spectral clustering algorithm based on Spark[J]. Chinese Journal of Computers, 2018, 41(4): 868-885(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSJX201804008.htm
[6]	李玉, 袁永华, 赵雪梅. 可变类谱聚类遥感影像分割[J]. 电子学报, 2018, 46(12): 3021-3028. doi: 10.3969/j.issn.0372-2112.2018.12.028 LI Y, YUAN Y H, ZHAO X M. Spectral clustering of variable class for remote sensing image segmentation[J]. Acta Electronica Sinica, 2018, 46(12): 3021-3028(in Chinese). doi: 10.3969/j.issn.0372-2112.2018.12.028
[7]	ARRIETA J M, NAKASATO J C, PEREIRA M C. The p-Laplacian equation in thin domains: The unfolding approach[J]. Journal of Differential Equations, 2021, 274: 1-34. doi: 10.1016/j.jde.2020.12.004
[8]	FOWLKES C, BELONGIE S, CHUNG F, et al. Spectral grouping using the Nyström method[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(2): 214-225. doi: 10.1109/TPAMI.2004.1262185
[9]	DAKULAGI V. A new Nyström approximation based efficient coherent DOA estimator for radar application[J]. AEU-International Journal of Electronics and Communications, 2020, 124: 153328.
[10]	DU Y, TSANG L. Accurate calculations of emissivities of polar ocean surfaces between 0.5 and 2 GHz using an NTBC/Nyström/SMCG method[J]. IEEE Transactions on Geosciences and Remote Sensing, 2020, 58(4): 2732-2744. doi: 10.1109/TGRS.2019.2954886
[11]	薛丽霞, 孙伟, 汪荣贵, 等. 基于密度峰值优化的谱聚类算法[J]. 计算机应用研究, 2019, 36(7): 1948-1950. https://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ201907006.htm XUE L X, SUN W, WANG R G, et al. Spectral clustering based on density peak value optimization[J]. Application Research of Computers, 2019, 36(7): 1948-1950(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ201907006.htm
[12]	BOUNEFFOUF D. Spectral clustering using eigenspectrum shape based Nyström sampling[EB/OL]. (2020-07-21)[2021-09-01]. https://arxiv.org/abs/2007.11416.
[13]	WU L F, CHEN P Y, YEN I E, et al. Scalable spectral clustering using random binning features[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM, 2018: 2506-2515.
[14]	YANG Y, DENG S, LU J, et al. GraphLSHC: Towards large scale spectral hypergraph clustering[J]. Information Sciences, 2021, 544: 117-134. doi: 10.1016/j.ins.2020.07.018
[15]	HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 583-596. doi: 10.1109/TPAMI.2014.2345390
[16]	HENRIQUES J F, CARREIRA J, CASEIRO R, et al. Beyond hard negative mining: Efficient detector learning via block-circulant decomposition[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2014: 14144978.
[17]	BODDETI V N, KANADE T, KUMAR B V K V. Correlation filters for object alignment[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2013: 2291-2298.
[18]	LI H, RAY N, GUAN Y, et al. Fast large-scale spectral clustering via explicit feature mapping[J]. IEEE Transactions on Cybernetics, 2018, 49(3): 1058-1071.
[19]	RAHMAN M H, BOUGUILA N. Efficient feature mapping in classifying proportional data[J]. IEEE Access, 2020, 9: 3712-3724.
[20]	VÁZQUEZ-MARTÍN R, BANDERA A. Spatio-temporal feature-based keyframe detection from video shots using spectral clustering[J]. Pattern Recognition Letters, 2013, 34(7): 770-779. doi: 10.1016/j.patrec.2012.12.009
[21]	万月, 陈秀宏, 何佳佳. 利用稀疏自编码的局部谱聚类映射方法[J]. 传感器与微系统, 2018, 37(1): 145-148. https://www.cnki.com.cn/Article/CJFDTOTAL-CGQJ201801040.htm WAN Y, CHEN X H, HE J J. Local spectral clustering mapping algorithm using sparse autoencoders[J]. Transducer and Microsystem Technologies, 2018, 37(1): 145-148(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-CGQJ201801040.htm
[22]	BARNHILL E, HOLLIS L, SACK I, et al. Nonlinear multiscale regularisation in MR elastography: Towards fine feature mapping[J]. Medical Image Analysis, 2017, 35: 133-145. doi: 10.1016/j.media.2016.05.012
[23]	HANSEN T J, MAHONEY M W. Semi-supervised eigenvectors for large-scale locally-biased learning[J]. The Journal of Machine Learning Research, 2014, 15(1): 3691-3734.
[24]	HUANG D, WANG C D, WU J S, et al. Ultra-scalable spectral clustering and ensemble clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(6): 1212-1226. doi: 10.1109/TKDE.2019.2903410
[25]	DONATELLI M, NOVARA P, ROMANI L, et al. A merged tuning of binary and ternary Loop's subdivision[J]. Computer Aided Geometric Design, 2019, 69: 27-44. doi: 10.1016/j.cagd.2018.12.005
[26]	GODI P K, KRISHNA B T, KOTIPALLI P. Design optimization of multiplier-free parallel pipelined FFT on field programmable gate array[J]. IET Circuits Devices & Systems, 2020, 14(7): 995-1000.
[27]	GAO S J. Fast incremental spectral clustering in titanate application via graph Fourier transform[J]. IEEE Access, 2020, 8: 57252-57259. doi: 10.1109/ACCESS.2020.2982439
[28]	BIBI A, ITANI H, GHANEM B. FFTLasso: Large-scale LASSO in the Fourier domain[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 17355224.
[29]	ANTOGNINI J M, SOHL-DICKSTEIN J. PCA of high dimensional random walks with comparison to neural network training[EB/OL]. (2018-06-22)[2021-09-01]. https://arxiv.org/abs/1806.08805v1.
[30]	DRINEAS P, KANNAN R, MAHONEY M W. Fast Monte Carlo algorithms for matrices Ⅱ: Computing a low-rank approximation to a matrix[J]. SIAM Journal on Computing, 2006, 36(1): 158-183. doi: 10.1137/S0097539704442696
[31]	DRINEAS P, KANNAN R. Fast Monte-Carlo algorithms for approximate matrix multiplication[C]//Proceedings on the 42nd IEEE Symposium on Foundations of Computer Science. Piscataway: IEEE Press, 2001: 452-459.
[32]	LOOSLI G, CANU S, BOTTOU L. Training invariant support vector machines using selective sampling[M]//BOTTOU L, CHAPELLE O, WESTON J. Large scale kernel machines. Cambridge: MIT Press, 2007.