留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

一种傅里叶域海量数据高速谱聚类方法

张熳 徐兆瑞 沈项军

张熳, 徐兆瑞, 沈项军等 . 一种傅里叶域海量数据高速谱聚类方法[J]. 北京航空航天大学学报, 2022, 48(8): 1445-1454. doi: 10.13700/j.bh.1001-5965.2021.0537
引用本文: 张熳, 徐兆瑞, 沈项军等 . 一种傅里叶域海量数据高速谱聚类方法[J]. 北京航空航天大学学报, 2022, 48(8): 1445-1454. doi: 10.13700/j.bh.1001-5965.2021.0537
ZHANG Man, XU Zhaorui, SHEN Xiangjunet al. A high-speed spectral clustering method in Fourier domain for massive data[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(8): 1445-1454. doi: 10.13700/j.bh.1001-5965.2021.0537(in Chinese)
Citation: ZHANG Man, XU Zhaorui, SHEN Xiangjunet al. A high-speed spectral clustering method in Fourier domain for massive data[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(8): 1445-1454. doi: 10.13700/j.bh.1001-5965.2021.0537(in Chinese)

一种傅里叶域海量数据高速谱聚类方法

doi: 10.13700/j.bh.1001-5965.2021.0537
基金项目: 

国家自然科学基金 61572240

详细信息
    通讯作者:

    沈项军, E-mail: xjshen@ujs.edu.cn

  • 中图分类号: TP37

A high-speed spectral clustering method in Fourier domain for massive data

Funds: 

National Natural Science Foundation of China 61572240

More Information
  • 摘要:

    谱聚类方法广泛应用于数据挖掘和模式识别等领域,但大规模数据上高计算代价的特征向量求解及大数据带来的巨大内存需求,使得其应用于大规模数据时受到了极大的限制。为此,研究了基于傅里叶域的海量数据高速谱聚类方法。利用数据模式的重复性特点在傅里叶域建模,将耗时的特征向量计算转化为对预先确定的傅里叶域判别基进行选择来确定最终的特征向量,计算过程只需进行简单的乘法和加法运算,计算量得到极大的约减; 分批次训练样本,使用部分样本即可估计出整体数据的特征向量分布,确定最终的特征向量,压缩了计算时间和内存需求。在Ijcnn1、RCV1、Covtype-mult、Poker及MNIST-8M等大规模数据上的实验结果表明,所提方法在聚类精度等各项指标基本保持的前提下,训练时间相比FastESC、LSSHC、SC_RB、SSEIGS及USPEC等方法最高快了810.58倍,证明了所提方法在处理大规模聚类数据方面具有显著优势。

     

  • 图 1  本文方法流程示意图

    Figure 1.  Schematic diagram of the proposed method

    图 2  各方法在不同数据集上的运行时间对比

    Figure 2.  Running time comparison of each method on different datasets

    图 3  聚类准确率随维度变化的曲线

    Figure 3.  Variation curves of clustering accuracy with dimensionality

    表  1  大规模数据集

    Table  1.   Large-scale datasets

    数据集名称 样本数 特征维度 类别数
    Ijcnn1 126 701 22 2
    RCV1 534 135 47 236 52
    Covtype-mult 581 012 54 7
    Poker 1 025 010 10 10
    MNIST-8M 8 000 000 784 10
    下载: 导出CSV

    表  2  不同方法的平均准确率

    Table  2.   Average accuracy of various methods

    方法 Ijcnn1 RCV1 Covtype-mult Poker MNIST-8M
    FastESC 0.875 0±0.008 5 0.111 0±0 0.472 0±0.001 2 0.471 0±0.004 3 0.559 3±0.021 6
    LSSHC 0.576 9±0.010 4 0.207 3±0.000 2 0.233 9±0.000 1 0.666 7±0.014 7 0.778 5±0.010 6
    SC_RB 0.900 4±0 0.300 1±0.005 3 0.443 3±0.006 7 0.677 9±0.024 8 0.600 7±0.017 5
    SSEIGS 0.869 1±0 0.299 9±0 0.338 8±0.000 2 0.703 1±0.034 5 0.599 0±0.002 1
    USPEC 0.890 3±0.000 7 0.246 5±0.019 2 0.417 6±0.016 2 0.722 8±0.020 1 0.743 1±0.010 1
    本文方法 0.902 4±0 0.338 4±0.001 5 0.487 6±0 0.728 2±0.010 4 0.779 2±0.020 6
    下载: 导出CSV

    表  3  不同方法的平均纯度

    Table  3.   Average purity of various methods

    方法 Ijcnn1 RCV1 Covtype-mult Poker MNIST-8M
    FastESC 0.904 3±0 0.357 9±0 0.488 1±0 0.534 7±0 0.472 1±0.001 2
    LSSHC 0.904 3±0 0.432 4±0.000 1 0.510 5±0.000 1 0.679 1±0.021 3 0.624 1±0.019 9
    SC_RB 0.904 3±0 0.381 2±0.001 1 0.496 2±0.000 2 0.679 2±0.021 5 0.549 9±0.001 3
    SSEIGS 0.904 3±0 0.381 2±0.001 2 0.511 4±0.002 1 0.679 2±0.012 2 0.549 9±0.022 4
    USPEC 0.904 3±0.000 7 0.446 4±0.072 6 0.496 7±0.022 5 0.679 3±0.005 0 0.572 8±0.001 5
    本文方法 0.904 4±0 0.452 3±0.000 2 0.521 3±0 0.662 5±0.010 3 0.626 6±0.012 2
    下载: 导出CSV

    表  4  不同方法的平均精度

    Table  4.   Average precision of various methods

    方法 Ijcnn1 RCV1 Covtype-mult Poker MNIST-8M
    FastESC 0.826 9±0 0.200 7±0 0.378 3±0 0.668 0±0 0.319 5±0.002
    LSSHC 0.826 9±0 0.106 3±0.000 1 0.413 2±0.000 3 0.667 6±0.003 2 0.334 6±0.002 3
    SC_RB 0.826 9±0 0.465 4±0.012 0 0.393 9±0.000 1 0.664 3±0.026 4 0.319 9±0.004 6
    SSEIGS 0.827 0±0 0.466 5±0.010 2 0.529 1±0.010 1 0.663 4±0.021 3 0.376 1±0.015 6
    USPEC 0.826 7±0.000 6 0.297 6±0.090 7 0.373 0±0.012 7 0.667 8±0.021 5 0.379 4±0.011 2
    本文方法 0.827 0±0 0.466 7±0.000 5 0.386 5±0 0.668 1±0 0.424 0±0
    下载: 导出CSV

    表  5  不同方法的平均召回率

    Table  5.   Average recall of various methods

    方法 Ijcnn1 RCV1 Covtype-mult Poker MNIST-8M
    FastESC 0.999 9±0 0.040 4±0 0.753 4±0.021 6 0.677 6±0 0.332 9±0.001 2
    LSSHC 0.538 2±0.000 9 0.181 6±0.000 6 0.172 9±0.000 1 0.708 4±0.000 2 0.377 7±0.000 3
    SC_RB 0.999 9±0.000 4 0.410 4±0.010 2 0.786 0±0.038 5 0.702 5±0.012 3 0.330 1±0.001 1
    SSEIGS 0.969 1±0 0.399 9±0.012 6 0.599 7±0.000 9 0.705 5±0.004 3 0.481 5±0.001 1
    USPEC 0.970 2±0.000 9 0.111 7±0 0.706 7±0.047 4 0.710 0±0.001 2 0.486 8±0.021 3
    本文方法 0.999 9±0 0.415 1±0 0.786 3±0.024 8 0.705 1±0.010 2 0.486 9±0.002 3
    下载: 导出CSV

    表  6  不同方法的平均F1

    Table  6.   Average F1 of various methods

    方法 Ijcnn1 RCV1 Covtype-mult Poker MNIST-8M
    FastESC 0.905 2±0 0.067 3±0 0.492 2±0 0.500 1±0.000 1 0.326 1±0.000 2
    LSSHC 0.651 6±0.000 5 0.133 5±0.000 1 0.243 7±0.000 1 0.657 7±0.005 0 0.553 9±0.000 4
    SC_RB 0.905 0±0.000 1 0.350 1±0.001 2 0.445 4±0.001 0 0.662 8±0.001 2 0.427 0±0.021 2
    SSEIGS 0.892 1±0 0.359 8±0.001 2 0.493 2±0.000 3 0.657 8±0.016 1 0.496 1±0.010 1
    USPEC 0.892 7±0.000 7 0.162 1±0.003 8 0.487 6±0.021 7 0.678 4±0.001 2 0.424 8±0.002 3
    本文方法 0.905 2±0 0.365 5±0 0.498 8±0.002 0 0.644 4±0.000 1 0.554 8±0.000 1
    下载: 导出CSV
  • [1] PAN P, YOSHIDA Y. Average sensitivity of spectral clustering[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM, 2020: 1132-1140.
    [2] XING L Q. Intelligent multimedia urban planning construction based on spectral clustering algorithms of large data mining[J]. Multimedia Tools and Applications, 2020, 79: 35183-35194. doi: 10.1007/s11042-019-7572-x
    [3] ALSHAMMARI M, STAVRAKAKIS J, TAKATSUKA M. Refining a k-nearest neighbor graph for a computationally efficient spectral clustering[J]. Pattern Recognition, 2021, 114: 107869. doi: 10.1016/j.patcog.2021.107869
    [4] 谢娟英, 丁丽娟, 王明钊. 基于谱聚类的无监督特征选择算法[J]. 软件学报, 2020, 31(4): 1009-1024. https://www.cnki.com.cn/Article/CJFDTOTAL-RJXB202004009.htm

    XIE J Y, DING L J, WANG M Z. Spectral clustering based unsupervised feature selection algorithms[J]. Journal of Software, 2020, 31(4): 1009-1024(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-RJXB202004009.htm
    [5] 朱光辉, 黄圣彬, 袁春风, 等. SCoS: 基于Spark的并行谱聚类算法设计与实现[J]. 计算机学报, 2018, 41(4): 868-885. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJX201804008.htm

    ZHU G H, HUANG S B, YUAN C F, et al. SCoS: The design and implementation of parallel spectral clustering algorithm based on Spark[J]. Chinese Journal of Computers, 2018, 41(4): 868-885(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSJX201804008.htm
    [6] 李玉, 袁永华, 赵雪梅. 可变类谱聚类遥感影像分割[J]. 电子学报, 2018, 46(12): 3021-3028. doi: 10.3969/j.issn.0372-2112.2018.12.028

    LI Y, YUAN Y H, ZHAO X M. Spectral clustering of variable class for remote sensing image segmentation[J]. Acta Electronica Sinica, 2018, 46(12): 3021-3028(in Chinese). doi: 10.3969/j.issn.0372-2112.2018.12.028
    [7] ARRIETA J M, NAKASATO J C, PEREIRA M C. The p-Laplacian equation in thin domains: The unfolding approach[J]. Journal of Differential Equations, 2021, 274: 1-34. doi: 10.1016/j.jde.2020.12.004
    [8] FOWLKES C, BELONGIE S, CHUNG F, et al. Spectral grouping using the Nyström method[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(2): 214-225. doi: 10.1109/TPAMI.2004.1262185
    [9] DAKULAGI V. A new Nyström approximation based efficient coherent DOA estimator for radar application[J]. AEU-International Journal of Electronics and Communications, 2020, 124: 153328.
    [10] DU Y, TSANG L. Accurate calculations of emissivities of polar ocean surfaces between 0.5 and 2 GHz using an NTBC/Nyström/SMCG method[J]. IEEE Transactions on Geosciences and Remote Sensing, 2020, 58(4): 2732-2744. doi: 10.1109/TGRS.2019.2954886
    [11] 薛丽霞, 孙伟, 汪荣贵, 等. 基于密度峰值优化的谱聚类算法[J]. 计算机应用研究, 2019, 36(7): 1948-1950. https://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ201907006.htm

    XUE L X, SUN W, WANG R G, et al. Spectral clustering based on density peak value optimization[J]. Application Research of Computers, 2019, 36(7): 1948-1950(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSYJ201907006.htm
    [12] BOUNEFFOUF D. Spectral clustering using eigenspectrum shape based Nyström sampling[EB/OL]. (2020-07-21)[2021-09-01]. https://arxiv.org/abs/2007.11416.
    [13] WU L F, CHEN P Y, YEN I E, et al. Scalable spectral clustering using random binning features[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM, 2018: 2506-2515.
    [14] YANG Y, DENG S, LU J, et al. GraphLSHC: Towards large scale spectral hypergraph clustering[J]. Information Sciences, 2021, 544: 117-134. doi: 10.1016/j.ins.2020.07.018
    [15] HENRIQUES J F, CASEIRO R, MARTINS P, et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 583-596. doi: 10.1109/TPAMI.2014.2345390
    [16] HENRIQUES J F, CARREIRA J, CASEIRO R, et al. Beyond hard negative mining: Efficient detector learning via block-circulant decomposition[C]//Proceedings of the IEEE International Conference on Computer Vision. Piscataway: IEEE Press, 2014: 14144978.
    [17] BODDETI V N, KANADE T, KUMAR B V K V. Correlation filters for object alignment[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2013: 2291-2298.
    [18] LI H, RAY N, GUAN Y, et al. Fast large-scale spectral clustering via explicit feature mapping[J]. IEEE Transactions on Cybernetics, 2018, 49(3): 1058-1071.
    [19] RAHMAN M H, BOUGUILA N. Efficient feature mapping in classifying proportional data[J]. IEEE Access, 2020, 9: 3712-3724.
    [20] VÁZQUEZ-MARTÍN R, BANDERA A. Spatio-temporal feature-based keyframe detection from video shots using spectral clustering[J]. Pattern Recognition Letters, 2013, 34(7): 770-779. doi: 10.1016/j.patrec.2012.12.009
    [21] 万月, 陈秀宏, 何佳佳. 利用稀疏自编码的局部谱聚类映射方法[J]. 传感器与微系统, 2018, 37(1): 145-148. https://www.cnki.com.cn/Article/CJFDTOTAL-CGQJ201801040.htm

    WAN Y, CHEN X H, HE J J. Local spectral clustering mapping algorithm using sparse autoencoders[J]. Transducer and Microsystem Technologies, 2018, 37(1): 145-148(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-CGQJ201801040.htm
    [22] BARNHILL E, HOLLIS L, SACK I, et al. Nonlinear multiscale regularisation in MR elastography: Towards fine feature mapping[J]. Medical Image Analysis, 2017, 35: 133-145. doi: 10.1016/j.media.2016.05.012
    [23] HANSEN T J, MAHONEY M W. Semi-supervised eigenvectors for large-scale locally-biased learning[J]. The Journal of Machine Learning Research, 2014, 15(1): 3691-3734.
    [24] HUANG D, WANG C D, WU J S, et al. Ultra-scalable spectral clustering and ensemble clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2020, 32(6): 1212-1226. doi: 10.1109/TKDE.2019.2903410
    [25] DONATELLI M, NOVARA P, ROMANI L, et al. A merged tuning of binary and ternary Loop's subdivision[J]. Computer Aided Geometric Design, 2019, 69: 27-44. doi: 10.1016/j.cagd.2018.12.005
    [26] GODI P K, KRISHNA B T, KOTIPALLI P. Design optimization of multiplier-free parallel pipelined FFT on field programmable gate array[J]. IET Circuits Devices & Systems, 2020, 14(7): 995-1000.
    [27] GAO S J. Fast incremental spectral clustering in titanate application via graph Fourier transform[J]. IEEE Access, 2020, 8: 57252-57259. doi: 10.1109/ACCESS.2020.2982439
    [28] BIBI A, ITANI H, GHANEM B. FFTLasso: Large-scale LASSO in the Fourier domain[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 17355224.
    [29] ANTOGNINI J M, SOHL-DICKSTEIN J. PCA of high dimensional random walks with comparison to neural network training[EB/OL]. (2018-06-22)[2021-09-01]. https://arxiv.org/abs/1806.08805v1.
    [30] DRINEAS P, KANNAN R, MAHONEY M W. Fast Monte Carlo algorithms for matrices Ⅱ: Computing a low-rank approximation to a matrix[J]. SIAM Journal on Computing, 2006, 36(1): 158-183. doi: 10.1137/S0097539704442696
    [31] DRINEAS P, KANNAN R. Fast Monte-Carlo algorithms for approximate matrix multiplication[C]//Proceedings on the 42nd IEEE Symposium on Foundations of Computer Science. Piscataway: IEEE Press, 2001: 452-459.
    [32] LOOSLI G, CANU S, BOTTOU L. Training invariant support vector machines using selective sampling[M]//BOTTOU L, CHAPELLE O, WESTON J. Large scale kernel machines. Cambridge: MIT Press, 2007.
  • 加载中
图(3) / 表(6)
计量
  • 文章访问数:  66
  • HTML全文浏览量:  12
  • PDF下载量:  19
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-09-08
  • 录用日期:  2021-10-17
  • 刊出日期:  2022-05-19

目录

    /

    返回文章
    返回
    常见问答