成分数据典型相关分析的增量算法

孔博傲; 卢珊; 王惠文

doi:10.13700/j.bh.1001-5965.2021.0765

成分数据典型相关分析的增量算法

doi: 10.13700/j.bh.1001-5965.2021.0765

孔博傲¹,
卢珊^2, ,,
王惠文^{3, 4}

1.
北京航空航天大学数学科学学院，北京 100191
2.
中央财经大学统计与数学学院，北京 100081
3.
北京航空航天大学经济管理学院，北京 100191
4.
城市运行应急保障模拟技术北京市重点实验室，北京 100191

基金项目: 国家自然科学基金(72021001,72001222)

详细信息

通讯作者:
E-mail：shan.lu@cufe.edu.cn

中图分类号: O212.4
计量
- 文章访问数: 278
- HTML全文浏览量: 55
- PDF下载量: 25
- 被引次数: 0
出版历程
- 收稿日期: 2021-12-20
- 录用日期: 2022-05-06
- 网络出版日期: 2022-05-16
- 整期出版日期: 2023-10-31

Incremental computing methods of canonical correlation analysis for compositional data streams

KONG Boao¹,
LU Shan^{2
, ,},
WANG Huiwen^{3, 4}

1.
School of Mathematical Science，Beihang University，Beijing 100191，China
2.
School of Statistics and Mathematics，Central University of Finance and Economics，Beijing 100081，China
3.
School of Economics and Management，Beihang University，Beijing 100191，China
4.
Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations，Beijing 100191，China

Funds: National Natural Science Foundation of China (72021001,72001222)

More Information

Corresponding author: E-mail：shan.lu@cufe.edu.cn

摘要

摘要:
成分数据典型相关分析（CCAI）是一种研究多个成分数据变量之间线性相关关系的方法，在经济、管理、地质、化学等多个领域应用广泛。在海量数据背景下，研究如何针对成分数据流展开典型相关建模分析，具有重要的理论意义和实用价值。为此，提出了成分数据典型相关分析的增量方法，通过对增量成分数据的协方差分解，实现对成分数据流典型相关性的精确计算。同时，给出序贯式和并行式2种分块增量算法，可处理多组成分数据的数据流建模问题，序贯式分块增量算法，按照数据流的先后顺序进行计算，并行式分块增量算法可以达到提高计算效率的目的。通过对不同概率分布和样本规模的成分数据流的仿真研究及微博假新闻的实例分析，验证了所提算法相比于传统的非增量算法，在保证计算准确性的前提下，具有提高运算效率的优势。
- 成分数据 /
- 典型相关分析 /
- 数据流 /
- 协方差矩阵 /
- 特征分解
Abstract:
The approach of connecting linear correlations between several sets of multidimensional compositional variables known as canonical correlation analysis (CCA) for compositional data streams is widely applicable to the study of economics, administration, geology, and chemistry. In the context of massive data, it is of great significance to study how to perform CCA for compositional data streams. Propose an incremental modeling method for the CCA on compositional data streams, which provides accurate results based on the decomposition of the covariance matrix. Furthermore, two incremental modeling methods for compositional data streams are also derived. The first is the sequential block algorithm, which conducts CCA in the order of data stream blocks. The second is the parallel block algorithm, which can improve the calculating efficiency. The proposed methods do indeed outperform non-incremental ones in terms of running time while maintaining the accuracy of canonical correlation computing, according to extensive simulation studies on compositional data with various sample sizes and probability distributions.
- compositional data /
- canonical correlation analysis /
- data streams /
- covariance matrix /
- eigenvalue decomposition

HTML全文

图 1 对成分数据采用3种增量算法的时间对比

Figure 1. Comparison of running time of CCA for compositional data with three different incremental methods

下载: 全尺寸图片幻灯片

图 2 并行式分块增量算法得到的协方差矩阵的相对误差平均值

Figure 2. Average of relative error for estimation of cross-covariance matrix calculated by parallel block incremental method

下载: 全尺寸图片幻灯片

图 3 对微博假新闻数据进行典型相关分析的4种算法运行时间对比

Figure 3. Comparison of running time of CCA with four different incremental methods of fake news data from Weibo

下载: 全尺寸图片幻灯片

表 1 增量算法的运行时间

Table 1. Running time of incremental methods s

${D}$	${n}$	$\mathrm{\theta }$	一次性增量算法	序贯式分块增量算法	并行式分块增量算法	非增量算法
4	10 000	0.01	$0.001\;7\left(2.91\times {10}^{-7}\right)$	$0.002\;4\left(2.71\times {10}^{-7}\right)$	$0.003\;0\left(1.79\times {10}^{-7}\right)$	$0.309\;5\left(1.60\times {10}^{-4}\right)$
	100 000	0.01	$0.013\;5\left(1.32\times {10}^{-6}\right)$	$0.018\;0\left(3.31\times {10}^{-6}\right)$	$0.010\;1\left(3.37\times {10}^{-7}\right)$	$3.115\;4\left(9.69\times {10}^{-3}\right)$
	100 000	0.1	$0.130\;7\left(3.16\times {10}^{-5}\right)$	$0.174\;1\left(1.21\times {10}^{-4}\right)$	$0.101\;4\left(2.14\times {10}^{-5}\right)$	$3.307\;5\left(5.98\times {10}^{-3}\right)$
	200 000	0.1	$0.268\;8\left(2.06\times {10}^{-3}\right)$	$0.349\;6\left(4.08\times {10}^{-4}\right)$	$0.193\;6\left(4.36\times {10}^{-5}\right)$	$6.671\;6\left(1.84\times {10}^{-2}\right)$
5	10 000	0.01	$0.002\;5\left(2.65\times {10}^{-7}\right)$	$0.004\;1\left(2.22\times {10}^{-7}\right)$	$0.003\;7\left(1.34\times {10}^{-7}\right)$	$0.553\;4\left(4.35\times {10}^{-4}\right)$
	100 000	0.01	$0.023\;9\left(1.59\times {10}^{-5}\right)$	$0.032\;9\left(5.41\times {10}^{-4}\right)$	$0.018\;7\left(5.89\times {10}^{-7}\right)$	$5.399\;8\left(2.71\times {10}^{-2}\right)$
	100 000	0.1	$0.220\;4\left(1.17\times {10}^{-4}\right)$	$0.297\;0\left(2.88\times {10}^{-4}\right)$	$0.168\;5\left(2.27\times {10}^{-5}\right)$	$5.753\;9\left(1.56\times {10}^{-2}\right)$
	200 000	0.1	$0.443\;0\left(5.27\times {10}^{-4}\right)$	$0.600\;9\left(1.52\times {10}^{-3}\right)$	$0.336\;3\left(2.31\times {10}^{-4}\right)$	$11.652\;3\left(5.86\times {10}^{-1}\right)$
6	10 000	0.01	$0.003\;7\left(1.05\times {10}^{-6}\right)$	$0.006\;0\left(3.95\times {10}^{-7}\right)$	$0.005\;2\left(1.69\times {10}^{-7}\right)$	$0.839\;9\left(9.31\times {10}^{-4}\right)$
	100 000	0.01	$0.033\;4\left(5.38\times {10}^{-6}\right)$	$0.045\;8\left(2.08\times {10}^{-5}\right)$	$0.028\;5\left(2.47\times {10}^{-6}\right)$	$8.338\;8\left(6.95\times {10}^{-2}\right)$
	100 000	0.1	$0.334\;1\left(4.51\times {10}^{-4}\right)$	$0.446\;9\left(5.04\times {10}^{-4}\right)$	$0.258\;8\left(8.46\times {10}^{-5}\right)$	$8.929\;5\left(4.47\times {10}^{-2}\right)$
	200 000	0.1	$0.666\;4\left(8.13\times {10}^{-4}\right)$	$0.901\;8\left(2.18\times {10}^{-3}\right)$	$0.516\;4\left(3.06\times {10}^{-4}\right)$	$17.848\;1\left(2.16\times {10}^{-1}\right)$

下载: 导出CSV

表 2 微博假新闻中不同情感色彩与主题之间的典型主轴及典型相关系数

Table 2. Canonical variables and canonical correlations between different emotions and topics in fake news of Weibo

$h$	${\rho }_{h}$
1	$0.451\;9$
$2$	$0.245\;9$
$3$	$0.131\;9$

下载: 导出CSV

参考文献(29)

[1]	卢珊, 王惠文. 成分数据因变量的混合数据回归及在股市情绪构成分析中的应用[J]. 计量经济学报, 2021, 1(2): 469-478. LU S, WANG H W. Mixed data regression with compositional response and an application in shock market sentiment analysis[J]. China Journal of Econometrics, 2021, 1(2): 469-478(in Chinese).
[2]	WANG H W, LU S, ZHAO J C. Aggregating multiple types of complex data in stock market prediction: A model-independent framework[J]. Knowledge-based Systems, 2019, 164: 193-204. doi: 10.1016/j.knosys.2018.10.035
[3]	WEI Y G, WANG Z C, WANG H W, et al. Compositional data techniques for forecasting dynamic change in China’s energy consumption structure by 2020 and 2030[J]. Journal of Cleaner Production, 2021, 284: 124702. doi: 10.1016/j.jclepro.2020.124702
[4]	ZUZOLO D, CICCHELLA D, LIMA A, et al. Potentially toxic elements in soils of Campania region (Southern Italy): Combining raw and compositional data[J]. Journal of Geochemical Exploration, 2020, 213: 106524. doi: 10.1016/j.gexplo.2020.106524
[5]	EBRAHIMI P, ALBANESE S, ESPOSITO L, et al. Coupling compositional data analysis (CoDA) with hierarchical cluster analysis (HCA) for preliminary understanding of the dynamics of a complex water distribution system: The Naples (South Italy) case study[J]. Environmental Science:Water Research & Technology, 2021, 7(6): 1060-1077.
[6]	JANSSEN I, CLARKE A E, CARSON V, et al. A systematic review of compositional data analysis studies examining associations between sleep, sedentary behaviour, and physical activity with health outcomes in adults[J]. Applied Physiology Nutrition and Metabolism, 2020, 45(10): S248-S257.
[7]	AITCHISON J. Principal component analysis of compositional data[J]. Biometrika, 1983, 70(1): 57-65. doi: 10.1093/biomet/70.1.57
[8]	WANG H W, SHANGGUAN L Y, GUAN R, et al. Principal component analysis for compositional data vectors[J]. Computational Statistics, 2015, 30(4): 1079-1096. doi: 10.1007/s00180-015-0570-1
[9]	AITCHISON J. The statistical analysis of compositional data[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1982, 44(2): 139-160. doi: 10.1111/j.2517-6161.1982.tb01195.x
[10]	WANG H W, SHANGGUAN L Y, WU J J, et al. Multiple linear regression modeling for compositional data[J]. Neurocomputing, 2013, 122: 490-500.
[11]	龙文, 王惠文. 成分数据偏最小二乘 Logistic 回归模型及其应用[J]. 数量经济技术经济研究, 2006, 23(9): 156-161. LONG W, WANG H W. PLS Logistic regressionon compositional data and its application[J]. The Journal of Quantitative & Technical Economics, 2006, 23(9): 156-161(in Chinese).
[12]	WANG H W, WANG Z C, WANG S S. Sliced inverse regression method for multivariate compositional data modeling[J]. Statistical Papers, 2021, 62(1): 361-393. doi: 10.1007/s00362-019-01093-z
[13]	TANG Z Z, CHEN G H. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis[J]. Biostatistics, 2019, 20(4): 698-713. doi: 10.1093/biostatistics/kxy025
[14]	夏棒, 王惠文, 周荣刚. 成分数据路径分析模型[J]. 数学的实践与认识, 2019, 49(14): 191-199. XIA B, WANG H W, ZHOU R G. Path modeling for compositional data[J]. Mathematics in Practice and Theory, 2019, 49(14): 191-199(in Chinese).
[15]	WANG Z C, WANG H W, WANG S S. Linear mixed-effects model for multivariate longitudinal compositional data[J]. Neurocomputing, 2019, 335: 48-58. doi: 10.1016/j.neucom.2019.01.043
[16]	GREENACRE M. Variable selection in compositional data analysis using pairwise logratios[J]. Mathematical Geosciences, 2019, 51(5): 649-682. doi: 10.1007/s11004-018-9754-x
[17]	HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical correlation analysis: An overview with application to learning methods[J]. Neural Computation, 2004, 16(12): 2639-2664. doi: 10.1162/0899766042321814
[18]	HOTELLING H. Relations between two sets of variates[J]. Biometrika, 1936, 28(3-4): 321-377.
[19]	龙文. 带约束条件的数据表分析预测方法及应用研究[D]. 北京: 北京航空航天大学, 2007: 14-20. LONG W. Method study based on constrained data and application [D]. Beijing: Beihang University, 2007: 14-20(in Chinese).
[20]	ARTAC M, JOGAN M, LEONARDIS A. Incremental PCA for on-line visual learning and recognition[C]//Proceedings of the 2002 International Conference on Pattern Recognition. Piscataway: IEEE Press, 2002: 781-784.
[21]	LI Y M. On incremental and robust subspace learning[J]. Pattern Recognition, 2004, 37(7): 1509-1518. doi: 10.1016/j.patcog.2003.11.010
[22]	ZENG X, LI G Z. Incremental partial least squares analysis of big streaming data[J]. Pattern Recognition, 2014, 47(11): 3726-3735. doi: 10.1016/j.patcog.2014.05.022
[23]	WEI Y, WANG H G, WANG S S, et al. Incremental modelling for compositional data streams[J]. Communications in Statistics-Simulation and Computation, 2019, 48(8): 2229-2243. doi: 10.1080/03610918.2018.1455870
[24]	EGOZCUE J J, PAWLOWSKY-GLAHN V, MATEU-FIGUERAS G, et al. Isometric logratio transformations for compositional data analysis[J]. Mathematical Geology, 2003, 35(3): 279-300. doi: 10.1023/A:1023818214614
[25]	AITCHISON J, EGOZCUE J J. Compositional data analysis: Where are we and where should we be heading?[J]. Mathematical Geology, 2005, 37(7): 829-850. doi: 10.1007/s11004-005-7383-7
[26]	AITCHISON J. The statistical analysis of compositional data[M]. London: Chapman and Hall, 1986.
[27]	MEEL P, VISHWAKARMA D K. Fake news, rumor, information pollution in social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities[J]. Expert Systems with Applications, 2020, 153(1): 112986.
[28]	LU S, ZHAO J C, WANG H W. MD-MBPLS: A novel explanatory model in computational social science[J]. Knowledge-based Systems, 2021, 223: 107023.
[29]	CHUAI Y W, ZHAO J C. Anger makes fake news viral online[EB/OL]. (2020-08-27) [2021-12-01].