-
摘要:
为研究多变量区间数据的降维和可视化,采用包含中心点和半长对数值的二维数组表征区间数据,建立了区间数据的代数运算法则,并在此基础上提出了一种新的区间数据主成分分析(PCA)方法。对区间半长取对数的处理保证了最终得到的区间主成分半长非负的合理性,计算过程简单、复杂度较低,并且使得降维前后样本集合中点点之间相对位置的改变尽可能小。通过对高维空间进行变量降维,从而多种经典的统计分析方法能够得到运用,同时能够在低维空间中描绘原始高维空间中的样本点,使得多变量区间数据的可视化成为可能。仿真实验结果表明了所提方法的有效性。
-
关键词:
- 区间数据 /
- 主成分分析(PCA) /
- 中心-对数半长 /
- 降维 /
- 协方差矩阵
Abstract:In order to study the dimension reduction and visualization of multivariate interval data, a two-dimensional array including center and log-radius is used as the expression of interval data. Then the algebraic algorithm of interval data is given, and a new Principal Component Analysis (PCA) method of interval data is proposed on this basis. The processing of the logarithm of interval radius ensures the rationality that the range of the final interval principal components are non-negative. The calculation of this new method is simple, and the complexity is low. Furthermore, the change of the relative position between the points in the sample group before and after the dimension reduction is as small as possible. By reducing the dimension of variables in the high-dimensional space, various classical statistical analysis methods can be used. Besides, the sample points in the original high-dimensional space can be depicted in the low-dimensional space, which makes it possible to visualize multivariate interval data. The results of simulation experiment verify the effectiveness of the proposed method.
-
表 1 V-PCA、C-PCA和C-lnR PCA时间复杂度的比较
Table 1. Comparison of time complexity among V-PCA, C-PCA and C-lnR PCA methods
计算步骤 复杂度 V-PCA C-PCA C-lnR PCA 计算Xj均值 n·2p个值参与运算 n个值参与运算 2n个值参与运算 计算Xj方差 n·2p个值参与运算 n个值参与运算 2n个值参与运算 对Xj标准化 n·2p个值参与运算 n个值参与运算 2n个值参与运算 计算协方差矩阵 Cp2×(n·2p)次乘法 Cp2×n次乘法 Cp2×(2n)次乘法 计算每个样本的第h区间主成分 p·2p次乘法 2p次乘法 (2p+1)次乘法 表 2 C-lnR PCA、C-PCA和V-PCA有效性指标的平均值
Table 2. Average values of validity index of C-lnR PCA, C-PCA and V-PCA
方法 p=4 n=6 n=12 n=24 n=48 C-lnR PCA 0.717 7 0.664 1 0.657 9 0.627 8 C-PCA 0.668 1 0.644 2 0.638 0 0.623 1 V-PCA 0.626 7 0.623 3 0.616 5 0.603 9 方法 p=8 n=6 n=12 n=24 n=48 C-lnR PCA 0.749 6 0.698 5 0.665 8 0.640 5 C-PCA 0.665 1 0.641 4 0.633 2 0.620 6 V-PCA 0.653 4 0.632 2 0.622 8 0.620 0 方法 p=12 n=6 n=12 n=24 n=48 C-lnR PCA 0.792 9 0.723 0 0.681 9 0.651 8 C-PCA 0.667 6 0.655 4 0.636 6 0.619 9 V-PCA 0.661 3 0.641 8 0.627 0 0.612 5 -
[1] WOLD S, ESBENSEN K, GELADI, P. Principal component analysis[J]. Chemometrics and Intelligent Laboratory Systems, 1987, 2(1-3): 37-52. doi: 10.1016/0169-7439(87)80084-9 [2] 任若恩, 王惠文. 多元统计数据分析: 理论、方法、实例[M]. 北京: 国防工业出版社, 1997: 92-95.REN R E, WANG H W. Multivariate statistical data analysis: Theory, method and examples[M]. Beijing: National Defense Industry Press, 1997: 92-95(in Chinese). [3] SPETSIERIS P G, MA Y, DHAWAN V, et al. Differential diagnosis of parkinsonian syndromes using PCA-based functional imaging features[J]. NeuroImage, 2009, 45(4): 1241-1252. doi: 10.1016/j.neuroimage.2008.12.063 [4] 胡艳, 王惠文. 一种海量数据的分析技术——符号数据分析及应用[J]. 北京航空航天大学学报, 2002, 17(2): 40-44. https://www.cnki.com.cn/Article/CJFDTOTAL-BHDS200402009.htmHU Y, WANG H W. A new data mining method based on huge data and its application[J]. Journal of Beijing University of Aeronautics and Astronautics, 2002, 17(2): 40-44(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-BHDS200402009.htm [5] DIDAY E. Thinking by classes in data science: The symbolic data analysis paradigm: Symbolic data analysis[J]. Wiley Interdiplinary Reviews: Computational Statistics, 2016, 8(5): 172-205. doi: 10.1002/wics.1384 [6] 张寅, 王岩, 王惠文. 重点学术期刊专项基金管理中的期刊评价——基于简化的区间数据主成分分析方法[J]. 管理科学学报, 2010, 13(7): 92-98. https://www.cnki.com.cn/Article/CJFDTOTAL-JCYJ201007009.htmZHANG Y, WANG Y, WANG H W. Evaluating of academic journals in management of key academic journal fund: An application of simplified principal component analysis based on interval data[J]. Journal of Management Sciences in China, 2010, 13(7): 92-98(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JCYJ201007009.htm [7] CAZES P, CHOUAKRIA A, DIDAY E, et al. Extension de l'analyse en composantes principales à des donnés de type intervalle[J]. Revue de Statistique Apliquée, 1997(3): 5-24. CAZES P, CHOUAKRIA A, DIDAY E, et al. Extending principal component analysis to interval data[J]. Applied Statistics Review, 1997(3): 5-24(in France). http://www.researchgate.net/publication/256822020_Extensions_de_l'Analyse_en_Composantes_Principales_a_des_donnees_de_type_intervalle [8] DIDAY E, BOCK H H. Analysis of symbolic data: Exploratory methods for extracting statistical information from complex data[J]. Journal of Classification, 2000, 18(2): 291-294. doi: 10.1007/978-3-642-57155-8_3 [9] 王惠文, 李岩, 关蓉. 两种区间数据主成分分析方法的比较研究[J]. 北京航空航天大学学报, 2010, 24(4): 86-89. https://www.cnki.com.cn/Article/CJFDTOTAL-BHDS201104017.htmWANG H W, LI Y, GUAN R. A comparison study of two methods for principal component analysis of interval data[J]. Journal of Beijing University of Aeronautics and Astronautics, 2010, 24(4): 86-89(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-BHDS201104017.htm [10] CHOUAKRIA A, DIDAY E, CAZES P. Vertices principal components analysis with an improved factorial representation[C]//Proceedings of the 6th Conference of the International Federation of Classification Societies (IFCS-98). Berlin: Springer, 1998: 397-402. [11] LAURO C N, PALUMBO F. Principal components analysis of interval data: A symbolic data analysis approach[J]. Computational Statistics, 2000, 15(1): 73-87. doi: 10.1007/s001800050038 [12] 郭均鹏, 李汶华. 基于经验相关矩阵的区间主成分分析[J]. 管理科学学报, 2008, 11(3): 49-52. doi: 10.3321/j.issn:1007-9807.2008.03.005GUO J P, LI W H. Interval PCA based on empirical correlation matrix[J]. Journal of Management Sciences in China, 2008, 11(3): 49-52(in Chinese). doi: 10.3321/j.issn:1007-9807.2008.03.005 [13] PALUMBO F, LAURO C N. A PCA for interval-valued data based on midpoints and radii[C]//Proceedings of the International Meeting of the Psychometric Society IMPS2001. Berlin: Springer, 2003: 641-648. [14] 郭均鹏, 李汶华. 基于误差理论的区间主成分分析及其应用[J]. 数理统计与管理, 2007, 26(4): 636-640. doi: 10.3969/j.issn.1002-1566.2007.04.012GUO J P, LI W H. Principle component analysis based on error theory and its application[J]. Application of Statistics and Management, 2007, 26(4): 636-640(in Chinese). doi: 10.3969/j.issn.1002-1566.2007.04.012 [15] WANG H H, GUAN R, WU J J. CIPCA: Complete-information-based principal component analysis for interval-valued data[J]. Neurocomputing, 2012, 86(5): 158-169. http://www.sciencedirect.com/science/article/pii/S0925231212001051 [16] 侯自盼, 李生刚. 一种针对区间型数据的新主成分分析法[J]. 纺织高校基础科学学报, 2016, 29(2): 184-189. https://www.cnki.com.cn/Article/CJFDTOTAL-FGJK201602007.htmHOU Z P, LI S G. A new principal component analysis method for interval data[J]. Basic Sciences Journal of Textile Universities, 2016, 29(2): 184-189(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-FGJK201602007.htm [17] 刘清贤. 区间型符号数据主成分分析及有效性研究[D]. 西安: 西安科技大学, 2019: 19-24.LIU Q X. Principal component analysis of interval symbol data and validity study[D]. Xi'an: Xi'an University of Science and Technology, 2019: 19-24(in Chinese). [18] 郭均鹏, 李汶华. 一种区间PCA的效度分析方法[J]. 系统工程学报, 2009, 24(2): 226-230. https://www.cnki.com.cn/Article/CJFDTOTAL-XTGC200902016.htmGUO J P, LI W H. Analysis of validity of the PCA for interval data[J]. Journal of Systems Engineering, 2009, 24(2): 226-230(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-XTGC200902016.htm
计量
- 文章访问数: 516
- HTML全文浏览量: 162
- PDF下载量: 48
- 被引次数: 0