-
摘要:
为进一步研究回归模型中高维数据的降维方法,提出基于Gram-Schmidt变换的新的有监督变量聚类(SCV-GS)方法。该方法未采用以潜变量为聚类中心的层次聚类,而是借用变量扫描思想,依次挑出对响应变量有重要贡献的关键变量,并将其作为聚类中心。SCV-GS方法基于Gram-Schmidt变换,对变量之间的高度相关性进行批量处理,并得到聚类结果;同时,结合偏最小二乘思想,提出新的同一性度量,并以此来选取最佳聚合参数。SCV-GS不仅可以快速得到变量聚类结果,而且可识别出对响应变量的解释及预测起关键作用的变量类。仿真表明该聚类方法运算速度显著提升,而且所得潜变量对应的回归系数的估计结果与对照方法表现一致;实例分析表明该方法具有更好的解释性和预测能力。
-
关键词:
- 降维 /
- 变量聚类 /
- 回归 /
- 高度相关 /
- Gram-Schmidt变换
Abstract:In order to study the dimension reduction method of high-dimensional data based on regression model further, and the supervised clustering of variables algorithm based on Gram-Schmidt transformation (SCV-GS) is proposed. SCV-GS uses the key variables selected in turn by the variable screening idea as the clustering center, which is different from the hierarchical variable clustering around latent variables. High correlation among variables is processed based on Gram-Schmidt transformation and the clustering results are obtained. At the same time, combined with the concept of partial least squares, a new criterion for "homogeneity" is proposed to select the optimal clustering parameters. SCV-GS can not only get the variable clustering results quickly, but also identify the most relevant variable groups and in what kind of structure the variables work to influence the response variable. Simulation results show that the calculation speed is significantly improved by SCV-GS, and the estimated regression coefficients corresponding to the latent variables are consistent with the comparison method. Real data analysis shows that SCV-GS performs better in interpretation and prediction.
-
Key words:
- dimension reduction /
- variable clustering /
- regression /
- high correlation /
- Gram-Schmidt transformation
-
表 1 SCV-GS方法变量聚类结果
Table 1. Variable clustering results by SCV-GS
变量 G1 G2 G3 G4 G5 x1 100 0 0 0 0 x2 100 0 0 0 0 x3 100 0 0 0 0 x4 100 0 0 0 0 x5 100 0 0 0 0 x6 100 0 0 0 0 x7 100 0 0 0 0 x8 100 0 0 0 0 x9 100 0 0 0 0 x10 100 0 0 0 0 x11 100 0 0 0 0 x12 100 0 0 0 0 x13 100 0 0 0 0 x14 100 0 0 0 0 x15 100 0 0 0 0 x16 100 0 0 0 0 x17 99 0 0 0 0 x18 100 0 0 0 0 x19 99 0 0 0 0 x20 100 0 0 0 0 x21 0 97 2 0 0 x22 0 100 0 0 0 x23 0 99 1 0 0 x24 0 100 0 0 0 x25 0 99 0 0 0 x26 0 100 0 0 0 x27 0 99 1 0 0 x28 0 98 1 0 0 x29 0 99 0 0 0 x30 0 99 1 0 0 x31 0 99 1 0 0 x32 0 98 2 0 0 x33 0 99 1 0 0 x34 0 100 0 0 0 x35 0 100 0 0 0 x36 0 100 0 0 0 x37 0 100 0 0 0 x38 0 99 0 0 1 x39 0 100 0 0 0 x40 0 100 0 0 0 x41 0 0 100 0 0 x42 0 0 100 0 0 x43 0 0 100 0 0 x44 0 0 100 0 0 x45 0 0 100 0 0 x46 0 1 99 0 0 x47 0 0 99 0 1 x48 0 0 100 0 0 x49 0 0 100 0 0 x50 0 0 100 0 0 x51 0 0 0 100 0 x52 0 0 0 100 0 x53 0 0 0 100 0 x54 0 0 0 100 0 x55 0 0 0 100 0 x56 0 0 0 100 0 x57 0 0 0 100 0 x58 0 0 0 100 0 x59 0 0 0 100 0 x60 0 0 0 100 0 x61 0 0 0 0 100 x62 0 0 0 0 100 x63 0 0 0 0 100 x64 0 0 0 0 100 x65 0 0 0 0 100 x66 0 0 0 0 100 x67 0 0 0 0 100 x68 0 0 0 0 100 x69 0 0 0 0 100 x70 0 0 0 0 100 x71 0 0 0 0 100 x72 0 0 0 0 100 x73 0 0 0 0 100 x74 0 0 0 0 100 x75 0 0 0 0 100 x76 0 0 0 0 100 x77 0 0 0 0 100 x78 0 0 0 0 100 x79 0 0 0 0 100 x80 0 0 0 0 100 表 2 SCV-LV与SCV-GS方法所得潜变量回归系数的估计结果
Table 2. Estimated regression coefficients by SCV-LVand SCV-GS as a function of latent variables
方法 SCV-LV -0.07(0.44) -0.08(0.43) 0.03(0.28) SCV-GS -0.10 (0.44) -0.10(0.43) 0.02(0.28) 表 3 SCV-LV与SCV-GS方法作用于实例数据集所得结果
Table 3. Results on real dataset by SCV-LV and SCV-GS
方法 K 变量 SCV-LV 1 0.83 x1~x19 SCV-GS 16 0.87 x2, x3 -
[1] TIBSHIRANI R.Regression shrinkage and selection via the lasso:A retrospective[J].Journal of the Royal Statistical Society:Series B(Statistica Methodology), 2011, 73(3):273-282. doi: 10.1111/j.1467-9868.2011.00771.x [2] ZOU H, HASTIE T.Regularization and variable selection via the elastic net[J].Journal of the Royal Statistical Society:Series B(Statistical Methodology), 2005, 67(2):301-320. doi: 10.1111/j.1467-9868.2005.00503.x [3] FAN J Q, LV J C.Sure independence screening for ultrahigh dimensional feature space[J].Journal of the Royal Statistical Society:Series B(Statistical Methodology), 2008, 70(5):849-911. doi: 10.1111/j.1467-9868.2008.00674.x [4] WANG H S.Forward regression for ultra-high dimensional variable screening[J].Journal of the American Statistical Association, 2009, 104(488):1512-1524. doi: 10.1198/jasa.2008.tm08516 [5] ZOU H, HASTIE T, TIBSHIRANI R.Sparse principal component analysis[J].Journal of Computational and Graphical Statistics, 2006, 15(2):265-286. doi: 10.1198/106186006X113430 [6] CHUN H, KELEŞ S.Sparse partial least squares regression for simultaneous dimension reduction and variable selection[J].Journal of the Royal Statistical Society:Series B(Statistical Methodology), 2010, 72(1):3-25. doi: 10.1111/j.1467-9868.2009.00723.x [7] CHEN M K, VIGNEAU E.Supervised clustering of variables[J].Advances in Data Analysis and Classification, 2016, 10(1):85-101. http://d.old.wanfangdata.com.cn/OAPaper/oai_doaj-articles_7d57b83193b227c3e090a9eeb28155dd [8] JOLLIFFE I T.Discarding variables in a principal component analysis.I:Artificial data[J].Applied Statistics, 1972, 21(2):160-173. doi: 10.2307/2346488 [9] HASTIE T, TIBSHIRANI R, BOTSTEIN D, et al.Supervised harvesting of expression trees[J].Genome Biology, 2001, 2(1):research0003-1. http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_17599 [10] VIGNEAU E, QANNARI E.Clustering of variables around latent components[J].Communications in Statistics-Simulation and Computation, 2003, 32(4):1131-1150. doi: 10.1081/SAC-120023882 [11] VIGNEAU E, CHEN M, QANNARI E M.ClustVarLV:An R package for the clustering of variables around latent variables[J].The R Journal, 2015, 7(2):134-148. doi: 10.32614/RJ-2015-026 [12] VIGNEAU E.Segmentation of a panel of consumers with missing data[J].Food Quality and Preference, 2018, 67:10-17. doi: 10.1016/j.foodqual.2017.04.010 [13] CARIOU V, QANNARI E M, RUTLEDGE D N, et al.ComDim:From multiblock data analysis to path modeling[J].Food Quality and Preference, 2018, 67:27-34. doi: 10.1016/j.foodqual.2017.02.012 [14] BJÖRCK Å.Numerics of gram-schmidt orthogonalization[J].Linear Algebra and Its Applications, 1994, 197-198:297-316. doi: 10.1016/0024-3795(94)90493-6 [15] LEON S J, BJÖRCK A, GANDER W.Gram-Schmidt orthogonalization:100 years and more[J].Numerical Linear Algebra with Applications, 2013, 20:492-532. doi: 10.1002/nla.1839 [16] CHEN S, BILLINGS S A, LUO W.Orthogonal least squares methods and their application to non-linear system identification[J].International Journal of Control, 1989, 50(5):1873-1896. doi: 10.1080/00207178908953472 [17] STOPPIGLIA H, DREYFUS G, DUBOIS R, et al.Ranking a random feature for variable and feature selection[J].Journal of Machine Learning Research, 2003, 3:1399-1414. http://cn.bing.com/academic/profile?id=521405dcd1fbe6041d8601c946a9376a&encoded=0&v=paper_preview&mkt=zh-cn [18] 王惠文, 仪彬, 叶明.基于主基底分析的变量筛选[J].北京航空航天大学学报, 2008, 34(11):1288-1291. https://bhxb.buaa.edu.cn/CN/abstract/abstract8983.shtmlWANG H W, YI B, YE M.Variable selection based on principal basis analysis[J].Journal of Beijing University of Aeronautics and Astronautics, 2008, 34(11):1288-1291(in Chinese). https://bhxb.buaa.edu.cn/CN/abstract/abstract8983.shtml [19] 王惠文, 陈梅玲, SAPORTA G.基于Gram-Schmidt过程的判别变量筛选方法[J].北京航空航天大学学报, 2011, 37(8):958-961. https://bhxb.buaa.edu.cn/CN/abstract/abstract12041.shtmlWANG H W, CHEN M L, SAPORTA G.Variable selection in discriminant analysis based on Gram-Schmidt process[J].Journal of Beijing University of Aeronautics and Astronautics, 2011, 37(8):958-961(in Chinese). https://bhxb.buaa.edu.cn/CN/abstract/abstract12041.shtml [20] LIU R P, WANG H W, WANG S S.Functional variable selection via Gram-Schmidt orthogonalization for multiple functional linear regression[J].Journal of Statistical Computation and Simulation, 2018, 88(18):3664-3680. doi: 10.1080/00949655.2018.1530776 [21] FISHER R.On the probable error of a coefficient of correlation deduced from a small sample[J].Metron, 1921, 1(4):3-32. http://cn.bing.com/academic/profile?id=3a072e95cc2650c0bd2f3ff83efe37e2&encoded=0&v=paper_preview&mkt=zh-cn [22] FRANK L E, FRIEDMAN J H.A statistical view of some chemometrics regression tools[J].Technometrics, 1993, 35(2):109-135. doi: 10.1080/00401706.1993.10485033 [23] MANGOLD W D, BEAN L, ADAMS D.The impact of intercollegiate athletics on graduation rates among major ncaa division I universities:Implications for college persistence theory and practice[J].Journal of Higher Education, 2003, 74(5):540-562. http://cn.bing.com/academic/profile?id=4c72dc2da8185d458e56835a53338f23&encoded=0&v=paper_preview&mkt=zh-cn