Citation: | LIU Ruiping, WANG Huiwen, WANG Shanshanet al. Supervised clustering of variables based on Gram-Schmidt transformation[J]. Journal of Beijing University of Aeronautics and Astronautics, 2019, 45(10): 2003-2010. doi: 10.13700/j.bh.1001-5965.2019.0050(in Chinese) |
In order to study the dimension reduction method of high-dimensional data based on regression model further, and the supervised clustering of variables algorithm based on Gram-Schmidt transformation (SCV-GS) is proposed. SCV-GS uses the key variables selected in turn by the variable screening idea as the clustering center, which is different from the hierarchical variable clustering around latent variables. High correlation among variables is processed based on Gram-Schmidt transformation and the clustering results are obtained. At the same time, combined with the concept of partial least squares, a new criterion for "homogeneity" is proposed to select the optimal clustering parameters. SCV-GS can not only get the variable clustering results quickly, but also identify the most relevant variable groups and in what kind of structure the variables work to influence the response variable. Simulation results show that the calculation speed is significantly improved by SCV-GS, and the estimated regression coefficients corresponding to the latent variables are consistent with the comparison method. Real data analysis shows that SCV-GS performs better in interpretation and prediction.
[1] |
TIBSHIRANI R.Regression shrinkage and selection via the lasso:A retrospective[J].Journal of the Royal Statistical Society:Series B(Statistica Methodology), 2011, 73(3):273-282. doi: 10.1111/j.1467-9868.2011.00771.x
|
[2] |
ZOU H, HASTIE T.Regularization and variable selection via the elastic net[J].Journal of the Royal Statistical Society:Series B(Statistical Methodology), 2005, 67(2):301-320. doi: 10.1111/j.1467-9868.2005.00503.x
|
[3] |
FAN J Q, LV J C.Sure independence screening for ultrahigh dimensional feature space[J].Journal of the Royal Statistical Society:Series B(Statistical Methodology), 2008, 70(5):849-911. doi: 10.1111/j.1467-9868.2008.00674.x
|
[4] |
WANG H S.Forward regression for ultra-high dimensional variable screening[J].Journal of the American Statistical Association, 2009, 104(488):1512-1524. doi: 10.1198/jasa.2008.tm08516
|
[5] |
ZOU H, HASTIE T, TIBSHIRANI R.Sparse principal component analysis[J].Journal of Computational and Graphical Statistics, 2006, 15(2):265-286. doi: 10.1198/106186006X113430
|
[6] |
CHUN H, KELEŞ S.Sparse partial least squares regression for simultaneous dimension reduction and variable selection[J].Journal of the Royal Statistical Society:Series B(Statistical Methodology), 2010, 72(1):3-25. doi: 10.1111/j.1467-9868.2009.00723.x
|
[7] |
CHEN M K, VIGNEAU E.Supervised clustering of variables[J].Advances in Data Analysis and Classification, 2016, 10(1):85-101. http://d.old.wanfangdata.com.cn/OAPaper/oai_doaj-articles_7d57b83193b227c3e090a9eeb28155dd
|
[8] |
JOLLIFFE I T.Discarding variables in a principal component analysis.I:Artificial data[J].Applied Statistics, 1972, 21(2):160-173. doi: 10.2307/2346488
|
[9] |
HASTIE T, TIBSHIRANI R, BOTSTEIN D, et al.Supervised harvesting of expression trees[J].Genome Biology, 2001, 2(1):research0003-1. http://d.old.wanfangdata.com.cn/OAPaper/oai_pubmedcentral.nih.gov_17599
|
[10] |
VIGNEAU E, QANNARI E.Clustering of variables around latent components[J].Communications in Statistics-Simulation and Computation, 2003, 32(4):1131-1150. doi: 10.1081/SAC-120023882
|
[11] |
VIGNEAU E, CHEN M, QANNARI E M.ClustVarLV:An R package for the clustering of variables around latent variables[J].The R Journal, 2015, 7(2):134-148. doi: 10.32614/RJ-2015-026
|
[12] |
VIGNEAU E.Segmentation of a panel of consumers with missing data[J].Food Quality and Preference, 2018, 67:10-17. doi: 10.1016/j.foodqual.2017.04.010
|
[13] |
CARIOU V, QANNARI E M, RUTLEDGE D N, et al.ComDim:From multiblock data analysis to path modeling[J].Food Quality and Preference, 2018, 67:27-34. doi: 10.1016/j.foodqual.2017.02.012
|
[14] |
BJÖRCK Å.Numerics of gram-schmidt orthogonalization[J].Linear Algebra and Its Applications, 1994, 197-198:297-316. doi: 10.1016/0024-3795(94)90493-6
|
[15] |
LEON S J, BJÖRCK A, GANDER W.Gram-Schmidt orthogonalization:100 years and more[J].Numerical Linear Algebra with Applications, 2013, 20:492-532. doi: 10.1002/nla.1839
|
[16] |
CHEN S, BILLINGS S A, LUO W.Orthogonal least squares methods and their application to non-linear system identification[J].International Journal of Control, 1989, 50(5):1873-1896. doi: 10.1080/00207178908953472
|
[17] |
STOPPIGLIA H, DREYFUS G, DUBOIS R, et al.Ranking a random feature for variable and feature selection[J].Journal of Machine Learning Research, 2003, 3:1399-1414. http://cn.bing.com/academic/profile?id=521405dcd1fbe6041d8601c946a9376a&encoded=0&v=paper_preview&mkt=zh-cn
|
[18] |
王惠文, 仪彬, 叶明.基于主基底分析的变量筛选[J].北京航空航天大学学报, 2008, 34(11):1288-1291. https://bhxb.buaa.edu.cn/CN/abstract/abstract8983.shtml
WANG H W, YI B, YE M.Variable selection based on principal basis analysis[J].Journal of Beijing University of Aeronautics and Astronautics, 2008, 34(11):1288-1291(in Chinese). https://bhxb.buaa.edu.cn/CN/abstract/abstract8983.shtml
|
[19] |
王惠文, 陈梅玲, SAPORTA G.基于Gram-Schmidt过程的判别变量筛选方法[J].北京航空航天大学学报, 2011, 37(8):958-961. https://bhxb.buaa.edu.cn/CN/abstract/abstract12041.shtml
WANG H W, CHEN M L, SAPORTA G.Variable selection in discriminant analysis based on Gram-Schmidt process[J].Journal of Beijing University of Aeronautics and Astronautics, 2011, 37(8):958-961(in Chinese). https://bhxb.buaa.edu.cn/CN/abstract/abstract12041.shtml
|
[20] |
LIU R P, WANG H W, WANG S S.Functional variable selection via Gram-Schmidt orthogonalization for multiple functional linear regression[J].Journal of Statistical Computation and Simulation, 2018, 88(18):3664-3680. doi: 10.1080/00949655.2018.1530776
|
[21] |
FISHER R.On the probable error of a coefficient of correlation deduced from a small sample[J].Metron, 1921, 1(4):3-32. http://cn.bing.com/academic/profile?id=3a072e95cc2650c0bd2f3ff83efe37e2&encoded=0&v=paper_preview&mkt=zh-cn
|
[22] |
FRANK L E, FRIEDMAN J H.A statistical view of some chemometrics regression tools[J].Technometrics, 1993, 35(2):109-135. doi: 10.1080/00401706.1993.10485033
|
[23] |
MANGOLD W D, BEAN L, ADAMS D.The impact of intercollegiate athletics on graduation rates among major ncaa division I universities:Implications for college persistence theory and practice[J].Journal of Higher Education, 2003, 74(5):540-562. http://cn.bing.com/academic/profile?id=4c72dc2da8185d458e56835a53338f23&encoded=0&v=paper_preview&mkt=zh-cn
|