From grid to "East-west Computing Transfer" : Constructing national computing infrastructure
-
摘要:
简要回顾了几十年来计算机使用方式的变迁,介绍了基于网络计算技术的国家高性能计算基础设施CNGrid的设计与实现。讨论了在“东数西算”战略工程背景下中国算力发展的新趋势,以及国家算力基础设施发展面临的新的技术挑战,并对中国未来超算应用生态和算力基础设施建设提出了展望。
Abstract:This article gives a review of the evolution of computer use-mode over the time since the invention of modern computers and presents the challenges and tasks in building the national computing infrastructure. The first section of this article provides a brief overview of how computer use-mode has evolved over the past several decades. Then the design and implementation of China's national high performance computing infrastructure CNGrid are introduced. Following that, the trends and new technical challenges in developing the computing infrastructure under the circumstance of the national strategic project of "East-west Computing Transfer" are discussed. Finally, the perspectives of building the supercomputing eco-system and constructing the new type of computing infrastructure in China are presented.
-
Key words:
- high performance computing /
- infrastructure /
- grid computing /
- CNGrid /
- East-west Computing Transfer
-
表 1 世界部分网格相关研究计划
Table 1. Part of grid-related programs in the world
国别 网格计算相关研究计划/项目 美国 TeraGrid, XSEDE 欧盟 EGEE, EGI 英国 UK e-Science 日本 NAREGI,HPCI 韩国 K*Grid 中国 CNGrid,ChinaGrid 表 2 中国科技部的网格和高性能计算项目
Table 2. Grid and high performance computing projects under the Ministry of Science and Technology of China
项目来源 项目名称 执行周期 项目成果 国家863重大课题 国家高性能计算环境 1999—2000年 4 000亿次曙光3000;包含5个高性能计算中心的国家高性能计算环境原型 国家863重大专项 高性能计算机及核心软件 2002—2005年 11.2万亿次曙光4000,5.36万亿次的联想深腾6800;国家高性能计算环境实验床“中国国家网格CNGrid”,8个结点,18万亿次计算能力; 一批网格应用 国家863重大项目 高效能计算机及网格服务环境 2006—2010年 4 700万亿次的天河1A,3 000万亿次的曙光6000,1 071万亿次的神威蓝光; 具有服务特征的国家网格服务环境CNGrid,11个结点,8 000万亿次计算能力; 一批网格和高性能计算应用 国家863重大项目 高效能计算机及应用服务环境 2011—2015年 12.5亿亿次的神威·太湖之光,10亿亿次的天河2A;以服务支持应用的国家高性能计算环境CNGrid,14个结点,20亿亿次计算能力; 一批高性能计算应用 国家重点研发专项 高性能计算 2016—2021年 E级计算机; 初步具备基础设施形态的国家高性能计算环境CNGrid,19个结点,52亿亿次计算能力; 一批高性能计算应用 神经网络模型 应用领域 训练计算量/Flops AlexNet 图像分类 4.7×1017 VGG16 图像分类 8.5×1018 YOLOv3 图像目标检测 5.1×1019 Transformer 自然语言处理 7.4×1018 GPT-3 自然语言处理 3.1×1023 注:数据来源于https://docs.google.com/spreadsheets/d/1AAIebjNsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4。 表 4 TOP500排名前十的高性能计算机(2022年6月)[26]
Table 4. TOP10 in TOP500 high performance computing systems (June 2022)[26]
排名 系统 处理器/加速器 Linpack性能/PFlops 1 Frontier AMD 64C+AMD MI250X 1 102 2 Fugaku(富岳) A64FX 48C 442.01 3 LUMI AMD 64C+AMD MI250X 151.90 4 Summit(顶点) IBM Power+Nvidia V100 148.60 5 Sierra(山脊) IBM Power+Nvidia V100 94.64 6 Sunway TaihuLight(神威·太湖之光) Sunway SW26010 93.01 7 Perlmutter AMD 64C+Nvidia A100 70.87 8 Selene AMD 64C+Nvidia A100 63.46 9 Tianhe-2A(天河2A) Intel Xeon + Matrix2000 61.44 10 Adastra AMD 64C+AMD MI250X 46.10 注:数据来源于http://www.top500.org。 -
[1] DENNIS J. Segmentation and the design of multiprogrammed computer systems[J]Journal of the ACM, 1965, 12(4): 589-602. doi: 10.1145/321296.321310 [2] SACKMAN H. Time-sharing versus batch processing: The experimental evidence[C]//Proceedings of the American Federation of Information Processing Societies. New York: ACM, 1968: 1-10. [3] SCHWARTZ J, COFFMAN E, WEISSMAN C. A general-purpose time-sharing system[C]//Proceedings of the American Federation of Information Processing Societies. New York: ACM, 1964: 397-411. [4] MILLS D L, BRAUN H. The NSFNET backbone network[C]//Proceedings of the ACM Workshop on Frontiers in Computer Communications Technology. New York: ACM, 1987: 191-196. [5] FOSTER I T, KESSELMAN C. The grid: Blueprint for a new computing infrastructure[M]. San Francisco: Morgan Kaufman Publishers, 1998. [6] STEVENS R, WOODWARD P, DEFANTI T, et al. From the I-WAY to the national technology grid[J]. Communications of the ACM, 1997, 40(11): 50-60. doi: 10.1145/265684.265692 [7] THOMAS M, BOISSEAU J, DAHAN M, et al. Development of NPACI grid application portals and portal Web services[J]. Cluster Computing, 2003, 6(3): 177-188. doi: 10.1023/A:1023566402391 [8] FOSTER I, CZAJKOWSKI K, FERGUSON D, et al. Modeling and managing state in distributed systems: The role of OGSI and WSRF[J]. Proceedings of the IEEE, 2005, 93(3): 604-612. doi: 10.1109/JPROC.2004.842766 [9] TALIA D. The open grid services architecture: Where the grid meets the Web[J]. IEEE Internet Computing, 2002, 6(6): 67-71. doi: 10.1109/MIC.2002.1067739 [10] FOSTER I, KESSELMAN C. Globus: A metacomputing infrastructure toolkit[J]. International Journal of Supercomputer Application, 1998, 11(2): 115-129. [11] REED D.A. Grids, the TeraGrid, and beyond[J]. IEEE Computer, 2003, 36(1): 62-68. doi: 10.1109/MC.2003.1160057 [12] KUNSZT P. European DataGrid project: Status and plans[J]. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 2003, 502(2-3): 376-381. doi: 10.1016/S0168-9002(03)00447-9 [13] GAGLIARDI F, JONES B, GREY F, et al. Building an infrastructure for scientific grid computing: Status and goals of the EGEE project[J]. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2005, 363(1833): 1729-1742. doi: 10.1098/rsta.2005.1603 [14] HEY T, TREFETHEN A E. The UK e-Science core programme and the grid[J]. Future Generation Computer Systems, 2002, 18(8): 1017-1031. doi: 10.1016/S0167-739X(02)00082-1 [15] MATSUOKA S, SHINJO S, AOYAGI M, et al. Japanese computational grid research project: NAREGI[J]. Proceedings of the IEEE, 2005, 93(3): 522-533. doi: 10.1109/JPROC.2004.842748 [16] ARMBRUST M, FOX A, GRIFFITH R, et al. Above the clouds: A Berkeley view of cloud computing: UCB/EECS-2009-28[R]. Berkeley: EECS Department University of California, Berkeley Technical Report, 2009. [17] SARASWAT M, TRIPATHI R C. Cloud computing: Analysis of top 5 CSPs in SaaS, PaaS and IaaS platforms[C]//2020 9th International Conference on System Modeling and Advancement in Research Trends, 2020: 20421390. [18] SOTOMAYOR B, MONTERO R, LLORENTE I, et al. Virtual infrastructure management in private and hybrid clouds[J]. IEEE Internet Computing, 2009, 13(5): 14-22. doi: 10.1109/MIC.2009.119 [19] BARIK R, LENKA R, RAO K, et al. Performance analysis of virtual machines and containers in cloud computing[C]//2016 International Conference on Computing, Communication and Automation. Piscataway: IEEE Press, 2016: 16585534. [20] SIMONS J. HPC cloud bad; HPC in the cloud good[C]//2013 IEEE 27th International Symposium on Parallel and Distributed Processing. Piscataway: IEEE Press, 2013: 13683523. [21] MOR N. Edge computing: Scaling resources within multiple administrative domains[J]. Queue, 2018, 16(6): 106-116. doi: 10.1145/3305263.3313377 [22] 乔健, 查礼. 中国国家网格作业管理设计与实现[J]. 计算机应用, 2008, 28(8): 2003-2009. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJY200808030.htmQIAO J, ZHA L. Design and implementation of grid job management for China national grid[J]. Computer Applications, 2008, 28(8): 2003-2009(in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-JSJY200808030.htm [23] 王小宁, 肖海力, 曹荣强. 面向高性能计算环境的作业优化调度模型的设计与实现[J]. 计算机工程与科学, 2017, 39(4): 619-626. doi: 10.3969/j.issn.1007-130X.2017.04.002WANG X N, XIAO H L, CAO R Q. Design and implementation of an optimal job scheduling model for the high performance computing environment[J]. Computer Engineering & Science, 2017, 39(4): 619-626(in Chinese). doi: 10.3969/j.issn.1007-130X.2017.04.002 [24] 喻林, 邹永强, 查礼. CNGrid GOS安全: 设计与实现[J]. 华中科技大学学报(自然科学版), 2010, 38(S1): 6-10. https://www.cnki.com.cn/Article/CJFDTOTAL-HZLG2010S1003.htmYU L, ZOU Y Q, ZHA L. CNGrid GOS security: Design and implementation[J]. Journal of Huazhong University of Science & Technology (Natural Science Edition), 2010, 38(S1): 6-10 (in Chinese). https://www.cnki.com.cn/Article/CJFDTOTAL-HZLG2010S1003.htm [25] SEVILLA J, VILLALOBOS P, C ERON J, et al. Parameter, compute and data trends in machine learning[EB/OL]. [2022-05-30]. https://docs.google.com/spreadsheets/d/1AAIebj NsnJj_uKALHbXNfn3_YsT6sHXtCU0q7OIPuc4. [26] TOP500 list[EB/OL]. [2022-06-20]. https://top500.org/lists/top500/2022/06/. [27] BONAWITZ K, EICHNER H, GRIESKAMP W, et al. Towards federated learning at scale: System design[C]//Proceedings of the Conference on Machine Learning and Systems. Piscataway: IEEE Press, 2019: 1-15. -