北京航空航天大学学报 ›› 2016, Vol. 42 ›› Issue (11): 2340-2348.doi: 10.13700/j.bh.1001-5965.2015.0731

• 论文 • 上一篇    下一篇

面向集群环境的虚拟化GPU计算平台

杨经纬, 马凯, 龙翔   

  1. 北京航空航天大学 计算机学院, 北京 100083
  • 收稿日期:2015-11-09 修回日期:2016-01-15 出版日期:2016-11-20 发布日期:2016-03-15
  • 通讯作者: 龙翔,Tel.:010-82339685,E-mail:long@buaa.edu.cn E-mail:long@buaa.edu.cn
  • 作者简介:杨经纬,男,博士研究生。主要研究方向:计算系统结构、嵌入式系统、多核实时操作系统与实时调度。E-mail:yaungjw@buaa.edu.cn;马凯,男,硕士研究生。主要研究方向:计算机系统结构、并行与分布式计算。E-mail:makai@buaa.edu.cn;龙翔,男,博士,教授,博士生导师。主要研究方向:计算系统结构、并行与分布式系统、实时系统。Tel.:010-82339685,E-mail:long@buaa.edu.cn

Virtualized GPU computing platform in clustered system environment

YANG Jingwei, MA Kai, LONG Xiang   

  1. School of Computer Science and Engineering, Beijing University of Aeronautics and Astronautics, Beijing 100083, China
  • Received:2015-11-09 Revised:2016-01-15 Online:2016-11-20 Published:2016-03-15

摘要: 针对集群系统的多节点多GPU环境,提出一种新型虚拟化GPU计算平台。该平台实现对集群系统所有节点上GPU资源的统一抽象与管理,构建公共GPU资源池。原有GPU应用程序可以不经任何修改而迁移到虚拟化GPU计算平台,并具备访问资源池内任何GPU的能力,编程人员无需显式针对多节点多GPU应用展开MPI编程。应用程序摆脱了单个节点上GPU资源的限制,并具备无差别地访问集群系统中任何可用GPU资源的能力,能有效提高系统总体资源利用率以及吞吐量。采用流水化通信技术,实现对虚拟化GPU计算平台的运行时开销以及节点间数据传输延迟的隐藏。实验表明:与非流水化通信相比,系统总体数据传输延迟降低了50%~70%,具备与节点机本地数据传输等同的通信性能。

关键词: GPU, MPI, CUDA, 集群系统, 硬件加速, 并行计算, 高性能计算

Abstract: A virtualized GPU computing platform is proposed for clustered systems, which are often equipped with GPUs in some nodes. All GPUs in system are uniformly abstracted as virtualized ones in a commonly accessed resource pool. Legacy GPU programs can execute on the virtualized GPU computing platform without any modification and any free virtualized GPU in the common resource pool is available to it, which relieves the burden of MPI programming. The platform frees programs with the limit of GPUs in local node and makes it possible for them to access any available GPU in distributed nodes, leading to higher system utilization and throughput. Based on pipelined communication, the run-time overhead and inter-node transmitting latency in virtualized GPU computing platform are hidden by intra-node memory copying and GPU computing. Compared with the non-pipelined communication, the total transmission latency is decreased by approximately 50%-70%. It results in a comparable performance with intra-node local data transmission.

Key words: GPU, MPI, CUDA, clustered systems, hardware acceleration, parallel computing, high performance computing

中图分类号: 


版权所有 © 《北京航空航天大学学报》编辑部
通讯地址:北京市海淀区学院路37号 北京航空航天大学学报编辑部 邮编:100191 E-mail:jbuaa@buaa.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发