北京航空航天大学学报 ›› 2018, Vol. 44 ›› Issue (10): 2115-2124.doi: 10.13700/j.bh.1001-5965.2017.0786

• 论文 • 上一篇    下一篇

一种在复杂环境中支持容错的高性能规约框架

李超1, 赵长海2, 晏海华1, 刘超1, 文佳敏2, 王增波2   

  1. 1. 北京航空航天大学 计算机学院, 北京 100083;
    2. 中国石油集团东方地球物理勘探有限责任公司 物探技术研究中心, 北京 100088
  • 收稿日期:2017-12-21 修回日期:2018-03-16 出版日期:2018-10-20 发布日期:2018-10-29
  • 通讯作者: 晏海华,E-mail:yhh@buaa.edu.cn E-mail:yhh@buaa.edu.cn
  • 作者简介:李超,男,博士研究生。主要研究方向:高性能计算;晏海华,男,硕士,副教授。主要研究方向:软件工程和高性能计算。
  • 基金资助:
    国家自然科学基金(61672073);中国石油天然气集团公司科学研究与技术开发项目(2016E-1001)

A fault tolerant high-performance reduction framework in complex environment

LI Chao1, ZHAO Changhai2, YAN Haihua1, LIU Chao1, WEN Jiamin2, WANG Zengbo2   

  1. 1. School of Computer Science and Engineering, Beijing University of Aeronautics and Astronautics, Beijing 100083, China;
    2. Research Center of Geophysical Exploration, BGP INC., China National Petroleum Corporation, Beijing 100088, China
  • Received:2017-12-21 Revised:2018-03-16 Online:2018-10-20 Published:2018-10-29

摘要: 规约是并行应用最常用的集合通信操作之一,现存规约算法存在2方面主要问题。第一,不适应复杂环境,当计算环境出现干扰时,规约效率显著降低。第二,不支持容错,当节点发生故障时,规约被迫中断。针对上述问题,提出一种基于任务并行的高性能分布式规约框架。首先,该框架将规约拆分为一系列独立的计算任务,使用任务调度器以保证就绪任务被优先调度到具有较高性能的节点上执行,从而有效避免了慢节点对整体性能的影响。其次,该框架基于规约数据的可靠性存储和故障侦听机制,以任务为粒度,可在应用不退出的前提下实现故障恢复。在复杂环境中的实验结果表明,分布式规约框架具有高可靠性,与现有规约算法相比,规约性能最高提升了2.2倍,并发规约性能最高提升了4倍。

关键词: 规约, 集合通信, 复杂环境, 干扰, 容错, 并行计算

Abstract: Reduction is one of the most commonly used collective communication operations for parallel applications. There are two problems for the existing reduction algorithms:First, they cannot adapt to complex environment. When interferences appear in computing environment, the efficiency of reduction degrades significantly. Second, they are not fault tolerant. The reduction operation is interrupted when a node failure occurs. To solve these problems, this paper proposes a task-based parallel high-performance distributed reduction framework. Firstly, each reduction operation is divided into a series of independent computing tasks. The task scheduler is adopted to guarantee that ready tasks will take precedence in execution and each task will be scheduled to the computing node with better performance. Thus, the side effect of slow nodes on the whole efficiency can be reduced. Secondly, based on the reliability storage for reduction data and fault detecting mechanism, fault tolerance can be implemented in tasks without stopping the application. The experimental results in complex environment show that the distributed reduction framework promises high availability and, compared with the existing reduction algorithm, the reduction performance and concurrent reduction performance of distributed reduction framework are improved by 2.2 times and 4 times, respectively.

Key words: reduction, collective communication, complex environment, interference, fault tolerance, parallel computing

中图分类号: 


版权所有 © 《北京航空航天大学学报》编辑部
通讯地址:北京市海淀区学院路37号 北京航空航天大学学报编辑部 邮编:100191 E-mail:jbuaa@buaa.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发