提高存算一体化系统性能的静态计算划分方法

薛洪宇; 徐晟; 罗乐; 闫亮; 邹兴奇

doi:10.13700/j.bh.1001-5965.2024.0209

提高存算一体化系统性能的静态计算划分方法

doi: 10.13700/j.bh.1001-5965.2024.0209

薛洪宇¹,
徐晟^{1, 2},
罗乐¹,
闫亮^{3, 4},
邹兴奇^3, ,

1.
安徽师范大学计算机与信息学院，芜湖 241000
2.
合肥综合性国家科学中心人工智能研究院，合肥 230094
3.
中国科学院计算技术研究所，北京 100083
4.
中国科学院大学，北京 100049

基金项目:

国家自然科学基金(62102005,62104230)；安徽省自然科学基金(2008085QF330,2108085QF265)；安徽高校协同创新项目(GXXT-2021-011)；安徽师范大学研究项目(751968)

详细信息

通讯作者:
E-mail：zouxingqi@ict.ac.cn

中图分类号: TP303
计量
- 文章访问数: 10
- HTML全文浏览量: 2
- PDF下载量: 2
- 被引次数: 0
出版历程
- 收稿日期: 2024-04-10
- 录用日期: 2024-08-23
- 网络出版日期: 2024-09-13
- 整期出版日期: 2026-06-30

Enhancing performance through static computing partitioning approach in processing-in-memory systems

XUE Hongyu¹,
XU Sheng^{1, 2},
LUO Le¹,
YAN Liang^{3, 4},
ZOU Xingqi^{3
, ,}

1.
School of Computer and Information，Anhui Normal University，Wuhu 241000，China
2.
Institute of Artificial Intelligence，Hefei Comprehensive National Science Center，Hefei 230094，China
3.
Institute of Computing Technology，Chinese Academy of Sciences，Beijing 100083，China
4.
University of Chinese Academy of Sciences，Beijing 100049，China

Funds:

National Natural Science Foundation of China (62102005,62104230); Anhui Provincial Natural Science Foundation (2008085QF330,2108085QF265); The University Synergy Innovation Program of Anhui Province (GXXT-2021-011); Research Program of Anhui Normal University (751968)

More Information

Corresponding author: E-mail：zouxingqi@ict.ac.cn

摘要

摘要:
存算一体化(PIM)系统通过引入存内计算单元改变冯·诺依曼结构“存算分离”模式，可有效缓解存储墙问题，但PIM系统与现有软件体系不匹配，性能和能效受程序的计算划分限制，甚至可能出现负优化。针对该问题，提出一种PIM系统中的静态计算划分方法。该方法将PIM系统中程序的执行过程抽象为基于注释调用图(ACG)的表征模型，从而将计算划分问题转化为ACG的最小割问题，并提出一种基于模拟退火的启发式计算划分求解算法。实验结果表明：相较于传统方法，所提方法的计算划分结果可平均提升39%系统性能，降低32%系统能耗。
- 存算一体化 /
- 计算划分 /
- 最小割 /
- 模拟退火 /
- 异构系统
Abstract:
Processing-in-memory (PIM) systems mitigate the von Neumann “memory-wall” bottleneck by integrating in-memory computing units to break the conventional memory-computation separation paradigm. However, PIM architectures are incompatible with mainstream software stacks, their performance and energy efficiency are highly constrained by the computational partitioning of programs, and may even suffer from performance degradation or negative optimization. In this paper, we propose a static computing partitioning approach that deals with this challenge. The key insight of our work is to reframe the computing partitioning as an annotated call graph (ACG) partitioning problem and propose a simulated annealing-based algorithm to find the optimal computing partitions. In comparison to traditional methods, our trials show that our methodology can improve performance by 39% and cut energy use by an average of 32%.
- processing-in-memory /
- computing partitioning /
- min-cut /
- simulated annealing /
- heterogeneous system

HTML全文

图 1 由源代码生成注释调用图的流程

Figure 1. The working-flow of ACG generation from the source code

下载: 全尺寸图片幻灯片

图 2 基于注释调用图的计算划分方法和对应的存算一体程序计算划分示意图

Figure 2. ACG-based computing partitioning approach and corresponding compute-in-memory program computing partitioning diagram

下载: 全尺寸图片幻灯片

图 3 不同大小数据集和不同计算划分下的性能加速比

Figure 3. Performance acceleration ratio under different dataset sizes and computational partitions

下载: 全尺寸图片幻灯片

图 4 基准程序在存内执行指令百分比

Figure 4. Percentage of instructions executed in-memory by benchmark programmes

下载: 全尺寸图片幻灯片

图 5 不同基准设计的能耗分解图

Figure 5. Energy consumption breakdown for different benchmark schemes

下载: 全尺寸图片幻灯片

图 6 不同存内处理器核频率对系统性能的影响

Figure 6. Impact of different in-memory processor core frequencies on system performance

下载: 全尺寸图片幻灯片

图 7 局部性感知对划分结果的影响示意图

Figure 7. Schematic diagram of the impact of locality awareness on partitioning results

下载: 全尺寸图片幻灯片

图 8 使用非图处理程序的性能加速比

Figure 8. Performance acceleration ratio when using non-graph processing programmes

下载: 全尺寸图片幻灯片

表 1 注释调用图中边的属性说明

Table 1. Description of the edge properties in the ACG

ACG中边的类型	对应属性	说明
数据相关边 e = (u, v)	$ \dfrac{{S}_{u,v}}{B_{\text{W}}^{}} $	基本块u向基本块v传输数据的时间
控制相关边		控制相关决定了基本块间的先行后续关系
从PIM伪节点出发的边 e = ($ {v}_{\text{PIM}} $, v)	$ T_{v}^{\text{PIM}} $	基本块v在存内计算单元中的执行时间，无法在存内执行的基本块值为+∞
从CPU伪节点出发的边 e = ($ {v}_{\text{CPU}} $, v)	$ T_{v}^{\text{CPU}} $	基本块v在CPU上的执行时间

下载: 导出CSV

表 2 模拟存算一体系统配置参数

Table 2. Configuration parameters of simulated PIM system

处理单元类型	功能模块	参数配置
CPU	处理器	16核，2 GHz频率，乱序执行
	一级缓存	分离的指令与数据缓存，各32 KB， 4路组相联，块大小64 B，修改、独占、共享、无效（modified, exclusive, shared, invalid, MESI）
	二级缓存	共享缓存，2 MB，8路组相联，块大小64 B，MESI
存内处理器	内存模型	HMC V2.0, 4 GB, 256 Banks
	处理器	32个，顺序执行，2 GHz，1个/vault
	一级缓存	分离的指令与数据缓存，各32 KB，4路组相联，块大小64 B，MESI
	一致性协议	粗粒度锁^[21-22]，锁粒度4 KB

下载: 导出CSV

表 3 本文使用的基准程序和数据集

Table 3. Benchmarks and datasets used in this paper

基准程序	数据集
Average Teenage Follower (ATF)、 Breadth-First Search (BFS)、 Bellman Ford Shortest Path (SP)、 PageRank (PR)	p2p-Gnutella30 (36 K点, 88 K边)、 com-DBLP (317 K点, 1 M边)、 com-Youtube (1.1 M点, 2.9 M边)、 wiki-Talk (2.3 M点, 5 M边)、 soc-LiveJournal1 (4.8 M点, 6.9 M边)

下载: 导出CSV

表 4 非图处理应用存内加载比例

Table 4. In-memory loading ratio for non-graph processing applications

基准程序	存内加载比例/%
基准程序	PIM-Only	PIM-Atomic	本文方法
GEMV	100	15.44	13.75
Select	100	31.45	36.48
Unique	100	40.60	46.17
MLP	100	5.35	9.64

下载: 导出CSV

参考文献(45)

[1]	HUANG X H, LIU C S, JIANG Y G, et al. In-memory computing to break the memory wall[J]. Chinese Physics B, 2020, 29(7): 078504.
[2]	LEE C, SHIN W, KIM D J, et al. NVDIMM-C: a byte-addressable non-volatile memory module for compatibility with standard DDR memory interfaces[C]//Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture. Piscataway: IEEE Press, 2020: 502-514.
[3]	PATTNAIK A P. Be aware of data movement: optimizing throughput processors for efficient computations[D]. State College: The Pennsylvania State University, 2019: 10-12.
[4]	Advanced Micro Devices, Inc. AMD EPYC™ 9684X server processors[EB/OL]. (2023-01-16)[2024-03-12]. https://www.amd.com/zh-cn/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9684x.html.
[5]	AHN J, YOO S, MUTLU O, et al. PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture[C]//Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture. Piscataway: IEEE Press, 2015: 336-348.
[6]	KAUTZ W H. Cellular logic-in-memory arrays[J]. IEEE Transactions on Computers, 1969, C-18(8): 719-727.
[7]	JEDDELOH J, KEETH B. Hybrid memory cube new DRAM architecture increases density and performance[C]///Proceedings of the 2012 Symposium on VLSI Technology. Piscataway: IEEE Press, 2012: 87-88.
[8]	VAHIDPOUR M, O’BRIEN W, WHYLAND J T, et al. Superconducting through-silicon vias for quantum integrated circuits[EB/OL]. (2017-08-07)[2024-03-12]. https://arxiv.org/abs/1708.02226.
[9]	LI T, HOU J, YAN J L, et al. Chiplet heterogeneous integration technology—status and challenges[J]. Electronics, 2020, 9(4): 670.
[10]	AHN J, HONG S, YOO S, et al. A scalable processing-in-memory accelerator for parallel graph processing[C]//Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture. Piscataway: IEEE Press, 2015: 105-117.
[11]	IMANI M, KIM Y, ROSING T. MPIM: multi-purpose in-memory processing using configurable resistive memory[C]//Proceedings of the 2017 22nd Asia and South Pacific Design Automation Conference. Piscataway: IEEE Press, 2017: 757-763.
[12]	STRUKOV D B, SNIDER G S, STEWART D R, et al. The missing memristor found[J]. Nature, 2008, 453(7191): 80-83.
[13]	KHAN K, PASRICHA S, KIM R G. A survey of resource management for processing-in-memory and near-memory processing architectures[J]. Journal of Low Power Electronics and Applications, 2020, 10(4): 30.
[14]	HSIEH K, EBRAHIM E, KIM G, et al. Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems[C]//Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture. Piscataway: IEEE Press, 2016: 204-216.
[15]	TSAI P A, CHEN C P, SANCHEZ D. Adaptive scheduling for systems with asymmetric memory hierarchies[C]//Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture. Piscataway: IEEE Press, 2018: 641-654.
[16]	XIAO Y, NAZARIAN S, BOGDAN P. Prometheus: processing-in-memory heterogeneous architecture design from a multi-layer network theoretic strategy[C]//Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition. Piscataway: IEEE Press, 2018: 1387-1392.
[17]	LI Z R, CHEN X M, HAN Y H. Optimal data allocation for graph processing in processing-in-memory systems[C]//Proceedings of the 2022 27th Asia and South Pacific Design Automation Conference. Piscataway: IEEE Press, 2022: 238-243.
[18]	RYDER B G. Constructing the call graph of a program[J]. IEEE Transactions on Software Engineering, 1979, SE-5(3): 216-226.
[19]	NAI L F, HADIDI R, SIM J, et al. GraphPIM: enabling instruction-level PIM offloading in graph computing frameworks[C]//Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture. Piscataway: IEEE Press, 2017: 457-468.
[20]	YAN L, ZHANG M Z, WANG R J, et al. CoPIM: a concurrency-aware PIM workload offloading architecture for graph applications[C]//Proceedings of the 2021 IEEE/ACM International Symposium on Low Power Electronics and Design. Piscataway: IEEE Press, 2021: 1-6.
[21]	BOROUMAND A, GHOSE S, PATEL M, et al. CoNDA: efficient cache coherence support for near-data accelerators[C]//Proceedings of the 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture. Piscataway: IEEE Press, 2020: 629-642.
[22]	XU S, CHEN X M, WANG Y, et al. CuckooPIM: an efficient and less-blocking coherence mechanism for processing-in-memory systems[C]//Proceedings of the 2019 24th Asia and South Pacific Design Automation Conference. Piscataway: IEEE Press, 2024: 1-6.
[23]	SONG L H, QIAN X H, LI H, et al. PipeLayer: a pipelined ReRAM-based accelerator for deep learning[C]//Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture. Piscataway: IEEE Press, 2017: 541-552.
[24]	ZOU K W, WANG Y, LI H W, et al. XORiM: a case of in-memory bit-comparator implementation and its performance implications[C]//Proceedings of the 2018 23rd Asia and South Pacific Design Automation Conference. Piscataway: IEEE Press, 2018: 349-354.
[25]	NAI L F, HADIDI R, XIAO H, et al. Thermal-aware processing-in-memory instruction offloading[J]. Journal of Parallel and Distributed Computing, 2019, 130: 193-207.
[26]	LI S C, XU C, ZOU Q S, et al. Pinatubo: a processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories[C]//Proceedings of the 2016 53nd ACM/EDAC/IEEE Design Automation Conference. Piscataway: IEEE Press, 2016: 1-6.
[27]	HADIDI R, NAI L F, KIM H, et al. CAIRO: a compiler-assisted technique for enabling instruction-level offloading of processing-in-memory[J]. ACM Transactions on Architecture and Code Optimization, 2017, 14(4): 1-25.
[28]	AHMED H, SANTOS P C, LIMA J P C, et al. A compiler for automatic selection of suitable processing-in-memory instructions[C]//Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition. Piscataway: IEEE Press, 2019: 564-569.
[29]	JIN H, CHEN D, ZHENG L, et al. Accelerating graph convolutional networks through a PIM-accelerated approach[J]. IEEE Transactions on Computers, 2023, 72(9): 2628-2640.
[30]	NAI L F, HADIDI R, XIAO H, et al. CoolPIM: thermal-aware source throttling for efficient PIM instruction offloading[C]//Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium. Piscataway: IEEE Press, 2018: 680-689.
[31]	PATTNAIK A, TANG X L, JOG A, et al. Scheduling techniques for GPU architectures with processing-in-memory capabilities[C]//Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. New York: ACM, 2016: 31-44.
[32]	HSIEH K, KHAN S, VIJAYKUMAR N, et al. Accelerating pointer chasing in 3D-stacked memory: challenges, mechanisms, evaluation[C]//Proceedings of the 2016 IEEE 34th International Conference on Computer Design. Piscataway: IEEE Press, 2016: 25-32.
[33]	LI J, WANG X, TUMEO A, et al. PIMS: a lightweight processing-in-memory accelerator for stencil computations[C]//Proceedings of the International Symposium on Memory Systems. New York: ACM, 2019: 41-52.
[34]	FERRANTE J, OTTENSTEIN K J, WARREN J D. The program dependence graph and its use in optimization[C]//Proceedings of the International Symposium on Programming. Berlin: Springer, 1984: 125-132.
[35]	RAILING B P, HEIN E R, CONTE T M. Contech: efficiently generating dynamic task graphs for arbitrary parallel programs[J]. ACM Transactions on Architecture and Code Optimization, 2015, 12(2): 1-24.
[36]	GIBBONS P B, MUCHNICK S S. Efficient instruction scheduling for a pipelined architecture[C]//Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction. New York: ACM, 1986: 11-16.
[37]	BUCHSBAUM A L, KAPLAN H, ROGERS A, et al. Linear-time pointer-machine algorithms for least common ancestors, MST verification, and dominators[C]//Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. New York: ACM, 1998: 279-288.
[38]	LUK C K, COHN R, MUTH R, et al. Pin: building customized program analysis tools with dynamic instrumentation[J]. ACM SIGPLAN Notices: A Monthly Publication of the Special Interest Group on Programming Languages, 2005, 40(6): 190-200.
[39]	BARBERA M V, KOSTA S, MEI A, et al. To offload or not to offload the bandwidth and energy costs of mobile cloud computing[C]//Proceedings of the 2013 Proceedings IEEE INFOCOM. Piscataway: IEEE Press, 2013: 1285-1293.
[40]	DELAHAYE D, CHAIMATANAN S, MONGEAU M. Simulated annealing: from basics to applications[M]//GENDREAU M, POTVIN J Y. Handbook of metaheuristics. Berlin: Springer, 2019: 1-35.
[41]	XU S, CHEN X M, WANG Y, et al. PIMSim: a flexible and detailed processing-in-memory simulator[J]. IEEE Computer Architecture Letters, 2019, 18(1): 6-9.
[42]	BEAMER S, ASANOVIĆ K, PATTERSON D. The GAP benchmark suite[EB/OL]. (2017-05-16)[2024-03-12]. https://doi.org/10.48550/arXiv.1508.03619.
[43]	LESKOVEC J, SOSIČ R. SNAP: a general-purpose network analysis and graph-mining library[J]. ACM Transactions on Intelligent Systems, 2016, 8(1): 1-20.
[44]	MURALIMANOHAR N, BALASUBRAMONIAN R, JOUPPI N P. CACTI 6.0: a tool to model large caches: HPL-2009-85[R]. Palo Alto: HP Laboratories, 2009.
[45]	GÓMEZ-LUNA J, EL HAJJ I, FERNANDEZ I, et al. Benchmarking a new paradigm: experimental analysis and characterization of a real processing-in-memory system[J]. IEEE Access, 2022, 10: 52565-52608.