基于3D-Winograd的快速卷积算法设计及FPGA实现

林珂玉; 姜宏旭; 张永华; 丛容子

doi:10.13700/j.bh.1001-5965.2020.0310

基于3D-Winograd的快速卷积算法设计及FPGA实现

doi: 10.13700/j.bh.1001-5965.2020.0310

北京航空航天大学数字媒体北京市重点实验室, 北京 100083

基金项目:

航天科学技术基金 190109

国家自然科学基金 61872017

详细信息

通讯作者:
姜宏旭, E-mail: jianghx@buaa.edu.cn

中图分类号: TP391
计量
- 文章访问数: 457
- HTML全文浏览量: 61
- PDF下载量: 88
- 被引次数: 0
出版历程
- 收稿日期: 2020-07-03
- 录用日期: 2020-11-08
- 网络出版日期: 2021-09-20

Design and FPGA implementation of fast convolution algorithm based on 3D-Winograd

Beijing Key Laboratory of Digital Media, Beihang University, Beijing 100083, China

Funds:

Aerospace Science and Technology Fund 190109

National Natural Science Foundation of China 61872017

More Information

Corresponding author: JIANG Hongxu, E-mail: jianghx@buaa.edu.cn

摘要

摘要:
近年来，卷积神经网络（CNN）已被计算机视觉任务广泛采用。由于FPGA的高性能、能效和可重新配置性，已被认为是最有前途的CNN硬件加速器，但是受FPGA计算能力、存储资源的限制，基于传统Winograd算法计算三维卷积的FPGA解决方案性能还有提升的空间。首先，研究了适用于三维运算的Winograd算法一维展开过程；然后，通过增加一次性输入特征图和卷积块的维度大小、低比特量化权重和输入数据等方法改善CNN在FPGA上的运行性能。优化思路包括使用移位代替部分除法的方法、分tile方案、二维到三维扩展及低比特量化等4个部分。相对传统的二维Winograd算法，优化算法每个卷积层的时钟周期数减少了7倍左右，相较传统滑窗卷积算法平均每个卷积层减少7倍左右。通过研究，证明了基于一维展开的3D-Winograd算法可以大大减少运算复杂度，并改善在FPGA运行CNN的性能。
- 卷积神经网络(CNN) /
- FPGA /
- Winograd /
- 卷积算法 /
- 快速算法
Abstract:
In recent years, Convolutional Neural Networks (CNNs) have been widely adopted by computer vision tasks. Due to the high performance, energy efficiency, and reconfigurability of FPGA, it has been considered as the most promising CNN hardware accelerator. However, the existing FPGA solutions based on the traditional Winograd method are usually limited by FPGA computing power and storage resources, and there is room for improvement in performance of 3D convolution operations. This paper first studied the one-dimensional expansion process of the Winograd algorithm suitable for three-dimensional operations; then, improved the performance of CNN on FPGA by increasing the one-time input feature map and the dimensional size of the convolution block, low-bit quantization weight and input data. The optimization ideas include four parts: the method of using shift instead of partial division, the division of tiles, the expansion of two-dimensional to three-dimensional, and low-bit quantization. Compared with the traditional two-dimensional Winograd algorithm, the number of clock cycles of each convolutional layer of the optimized algorithm is reduced by about 7 times, which is about 7 times less for each convolutional layer than the traditional sliding window convolution algorithm. Through the research, it is proved that the 3D-Winograd algorithm based on one-dimensional expansion can greatly reduce the computational complexity and improve the performance of running CNN on FPGA.
- Convolutional Neural Network(CNN) /
- FPGA /
- Winograd /
- convolution algorithm /
- fast algorithm

HTML全文

图 1 传统滑窗卷积算法与2D-Winograd算法

Figure 1. Conventional sliding window convolution algorithm and 2D-WINOGRAD algorithm

下载: 全尺寸图片幻灯片

图 2 基于Winograd原理的一维卷积过程

Figure 2. One-dimensional convolution process based on Winograd principle

下载: 全尺寸图片幻灯片

图 3 基于一维扩展的Winograd三维卷积过程

Figure 3. Three-dimensional convolution process of Winograd based on one-dimensional expansion

下载: 全尺寸图片幻灯片

图 4 3种Winograd卷积过程伪代码

Figure 4. Three kinds pseudo code of Winograd convolution process

下载: 全尺寸图片幻灯片

表 1 3D-Winograd和2D-Winograd性能对比

Table 1. Performance comparison of 3D-Winograd and 2D-Winograd

性能指标		2D-Winograd^[16]	3D-Winograd	结果
时钟周期	Latency	561 635	518 435	提高7%
资源占用	DSP/个	36	30	计算复杂度降低；计算资源减少
	FF/个	10 264	5 670
	LUT/个	11 399	6 275
	乘法器/个	947	1 060	基本不变
注：2D-Winograd和3D-Winograd的tile尺寸为4×4，Fp32。

下载: 导出CSV

表 2 tile扩展前后性能对比

Table 2. Performance comparison before and after tile expansion

性能指标		3D-Winograd	3D-Winograd	结果
时钟周期	Latency	518 435	235 871	提高2.198倍
资源占用	DSP/个	30	77	计算资源增加
	FF/个	5 670	21 492
	LUT/个	6 275	20 965
	乘法器/个	1 060	1 633	增加54%
注：3D-Winograd的tile尺寸分别为4×4，Fp32；6×6，Fp32。

下载: 导出CSV

表 3 量化前后性能对比

Table 3. Performance comparison before and after quantization

性能指标		3D-Winograd	3D-Winograd	结果
时钟周期	Latency	235 871	32 747, 30 633	提高7.20, 7.70倍
资源占用	DSP/个	77	59, 47	DSP资源降低1.31, 1.64倍
	FF/个	21 492	4 087, 2 785
	LUT/个	20 965	27 007, 10 412
	乘法器/个	1 633	728, 657	减少2.24, 2.49倍
注：3D-Winograd的tile尺寸为6×6，Fp32;6×6, 8位定点/移位。

下载: 导出CSV

表 4 传统滑窗卷积、2D-Winograd和3D-Winograd算法性能对比

Table 4. Performance comparison of traditional sliding window learning, 2D-Winograd and 3D-Winograd algorithms

图层	输出特征图尺寸	卷核尺寸积	分块尺寸	吞吐率/BFLOPS
图层	输出特征图尺寸	卷核尺寸积	分块尺寸	传统滑窗卷积算法^[19]	2D-Winograd^[16]	3D-Winograd
Conv0	416×416×32	32/2	4×4	0.299	0.901	1.080
Conv1	208×208×64	64/2	4×4	1.595	5.168	6.201
Conv2	104×104×128	128/2	4×4	1.595	5.168	6.201
Conv3	104×104×64	64/2	4×4	0.177	0.521	0.637
Conv4	104×104×128	128/2	4×4	1.595	5.168	6.201
Conv5	52×52×256	256/4	6×6	1.595	11.361	13.623
Conv6	52×52×128	512/4	6×6	1.595	11.361	13.623
Conv7	52×52×256	256/4	6×6	0.177	1.167	1.399
Conv8	26×26×512	512/4	6×6	1.595	11.361	13.623
Conv9	26×26×256	256/4	6×6	0.177	1.167	1.399
Conv10	26×26×512	512/4	6×6	1.595	11.361	13.623
Conv11	26×26×256	256/4	6×6	0.177	1.167	1.399
Conv12	26×26×512	512/4	6×6	1.595	11.361	13.623
Conv13	26×26×1 024	1 024/4	6×6	3.190	21.022	25.230
Conv14	26×26×1 024	1 024/4	6×6	3.190	21.022	25.230
Conv15	26×26×1 024	1024/4	6×6	3.987	26.286	31.534
总时间/ms				1 052	160.963	123.799

下载: 导出CSV

表 5 CNN在FPGA和GPU的性能功耗比对比

Table 5. Comparison of CNN's performance/power ratio on FPGA and GPU

平台	平均每个Conv层吞吐率/BFLOPS	平均每个Conv层功耗/J	性能功耗比
FPGA	10.913 75	0.328 94	33.178
GPU	21.590 37	8.1	2.665

下载: 导出CSV

参考文献(19)

[1]	ZHANG X F, WANG J S, ZHU C, et al. AccDNN: An IP-based DNN generator for FPGAs[C]//2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). Piscataway: IEEE Press, 2018: 210.
[2]	GUAN Y J, LIANG H, XU N Y, et al. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates[C]//2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM). Piscataway: IEEE Press, 2017: 152-159.
[3]	GEORGE J K, NEJADRIAHI H, SORGER V J. Towards on-chip optical FFTs for convolutional neural networks[C]//2017 IEEE International Conference on Rebooting Computing(ICRC). Piscataway: IEEE Press, 2017: 1-4.
[4]	ORDÓÑEZ Á, ARGVELLO F, HERAS D B. GPU accelerated FFT-based registration of hyperspectral scenes[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2017, 10(11): 4869-4878. doi: 10.1109/JSTARS.2017.2734052
[5]	SUITA S, NISHIMURA T, TOKURA H, et al. Efficient cuDNN-compatible convolution-pooling on the GPU[C]//International Conference on Parallel Processing and Applied Mathematics. Berlin: Springer, 2019: 46-58.
[6]	ZHANG C, PRASANNA V. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system[C]//FPGA'17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017: 35-44.
[7]	CONG J, XIAO B J. Minimizing computation in convolutional neural networks[M]//Artificial Neural Networks and Machine Learning-ICANN 2014. Berlin: Springer, 2014: 281-290.
[8]	SUDA N, CHANDRA V, DASIKA G, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]//FPGA'16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016: 16-25.
[9]	ZHANG C, SUN G Y, FANG Z M, et al. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(11): 2072-2085. doi: 10.1109/TCAD.2017.2785257
[10]	LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE Press, 2016: 4013-4021.
[11]	ZHANG C, LI P, SUN G Y, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]//FPGA'15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015: 161-170.
[12]	QIU J T, WANG J, YAO S, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]//FPGA'16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016: 26-35.
[13]	YU J C, GE G J, HU Y M, et al. Instruction driven cross-layer CNN accelerator for fast detection on FPGA[J]. ACM Transactions on Reconfigurable Technology and Systems, 2018, 11(3): 1-23.
[14]	AHMAD A, PASHA M A. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs[C]//2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). Piscataway: IEEE Press, 2019: 1106-1111.
[15]	LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. doi: 10.1109/5.726791
[16]	LIANG Y, LU L Q, XIAO Q C, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(4): 857-870. doi: 10.1109/TCAD.2019.2897701
[17]	LU L Q, LIANG Y. SpWA: An efficient sparse Winograd convolutional neural networks accelerator on FPGAs[C]//2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). Piscataway: IEEE Press, 2018: 1-6.
[18]	ZHAO Y L, WANG D H, WANG L O. Convolution accelerator designs using fast algorithms[J]. Algorithms, 2019, 12(5): 112. doi: 10.3390/a12050112
[19]	REDMON J, FARHADI A. YOLO9000: Better, faster, stronger[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE Press, 2017: 6517-6525.