-
摘要:
近年来,卷积神经网络(CNN)已被计算机视觉任务广泛采用。由于FPGA的高性能、能效和可重新配置性,已被认为是最有前途的CNN硬件加速器,但是受FPGA计算能力、存储资源的限制,基于传统Winograd算法计算三维卷积的FPGA解决方案性能还有提升的空间。首先,研究了适用于三维运算的Winograd算法一维展开过程;然后,通过增加一次性输入特征图和卷积块的维度大小、低比特量化权重和输入数据等方法改善CNN在FPGA上的运行性能。优化思路包括使用移位代替部分除法的方法、分tile方案、二维到三维扩展及低比特量化等4个部分。相对传统的二维Winograd算法,优化算法每个卷积层的时钟周期数减少了7倍左右,相较传统滑窗卷积算法平均每个卷积层减少7倍左右。通过研究,证明了基于一维展开的3D-Winograd算法可以大大减少运算复杂度,并改善在FPGA运行CNN的性能。
-
关键词:
- 卷积神经网络(CNN) /
- FPGA /
- Winograd /
- 卷积算法 /
- 快速算法
Abstract:In recent years, Convolutional Neural Networks (CNNs) have been widely adopted by computer vision tasks. Due to the high performance, energy efficiency, and reconfigurability of FPGA, it has been considered as the most promising CNN hardware accelerator. However, the existing FPGA solutions based on the traditional Winograd method are usually limited by FPGA computing power and storage resources, and there is room for improvement in performance of 3D convolution operations. This paper first studied the one-dimensional expansion process of the Winograd algorithm suitable for three-dimensional operations; then, improved the performance of CNN on FPGA by increasing the one-time input feature map and the dimensional size of the convolution block, low-bit quantization weight and input data. The optimization ideas include four parts: the method of using shift instead of partial division, the division of tiles, the expansion of two-dimensional to three-dimensional, and low-bit quantization. Compared with the traditional two-dimensional Winograd algorithm, the number of clock cycles of each convolutional layer of the optimized algorithm is reduced by about 7 times, which is about 7 times less for each convolutional layer than the traditional sliding window convolution algorithm. Through the research, it is proved that the 3D-Winograd algorithm based on one-dimensional expansion can greatly reduce the computational complexity and improve the performance of running CNN on FPGA.
-
Key words:
- Convolutional Neural Network(CNN) /
- FPGA /
- Winograd /
- convolution algorithm /
- fast algorithm
-
表 1 3D-Winograd和2D-Winograd性能对比
Table 1. Performance comparison of 3D-Winograd and 2D-Winograd
性能指标 2D-Winograd[16] 3D-Winograd 结果 时钟周期 Latency 561 635 518 435 提高7% 资源占用 DSP/个 36 30 计算复杂度降低;计算资源减少 FF/个 10 264 5 670 LUT/个 11 399 6 275 乘法器/个 947 1 060 基本不变 注:2D-Winograd和3D-Winograd的tile尺寸为4×4,Fp32。 表 2 tile扩展前后性能对比
Table 2. Performance comparison before and after tile expansion
性能指标 3D-Winograd 3D-Winograd 结果 时钟周期 Latency 518 435 235 871 提高2.198倍 资源占用 DSP/个 30 77 计算资源增加 FF/个 5 670 21 492 LUT/个 6 275 20 965 乘法器/个 1 060 1 633 增加54% 注:3D-Winograd的tile尺寸分别为4×4,Fp32;6×6,Fp32。 表 3 量化前后性能对比
Table 3. Performance comparison before and after quantization
性能指标 3D-Winograd 3D-Winograd 结果 时钟周期 Latency 235 871 32 747, 30 633 提高7.20, 7.70倍 资源占用 DSP/个 77 59, 47 DSP资源降低1.31, 1.64倍 FF/个 21 492 4 087, 2 785 LUT/个 20 965 27 007, 10 412 乘法器/个 1 633 728, 657 减少2.24, 2.49倍 注:3D-Winograd的tile尺寸为6×6,Fp32;6×6, 8位定点/移位。 表 4 传统滑窗卷积、2D-Winograd和3D-Winograd算法性能对比
Table 4. Performance comparison of traditional sliding window learning, 2D-Winograd and 3D-Winograd algorithms
图层 输出特征图尺寸 卷核尺寸积 分块尺寸 吞吐率/BFLOPS 传统滑窗卷积算法[19] 2D-Winograd[16] 3D-Winograd Conv0 416×416×32 32/2 4×4 0.299 0.901 1.080 Conv1 208×208×64 64/2 4×4 1.595 5.168 6.201 Conv2 104×104×128 128/2 4×4 1.595 5.168 6.201 Conv3 104×104×64 64/2 4×4 0.177 0.521 0.637 Conv4 104×104×128 128/2 4×4 1.595 5.168 6.201 Conv5 52×52×256 256/4 6×6 1.595 11.361 13.623 Conv6 52×52×128 512/4 6×6 1.595 11.361 13.623 Conv7 52×52×256 256/4 6×6 0.177 1.167 1.399 Conv8 26×26×512 512/4 6×6 1.595 11.361 13.623 Conv9 26×26×256 256/4 6×6 0.177 1.167 1.399 Conv10 26×26×512 512/4 6×6 1.595 11.361 13.623 Conv11 26×26×256 256/4 6×6 0.177 1.167 1.399 Conv12 26×26×512 512/4 6×6 1.595 11.361 13.623 Conv13 26×26×1 024 1 024/4 6×6 3.190 21.022 25.230 Conv14 26×26×1 024 1 024/4 6×6 3.190 21.022 25.230 Conv15 26×26×1 024 1024/4 6×6 3.987 26.286 31.534 总时间/ms 1 052 160.963 123.799 表 5 CNN在FPGA和GPU的性能功耗比对比
Table 5. Comparison of CNN's performance/power ratio on FPGA and GPU
平台 平均每个Conv层吞吐率/BFLOPS 平均每个Conv层功耗/J 性能功耗比 FPGA 10.913 75 0.328 94 33.178 GPU 21.590 37 8.1 2.665 -
[1] ZHANG X F, WANG J S, ZHU C, et al. AccDNN: An IP-based DNN generator for FPGAs[C]//2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). Piscataway: IEEE Press, 2018: 210. [2] GUAN Y J, LIANG H, XU N Y, et al. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates[C]//2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM). Piscataway: IEEE Press, 2017: 152-159. [3] GEORGE J K, NEJADRIAHI H, SORGER V J. Towards on-chip optical FFTs for convolutional neural networks[C]//2017 IEEE International Conference on Rebooting Computing(ICRC). Piscataway: IEEE Press, 2017: 1-4. [4] ORDÓÑEZ Á, ARGVELLO F, HERAS D B. GPU accelerated FFT-based registration of hyperspectral scenes[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2017, 10(11): 4869-4878. doi: 10.1109/JSTARS.2017.2734052 [5] SUITA S, NISHIMURA T, TOKURA H, et al. Efficient cuDNN-compatible convolution-pooling on the GPU[C]//International Conference on Parallel Processing and Applied Mathematics. Berlin: Springer, 2019: 46-58. [6] ZHANG C, PRASANNA V. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system[C]//FPGA'17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017: 35-44. [7] CONG J, XIAO B J. Minimizing computation in convolutional neural networks[M]//Artificial Neural Networks and Machine Learning-ICANN 2014. Berlin: Springer, 2014: 281-290. [8] SUDA N, CHANDRA V, DASIKA G, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]//FPGA'16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016: 16-25. [9] ZHANG C, SUN G Y, FANG Z M, et al. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(11): 2072-2085. doi: 10.1109/TCAD.2017.2785257 [10] LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE Press, 2016: 4013-4021. [11] ZHANG C, LI P, SUN G Y, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]//FPGA'15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015: 161-170. [12] QIU J T, WANG J, YAO S, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]//FPGA'16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016: 26-35. [13] YU J C, GE G J, HU Y M, et al. Instruction driven cross-layer CNN accelerator for fast detection on FPGA[J]. ACM Transactions on Reconfigurable Technology and Systems, 2018, 11(3): 1-23. [14] AHMAD A, PASHA M A. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs[C]//2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). Piscataway: IEEE Press, 2019: 1106-1111. [15] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. doi: 10.1109/5.726791 [16] LIANG Y, LU L Q, XIAO Q C, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(4): 857-870. doi: 10.1109/TCAD.2019.2897701 [17] LU L Q, LIANG Y. SpWA: An efficient sparse Winograd convolutional neural networks accelerator on FPGAs[C]//2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). Piscataway: IEEE Press, 2018: 1-6. [18] ZHAO Y L, WANG D H, WANG L O. Convolution accelerator designs using fast algorithms[J]. Algorithms, 2019, 12(5): 112. doi: 10.3390/a12050112 [19] REDMON J, FARHADI A. YOLO9000: Better, faster, stronger[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE Press, 2017: 6517-6525.