留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于3D-Winograd的快速卷积算法设计及FPGA实现

林珂玉 姜宏旭 张永华 丛容子

林珂玉, 姜宏旭, 张永华, 等 . 基于3D-Winograd的快速卷积算法设计及FPGA实现[J]. 北京航空航天大学学报, 2021, 47(9): 1900-1907. doi: 10.13700/j.bh.1001-5965.2020.0310
引用本文: 林珂玉, 姜宏旭, 张永华, 等 . 基于3D-Winograd的快速卷积算法设计及FPGA实现[J]. 北京航空航天大学学报, 2021, 47(9): 1900-1907. doi: 10.13700/j.bh.1001-5965.2020.0310
LIN Keyu, JIANG Hongxu, ZHANG Yonghua, et al. Design and FPGA implementation of fast convolution algorithm based on 3D-Winograd[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(9): 1900-1907. doi: 10.13700/j.bh.1001-5965.2020.0310(in Chinese)
Citation: LIN Keyu, JIANG Hongxu, ZHANG Yonghua, et al. Design and FPGA implementation of fast convolution algorithm based on 3D-Winograd[J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(9): 1900-1907. doi: 10.13700/j.bh.1001-5965.2020.0310(in Chinese)

基于3D-Winograd的快速卷积算法设计及FPGA实现

doi: 10.13700/j.bh.1001-5965.2020.0310
基金项目: 

航天科学技术基金 190109

国家自然科学基金 61872017

详细信息
    通讯作者:

    姜宏旭, E-mail: jianghx@buaa.edu.cn

  • 中图分类号: TP391

Design and FPGA implementation of fast convolution algorithm based on 3D-Winograd

Funds: 

Aerospace Science and Technology Fund 190109

National Natural Science Foundation of China 61872017

More Information
  • 摘要:

    近年来,卷积神经网络(CNN)已被计算机视觉任务广泛采用。由于FPGA的高性能、能效和可重新配置性,已被认为是最有前途的CNN硬件加速器,但是受FPGA计算能力、存储资源的限制,基于传统Winograd算法计算三维卷积的FPGA解决方案性能还有提升的空间。首先,研究了适用于三维运算的Winograd算法一维展开过程;然后,通过增加一次性输入特征图和卷积块的维度大小、低比特量化权重和输入数据等方法改善CNN在FPGA上的运行性能。优化思路包括使用移位代替部分除法的方法、分tile方案、二维到三维扩展及低比特量化等4个部分。相对传统的二维Winograd算法,优化算法每个卷积层的时钟周期数减少了7倍左右,相较传统滑窗卷积算法平均每个卷积层减少7倍左右。通过研究,证明了基于一维展开的3D-Winograd算法可以大大减少运算复杂度,并改善在FPGA运行CNN的性能。

     

  • 图 1  传统滑窗卷积算法与2D-Winograd算法

    Figure 1.  Conventional sliding window convolution algorithm and 2D-WINOGRAD algorithm

    图 2  基于Winograd原理的一维卷积过程

    Figure 2.  One-dimensional convolution process based on Winograd principle

    图 3  基于一维扩展的Winograd三维卷积过程

    Figure 3.  Three-dimensional convolution process of Winograd based on one-dimensional expansion

    图 4  3种Winograd卷积过程伪代码

    Figure 4.  Three kinds pseudo code of Winograd convolution process

    表  1  3D-Winograd和2D-Winograd性能对比

    Table  1.   Performance comparison of 3D-Winograd and 2D-Winograd

    性能指标 2D-Winograd[16] 3D-Winograd 结果
    时钟周期 Latency 561 635 518 435 提高7%
    资源占用 DSP/个 36 30 计算复杂度降低;计算资源减少
    FF/个 10 264 5 670
    LUT/个 11 399 6 275
    乘法器/个 947 1 060 基本不变
    注:2D-Winograd和3D-Winograd的tile尺寸为4×4,Fp32。
    下载: 导出CSV

    表  2  tile扩展前后性能对比

    Table  2.   Performance comparison before and after tile expansion

    性能指标 3D-Winograd 3D-Winograd 结果
    时钟周期 Latency 518 435 235 871 提高2.198倍
    资源占用 DSP/个 30 77 计算资源增加
    FF/个 5 670 21 492
    LUT/个 6 275 20 965
    乘法器/个 1 060 1 633 增加54%
    注:3D-Winograd的tile尺寸分别为4×4,Fp32;6×6,Fp32。
    下载: 导出CSV

    表  3  量化前后性能对比

    Table  3.   Performance comparison before and after quantization

    性能指标 3D-Winograd 3D-Winograd 结果
    时钟周期 Latency 235 871 32 747, 30 633 提高7.20, 7.70倍
    资源占用 DSP/个 77 59, 47 DSP资源降低1.31, 1.64倍
    FF/个 21 492 4 087, 2 785
    LUT/个 20 965 27 007, 10 412
    乘法器/个 1 633 728, 657 减少2.24, 2.49倍
    注:3D-Winograd的tile尺寸为6×6,Fp32;6×6, 8位定点/移位。
    下载: 导出CSV

    表  4  传统滑窗卷积、2D-Winograd和3D-Winograd算法性能对比

    Table  4.   Performance comparison of traditional sliding window learning, 2D-Winograd and 3D-Winograd algorithms

    图层 输出特征图尺寸 卷核尺寸积 分块尺寸 吞吐率/BFLOPS
    传统滑窗卷积算法[19] 2D-Winograd[16] 3D-Winograd
    Conv0 416×416×32 32/2 4×4 0.299 0.901 1.080
    Conv1 208×208×64 64/2 4×4 1.595 5.168 6.201
    Conv2 104×104×128 128/2 4×4 1.595 5.168 6.201
    Conv3 104×104×64 64/2 4×4 0.177 0.521 0.637
    Conv4 104×104×128 128/2 4×4 1.595 5.168 6.201
    Conv5 52×52×256 256/4 6×6 1.595 11.361 13.623
    Conv6 52×52×128 512/4 6×6 1.595 11.361 13.623
    Conv7 52×52×256 256/4 6×6 0.177 1.167 1.399
    Conv8 26×26×512 512/4 6×6 1.595 11.361 13.623
    Conv9 26×26×256 256/4 6×6 0.177 1.167 1.399
    Conv10 26×26×512 512/4 6×6 1.595 11.361 13.623
    Conv11 26×26×256 256/4 6×6 0.177 1.167 1.399
    Conv12 26×26×512 512/4 6×6 1.595 11.361 13.623
    Conv13 26×26×1 024 1 024/4 6×6 3.190 21.022 25.230
    Conv14 26×26×1 024 1 024/4 6×6 3.190 21.022 25.230
    Conv15 26×26×1 024 1024/4 6×6 3.987 26.286 31.534
    总时间/ms 1 052 160.963 123.799
    下载: 导出CSV

    表  5  CNN在FPGA和GPU的性能功耗比对比

    Table  5.   Comparison of CNN's performance/power ratio on FPGA and GPU

    平台 平均每个Conv层吞吐率/BFLOPS 平均每个Conv层功耗/J 性能功耗比
    FPGA 10.913 75 0.328 94 33.178
    GPU 21.590 37 8.1 2.665
    下载: 导出CSV
  • [1] ZHANG X F, WANG J S, ZHU C, et al. AccDNN: An IP-based DNN generator for FPGAs[C]//2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). Piscataway: IEEE Press, 2018: 210.
    [2] GUAN Y J, LIANG H, XU N Y, et al. FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates[C]//2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines(FCCM). Piscataway: IEEE Press, 2017: 152-159.
    [3] GEORGE J K, NEJADRIAHI H, SORGER V J. Towards on-chip optical FFTs for convolutional neural networks[C]//2017 IEEE International Conference on Rebooting Computing(ICRC). Piscataway: IEEE Press, 2017: 1-4.
    [4] ORDÓÑEZ Á, ARGVELLO F, HERAS D B. GPU accelerated FFT-based registration of hyperspectral scenes[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2017, 10(11): 4869-4878. doi: 10.1109/JSTARS.2017.2734052
    [5] SUITA S, NISHIMURA T, TOKURA H, et al. Efficient cuDNN-compatible convolution-pooling on the GPU[C]//International Conference on Parallel Processing and Applied Mathematics. Berlin: Springer, 2019: 46-58.
    [6] ZHANG C, PRASANNA V. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system[C]//FPGA'17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017: 35-44.
    [7] CONG J, XIAO B J. Minimizing computation in convolutional neural networks[M]//Artificial Neural Networks and Machine Learning-ICANN 2014. Berlin: Springer, 2014: 281-290.
    [8] SUDA N, CHANDRA V, DASIKA G, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]//FPGA'16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016: 16-25.
    [9] ZHANG C, SUN G Y, FANG Z M, et al. Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38(11): 2072-2085. doi: 10.1109/TCAD.2017.2785257
    [10] LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE Press, 2016: 4013-4021.
    [11] ZHANG C, LI P, SUN G Y, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]//FPGA'15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015: 161-170.
    [12] QIU J T, WANG J, YAO S, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]//FPGA'16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016: 26-35.
    [13] YU J C, GE G J, HU Y M, et al. Instruction driven cross-layer CNN accelerator for fast detection on FPGA[J]. ACM Transactions on Reconfigurable Technology and Systems, 2018, 11(3): 1-23.
    [14] AHMAD A, PASHA M A. Towards design space exploration and optimization of fast algorithms for convolutional neural networks (CNNs) on FPGAs[C]//2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). Piscataway: IEEE Press, 2019: 1106-1111.
    [15] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324. doi: 10.1109/5.726791
    [16] LIANG Y, LU L Q, XIAO Q C, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(4): 857-870. doi: 10.1109/TCAD.2019.2897701
    [17] LU L Q, LIANG Y. SpWA: An efficient sparse Winograd convolutional neural networks accelerator on FPGAs[C]//2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). Piscataway: IEEE Press, 2018: 1-6.
    [18] ZHAO Y L, WANG D H, WANG L O. Convolution accelerator designs using fast algorithms[J]. Algorithms, 2019, 12(5): 112. doi: 10.3390/a12050112
    [19] REDMON J, FARHADI A. YOLO9000: Better, faster, stronger[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). Piscataway: IEEE Press, 2017: 6517-6525.
  • 加载中
图(4) / 表(5)
计量
  • 文章访问数:  437
  • HTML全文浏览量:  57
  • PDF下载量:  86
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-07-03
  • 录用日期:  2020-11-08
  • 网络出版日期:  2021-09-20

目录

    /

    返回文章
    返回
    常见问答