留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于RISC-V向量扩展的图像预处理加速方法

刘强 尹蔚 李凯

吕卫民, 胡冬, 谢劲松, 等 . 平面工艺功率型半导体晶体管铝迁移实验[J]. 北京航空航天大学学报, 2011, 37(12): 1515-1518.
引用本文: 刘强,尹蔚,李凯. 基于RISC-V向量扩展的图像预处理加速方法[J]. 北京航空航天大学学报,2025,51(4):1074-1084 doi: 10.13700/j.bh.1001-5965.2023.0208
Lü Weimin, Hu Dong, Xie Jinsong, et al. Experiments of aluminum migration of power silicon transistor using planar process[J]. Journal of Beijing University of Aeronautics and Astronautics, 2011, 37(12): 1515-1518. (in Chinese)
Citation: LIU Q,YIN W,LI K. Image preprocessing acceleration method based on RISC-V vector extension[J]. Journal of Beijing University of Aeronautics and Astronautics,2025,51(4):1074-1084 (in Chinese) doi: 10.13700/j.bh.1001-5965.2023.0208

基于RISC-V向量扩展的图像预处理加速方法

doi: 10.13700/j.bh.1001-5965.2023.0208
基金项目: 国家自然科学基金(U21B2031)
详细信息
    通讯作者:

    E-mail:qiangliu@tju.edu.cn

  • 中图分类号: TN402

Image preprocessing acceleration method based on RISC-V vector extension

Funds: National Natural Science Foundation of China (U21B2031)
More Information
  • 摘要:

    作为卷积神经网络(CNN)计算的前序步骤,图像预处理不可或缺又非常耗时。为加速图像预处理,提出一种基于RISC-V向量扩展的加速方法,对灰度化、标准化、高斯滤波等11种图像预处理算法进行加速。从计算模式上将11种图像预处理算法归为4类,并基于RISC-V向量扩展对各类图像预处理算法设计了加速方案;为进一步提高性能,新增6条自定义的向量指令,并通过修改编译器和设计硬件模块实现了6条自定义向量指令;使用现场可编程门阵列(FPGA)进行测试,并分析了向量处理器配置对性能和资源消耗的影响。结果显示:所提方法相比标量处理器实现了3.13~9.97倍的加速效果,可有效解决图像预处理在深度学习过程中的性能瓶颈问题。

     

  • 图 1  硬件平台架构

    Figure 1.  Hardware architecture

    图 2  索引加载指令实现像素位置变化

    Figure 2.  Pixel location change by index loading instruction

    图 3  卷积计算过程

    Figure 3.  Convolution computation process

    图 4  索引加载和累加计算过程

    Figure 4.  Computation process of index loading and accumulation

    图 5  高斯滤波计算过程

    Figure 5.  Gaussian filter computation process

    图 6  自定义向量指令格式

    Figure 6.  Customized vector instruction format

    图 7  自定义向量指令的硬件实现

    Figure 7.  Hardware implementation of customized vector instructions

    图 8  不同向量寄存器长度配置下的灰度化周期数

    Figure 8.  Cycles of grayscale processing under different vector register length configurations

    图 9  不同向量处理单元位宽配置下的灰度化周期数

    Figure 9.  Cycles of grayscale processing under different vector processing pipe width configurations

    图 10  图像分类应用中的效果

    Figure 10.  Effect in image classification application

    图 11  车牌识别应用中的效果

    Figure 11.  Effect in license plate recognition application

    表  1  自定义向量指令功能

    Table  1.   Function of customized vector instructions

    指令 功能
    vset255 vd = vs1 >rs2 ? 255: vs1
    vset0 vd = vs1 >rs2 ?vs1:0
    vabs vd = vs1 > rs2 ? vs1: −vs1
    vcmpset vd = vs1 > rs2 ?255: 0
    vwmsau vd = vd + (vs1·rs2 >> 8)
    vmuls vd = vs1·rs2 >>8
    下载: 导出CSV

    表  2  不同向量寄存器长度配置下资源消耗及功耗

    Table  2.   Resource and power consumption under different vector register length configurations

    向量寄存器长度/bit LUT FF BRAM DSP 功耗/W
    64 9880 5976 64 7 0.336
    128 10373 6964 64 7 0.346
    256 11312 8716 64 7 0.353
    512 13262 12168 64 7 0.373
    1024 17374 19079 64 7 0.424
    2048 41836 33068 64 7 0.533
    下载: 导出CSV

    表  3  不同向量处理单元位宽配置下资源消耗及功耗

    Table  3.   Resource and power consumption under different vector processing pipe width configurations

    向量处理单元位宽/bit LUT FF BRAM DSP 功耗/W
    32 41386 33068 64 7 0.533
    64 42693 33642 64 11 0.539
    128 44133 34772 64 19 0.545
    256 47470 37058 64 35 0.577
    512 52061 41584 64 67 0.607
    1024 81279 48705 64 131 0.738
    下载: 导出CSV

    表  4  像素位置变化类算法加速效果

    Table  4.   Acceleration results of pixel location change algorithms

    算法 标量基准周期数 RVV周期数 加速倍数
    旋转/镜像 46728 4791 9.75
    平移 47439 5679 8.35
    缩放 40681 5868 6.93
    下载: 导出CSV

    表  5  像素数值变化类算法加速效果

    Table  5.   Acceleration results of pixel value change algorithms

    算法 标量基准
    周期数
    标准RVV
    周期数
    标准RVV
    加速倍数
    自定义RVV
    周期数
    自定义RVV
    加速倍数
    灰度化 60488 10786 5.61 8651 6.99
    二值化 42576 42576 1.00 4272 9.97
    亮度调整 116627 80730 1.44 17422 6.69
    下载: 导出CSV

    表  6  求全局平均值类算法加速效果

    Table  6.   Acceleration results of global average computation algorithms

    算法 标量基准周期数 RVV周期数 加速倍数
    标准化 65620 8724 7.52
    下载: 导出CSV

    表  7  卷积类算法加速效果

    Table  7.   Acceleration results of convolution algorithms

    算法 标量基准
    周期数
    标准RVV
    周期数
    标准RVV
    加速倍数
    自定义RVV
    周期数
    自定义RVV
    加速倍数
    高斯滤波 268604 66178 4.06 63037 4.26
    拉普拉斯滤波 214456 105235 2.04 57650 3.72
    索贝尔边缘检测 214783 197202 1.09 68542 3.13
    下载: 导出CSV

    表  8  资源消耗及功耗

    Table  8.   Resource and power consumption

    硬件架构 LUT FF BRAM DSP 功耗/W
    ibex+CNN 310840 279836 957 2161 11.65
    ibex+Vicuna+CNN 349618 311699 1005 2168 11.88
    ibex+自定义Vicuna+CNN 349835 311964 1005 2168 11.88
    下载: 导出CSV
  • [1] PAL K K, SUDEEP K S. Preprocessing for image classification by convolutional neural networks[C]//Proceedings of the IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology. Piscataway: IEEE Press, 2016: 1778-1781.
    [2] ŞABAN Ö, AKDEMIR B. Effects of histopathological image pre-processing on convolutional neural networks[J]. Procedia Computer Science, 2018, 132: 396-403. doi: 10.1016/j.procs.2018.05.166
    [3] PITALOKA D A, WULANDARI A, BASARUDDIN T, et al. Enhancing CNN with preprocessing stage in automatic emotion recognition[J]. Procedia Computer Science, 2017, 116: 523-529. doi: 10.1016/j.procs.2017.10.038
    [4] TABIK S, PERALTA D, HERRERA-POYATOS A, et al. A snapshot of image pre-processing for convolutional neural networks: case study of MNIST[J]. International Journal of Computational Intelligence Systems, 2017, 10(1): 555-568. doi: 10.2991/ijcis.2017.10.1.38
    [5] DU C Y, TSAI C F, CHEN W C, et al. A 28 nm 11.2 TOPS/W hardware-utilization-aware neural-network accelerator with dynamic dataflow[C]//Proceedings of the IEEE International Solid-State Circuits Conference. Piscataway: IEEE Press, 2023: 1-3.
    [6] KELLER B, VENKATESAN R, DAI S, et al. A 95.6-TOPS/W deep learning inference accelerator with per-vector scaled 4-bit quantization in 5 nm[J]. IEEE Journal of Solid-State Circuits, 2023, 58(4): 1129-1141. doi: 10.1109/JSSC.2023.3234893
    [7] KARNIK T, KURIAN D, ASERON P, et al. A cm-scale self-powered intelligent and secure IoT edge mote featuring an ultra-low-power SoC in 14 nm tri-gate CMOS[C]//Proceedings of the IEEE International Solid-State Circuits Conference. Piscataway: IEEE Press, 2018: 46-48.
    [8] NARAYANAN D, SANTHANAM K, PHANISHAYEE A, et al. Accelerating deep learning workloads through efficient multi-model execution[C]//Proceedings of the NeurIPS Workshop on Systems for Machine Learning. [S. l. ]: NeurIPS, 2018: 20-27.
    [9] JIA T Y, JU Y H, JOSEPH R, et al. NCPU: an embedded neural CPU architecture on resource-constrained low power devices for real-time end-to-end performance[C]//Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. Piscataway: IEEE Press, 2020: 1097-1109.
    [10] NVIDIA. NVIDIA data loading library (DALI) [EB/OL]. [2023-04-01]. https://github.com/NVIDIA/DALI.
    [11] MA N. A SoC-based acceleration method for UAV runway detection image pre-processing algorithm[C]//Proceedings of the 25th International Conference on Automation and Computing. Piscataway: IEEE Press, 2019: 1-6.
    [12] KUO Y M, GARCÍA-HERRERO F, RUANO O, et al. RISC-V Galois field ISA extension for non-binary error-correction codes and classical and post-quantum cryptography[J]. IEEE Transactions on Computers, 2023, 72(3): 682-692.
    [13] RAZILOV V, MATÚŠ E, FETTWEIS G. Communications signal processing using RISC-V vector extension[C]//Proceedings of the International Wireless Communications and Mobile Computing. Piscataway: IEEE Press, 2022: 690-695.
    [14] CAVALCANTE M, SCHUIKI F, ZARUBA F, et al. Ara: a 1-GHz scalable and energy-efficient RISC-V vector processor with multiprecision floating-point support in 22-nm FD-SOI[J]. IEEE Transactions on Very Large Scale Integration Systems, 2020, 28(2): 530-543. doi: 10.1109/TVLSI.2019.2950087
    [15] 刘强, 李一可. 基于指令扩展的RISC-V可配置故障注入检测方法[J/OL]. 北京航空航天大学学报, 2023(2023-03-10)[2023-04-01]. https://bhxb.buaa.edu.cn/bhzk/cn/article/doi/10.13700/j.bh.1001-5965.2022.0995.

    LIU Q, LI Y K. Configurable fault detection method for RISC-V processors based on instruction extension[J/OL]. Journal of Beijing University of Aeronautics and Astronautics, 2023(2023-03-10)[2023-04-01]. https://bhxb.buaa.edu.cn/bhzk/cn/article/doi/10.13700/j.bh.1001-5965.2022.0995(in Chinese).
    [16] ASANOVIC K. Vector extension 1.0[EB/OL]. (2021-09-20)[2023-04-01]. https://github.com/riscv/riscv-v-spec/releases/tag/v1.0.
    [17] PLATZER M, PUSCHNER P. Vicuna: a timing-predictable RISC-V vector coprocessor for scalable parallel computation[C]//Proceedings of the 33rd Euromicro Conference on Real-Time Systems. Porto: [s. n. ], 2021: 1-18.
    [18] SCHIAVONE P D, CONTI F, ROSSI D, et al. Slow and steady wins the race? A comparison of ultra-low-power RISC-V cores for Internet-of-things applications[C]//Proceedings of the 27th International Symposium on Power and Timing Modeling, Optimization and Simulation. Piscataway: IEEE Press, 2017: 1-8.
    [19] OpenHW Group. OpenHW group eXtension interface[EB/OL]. [2023-04-01]. https://docs.openhwgroup.org/projects/openhw-group-core-v-xif/en/latest/index.html.
    [20] ALEX K, VINOD N. The CIFAR-10 dataset[EB/OL]. [2023-04-01]. http://www.cs.toronto.edu/~kriz/cifar.html.
    [21] YAN S, LIU Z Y, WANG Y, et al. An FPGA-based MobileNet accelerator considering network structure characteristics[C]//Proceedings of the 31st International Conference on Field-Programmable Logic and Applications. Piscataway: IEEE Press, 2021: 17-23.
  • 加载中
图(11) / 表(8)
计量
  • 文章访问数:  273
  • HTML全文浏览量:  72
  • PDF下载量:  6
  • 被引次数: 0
出版历程
  • 收稿日期:  2023-04-24
  • 录用日期:  2023-05-26
  • 网络出版日期:  2023-06-09
  • 整期出版日期:  2025-04-30

目录

    /

    返回文章
    返回
    常见问答