留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

解耦同类自知识蒸馏的轻量化唇语识别方法

马金林 刘宇灏 马自萍 郭兆伟 吕鑫

马金林,刘宇灏,马自萍,等. 解耦同类自知识蒸馏的轻量化唇语识别方法[J]. 北京航空航天大学学报,2024,50(12):3709-3719 doi: 10.13700/j.bh.1001-5965.2022.0931
引用本文: 马金林,刘宇灏,马自萍,等. 解耦同类自知识蒸馏的轻量化唇语识别方法[J]. 北京航空航天大学学报,2024,50(12):3709-3719 doi: 10.13700/j.bh.1001-5965.2022.0931
MA J L,LIU Y H,MA Z P,et al. Lightweight lip reading method based on decoupling homogeneous self-knowledge distillation[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(12):3709-3719 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0931
Citation: MA J L,LIU Y H,MA Z P,et al. Lightweight lip reading method based on decoupling homogeneous self-knowledge distillation[J]. Journal of Beijing University of Aeronautics and Astronautics,2024,50(12):3709-3719 (in Chinese) doi: 10.13700/j.bh.1001-5965.2022.0931

解耦同类自知识蒸馏的轻量化唇语识别方法

doi: 10.13700/j.bh.1001-5965.2022.0931
基金项目: 宁夏自然科学基金(2022AAC03268);北方民族大学中央高校基本科研业务费专项资金(2021KJCX09,FWNX21);北方民族大学“计算机视觉与虚拟现实”创新团队项目
详细信息
    通讯作者:

    E-mail:majinlin@nmu.edu.cn

  • 中图分类号: TP391

Lightweight lip reading method based on decoupling homogeneous self-knowledge distillation

Funds: Natural Science Foundation of Ningxia Autonomous Region (2022AAC03268); Basic Research Funds for Central Universities of North Minzu University (2021KJCX09,FWNX21); The Innovation Team of Computer Vision and Virtual Reality of North Minzu University
More Information
  • 摘要:

    针对唇语识别模型因参数量和计算量较大而无法在移动终端和边缘设备上应用的问题,提出基于解耦同类自知识蒸馏和GhostNet-TSM的唇语识别方法。提出具有时序特征提取能力的GhostNet-TSM网络;将同类自知识蒸馏的特征解耦为目标类特征和非目标类特征,分别设置损失函数,以提高模型的识别精度;使用解耦同类自知识蒸馏方法在LRW和LIP350数据集上进行模型训练,并在OuluVS数据集上进行验证。实验结果表明:GhostNet-TSM网络在LRW数据集上达到了85.2%的识别准确率,超过了多数非轻量化模型,浮点数计算量和模型参数量降低至0.988 GFLOPs和20.310×106

     

  • 图 1  普通卷积与Ghost模块

    Figure 1.  Convolution and Ghost module

    图 2  Ghost瓶颈模块结构

    Figure 2.  Structure of Ghost bottleneck module

    图 3  普通知识蒸馏原理

    Figure 3.  Principles of general knowledge distillation

    图 4  同类自知识蒸馏原理

    Figure 4.  Principles of homogeneous self-knowledge distillation

    图 5  TSM模块原理

    Figure 5.  Principles of TSM

    图 6  GhostNet-TSM结构

    Figure 6.  Structure of GhostNet-TSM

    图 7  TSM模块的2种插入方式

    Figure 7.  Two kinds of inserting TSM module

    图 8  同类自知识蒸馏与解耦后的流程

    Figure 8.  Decoupling homogeneous self-knowledge distillation and decoupled process

    图 9  LIP350数据采集示例

    Figure 9.  Samples of LIP350 data acquisition

    表  1  TN对识别性能的影响

    Table  1.   Influence of T and N on recognition performance

    网络 蒸馏或正则化方法 T N 准确率/% 变化量/%
    GhostNet-TSM无蒸馏 79.3
    同类自知识蒸馏80.2+0.9
    同类自知识蒸馏(仅T 77.6−1.7
    同类自知识蒸馏(仅N 80.7+1.4
    MobileNet-TSM无蒸馏 75.3
    同类自知识蒸馏76.1+0.8
    同类自知识蒸馏(仅T 74.4−0.9
    同类自知识蒸馏(仅N 76.3+1.0
    ResNet101-TSM无蒸馏 83.6
    同类自知识蒸馏84.5+0.9
    同类自知识蒸馏(仅T 83.0−0.6
    同类自知识蒸馏(仅N 84.4+0.8
    下载: 导出CSV

    表  2  不同网络结合TSM模块的对比实验结果

    Table  2.   Comparative experiment of different networks combining TSM modules

    网络组成 基础网络 网络类型 准确率/% 参数量 浮点数运算量/GFLOPS
    ResNet18-TSM ResNet18 非轻量级 74.9 11.356×106 7.320
    ResNet34-TSM ResNet34 非轻量级 79.8 21.464×106 14.995
    ResNet101-TSM ResNet101 非轻量级 83.6 43.217×106 33.180
    DenseNet121-TSM DenseNet121 非轻量级 77.6 51.182×106 37.976
    MobileNet-TSM MobileNetV2 轻量级 75.3 3.672×106 1.300
    ShuffleNet v2-TSM ShuffleNet V2 轻量级 71.1 4.823×106 1.492
    GhostNet-TSM (本文) GhostNet 轻量级 79.3 4.350×106 0.678266
    下载: 导出CSV

    表  3  与其他蒸馏方法的性能对比

    Table  3.   Performance comparison with other distillation methods

    蒸馏或正则化方法 准确率/%
    无蒸馏 79.3
    标签平滑 79.5
    普通知识蒸馏 77.6
    自知识蒸馏 79.7
    同类自知识蒸馏 80.2
    解耦同类自知识蒸馏(仅使用N 80.7
    下载: 导出CSV

    表  4  解耦参数对比

    Table  4.   Comparison of parameters of decoupling

    TN的系数 准确率/%
    LIP350 LRW
    $ \alpha $= 0, $ \beta $= 0.8 80.4 82.0
    $ \alpha $= 0, $ \beta $= 1 80.7 82.5
    $ \alpha $= 0, $ \beta $= 2 80.9 82.1
    $ \alpha $= 0, $ \beta $= 4 81.5 83.6
    $ \alpha $= 0, $ \beta $= 5 81.3 82.6
    $ \alpha $= 0, $ \beta $= 6 81.1 83.2
    $ \alpha $= 0, $ \beta $= 7 79.5 81.4
    $ \alpha $= 0, $ \beta $= 8 77.1 80.8
    $ \alpha $= 0, $ \beta $= 4 81.5 83.6
    $ \alpha $= 0.2, $ \beta $= 4 82.1 83.9
    $ \alpha $= 0.3, $ \beta $= 4 82.5 83.3
    $ \alpha $= 0.4, $ \beta $= 4 83.3 84.5
    $ \alpha $= 0.5, $ \beta $= 4 82.9 84.6
    $ \alpha $= 0.6, $ \beta $= 4 82.8 85.2
    $ \alpha $= 0.8, $ \beta $= 4 81.3 84.8
    $ \alpha $= 1, $ \beta $= 4 80.8 83.5
    下载: 导出CSV

    表  5  OuluVS数据集上验证实验结果

    Table  5.   Validation experiment on OuluVS dataset

    说话者 蒸馏方法 准确率/%
    7号 无蒸馏 91.2
    同类自知识蒸馏 91.8
    解耦同类自知识蒸馏 94.6
    9号 无蒸馏 81.0
    同类自知识蒸馏 82.3
    解耦同类自知识蒸馏 86.4
    14号 无蒸馏 87.5
    同类自知识蒸馏 88.3
    解耦同类自知识蒸馏 89.8
    下载: 导出CSV

    表  6  与其他唇语识别方法的性能对比

    Table  6.   Performance comparison with other lip recognition methods

    唇语识别方法 年份 唇语识别模型的网络组成 参数量 浮点数运算量/GFLOPs 准确率/%
    前端 后端
    文献[30] 2017 VGG-M 61.1
    文献[31] 2017 VGG-M LSTM 419.256×106 32.744 76.2
    文献[32] 2019 ResNet34 + DenseNet3D Conv-BLSTM 331.603×106 29.386 83.3
    文献[10] 2017 3D-Conv + ResNet34
    BLSTM 93.168×106 11.539 83.0
    文献[27] 2018 3D-Conv + ResNet34 BLSTM 132.581×106 13.681 83.4
    文献[33] 2019 (3D-Conv)×2 BLSTM 976.867×106 47.806 84.1
    文献[26] 2018 3D-Conv + ResNet18 BLSTM 55.378×106 9.332 84.3
    文献[29] 2020 3D-Conv + ResNet18 BGRU 57.644×106 10.055 84.4
    文献[28] 2020 3D-Conv + ResNet18 BGRU 59.520×106 10.598 85.0
    文献[34] 2020 3D-Conv+P3D-ResNet TCN 84.8
    文献[35] 2020 SpotFast TransformerProduct-Key memory 84.4
    文献[13] 2020 3D-Conv + ResNet18 MS-TCN 52.872×106 9.120 85.3
    文献[36] 2021 3D-Conv + ResNet18 BGRU+ Visual-Audio Memory 85.4
    文献[37] 2022 MoCo + Wav2Vec by SJTU LUMIA MoCo + Wav2Vec by SJTU LUMIA 85.0
    本文方法 GhostNet+TSM+解耦同类自知识蒸馏 GhostNet+TSM+解耦同类自知识蒸馏 20.310×106 0.988 85.2
    下载: 导出CSV
  • [1] LESANI F S, GHAZVINI F F, DIANAT R. Mobile phone security using automatic lip reading[C]//Proceedings of the 9th International Conference on e-Commerce in Developing Countries: With Focus on e-Business. Piscataway: IEEE Press, 2015.
    [2] MATHULAPRANGSAN S, WANG C Y, KUSUM A Z, et al. A survey of visual lip reading and lip-password verification[C]//Proceedings of the International Conference on Orange Technologies. Piscataway: IEEE Press, 2015: 22-25.
    [3] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2016: 4945-4949.
    [4] HUANG J T, LI J Y, GONG Y F. An analysis of convolutional neural networks for speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2015: 4989-4993.
    [5] CHAE H, KANG C M, KIM B, et al. Autonomous braking system via deep reinforcement learning[C]//Proceedings of the IEEE 20th International Conference on Intelligent Transportation Systems. Piscataway: IEEE Press, 2017: 1-6.
    [6] HONG X P, YAO H X, WAN Y Q, et al. A PCA based visual DCT feature extraction method for lip-reading[C]//Proceedings of the International Conference on Intelligent Information Hiding and Multimedia. Piscataway: IEEEPress, 2006: 321-326.
    [7] PUVIARASAN N, PALANIVEL S. Lip reading of hearing impaired persons using HMM[J]. Expert Systems with Applications, 2021, 57(24): 61-73.
    [8] 马金林, 朱艳彬, 马自萍, 等.唇语识别的深度学习方法综述[J]. 计算机工程与应用, 2021, 57(24): 61-73.

    MA J L,ZHU Y B,MA Z P,et al. Survey of deep learning methods for lip recognition[J]. Computer Engineering and Applications, 2021, 57(24): 61-73(in Chinese).
    [9] 马金林, 陈德光, 郭贝贝, 等. 唇语语料库综述[J]. 计算机工程与应用, 2019, 55(22): 1-13.

    MA J L, CHEN D G, GUO B B,et al. Lip corpus review[J]. Computer Engineering and Applications, 2019, 55(22): 1-13(in Chinese).
    [10] STAFYLAKIS T, TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association. Toulouse: ISCA-INT Speech Communication Association, 2017: 3652-3656.
    [11] ASSAELY M, SHILLINGFORD B, WHITESON S, et al. LipNet: End-to-end sentence-level lipreading[J]. (2016-12-16)[2022-11-01].
    [12] YANG H, YUAN C F, ZHANG L, et al. STA-CNN: Convolutional spatial-temporal attention learning for action recognition[J]. IEEE Transactions on Image Processing, 2020, 29: 5783-5793. doi: 10.1109/TIP.2020.2984904
    [13] MARTINEZ B, MA P C, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2020: 6319-6323.
    [14] HAN K, WANG Y H, TIAN Q, et al. GhostNet: More features from cheap operations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 1577-1586.
    [15] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778.
    [16] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[EB/OL]. (2015-05-09)[2022-11-01].
    [17] LIU Y F, SHU C Y, WANG J D, et al. Structured knowledge distillation for dense prediction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7035-7049. doi: 10.1109/TPAMI.2020.3001940
    [18] YANG Z D, LI Z, JIANG X H, et al. Focal and global knowledge distillation for detectors[EB/OL]. (2022-05-09)[2022-11-01].
    [19] MOBAHI H, FARAJTABAR M, BARTLETT P L. Self-distillation amplifies regularization in Hilbert space[EB/OL]. (2020-10-26) [2022-11-01].
    [20] ZHANG Z L, SABUNCU M R. Self-distillation as instance-specific label smoothing[EB/OL]. (2020-10-22)[2022-11-01].
    [21] YUN S, PARK J, LEE K, et al. Regularizing class-wise predictions via self-knowledge distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 13873-13882.
    [22] LIN J, GAN C, HAN S. TSM: Temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 7082-7092.
    [23] KING D E. Dlib-ml: A machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10: 1755-1758.
    [24] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 4510-4520.
    [25] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 122-138.
    [26] STAFYLAKIS T, KHAN M H, TZIMIROPOULOS G. Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs[J]. Computer Vision and Image Understanding, 2018, 176-177: 22-32. doi: 10.1016/j.cviu.2018.10.003
    [27] PETRIDIS S, STAFYLAKIS T, MA P C, et al. Audio-visual speech recognition with a hybrid CTC/attention architecture[C]//Proceedings of the IEEE Spoken Language Technology Workshop. Piscataway: IEEE Press, 2018: 513-520.
    [28] ZHANG Y H, YANG S, XIAO J Y, et al. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition[C]//Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition. Piscataway: IEEE Press, 2020: 356-363.
    [29] ZHAO X, YANG S, SHAN S G, et al. Mutual information maximization for effective lip reading[C]//Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition. Piscataway: IEEE Press, 2020: 420-427.
    [30] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2017: 87-103.
    [31] CHUNG J S, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 3444-3453.
    [32] WANG C H. Multi-grained spatio-temporal modeling for lip-reading[EB/OL]. (2019-09-02)[2022-11-01].
    [33] WENG X S, KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[EB/OL]. (2019-07-19)[2022-11-01].
    [34] XU B, LU C, GUO Y D, et al. Discriminative multi-modality speech recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 14421-14430.
    [35] WIRIYATHAMMABHUM P. SpotFast networks with memory augmented lateral transformers for lipreading[C]//Proceedings of the International Conference on Neural Information Processing. Berlin: Springer, 2020: 554-561.
    [36] KIM M, HONG J, PARK S J, et al. Multi-modality associative bridging through memory: speech sound recollected from face video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 296-306.
    [37] PAN X C, CHEN P Y, GONG Y C, et al. Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition[EB/OL]. (2022-05-26)[2022-11-01].
  • 加载中
图(9) / 表(6)
计量
  • 文章访问数:  299
  • HTML全文浏览量:  101
  • PDF下载量:  7
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-11-17
  • 录用日期:  2023-02-17
  • 网络出版日期:  2023-03-10
  • 整期出版日期:  2024-12-31

目录

    /

    返回文章
    返回
    常见问答