Lightweight lip reading method based on decoupling homogeneous self-knowledge distillation
-
摘要:
针对唇语识别模型因参数量和计算量较大而无法在移动终端和边缘设备上应用的问题,提出基于解耦同类自知识蒸馏和GhostNet-TSM的唇语识别方法。提出具有时序特征提取能力的GhostNet-TSM网络;将同类自知识蒸馏的特征解耦为目标类特征和非目标类特征,分别设置损失函数,以提高模型的识别精度;使用解耦同类自知识蒸馏方法在LRW和LIP350数据集上进行模型训练,并在OuluVS数据集上进行验证。实验结果表明:GhostNet-TSM网络在LRW数据集上达到了85.2%的识别准确率,超过了多数非轻量化模型,浮点数计算量和模型参数量降低至0.988 GFLOPs和20.310×106。
Abstract:Due to its huge parameters and computational requirements, the lip reading model is not suitable for use with edge devices or mobile terminals. In order to solve this problem, we propose a lip reading method based on decoupled homogeneous self-knowledge distillation and GhostNet-TSM. First, the TSM module is installed into the GhostNet to extract temporal features. Second, the homogenous self-knowledge distillation is decoupled to increase the recognition accuracy of the model. Finally, the validation experiment on the LRW and LIP350 datasets has been completed and verified on the OuluVS dataset.The experimental results show that the recognition accuracy of GhostNet-TSM on the LRW dataset reaches 85.2%, which exceeds that of most non-lightweight models. The number of floating point operations per second and model parameters is reduced to 0.988 GFLOPs and 20.310×106.
-
Key words:
- lip reading /
- knowledge distillation /
- lightweight /
- GhostNet /
- TSM module
-
表 1 T和N对识别性能的影响
Table 1. Influence of T and N on recognition performance
网络 蒸馏或正则化方法 T N 准确率/% 变化量/% GhostNet-TSM 无蒸馏 79.3 同类自知识蒸馏 √ √ 80.2 +0.9 同类自知识蒸馏(仅T) √ 77.6 −1.7 同类自知识蒸馏(仅N) √ 80.7 +1.4 MobileNet-TSM 无蒸馏 75.3 同类自知识蒸馏 √ √ 76.1 +0.8 同类自知识蒸馏(仅T) √ 74.4 −0.9 同类自知识蒸馏(仅N) √ 76.3 +1.0 ResNet101-TSM 无蒸馏 83.6 同类自知识蒸馏 √ √ 84.5 +0.9 同类自知识蒸馏(仅T) √ 83.0 −0.6 同类自知识蒸馏(仅N) √ 84.4 +0.8 表 2 不同网络结合TSM模块的对比实验结果
Table 2. Comparative experiment of different networks combining TSM modules
网络组成 基础网络 网络类型 准确率/% 参数量 浮点数运算量/GFLOPS ResNet18-TSM ResNet18 非轻量级 74.9 11.356×106 7.320 ResNet34-TSM ResNet34 非轻量级 79.8 21.464×106 14.995 ResNet101-TSM ResNet101 非轻量级 83.6 43.217×106 33.180 DenseNet121-TSM DenseNet121 非轻量级 77.6 51.182×106 37.976 MobileNet-TSM MobileNetV2 轻量级 75.3 3.672×106 1.300 ShuffleNet v2-TSM ShuffleNet V2 轻量级 71.1 4.823×106 1.492 GhostNet-TSM (本文) GhostNet 轻量级 79.3 4.350×106 0.678266 表 3 与其他蒸馏方法的性能对比
Table 3. Performance comparison with other distillation methods
蒸馏或正则化方法 准确率/% 无蒸馏 79.3 标签平滑 79.5 普通知识蒸馏 77.6 自知识蒸馏 79.7 同类自知识蒸馏 80.2 解耦同类自知识蒸馏(仅使用N) 80.7 表 4 解耦参数对比
Table 4. Comparison of parameters of decoupling
T与N的系数 准确率/% LIP350 LRW $ \alpha $= 0, $ \beta $= 0.8 80.4 82.0 $ \alpha $= 0, $ \beta $= 1 80.7 82.5 $ \alpha $= 0, $ \beta $= 2 80.9 82.1 $ \alpha $= 0, $ \beta $= 4 81.5 83.6 $ \alpha $= 0, $ \beta $= 5 81.3 82.6 $ \alpha $= 0, $ \beta $= 6 81.1 83.2 $ \alpha $= 0, $ \beta $= 7 79.5 81.4 $ \alpha $= 0, $ \beta $= 8 77.1 80.8 $ \alpha $= 0, $ \beta $= 4 81.5 83.6 $ \alpha $= 0.2, $ \beta $= 4 82.1 83.9 $ \alpha $= 0.3, $ \beta $= 4 82.5 83.3 $ \alpha $= 0.4, $ \beta $= 4 83.3 84.5 $ \alpha $= 0.5, $ \beta $= 4 82.9 84.6 $ \alpha $= 0.6, $ \beta $= 4 82.8 85.2 $ \alpha $= 0.8, $ \beta $= 4 81.3 84.8 $ \alpha $= 1, $ \beta $= 4 80.8 83.5 表 5 OuluVS数据集上验证实验结果
Table 5. Validation experiment on OuluVS dataset
说话者 蒸馏方法 准确率/% 7号 无蒸馏 91.2 同类自知识蒸馏 91.8 解耦同类自知识蒸馏 94.6 9号 无蒸馏 81.0 同类自知识蒸馏 82.3 解耦同类自知识蒸馏 86.4 14号 无蒸馏 87.5 同类自知识蒸馏 88.3 解耦同类自知识蒸馏 89.8 表 6 与其他唇语识别方法的性能对比
Table 6. Performance comparison with other lip recognition methods
唇语识别方法 年份 唇语识别模型的网络组成 参数量 浮点数运算量/GFLOPs 准确率/% 前端 后端 文献[30] 2017 VGG-M 61.1 文献[31] 2017 VGG-M LSTM 419.256×106 32.744 76.2 文献[32] 2019 ResNet34 + DenseNet3D Conv-BLSTM 331.603×106 29.386 83.3 文献[10] 2017 3D-Conv + ResNet34 BLSTM 93.168×106 11.539 83.0 文献[27] 2018 3D-Conv + ResNet34 BLSTM 132.581×106 13.681 83.4 文献[33] 2019 (3D-Conv)×2 BLSTM 976.867×106 47.806 84.1 文献[26] 2018 3D-Conv + ResNet18 BLSTM 55.378×106 9.332 84.3 文献[29] 2020 3D-Conv + ResNet18 BGRU 57.644×106 10.055 84.4 文献[28] 2020 3D-Conv + ResNet18 BGRU 59.520×106 10.598 85.0 文献[34] 2020 3D-Conv+P3D-ResNet TCN 84.8 文献[35] 2020 SpotFast TransformerProduct-Key memory 84.4 文献[13] 2020 3D-Conv + ResNet18 MS-TCN 52.872×106 9.120 85.3 文献[36] 2021 3D-Conv + ResNet18 BGRU+ Visual-Audio Memory 85.4 文献[37] 2022 MoCo + Wav2Vec by SJTU LUMIA MoCo + Wav2Vec by SJTU LUMIA 85.0 本文方法 GhostNet+TSM+解耦同类自知识蒸馏 GhostNet+TSM+解耦同类自知识蒸馏 20.310×106 0.988 85.2 -
[1] LESANI F S, GHAZVINI F F, DIANAT R. Mobile phone security using automatic lip reading[C]//Proceedings of the 9th International Conference on e-Commerce in Developing Countries: With Focus on e-Business. Piscataway: IEEE Press, 2015. [2] MATHULAPRANGSAN S, WANG C Y, KUSUM A Z, et al. A survey of visual lip reading and lip-password verification[C]//Proceedings of the International Conference on Orange Technologies. Piscataway: IEEE Press, 2015: 22-25. [3] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2016: 4945-4949. [4] HUANG J T, LI J Y, GONG Y F. An analysis of convolutional neural networks for speech recognition[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2015: 4989-4993. [5] CHAE H, KANG C M, KIM B, et al. Autonomous braking system via deep reinforcement learning[C]//Proceedings of the IEEE 20th International Conference on Intelligent Transportation Systems. Piscataway: IEEE Press, 2017: 1-6. [6] HONG X P, YAO H X, WAN Y Q, et al. A PCA based visual DCT feature extraction method for lip-reading[C]//Proceedings of the International Conference on Intelligent Information Hiding and Multimedia. Piscataway: IEEEPress, 2006: 321-326. [7] PUVIARASAN N, PALANIVEL S. Lip reading of hearing impaired persons using HMM[J]. Expert Systems with Applications, 2021, 57(24): 61-73. [8] 马金林, 朱艳彬, 马自萍, 等.唇语识别的深度学习方法综述[J]. 计算机工程与应用, 2021, 57(24): 61-73.MA J L,ZHU Y B,MA Z P,et al. Survey of deep learning methods for lip recognition[J]. Computer Engineering and Applications, 2021, 57(24): 61-73(in Chinese). [9] 马金林, 陈德光, 郭贝贝, 等. 唇语语料库综述[J]. 计算机工程与应用, 2019, 55(22): 1-13.MA J L, CHEN D G, GUO B B,et al. Lip corpus review[J]. Computer Engineering and Applications, 2019, 55(22): 1-13(in Chinese). [10] STAFYLAKIS T, TZIMIROPOULOS G. Combining residual networks with LSTMs for lipreading[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association. Toulouse: ISCA-INT Speech Communication Association, 2017: 3652-3656. [11] ASSAELY M, SHILLINGFORD B, WHITESON S, et al. LipNet: End-to-end sentence-level lipreading[J]. (2016-12-16)[2022-11-01]. [12] YANG H, YUAN C F, ZHANG L, et al. STA-CNN: Convolutional spatial-temporal attention learning for action recognition[J]. IEEE Transactions on Image Processing, 2020, 29: 5783-5793. doi: 10.1109/TIP.2020.2984904 [13] MARTINEZ B, MA P C, PETRIDIS S, et al. Lipreading using temporal convolutional networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE Press, 2020: 6319-6323. [14] HAN K, WANG Y H, TIAN Q, et al. GhostNet: More features from cheap operations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 1577-1586. [15] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2016: 770-778. [16] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[EB/OL]. (2015-05-09)[2022-11-01]. [17] LIU Y F, SHU C Y, WANG J D, et al. Structured knowledge distillation for dense prediction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 7035-7049. doi: 10.1109/TPAMI.2020.3001940 [18] YANG Z D, LI Z, JIANG X H, et al. Focal and global knowledge distillation for detectors[EB/OL]. (2022-05-09)[2022-11-01]. [19] MOBAHI H, FARAJTABAR M, BARTLETT P L. Self-distillation amplifies regularization in Hilbert space[EB/OL]. (2020-10-26) [2022-11-01]. [20] ZHANG Z L, SABUNCU M R. Self-distillation as instance-specific label smoothing[EB/OL]. (2020-10-22)[2022-11-01]. [21] YUN S, PARK J, LEE K, et al. Regularizing class-wise predictions via self-knowledge distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 13873-13882. [22] LIN J, GAN C, HAN S. TSM: Temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2019: 7082-7092. [23] KING D E. Dlib-ml: A machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10: 1755-1758. [24] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2018: 4510-4520. [25] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: Practical guidelines for efficient CNN architecture design[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2018: 122-138. [26] STAFYLAKIS T, KHAN M H, TZIMIROPOULOS G. Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs[J]. Computer Vision and Image Understanding, 2018, 176-177: 22-32. doi: 10.1016/j.cviu.2018.10.003 [27] PETRIDIS S, STAFYLAKIS T, MA P C, et al. Audio-visual speech recognition with a hybrid CTC/attention architecture[C]//Proceedings of the IEEE Spoken Language Technology Workshop. Piscataway: IEEE Press, 2018: 513-520. [28] ZHANG Y H, YANG S, XIAO J Y, et al. Can we read speech beyond the lips? Rethinking RoI selection for deep visual speech recognition[C]//Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition. Piscataway: IEEE Press, 2020: 356-363. [29] ZHAO X, YANG S, SHAN S G, et al. Mutual information maximization for effective lip reading[C]//Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition. Piscataway: IEEE Press, 2020: 420-427. [30] CHUNG J S, ZISSERMAN A. Lip reading in the wild[C]//Proceedings of the European Conference on Computer Vision. Berlin: Springer, 2017: 87-103. [31] CHUNG J S, SENIOR A, VINYALS O, et al. Lip reading sentences in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2017: 3444-3453. [32] WANG C H. Multi-grained spatio-temporal modeling for lip-reading[EB/OL]. (2019-09-02)[2022-11-01]. [33] WENG X S, KITANI K. Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading[EB/OL]. (2019-07-19)[2022-11-01]. [34] XU B, LU C, GUO Y D, et al. Discriminative multi-modality speech recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE Press, 2020: 14421-14430. [35] WIRIYATHAMMABHUM P. SpotFast networks with memory augmented lateral transformers for lipreading[C]//Proceedings of the International Conference on Neural Information Processing. Berlin: Springer, 2020: 554-561. [36] KIM M, HONG J, PARK S J, et al. Multi-modality associative bridging through memory: speech sound recollected from face video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE Press, 2021: 296-306. [37] PAN X C, CHEN P Y, GONG Y C, et al. Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition[EB/OL]. (2022-05-26)[2022-11-01].