北京航空航天大学学报 ›› 2021, Vol. 47 ›› Issue (3): 500-508.doi: 10.13700/j.bh.1001-5965.2020.0475

• 论文 • 上一篇    下一篇

融合句嵌入的VAACGAN多对多语音转换

李燕萍1, 曹盼1, 石杨1, 张燕2   

  1. 1. 南京邮电大学 通信与信息工程学院, 南京 210003;
    2. 金陵科技学院 软件工程学院, 南京 211169
  • 收稿日期:2020-08-31 发布日期:2021-04-08
  • 通讯作者: 李燕萍 E-mail:liyp@njupt.edu.cn
  • 作者简介:李燕萍,女,博士,副教授。主要研究方向:语音转换和说话人识别;曹盼,女,硕士。主要研究方向:语音转换;石杨,男,硕士。主要研究方向:语音转换;张燕,女,博士,教授。主要研究方向:模式识别和领域软件工程。
  • 基金资助:
    国家自然科学基金(61401227,61872199,61872424);金陵科技学院智能人机交互科技创新团队建设专项(218/010119200113)

Many-to-many voice conversion with sentence embedding based on VAACGAN

LI Yanping1, CAO Pan1, SHI Yang1, ZHANG Yan2   

  1. 1. College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China;
    2. School of Software Engineering, Jinling Institute of Technology, Nanjing 211169, China
  • Received:2020-08-31 Published:2021-04-08
  • Supported by:
    National Natural Science Foundation of China (61401227, 61872199, 61872424); Special Project of Intelligent Human-Computer Interaction Technology Innovation Team Building of Jinling Institute of Technology (218/010119200113)

摘要: 针对非平行文本条件下语音转换质量不理想、说话人个性相似度不高的问题,提出一种融合句嵌入的变分自编码辅助分类器生成对抗网络(VAACGAN)语音转换方法,在非平行文本条件下,有效实现了高质量的多对多语音转换。辅助分类器生成对抗网络的鉴别器中包含辅助解码器网络,能够在预测频谱特征真假的同时输出训练数据所属的说话人类别,使得生成对抗网络的训练更为稳定且加快其收敛速度。通过训练文本编码器获得句嵌入,将其作为一种语义内容约束融合到模型中,利用句嵌入包含的语义信息增强隐变量表征语音内容的能力,解决隐变量存在的过度正则化效应的问题,有效改善语音合成质量。实验结果表明:所提方法的转换语音平均MCD值较基准模型降低6.67%,平均MOS值提升8.33%,平均ABX值提升11.56%,证明该方法在语音音质和说话人个性相似度方面均有显著提升,实现了高质量的语音转换。

关键词: 语音转换, 句嵌入, 文本编码器, 辅助分类器生成对抗网络(ACGAN), 变分自编码器, 非平行文本, 多对多

Abstract: To solve the problems of poor speech quality and unsatisfactory speaker similarity for converted speech in existing non-parallel VC methods, this paper presents a novel voice conversion model based on Variational Autoencoding Auxiliary Classifier Generative Adversarial Network (VAACGAN) with sentence embedding, which achieves high-quality many-to-many voice conversion for non-parallel corpora. First, in the ACGAN, the discriminator contains an auxiliary decoder network that predicts the true or false of the spectral feature meanwhile the speaker category to which the training data belongs, thus achieving a more stable training process and a faster iterative convergence. Furthermore, sentence embedding is obtained by training the text encoder, which is introduced into the model as semantic content constraint, can enhance the ability of latent variables to characterize the speech content, effectively solve the over-regularization effect of latent variables and improve the quality of converted speech significantly. Experimental results show that the average value of MCD of the converted speech is decreased by 6.67%, MOS is increased by 8.33%, and ABX is increased by 11.56% compared with baseline method, which demonstrate that the proposed method significantly outperforms the baseline method in both speech quality and speaker similarity and achieves high-quality voice conversion.

Key words: voice conversion, sentence embedding, text-encoder, Auxiliary Classifier Generative Adversarial Network (ACGAN), variational autoencoder, non-parallel corpora, many-to-many

中图分类号: 


版权所有 © 《北京航空航天大学学报》编辑部
通讯地址:北京市海淀区学院路37号 北京航空航天大学学报编辑部 邮编:100191 E-mail:jbuaa@buaa.edu.cn
本系统由北京玛格泰克科技发展有限公司设计开发