Monaural singing voice separation based on high-resolution network

ZHANG Yang; NIU Zhixian; NIU Baoning; CHANG Yan

doi:10.13700/j.bh.1001-5965.2019.0491

Volume 46 Issue 8

Aug. 2020

Turn off MathJax

Article Contents

Journal of Beijing University of Aeronautics and Astronautics > 2020 > 46(8): 1555-1563.

ZHANG Yang, NIU Zhixian, NIU Baoning, et al. Monaural singing voice separation based on high-resolution network[J]. Journal of Beijing University of Aeronautics and Astronautics, 2020, 46(8): 1555-1563. doi: 10.13700/j.bh.1001-5965.2019.0491(in Chinese)

Citation:

ZHANG Yang, NIU Zhixian, NIU Baoning, et al. Monaural singing voice separation based on high-resolution network[J]. Journal of Beijing University of Aeronautics and Astronautics, 2020, 46(8): 1555-1563. doi: 10.13700/j.bh.1001-5965.2019.0491(in Chinese)

Citation:

PDF( 5838 KB)

Monaural singing voice separation based on high-resolution network

doi: 10.13700/j.bh.1001-5965.2019.0491

1.
College of Information and Computer, Taiyuan University of Technology, Jinzhong 030600, China
2.
Institute of Software, Chinese Academy of Sciences, Beijing 100190, China

Funds:

National Key R & D Program of China 2017YFB1401001-01

National Natural Science Foundation of China 61572345

More Information

Corresponding author: NIU Zhixian, E-mail:niuniurose63@163.com
Received Date: 09 Sep 2019
Accepted Date: 13 Dec 2019
Publish Date: 20 Aug 2020

Abstract

Abstract

Monaural singing voice separation separates singing voice and accompaniment from a song, which can be used for applications such as melody extraction, lyrics recognition, karaoke, etc. To resolve the limited accuracy of predicted spectrogram, this paper proposes a monaural singing voice separation algorithm based on high-resolution neural network, which has the advantages of parallel structure and sufficient features interaction for improving the performance of the model. Firstly, the high-resolution network suitable for singing voice separation is designed and constructed. Then, the spectrogram of the origin song is input to the network in order to get the predicted spectrograms of accompaniment and singing voice. Finally, the time-domain signals are reconstructed by combining the song phases with the separated spectrograms. Experiments conducted on the MIR-1K dataset show that SNR, SIR and SAR indicators of the proposed algorithm are better than those of the state-of-the-art algorithm, and the proposed algorithm improves the quality of the separated accompaniment and singing voice.
- monaural singing voice separation,
- deep learning,
- spectrogram,
- high-resolution network,
- frequency-domain model

FullText(HTML)

References(18)

References

[1]	李伟, 李子晋, 高永伟.理解数字音乐——音乐信息检索技术综述[J].复旦学报(自然科学版), 2018, 57(3):5-47. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=fdxb201803001 LI W, LI Z J, GAO Y W.Understanding digital music-A review of music information retrieval technology[J].Journal of Fudan University(Natural Science), 2018, 57(3):5-47(in Chinese). http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=fdxb201803001
[2]	SIMPSON A J R, ROMA G, PLUMBLEY M D.Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network[C]//International Conference on Latent Variable Analysis and Signal Separation.Berlin: Springer, 2015: 429-436.
[3]	HUANG P S, KIM M, HASEGAWA-JOHNSON M, et al.Joint optimization of masks and deep recurrent neural networks for monaural source separation[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(12):2136-2147. doi: 10.1109/TASLP.2015.2468583
[4]	UHLICH S, PORCH M, GIRON F, et al.Improving music source separation based on deep neural networks through data augmentation and network blending[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).Piscataway: IEEE Press, 2017: 261-265.
[5]	JANSSON A, HUMPHREY E, MONTECCHIO N, et al.Singing voice separation with deep U-Net convolutional networks[C]//18th International Society for Music Information Retrieval Conference(ISMIR), 2017: 745-751.
[6]	PARK S, KIM T, LEE K, et al.Music source separation using stacked hourglass networks[C]//19th International Society for Music Information Retrieval Conference(ISMIR), 2018: 289-296.
[7]	STOLLER D, EWERT S, DIXON S.Wave-U-Net: A multi-scale neural network for end-to-end audio source separation[C]//19th International Society for Music Information Retrieval Conference(ISMIR), 2018: 334-340.
[8]	SUN K, XIAO B, LIU D, et al.Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE Press, 2019: 5693-5703.
[9]	SUN K, ZHAO Y, JIANG B R, et al.High-resolution representations for labeling pixels and regions[EB/OL].(2019-04-09)[2019-09-01].https://arxiv.org/abs/1904.04514.
[10]	VIRTANEN T.Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria[J].IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(3):1066-1074. doi: 10.1109/TASL.2006.885253
[11]	HUANG P S, CHEN S D, SMARAGDIS P, et al.Singing-voice separation from monaural recordings using robust principal component analysis[C]//2012 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).Piscataway: IEEE Press, 2012: 57-60.
[12]	HSU C L, WANG D L, JANG J S R, et al.A tandem algorithm for singing pitch extraction and voice separation from music accompaniment[J].IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(5):1482-1491. doi: 10.1109/TASL.2011.2182510
[13]	IKEMIYA Y, ITOYAMA K, YOSHⅡ K.Singing voice separation and vocal F0 estimation based on mutual combination of robust principal component analysis and subharmonic summation[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(11):2084-2095. doi: 10.1109/TASLP.2016.2577879
[14]	RAFⅡ Z, PARDO B.Music/voice separation using the similarity matrix[C]//13th International Society for Music Information Retrieval Conference(ISMIR), 2012: 583-588.
[15]	ZHU B L, LI W, LI R J, et al.Multi-stage non-negative matrix factorization for monaural singing voice separation[J].IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(10):2096-2107. doi: 10.1109/TASL.2013.2266773
[16]	ZHANG X, LI W, ZHU B L.Latent time-frequency component analysis: A novel pitch-based approach for singing voice separation[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).Piscataway: IEEE Press, 2015: 131-135.
[17]	DEIF H, WANG W, GAN L, et al.Local discontinuity based approach for monaural singing voice separation from accompanying music with multi-stage non-negative matrix factorization[C]//2015 IEEE Global Conference on Signal and Information Processing(GlobalSIP).Piscataway: IEEE Press, 2015: 93-97.
[18]	HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway: IEEE Press, 2016: 770-778.

Relative Articles

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(6) / Tables(2)

Get Citation

PDF

XML

Article Metrics

Article views(522) PDF downloads(80)

Monaural singing voice separation based on high-resolution network

doi: 10.13700/j.bh.1001-5965.2019.0491

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Monaural singing voice separation based on high-resolution network

doi: 10.13700/j.bh.1001-5965.2019.0491

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Proportional views

Related

Export File

Citation

Format

Content