浏览全部资源
扫码关注微信
中车株洲电力机车研究所有限公司,湖南 株洲 412001
罗 潇(1993—),男,硕士,工程师,主要从事语音信号处理工作;E-mail: shawl336@163.com
纸质出版日期:2023-11-10,
收稿日期:2022-03-24,
修回日期:2023-05-23,
扫 描 看 全 文
罗潇, 刘悦. 轨道交通车载端到端语音合成[J]. 机车电传动, 2023(6): 122-128.
LUO Xiao, LIU Yue. An end-to-end text-to-speech system for vehicle-mounted devices[J]. Electric drive for locomotives,2023(6): 122-128.
罗潇, 刘悦. 轨道交通车载端到端语音合成[J]. 机车电传动, 2023(6): 122-128. DOI: 10.13890/j.issn.1000-128X.2023.06.015.
LUO Xiao, LIU Yue. An end-to-end text-to-speech system for vehicle-mounted devices[J]. Electric drive for locomotives,2023(6): 122-128. DOI: 10.13890/j.issn.1000-128X.2023.06.015.
高自然度的语音合成是车载人机交互进入高级智能化的基本要求之一。现阶段的轨道交通领域仍在广泛使用传统的低自然度语音合成算法,这与高速发展的智能化人机交互技术脱节。相比之下,端到端的深度学习语音合成算法凭借其近乎媲美人声的自然度已经成为各领域语音合成的主流算法。文章介绍了一种适用于离线轨道交通车载环境的端到端深度学习语音合成算法,该算法的主观意见评分达到4.18,并且在车载嵌入式硬件平台英伟达Xavier上的实时率达到0.52。试验证明,该算法不仅在自然度等主观性能上远超传统语音合成算法,同时也具备在轨道交通离线车载环境下的工程实用性。
High-naturalness text-to-speech is one of the basic requirements for advanced intelligence in vehicle-mounted human-machine interaction. Currently
in the rail transit field
there is widespread use of traditional low-naturalness text-to-speech algorithms
which are out of touch with the rapidly developing intelligent human-machine interaction technology. In contrast
end-to-end deep learning-based text-to-speech algorithms
with their nearly human-like naturalness
have become dominant in various fields of text-to-speech. This paper introduced an end-to-end deep learning-based text-to-speech algorithm suitable for offline railway vehicle environments. The mean opinion score of this algorithm reached 4.18
and the real-time rate on the vehicle-mounted embedded hardware platform NVIDIA Xavier reached 0.52. Experiments show that this algorithm not only outperforms traditional text-to-speech algorithms in terms of subjective performance such as naturalness
but also possesses engineering practicality in the offline vehicle environment of railway transportation.
轨道交通车载人机交互智能化深度学习端到端语音合成
rail transitvehicle-mounted human-machine interactionintelligentdeep learningend-to-end text-to-speech
刘悦,林军,游俊.语音识别技术在车载领域的应用及发展[J].控制与信息技术, 2019(2):1-6.
LIU Yue, LIN Jun, YOU Jun. Application and development of automatic speech recognition in vehicle field[J]. Control and information technology, 2019(2):1-6.
刘悦,林军,罗潇,等. 基于时延神经网络的语音识别算法及其在轨道交通领域的应用研究[J].控制与信息技术, 2022(4):11-16.
LIU Yue, LIN Jun, LUO Xiao, et al. Research of speech re-cognition algorithm based on time delay neural network and its application in rail transit[J].Control and information technology, 2022(4):11-16.
SEEVIOUR P, HOLMES J, JUDD M. Automatic generation of control signals for a parallel formant speech synthesizer[C]//IEEE. IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia: IEEE, 1976: 690-693.
OLIVE J. Rule synthesis of speech from dyadic units[C]//IEEE. IEEE International Conference on Acoustics, Speech, and Signal Processing. Hartford: IEEE, 1977: 568-570.
YOSHIMURA T, TOKUDA K, MASUKO T, et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis[C]//DBLP. Sixth European Conference on Speech Communication and Technology. Budapest: DBLP, 1999: 2347-2350.
罗江平, 喻熙倬, 曹经纬, 等. 基于深度学习与支持向量机的钢轨伤损智能识别系统[J]. 机车电传动, 2021(2): 100-107.
LUO Jiangping, YU Xizhuo, CAO Jingwei, et al. Intelligent rail flaw detection system based on deep learning and support vector machine[J]. Electric drive for locomotives, 2021(2): 100-107.
贺德强, 江洲, 陈基永, 等. 基于深度卷积神经网络的铁路接触网鸟窝检测方法研究[J]. 机车电传动, 2019(4): 126-130.
HE Deqiang, JIANG Zhou, CHEN Jiyong, et al. Research on detection of bird nests in overhead catenary based on deep convolutional neural network[J]. Electric drive for locomotives, 2019(4): 126-130.
VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: a generative model for raw audio[C]//ISCA. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9). Sunnyvale: ISCA, 2016: 125.
SOTELO J, MEHRI S, KUMAR K, et al. Char2Wav: end-to-end speech synthesis[C]//ICLR. International Conference on Learning Representations. Toulon: ICLR, 2017: 1-6.
PING Wei, PENG Kainan, GIBIANSKY A, et al. Deep voice 3: scaling text-to-speech with convolutional sequence learning[C]//ICLR. 6th International Conference on Learning Representations. Vancouver: ICLR, 2018: 1-16.
WANG Yuxuan, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C]//ISCA. Interspeech 2017. Stockholm: ISCA, 2017: 4006-4010.
SHEN J, PANG Ruoming, WEISS R J, et al. Natural TTS synthesis by conditioning WaveNet on MEL spectrogram predictions[C]//IEEE. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018: 4779-4783.
LI Naihan, LIU Shujie, LIU Yanqing, et al. Neural speech synthesis with transformer network[C]//AAAI. The Thirty-third AAAI Conference on Artificial Intelligence (AAAI-19). Hawaii: AAAI Press, 2019: 6706-6713.
VALIN J M, SKOGLUND J. LPCNET: improving neural speech synthesis through linear prediction[C]//IEEE. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2018: 5891-5895.
ZEN Heiga, SENIOR A, SCHUSTER M. Statistical parametric speech synthesis using deep neural networks[C]//IEEE. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2013: 7962-7966.
HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
FAN Yuchen, QIAN Yao, XIE Fenglong, et al. TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//ISCA. Interspeech 2014. Singapore: ISCA, 2014: 1964-1968.
ARIK S Ö, CHRZANOWSKI M, COATES A, et al. Deep voice: real-time neural text-to-speech[C]//JMLR. Proceedings of the 34th International Conference on Machine Learning. Sydney: JMLR, 2017: 195-204.
ARIK S Ö, DIAMOS G, GIBIANSKY A, et al. Deep voice 2: multi-speaker neural text-to-speech[C]//Curran Associates Inc.. Proceedings of the 31st International Conference on Neural Information Processing Systems. California: Curran Associates Inc., 2017: 2966-2974.
REN Yi, RUAN Yangjun, TAN Xu, et al. FastSpeech: fast, robust and controllable text to speech[C]//Curran Associates Inc.. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019: 3171-3180.
REN Yi, HU Chenxu, TAN Xu, et al. FastSpeech 2: fast and high-quality end-to-end text to speech[C]//ICLR. International Conference on Learning Representations. Virtual: ICLR, 2021: 1-15.
CHOROWSKI J, BAHDANAU D, SERDYUK D, et al. Attention-based models for speech recognition[C]//MIT Press. Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal: MIT Press, 2015: 577-585.
KALCHBRENNER N, ELSEN E, SIMONYAN K, et al. Efficient neural audio synthesis[C]//PMLR. Proceedings of the 35th International Conference on Machine Learning. Stockholm: PMLR, 2018: 2410-2419.
BATTENBERG E, SKERRY-RYAN R J, MARIOORYAD S, et al. Location-relative attention mechanisms for robust long-form speech synthesis[C]//IEEE. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020: 6194-6198.
0
浏览量
9
下载量
0
CSCD
0
CNKI被引量
关联资源
相关文章
相关作者
相关机构