LUO Xiao, LIU Yue. An end-to-end text-to-speech system for vehicle-mounted devices[J]. Electric drive for locomotives,2023(6): 122-128. DOI: 10.13890/j.issn.1000-128X.2023.06.015.
An end-to-end text-to-speech system for vehicle-mounted devices
High-naturalness text-to-speech is one of the basic requirements for advanced intelligence in vehicle-mounted human-machine interaction. Currently
in the rail transit field
there is widespread use of traditional low-naturalness text-to-speech algorithms
which are out of touch with the rapidly developing intelligent human-machine interaction technology. In contrast
end-to-end deep learning-based text-to-speech algorithms
with their nearly human-like naturalness
have become dominant in various fields of text-to-speech. This paper introduced an end-to-end deep learning-based text-to-speech algorithm suitable for offline railway vehicle environments. The mean opinion score of this algorithm reached 4.18
and the real-time rate on the vehicle-mounted embedded hardware platform NVIDIA Xavier reached 0.52. Experiments show that this algorithm not only outperforms traditional text-to-speech algorithms in terms of subjective performance such as naturalness
but also possesses engineering practicality in the offline vehicle environment of railway transportation.
LIU Yue, LIN Jun, YOU Jun. Application and development of automatic speech recognition in vehicle field[J]. Control and information technology, 2019(2):1-6.
LIU Yue, LIN Jun, LUO Xiao, et al. Research of speech re-cognition algorithm based on time delay neural network and its application in rail transit[J].Control and information technology, 2022(4):11-16.
SEEVIOUR P, HOLMES J, JUDD M. Automatic generation of control signals for a parallel formant speech synthesizer[C]//IEEE. IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia: IEEE, 1976: 690-693.
OLIVE J. Rule synthesis of speech from dyadic units[C]//IEEE. IEEE International Conference on Acoustics, Speech, and Signal Processing. Hartford: IEEE, 1977: 568-570.
YOSHIMURA T, TOKUDA K, MASUKO T, et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis[C]//DBLP. Sixth European Conference on Speech Communication and Technology. Budapest: DBLP, 1999: 2347-2350.
LUO Jiangping, YU Xizhuo, CAO Jingwei, et al. Intelligent rail flaw detection system based on deep learning and support vector machine[J]. Electric drive for locomotives, 2021(2): 100-107.
HE Deqiang, JIANG Zhou, CHEN Jiyong, et al. Research on detection of bird nests in overhead catenary based on deep convolutional neural network[J]. Electric drive for locomotives, 2019(4): 126-130.
VAN DEN OORD A, DIELEMAN S, ZEN H, et al. WaveNet: a generative model for raw audio[C]//ISCA. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9). Sunnyvale: ISCA, 2016: 125.
SOTELO J, MEHRI S, KUMAR K, et al. Char2Wav: end-to-end speech synthesis[C]//ICLR. International Conference on Learning Representations. Toulon: ICLR, 2017: 1-6.
PING Wei, PENG Kainan, GIBIANSKY A, et al. Deep voice 3: scaling text-to-speech with convolutional sequence learning[C]//ICLR. 6th International Conference on Learning Representations. Vancouver: ICLR, 2018: 1-16.
WANG Yuxuan, SKERRY-RYAN R J, STANTON D, et al. Tacotron: towards end-to-end speech synthesis[C]//ISCA. Interspeech 2017. Stockholm: ISCA, 2017: 4006-4010.
SHEN J, PANG Ruoming, WEISS R J, et al. Natural TTS synthesis by conditioning WaveNet on MEL spectrogram predictions[C]//IEEE. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary: IEEE, 2018: 4779-4783.
LI Naihan, LIU Shujie, LIU Yanqing, et al. Neural speech synthesis with transformer network[C]//AAAI. The Thirty-third AAAI Conference on Artificial Intelligence (AAAI-19). Hawaii: AAAI Press, 2019: 6706-6713.
VALIN J M, SKOGLUND J. LPCNET: improving neural speech synthesis through linear prediction[C]//IEEE. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton: IEEE, 2018: 5891-5895.
ZEN Heiga, SENIOR A, SCHUSTER M. Statistical parametric speech synthesis using deep neural networks[C]//IEEE. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver: IEEE, 2013: 7962-7966.
HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
FAN Yuchen, QIAN Yao, XIE Fenglong, et al. TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//ISCA. Interspeech 2014. Singapore: ISCA, 2014: 1964-1968.
ARIK S Ö, CHRZANOWSKI M, COATES A, et al. Deep voice: real-time neural text-to-speech[C]//JMLR. Proceedings of the 34th International Conference on Machine Learning. Sydney: JMLR, 2017: 195-204.
ARIK S Ö, DIAMOS G, GIBIANSKY A, et al. Deep voice 2: multi-speaker neural text-to-speech[C]//Curran Associates Inc.. Proceedings of the 31st International Conference on Neural Information Processing Systems. California: Curran Associates Inc., 2017: 2966-2974.
REN Yi, RUAN Yangjun, TAN Xu, et al. FastSpeech: fast, robust and controllable text to speech[C]//Curran Associates Inc.. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019: 3171-3180.
REN Yi, HU Chenxu, TAN Xu, et al. FastSpeech 2: fast and high-quality end-to-end text to speech[C]//ICLR. International Conference on Learning Representations. Virtual: ICLR, 2021: 1-15.
CHOROWSKI J, BAHDANAU D, SERDYUK D, et al. Attention-based models for speech recognition[C]//MIT Press. Proceedings of the 28th International Conference on Neural Information Processing Systems. Montreal: MIT Press, 2015: 577-585.
KALCHBRENNER N, ELSEN E, SIMONYAN K, et al. Efficient neural audio synthesis[C]//PMLR. Proceedings of the 35th International Conference on Machine Learning. Stockholm: PMLR, 2018: 2410-2419.
BATTENBERG E, SKERRY-RYAN R J, MARIOORYAD S, et al. Location-relative attention mechanisms for robust long-form speech synthesis[C]//IEEE. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona: IEEE, 2020: 6194-6198.