Explore chapters and articles related to this topic
Message Synthesis from Stored Human Speech Components
Published in John Holmes, Wendy Holmes, Speech Synthesis and Recognition, 2002
The PSOLA technique can be used to modify pitch and timing directly in the waveform domain, without needing any explicit parametric analysis of the speech. The position of every instance of glottal closure (i.e. pitch pulse) is first marked on the speech waveform. These pitch markers can be used to generate a windowed segment of waveform for every pitch period. For each period, the window should be centred on the region of maximum amplitude, and the shape of the window function should be such that it is smoothly tapered to either side of the centre. A variety of different window functions have been used, but the Hanning window (shown in Figure 5.3) is a popular choice. The window length is set to be longer than a single period’s duration, so that there will always be some overlap between adjacent windowed signals. The OLA procedure can then be used to join together a sequence of windowed signals, where each one is centred on a pitch marker and is regarded as characterizing a single pitch period. By adding the sequence of windowed waveform segments in the relative positions given by the analysed pitch markers, the original signal can be reconstructed exactly. However, by adjusting the relative positions and number of the pitch markers before resynthesizing, it is possible to alter the pitch and timing, as described below.
Key Technologies for Text-to-Speech
Published in Katsuhiko Shirai, Masanobu Abe, Recent Progress in Japanese Speech Synthesis, 2000
Katsuhiko Shirai, Masanobu Abe
pick up the desired phonemic information. The model allows source and filter to be controlled independently. Therefore, the model is suitable for changing the prosodic parameters, especially fundamental frequency, of synthesis units. Based on the model, a formant synthesizer was introduced [Allen-87] followed by an LPC synthesizer [Markel-76], To improve LPC synthesizer quality, extensive studies examined the use of the residual signal as the source [Sato-84] [Tacked-85], More recently, however, algorithms based on waveform manipulation have become popular. The algorithm called PSOLA (Pitch Synchronously Overlap Addition) was originally proposed for French [Moulines-90]. The PSOLA algorithm has several advantages. First, PSOLA manipulates waveforms and does not perform source-filter separation. Because the source-filter model is an approximation, there is no guarantee that the separation works well if the fundamental frequency is changed. Therefore, even if the residual signal is employed to improve LPC synthesizer quality, the synthesized speech is sometimes rough. Waveform manipulation is free of these problems. Second, PSOLA is performed pitch synchronously. This makes it possible to easily change fundamental frequency in the time domain. Based on the idea of PSOLA, several algorithm were proposed in Japan [Kawai-93] [Mitome-94] [Arai-94] [Sakamoto-95] [Katae-95], and in fact most commercial Japanese text-to-speech systems employ PSOLA-like algorithms. Concerning the development of TTS systems, another advantage of waveform-based algorithms is that the amount of computation needed is small.
A Particular Character Speech Synthesis System Based on Deep Learning
Published in IETE Technical Review, 2021
Yuan Mei, Deng-pan Ye, Shun-zhi Jiang, Jia-rui Liu
Since 1980s, new progress has been made in speech synthesis technology, especially the introduction of pitch synchronous superposition (PSOLA) method (1990), which greatly improves the timbre and naturalness of speech synthesized by time-domain waveform splicing method. In the early 1990s, language conversion systems for French, German, English, Japanese and other languages based on PSOLA technology were successfully developed. The naturalness of these systems is higher than that of previous language synthesis systems based on LPC method or formant synthesizer, and the synthesizer based on PSOLA method is simple and easy to realize in real time, which has great commercial prospects. In the process of implementation, to convert text to speech unit division, make it correspond to the voice of specific segments of voice and speech segments in the database sequence, and finally to add smooth processing and using PSOLA speech waveform (pitch synchronous overlap) [8] and other signal processing method for the joining together of waveform, eventually make its voice information of target. Subsequently, a speech synthesis system based on HMM (hidden Markov model) was developed [9,10]. For example, the TTS speech synthesis system based on GMM-HMM acoustic model modeling and decision tree training.