Speech synthesis

Explore chapters and articles related to this topic

Text-to-Speech Synthesis

Published in Michael Filimowicz, Foundations in Sound Design for Embedded Media, 2019

The latest breakthrough in statistical speech synthesis is called WaveNet, developed by Google’s DeepMind team (van Oord et al. 2016). Instead of extracting acoustic features from the speech signals, WaveNet generates speech samples directly from the raw audio waveform. It uses causal convolutions to model the conditional probabilities of generating audio samples based on previous frames of audio samples. The resulting synthesis sounds much more natural than comparable unit selection and ANN synthesis using the same voice (van Oord et al. 2016). In order to ensure that the audio has the required language-specific characteristics, the model also needs to be conditioned on the linguistic and prosodic features. The main drawback of WaveNet is that it currently requires a lot of computational power to train models. However, ongoing efforts aim to make the algorithm faster and less resource-intensive (Shen et al. 2018).

View Chapter

Purchase Book

Published in Sadaoki Furui, Digital Speech Processing, Synthesis, and Recognition, 2018

Sadaoki Furui

Speech synthesis is a process which artificially produces speech for various applications, diminishing the dependence on using a person’s recorded voice. The speech synthesis methods enable a machine to pass on instructions or information to the user through ‘speaking.’ The applications include information supply services over telephone, such as banking services and directory services, various reservation services, public announcements, such as those at train stations, reading out manuscripts for collation, reading emails, faxes, and web pages over telephone, voice output in automatic translation systems, and special equipment for handicapped people, such as word processors with reading-out capability and book-reading aids for visually- handicapped people, and speaking aids for vocally-handicapped people.

Speech and Language Interfaces, Applications, and Technologies

View Chapter

Purchase Book

Published in Julie A. Jacko, The Human–Computer Interaction Handbook, 2012

Clare-Marie Karat, Jennifer Lai, Osamuyimen Stewart, Nicole Yankelovich

There are two types of speech synthesis commercially available today, concatenated synthesis and formant synthesis, which is the most prevalent type. Concatenated synthesis uses computers to assemble recorded voice sounds into speech output. It sounds fairly natural but can be prohibitively expensive for many applications, as it requires large disk storage space for the units of recorded speech and significant computational power to assemble the speech units on demand. Concatenated synthesizers rely on databases of diphones and demisyllables to create the natural sounding synthesized speech. Diphones are the transitions between phonemes. Demisyllables are the half-syllables recorded from the beginning of a sound to the center point, or from the center point to the end of a sound (Weinschenk and Barker 2000). After the voice units are recorded, the database of units is coded for changes in frequency, pitch, and prosody (intonation and duration). The coding process enables the database of voice units to be as efficient as possible.

Brain Computer Interface for Speech Synthesis Based on Multilayer Differential Neural Networks

View Article

Journal Information

Published in Cybernetics and Systems, 2022

Dusthon Llorente, Mariana Ballesteros, David Cruz-Ortiz, Ivan Salgado, Isaac Chairez

Speech synthesis (SS) aims to generate natural and intelligible sounds, generally from an arbitrary text (Lorenzo-Trueba et al. 2018). There are several interfaces that already solve the task of SS, many of them focused on assisting medical conditions (Gonzalez and Green 2018). For a complete interface, the SS has three main stages, the first one extracts the corresponding user information through a sensor measuring bio-signals from the patient. Then, data are processed to find patterns or the intention to speak between the measured bio-signals and a database that contains elements of communication to carry out the third stage, which is the synthesis of artificial voice synthesis. A useful working model to produce SS is related with the acquisition of electroencephalography (EEG) signals (Haider et al. 2019). Some novel researching studies explore certain areas of the brain related to the synthesis of language, showing how a patient response to some visual and auditory stimulus with the objective of producing basic SS (Hernández et al. 2019). Based on the literature, it is feasible to perform SS by imaginary speech (through EEG acquisition of thoughts of speech intention). As in the studied works by (Boloukian and Safi-Esfahani 2020; García-Salinas et al. 2019) that implement brain-computer interfaces (BCIs) to classify the word thought by a patient without additional sounds or facial movements. Unfortunately, these BCIs reached just 60% of accuracy.

Acoustic Features Modelling for Statistical Parametric Speech Synthesis: A Review

View Article

Journal Information

Published in IETE Technical Review, 2018

Nagaraj Adiga, S. R. M. Prasanna

There are deficiencies in the current decision-tree-based clustering used in HMM. The data fragmentation happens with the decision-tree-based clustering, and therefore it is ineffective for describing complex dependencies among linguistic and acoustic parameters. On the other hand, deep learning approaches are more effective than HMM-based systems. Popular deep learning methods, such as the deep belief network (DBN), the deep neural network (DNN), and the long short-term memory (LSTM)-based recurrent neural network (RNN) have given encouraging outcomes in joining with HMMs or without [23–27]. Usually, deep learning is likely to result in over-fitting with small corpora, and thus a significant amount of data is needed for successful training [28,29]. The deep learning system is growing fast, and many new deep learning algorithms are getting applications in speech synthesis.

Design of a speech-enabled 3D marine compass simulation system

View Article

Journal Information

Published in Ships and Offshore Structures, 2018

Bin Fu, Hongxiang Ren, Jingjing Liu, Xiaoxi Zhang

Based on the principles described above, the Microsoft Speech SDK 5.1 was used to develop the speech interactive functions for the 3D marine compass simulation system. This SDK application layer includes both speech recognition and speech synthesis programs. Speech recognition is managed by the speech recognition engine, while the speech synthesis engine is responsible for text-to-speech synthesis. The SDK also provides a speech application program interface and a device driver interface to achieve its speech function. The system's structure is shown in Figure 12.