Acoustic models – Knowledge and References

Explore chapters and articles related to this topic

Automatic Speech Recognition for Large Vocabularies

Published in John Holmes, Wendy Holmes, Speech Synthesis and Recognition, 2002

The main advantage of the top-down approach to clustering is that a context-dependent model will be specified for any triphone context, even if that context did not occur in the training data. It is thus possible to build more accurate models for unseen triphones than can be achieved with the simple backing-off strategy, assuming that the questions in the tree are such that contexts are grouped appropriately. Although the tree could be constructed by hand based on phonetic knowledge, this approach does not work very well in practice, as it does not take into account the acoustic similarity of the triphones in the data. It is, however, possible to construct trees automatically by combining the use of phonetic questions with tests of acoustic similarity and a test for sufficient data to represent any new division. This automatic construction provides generalization to unseen contexts while maintaining accuracy and robustness in the acoustic models. A popular version of this effective technique for constructing phonetic decision trees is explained in more detail in the next section.

AI for In-Vehicle Infotainment Systems

View Chapter

Purchase Book

Published in Josep Aulinas, Hanky Sjafrie, AI for Cars, 2021

Josep Aulinas, Hanky Sjafrie

After the acoustic features are extracted, the most likely sequence of phonemes is calculated with the help of an acoustic model extracted from training data. Following one popular acoustic model called GMM-HMM, Gaussian Mixture Model states are used in the Hidden Markov Model system. In this approach, GMM creates a model of the probability of acoustic features to be given to a phoneme by using a mixture of several Gaussian distributions. (Gaussian distributions are also known as normal distributions; they are essentially a series of bell-curve charts plotting “states” – which we can think of as acoustic snapshots in this case – in order to predict likelihood.) The HMM system then models the probability of one particular phoneme being followed by another one.

Automated Scoring of Extended Spontaneous Speech

View Chapter

Purchase Book

Published in Duanli Yan, André A. Rupp, Peter W. Foltz, Handbook of Automated Scoring, 2020

Klaus Zechner, Anastassia Loukina

Most speech features in SpeechRater are dependent on which words were spoken by the test taker, so it is essential that the words of a test-taker’s utterance are first obtained via an ASR system. Such systems use two statistical models to achieve this task, the acoustic model and the language model. The acoustic model describes the statistical properties of speech sounds (“phones”) in higher-dimensional acoustic space and is typically trained on a corpus in which the recorded speech was manually transcribed verbatim. The language model captures the likelihood of word sequences in English and is typically trained on a large corpus of transcribed speech and additionally, in some cases, written language.

Unsupervised lexical acquisition of relative spatial concepts using spoken user utterances

View Article

Journal Information

Published in Advanced Robotics, 2022

Rikunari Sagara, Ryo Taguchi, Akira Taniguchi, Tadahiro Taniguchi, Koosuke Hattori, Masahiro Hoguro, Taizo Umezaki

We taught a robot four relative spatial concepts shown in Table 4. We used a coordinate system, in which locations of candidate reference objects were set as the origin, and a direction from a reference object to the robot was set as a positive direction of the axis. A distance from the origin was set as a value sampled from a normal distribution. An angle between the axis and the line passing through the origin and the relative location was set as a value sampled from von Mises distribution. The parameters of distributions are shown in Table 4. Relative locations used as the training data are shown in Figure 7. We used utterances spoken in Japanese by one Japanese male speaker. The word sequences of the utterances were obtained using the parameters listed in Table 5. The 19 utterance patterns shown in Table 6 were used to simulate natural utterances. Examples of the utterances are listed in Table 7. We used Julius 4.5 and the Julius dictation-kit v4.4 [30,31] for speech recognition, and we used the acoustic model included in the Julius dictation-kit v4.4. The language model was not trained using any datasets in advance, and it only contained Japanese syllables with a uniform probability We used latticelm v0.4 for unsupervised word segmentation. We set parameters as shown in Table 8.

Spatial concept-based navigation with human speech instructions via probabilistic inference on Bayesian generative model

View Article

Journal Information

Published in Advanced Robotics, 2020

Akira Taniguchi, Yoshinobu Hagiwara, Tadahiro Taniguchi, Tetsunari Inamura

Our method estimates an action sequence (and the path on the map) in which the probabilistic distribution representing trajectory is maximized when a human speech instruction is given. The equation used to identify the action sequence when the probability distribution that is the objective function takes the maximum value is as A set of learned global parameters include the map m, a set of model parameters represent spatial concepts Θ, the acoustic model AM, and the language model LM is denoted as . Here, we assume that the self-position at a previous time-step is provided. In reality, the default position set beforehand or an estimated value obtained through a self-localization method such as Monte–Carlo localization (MCL) [30] is available.

Design of a speech-enabled 3D marine compass simulation system

View Article

Journal Information

Published in Ships and Offshore Structures, 2018

Bin Fu, Hongxiang Ren, Jingjing Liu, Xiaoxi Zhang

The acoustic model of speech recognition is based on the HMM time series model. HMM is a double stochastic process: one of the hidden processes is the state sequence, which is used to describe the dynamic characteristics of the short-time statistical features hidden in the observation sequence. The other random process is the output observation sequence, which corresponds to the transient characteristics of the observed signal. Based on the HMM identification process, the core problem is to find an optimal state sequence Q* = q*1, q2*, ..., q*T based on the observation sequence O = o1o2...oT and the model λ = (π, A, B). In the model, λ and π represent the initial state probability, A represents the state transition probability matrix and B represents the observation probability matrix. A decoding algorithm is used to determine the best state sequence; the current mainstream algorithm used for speech recognition decoding is the Viterbi algorithm (Kumar, Babu, et al. 2014). The maximum probability of P(Q, O|λ) calculated by the Viterbi algorithm is the optimal state sequence, Q*.