Explore chapters and articles related to this topic
Speech Coding for Wireless Communications
Published in Jerry D. Gibson, Mobile Communications Handbook, 2017
Although important, the MOS values obtained by listening to isolated utterances do not capture the dynamics of conversational voice communications in the various network environments. It is intuitive that speech codecs should be tested within the environment and while executing the tasks for which they are designed. Thus, since we are interested in conversational (two-way) voice communications, a more realistic test would be conducted in this scenario. The perceptual evaluation of speech quality (PESQ) method was developed to provide an assessment of speech codec performance in conversational voice communications. The PESQ has been standardized by the ITU-T as P.862 and can be used to generate MOS values for both narrowband and wideband speech [3]. The narrowband PESQ performs fairly well for the situations for which it has been qualified, and the wideband PESQ MOS, while initially not very accurate, has become more reliable in recent years.
*
Published in Gillian M. Davis, Noise Reduction in Speech Applications, 2018
In recent years, algorithms have been developed that can predict MOS results, avoiding some of the disadvantages of full-blown MOS testing. To be successful, these algorithms must evaluate the quality of voice signals in much the same way that nonlinear codecs encode and decode audio signals. That is, they evaluate whether a particular voice signal is distorted with regard to what a human listener would find annoying or distracting. Typically, these algorithms compare “clean” test signals (either actual voice signals or special voice-like signals) to more or less distorted versions of the same signal (having passed through some communications system). Using complex weighting methods that take into account what is perceptually important, the physiology of the human ear, and cognitive factors related to what human listeners are likely to notice, these algorithms provide a qualitative score that often maps closely to MOS. Two very important clarity algorithms in use today are: Perceptual evaluation of speech quality (PESQ) is based on the ITU-T P.862 standard that defines the algorithms used to compare reference speech samples with test samples to measure quality degradation due to distortion. PESQ replaces a previous perceptual quality algorithm called perceptual speech quality measure (PSQM), which was based on P.861.PAMS is an algorithm developed and licensed by Psytechnics, Inc. that compares speech-like samples to obtain listening effort and listening quality scores.17
Objective Quality and Intelligibility Measures
Published in Philipos C. Loizou, Speech Enhancement, 2013
Several widely used objective speech quality measures were evaluated: the segmental SNR (segSNR) measure, the weighted spectral slope (WSS) distance [15], the perceptual evaluation of speech quality (PESQ) measure [34], the log likelihood ratio (LLR) measure, the Itakura–Saito (IS), the cepstrum distance (CEP) measure [1], the frequency-weighted segmental SNR [13], and the frequency-variant spectral distance measures [1, p. 241]. Composite measures, combining a subset of the aforesaid measures, as well as modifications to the PESQ and WSS measures were also evaluated.
Improving time–frequency sparsity for enhanced audio source separation in degenerate unmixing estimation technique algorithm
Published in Journal of Control and Decision, 2022
Shahin M. Abdulla, J. Jayakumari
The ability to reconstruct signals, robustness to noise, and quality of demixing is evaluated using Correlation, Signal to Distortion Ratio (SDR), Signal to Interference Ratio (SIR) and Signal to Artifact Ratio (SAR), which are the numerical performance metrics of BSS-Evaluation toolbox (Vincent et al., 2006). The estimated source image is divided into true source signal (starget) and noise terms such as interference (einterf) created by interfering sources and artefacts (eartif) caused by ‘blurbing’ noise using multichannel time-invariant filters. SDR assesses the sound quality of the reconstructed signal, SIR determines the amount of interference imposed on the separated sound signal by other sources, and SAR determines the number of artefacts contained in the separated sound signal. Furthermore, Renyi Entropy (RE) proposed by Stankovic (2001) is used to verify the concentration of the TF representation. Lower RE indicates better energy concentration in TFR. The WDO measure illustrated in Jourjine et al. (2000) is used to assess the quality of the generated TF mask for source separation. The objective speech quality measurement algorithm described in ITU-T Recommendation P.862 is the PESQ Algorithm. Thus the quality of estimated sources is tested using the Perceptual Evaluation of Speech Quality (PESQ) measure between original and reconstructed sources described in Rix et al. (2001).
Degenerate unmixing estimation technique of speech mixtures in real environments using wavelets
Published in International Journal of Electronics Letters, 2020
Shahin M. Abdulla, J. Jayakumari
Perceptual evaluation of speech quality (PESQ), which is a good indicator of speech intelligibility, compares enhanced signal with clean signal and usually produces a score between 1.0 and 4.5 with high values indicating better quality. The above results were compared with the existing system model (Mukae, Ishida, & Murakami, 2014). We investigated different mixes with several constraints being aware of the ultimate objective to determine the ideal adjustments of constraints for the fine implementation. The sampling rate for all mixtures was 16 kHz.
Optical laser microphone for human-robot interaction: speech recognition in extremely noisy service environments
Published in Advanced Robotics, 2022
Takahiro Fukumori, Chengkai Cai, Yutao Zhang, Lotfi El Hafi, Yoshinobu Hagiwara, Takanobu Nishiura, Tadahiro Taniguchi
The quality of a microphone's sound recording can be quantified by evaluating the speech recognition results. We calculated the following three types of performance measures: the word error rate (WER), the perceptual evaluation of speech quality (PESQ) [36], and the short time objective intelligibility (STOI) [37]. The WER was calculated using the following equation: where I, S, D, and C are the numbers of inserted words, substituted words, deleted words, and correct words, respectively. The PESQ evaluates how the speech was distorted from its clean speech, especially in the telephone band (0.3–3.4 kHz). The STOI measures the intelligibility of a given speech on the basis of the correlations between the power spectra of the clean and testing speech at each octave band. Table 3 shows the results of WER, PESQ, and STOI for each irradiated object in Experiment 1. Note that the PESQ and STOI of the clean speech cannot be calculated since their calculation requires both clean and reference speeches. The WER of the clean speech recorded using the ECM was 21.5%. The tissue box achieved the best WER and PESQ in Table 3. When a plastic bottle was irradiated by the LDV, the WER was 35.9% (i.e. a quality decrease of about 15% from clean speech). On the other hand, when the speaker's throat was irradiated by the LDV, the WER significantly degraded to 99%. Figure 5 shows the spectrograms of a single clean speech, a speech measured using the nine target objects. We can observe that the result in Figure 5(c), which was obtained from the LDV that irradiates the throat, drastically changed from the clean speech result shown in Figure 5(a). Based on these results, we decided to use a plastic bottle, which achieved a moderate result in Table 3, as the target object in subsequent experiments.