Explore chapters and articles related to this topic
Micropower Adaptive VLSI Systems for Acoustic Source Localization and Separation
Published in Iniewski Krzysztof, Integrated Microsystems, 2017
Milutin Stanaćević, Gert Cauwenberghs
The experimental setup is shown in Figure 11.12. The speech signals were presented through loudspeakers positioned at 1.5 m distance from the array. The system sampling frequency of both chips was set to 16 kHz. A male and a female speaker from the TIMIT database were chosen as sound sources. To provide the ground truth data and full characterization of the systems, speech segments were presented individually through either loudspeaker at different time instances. The data were recorded for both speakers, archived, and presented to the gradient flow chip. Localization results obtained by gradient flow chip through LMS adaptation are reported in Table 11.3. The two recorded data sets were then added, and presented to the gradient flow ASIC. The gradient signals obtained from the chip were then presented to the ICA processor, configured to implement the outer-product update algorithm in Equation 11.33. The quantization level in three-level approximation of the function g was set to 100 mV amplitude change in voltage Vth. The observed convergence time was around 2 s. From the recorded 14-bit digital weights, the angles of incidence of the sources relative to the array were derived. These estimated angles are reported in Table 11.3. As seen, the angles obtained through LMS bearing estimation under individual source presentation are very close to the angles produced by ICA under joint presentation of both sources. The original sources and the recorded source signal estimates, along with recorded common-mode signal and first-order spatial gradients, are shown in Figure 11.13.
Recognizing spoken words in semantically-anomalous sentences: Effects of executive control in early-implanted deaf children with cochlear implants
Published in Cochlear Implants International, 2021
David B. Pisoni, William G. Kronenberger
The Perceptually Robust English Sentence Test Open-Set (PRESTO) (Gilbert et al., 2013) and PRESTO-Foreign Accented English (PRESTO-FAE) (Tamati & Pisoni, 2015) are high-variability sentence recognition tests that use multiple talkers and different regional dialects for each sentence. PRESTO consists of 30 sentences drawn from the Texas Instruments-MIT Acoustic-Phonetic Speech Corpus (TIMIT) database (Garafolo et al., 1993). Each sentence was spoken by a different male or female talker selected from 1of 6 regional United States dialects, with 3–5 words in each sentence serving as keywords (e.g. ‘A flame would use up air,’ ‘John cleaned shellfish for a living’). PRESTO-FAE (Tamati & Pisoni, 2015) consisted of 26 low predictability (LP) sentences selected from the Speech Perception in Noise Test (SPIN) test (Kalikow et al., 1977). Each sentence was spoken by a non-native speaker of English who differed in accent and international English dialect, with 3–6 words identified as keywords (‘It was stuck together with glue,’ ‘My jaw aches when I chew gum’). Percentage of keywords correctly recognized was the primary dependent measure used for PRESTO and PRESTO-FAE in the current data analysis.
Longitudinal effect of deactivating stimulation sites based on low-rate thresholds on speech recognition in cochlear implant users
Published in International Journal of Audiology, 2019
Speech recognition in quiet was evaluated using the TIMIT (Texas Instruments and Massachusetts Institute of Technology) sentences (Garofolo et al. 1993), which contain recordings of 630 speakers of eight major dialects of phonetically rich American English. Many of the sentences were semantically incoherent thus providing very limited contextual cues. Because the performance-intensity functions of the TIMIT sentences for cochlear implant listeners are unknown, the sentences were normalised to have equal root mean square values. They were calibrated to be delivered at 65 dB (A) SPL. For each condition, two lists of TIMIT sentences were randomly selected without replacement. Again, the subjects were instructed to repeat back what they heard and were encouraged to make their best guesses when they did not understand every word in the sentence. The number of words correctly identified was used to calculate a percent correct score.
A new perceptually weighted cost function in deep neural network based speech enhancement systems
Published in Hearing, Balance and Communication, 2019
The audio material for training, validation and test sets is taken from the TIMIT database [20]. Five type noises, namely white and pink noise, babble, jet engine and traffic, are taken from the NOISEX-92 database [21] for training the DNN. In addition to the five type noises from NOISEX-92 database, two type noises, restaurant and street, are extracted from the Aurora2 database [22] to evaluate the performance of the trained network in unseen noisy conditions. All signals are re-sampled to 16 kHz. For training the DNN, 1100 male and 1100 female clean speech utterances taken from the TIMIT training set are employed. All sentences are randomly added to the noise types used for training at a random SNR ranging from