Explore chapters and articles related to this topic
Pitch and timbre
Published in Stanley A. Gelfand, Hearing, 2017
There is reasonably good correspondence between pitch in mels, critical band intervals in barks (see Chapter 10), and distance along the basilar membrane (e.g., Stevens and Volkmann, 1940; Scharf, 1970; Zwicker and Fastl, 1999; Goldstein, 2000). Zwicker and Fastl (1999) suggested that 100 mels corresponds to one bark and a distance of approximately 1.3 mm along the cochlear partition. These relationships are illustrated in Figure 12.2. Goldstein (2000) reported that the Stevens and Volkmann (1940) mel scale is a power function of the frequency-place map for the human cochlea (Greenwood, 1990).
Automatic speech recognition: A primer for speech-language pathology researchers
Published in International Journal of Speech-Language Pathology, 2018
Mel-Frequency Cepstral Coefficients are the most common type of feature in ASR systems (Davis & Mermelstein, 1980). A block diagram of the MFCC feature extraction is given in Figure 2. The acoustic signal is first windowed into chunks of 25 milliseconds (ms) that overlap by 10 ms. Windows of this size are used under the assumption that the characteristics of the signal are relatively constant, a property referred to as stationarity. A single window is referred to as a frame. A tapering function, such as the Hamming window, is applied to each frame, followed by a Discrete Fourier Transform (DFT). The output is a description of the spectral properties of the frame. A non-linear scale called the mel scale is then applied to the spectrum in order to better match the non-linear frequency response of the human cochlea. The output is then log compressed. The log is not only used to compress the signal but also allows an easy way to remove fixed or unchanged components such as the room’s acoustics, the spectral characteristics of the specific microphone used and some fixed aspects of the speaker’s voice. These fixed components in the spectrum can be considered linear filters that are applied to the desired signal by multiplication. In the log domain, the multiplication is converted to addition, and the fixed components can be removed by subtracting the average signal over time.
Automated speech analysis tools for children’s speech production: A systematic literature review
Published in International Journal of Speech-Language Pathology, 2018
J. McKechnie, B. Ahmed, R. Gutierrez-Osuna, P. Monroe, P. McCabe, K. J. Ballard
Figure 3(C) summarises the feature extraction data from the studies. The majority of tools, in 20/32 publications, used Mel-frequency cepstral coefficients (MFCCs), often in combination with other features. MFCCs map spectral information from the speech signal onto the Mel scale, which approximates the way the human auditory system perceives frequencies. For three tools feature extraction was not reported (de Wet, Van der Walt, & Niesler, 2009; Duenser et al., 2016; Lee et al., 2011).