Gesture, Signing, and Tracking
Stefano Federici, Marcia J. Scherer in Assistive Technology Assessment Handbook, 2017
Evaluation of SLR can be considered at different levels. At the lowest level, recognition of hand poses, body posture, lip shape and facial expressions are all very challenging pattern recognition problems, which are being approached with a variety of different artificial intelligence methods. As Cooper et al. explain, SLR has some of the characteristics that also make speech recognition a difficult problem, such as coarticulation. Added to this, however, is dealing with the nonsequential aspects of sign production and obscuration between hands or from clothing. The construct of sign languages also provides many challenges. Nonmanual features (facial expression), sign placement, body shift and positional signs (relationships of hand poses to other parts of the body, other people, and objects in the environment), and adverbs that involve the relative speed of gesture are just some of the constructs that a recognizer must be able to deal with. Also, inter-signer differences are large. At the production level, similar to gestures, throughput of sign production and recognition can be computed, and errors are measured by observation or with respect to standard corpuses of different sign languages.
Speech and its perception
Stanley A. Gelfand in Hearing, 2017
Perhaps the most widely known speech perception theory is Liberman's motor theory, the details of which have evolved over the years (Liberman, 1996; Liberman et al., 1967; Liberman and Mattingly, 1985, 1989; Mattingly and Liberman, 1988; Liberman and Whalen, 2000). Recall that coarticulation causes a particular phonetic element to have different acoustical characteristics depending on its context (e.g., different formant transitions for /d/ in /di/ versus /du/). Motor theory proposes that speech perception involves identifying the intended speech gestures (effectively the neuromotor instructions to the articulators) that resulted in the acoustical signal produced by the speaker and heard by the listener. In other words, we perceive the invariant intended phonetic gestures (e.g., release of the alveolar closure in /di/ and /du/) that are encoded in the variable acoustical signals. This perceptual process involves biologically-evolved interactions between the speech perception and production systems, and is accomplished by a specialized speech or phonetic module (mode) in the central nervous system.
Inherent Noise Hidden in Nervous Systems’ Rhythms Leads to New Strategies for Detection and Treatments of Core Motor Sensing Traits in ASD
Elizabeth B. Torres, Caroline Whyatt in Autism, 2017
Indeed, both pointing and talking require a lengthy maturation period. They require the mastering of timely synergies and prospective coarticulation (Hardcastle and Hewlett 1999; Menard et al. 2013; Ryalls et al. 1993; Smith 2006), but developing these abilities requires continuous sensory feedback, particularly as the returning stream of self-generated movements is sensed back through afferent nerves of the periphery and autonomously supervised by the nervous systems. This continuous flow must be further integrated with other sensory inputs from external sources. If the processing of any of these components is impeded during neurodevelopment, proper map and sensory-motor transformation will also be affected.
The effect of phoneme-based auditory training on speech intelligibility in hearing-aid users
Published in International Journal of Audiology, 2022
Aleksandra Koprowska, Jeremy Marozeau, Torsten Dau, Maja Serman
The finding that the effects of training were observed only for one position of the consonant (C2 but not C1) was unexpected. Woods et al. (2015) found no effect of consonant position on training benefits in a study considering CVC (consonant-vowel-consonant) units. The effect of position might therefore be specific to the DANOK material and linked to the presence of the second vowel /i/. In DANOK, the initial consonant C1 is involved in coarticulatory effects with the following vowel only. C2, on the other hand, is part of a VCV (vowel-consonant-vowel) sequence. Thus, in the case of C2, more coarticulation cues are available for the listener. Assuming that the training improved the ability to utilise those cues, the benefit might be higher for the second consonant. The listeners might have adapted their strategy to optimise their score and therefore might have focussed their attention on the target, which was easier to identify, neglecting the first consonant. This explanation is also consistent with the minor drop in the C1 score in the training group, which was not observed in the control group.
Relationship between phoneme-level spectral acoustics and speech intelligibility in healthy speech: a systematic review
Published in Speech, Language and Hearing, 2021
Timothy Pommée, Mathieu Balaguer, Julien Pinquier, Julie Mauclair, Virginie Woisard, Renée Speyer
Just as in vowels, another type of measure that has been used in the retained papers are the dynamic formant transitions, among which the F2 slope. The F2 slope measure, used in glides in A22, is ‘a dynamic measure that reflects the rate at which speech movements can be performed’ (R. D. Kent, Kent, et al., 1989) and is thus related to speaking rate. Van Son and Pols (1999), investigating acoustic correlates of consonant reduction in healthy speech, found that the F2 slope difference (i.e., difference between the F2 slope in the VC- and CV-boundaries in VCV syllables) is lower in spontaneous than in read speech. This reduced F2 slope difference indicated a lower consonant-induced coarticulation in the VCV syllable, thus a reduced consonant articulation. The use of formant transition measures is all the more noteworthy since it has been shown that in healthy ageing a decrease in intelligibility can be partly attributed to slower tongue movements (Kuruvilla-Dugdale et al., 2020).
Assessing speech correction abilities with acoustic analyses: Evidence of preserved online correction in persons with aphasia
Published in International Journal of Speech-Language Pathology, 2018
Caroline A. Niziolek, Swathi Kiran
The eight PWA completed a behavioural experiment in which they read aloud monosyllabic words. Participants were seated in a sound booth while their speech was recorded with a head-worn condenser microphone placed ∼2 cm from the corner of the mouth. Recordings had a sampling rate of 44 100 Hz. On each trial, one of three monosyllabic words (“eat”, “Ed” or “add”) was randomly chosen and displayed on the screen. These three words were selected to avoid effects of consonant coarticulation—all words began with a vowel and ended with consonants sharing a place of articulation—and for comparison with past studies using this stimulus set. Visual presentation of target words was chosen for maximal efficiency and for suitability for planned neuroimaging follow-up studies. The presentation rate was automatically adjusted to account for variable response time delays: produced sounds with a duration of at least 150 ms were detected by the custom-developed software as a vocal response (Niziolek & Mandel, 2017), and the following trial was displayed following a 500-ms delay from response offset. Participants completed 600 trials total with an optional break after each block of 30 trials (∼every 60 s), for an average of 200 productions of each word.
Related Knowledge Centers
- Assimilation
- Place of Articulation
- Phone
- Gesture