Explore chapters and articles related to this topic
Human Speech Communication
Published in John Holmes, Wendy Holmes, Speech Synthesis and Recognition, 2002
It is clear that the human perceptual and cognitive systems must be enormously complex to be able to perform the task of linguistic processing. The very large number of neurons are, of course, working in parallel. Thus, although the actual processing speed in any one part of the central nervous system is very slow compared with the speed of modern electronic circuits, the overall perceptual decisions can be made within a few hundreds of milliseconds. Where machines are required to recognize and interpret speech, it is apparent that emulating human performance in processing normal relaxed conversation will not be possible without the machine having extensive linguistic knowledge and a very high ability to simulate human intelligence. However, if the task of the machine is simplified by placing constraints on what can be said, it is already possible to use speech for many types of human-machine interaction. Recent developments have greatly increased the range and complexity of tasks for which speech can be usefully applied, and speech technology capabilities are advancing all the time. Even so, the situation so often depicted in science fiction, where machines have no difficulty at all in understanding whatever people say to them, is still many years away.
Individual Differences and Inclusive Design
Published in Constantine Stephanidis, User Interfaces for All, 2000
David Benyon, Alison Crerar, Simon Wilkinson
With bespoke software, the adaptor between user and system has typically been the systems analyst and the computer programmer. The analyst can understand the users’ needs and translate these into system functions. Education and training are other examples of adaptors being used to mediate between a fixed design computer system and a human. More recently, we have seen speech recognition systems becoming available to act as adaptors between humans and computer systems; speech technology has wide application both for able-bodied and for physically disabled users. The use of an adaptor is appropriate when two systems cannot otherwise accommodate each other; this is the case when accessibility problems are alleviated by the choice of alternative input/output devices, or by communication via an alternative modality.
Ensuring It Works – How Do You Know?
Published in James Luke, David Porter, Padmanabhan Santhanam, Beyond Algorithms, 2022
James Luke, David Porter, Padmanabhan Santhanam
They are everywhere! Typically, ChatBots are deployed in customer support roles specialising in business-relevant tasks in specific domains (called Skills). In contrast to one-shot Question-Answering, the ChatBot tasks need multiple turns with the user in a specific context. Examples are travel reservations, banking transactions, answering COVID-19-related questions [6], etc. The user interaction can be via speech or text. Speech technology involves acoustic models and thus introduces further complications due to accents, dialects, background noise, etc. Once the user speech is transformed into text (typically using a speech-to-text AI component), NLU processes take over and then the problem becomes the same as for textual inputs.
Identification and authentication of user voice using DNN features and i-vector
Published in Cogent Engineering, 2020
Kydyrbekova Aizat, Othman Mohamed, Mamyrbayev Orken, Akhmediyarova Ainur, Bagashar Zhumazhanov
Automated systems that use voice as input and output when interacting with the user. These systems are based on the speech technology such as automatic speech recognition (ASR). In nature, the properties of speech signals change rapidly over time. The discrete Fourier transform is used to calculate the power spectrum of each frame. Elementary speech units of low-pass filters are used for low frequencies, while a wide range of low-pass filters are used for high frequencies. The main point of using ESU low-pass filters is to determine the energy level of different frequency ranges. The discrete cosine transform of the data outputs of the log filters is calculated. In this article, speech statements were divided into frames with a size of 25 ms. 12 MFCC and normalized energy with their first and second derivatives were calculated for each frame, resulting in 39 coefficients representing each frame.