Explore chapters and articles related to this topic
Automated Scoring of Extended Spontaneous Speech
Published in Duanli Yan, André A. Rupp, Peter W. Foltz, Handbook of Automated Scoring, 2020
Klaus Zechner, Anastassia Loukina
In automated speech processing research, ASR performance is usually measured using word error rate (WER), which is computed based on automated string alignment of the ASR’s word hypothesis and the reference transcription produced by a human transcriber (i.e., the words that the system produces compared to the words that a human transcriber produces). Specifically, WER is defined as the sum of all alignment errors (i.e., word substitutions, insertions, and deletions) normalized by the length of the reference (human) transcription. For spontaneous non-native speech, WERs above 50% in the past had been typically in the 2000s (Zechner et al., 2009); more recently, when using deep neural network-based ASR systems, researchers were able to obtain WERs in the range of 20% to 30% (Tao, Ghaffarzadegan, Chen, & Zechner, 2016; Cheng, Chen, & Metallinou, 2015), but this is dependent on the particular use context.
Recent Advancements in Automatic Sign Language Recognition (SLR)
Published in Sourav De, Paramartha Dutta, Computational Intelligence for Human Action Recognition, 2020
Varshini Prakash, B.K. Tripathy
The HMM is based on the freely available state-of-the-art open source speech recognition system RASR [10]. The system performance is measured in Word Error Rate (WER), which is based on the Levenshtein alignment which computes the desired number of insertion, deletion and substitutions between reference and hypothesis sentence which results in a transformation of the hypothesis into the reference sequence. WER=#deletions + \# insertions + \# substitutions#reference observations
Voice Subtitle Transmission in the Marine VHF Radiotelephony
Published in Adam Weintrit, Marine Navigation, 2017
The performance of speech recognition systems is usually evaluated in terms of accuracy and speed. Accuracy is usually rated with word error rate (WER), whereas speed is measured with the real time factor. Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR).
CNAMD Corpus: A Chinese Natural Audiovisual Multimodal Database of Conversations for Social Interactive Agents
Published in International Journal of Human–Computer Interaction, 2023
Jingyu Wu, Shi Chen, Wei Xiang, Lingyun Sun, Hongzeng Zhang, Zhengyu Zhang, Yanxu Li
For the automatic speech recognition (ASR) baseline models, we use audio as input. We compare the LSTM (Sak et al., 2014), FSMN (S. Zhang et al., 2015) and Paddle (H. Zhang et al., 2022) as the baseline models. LSTM is a famous model in NLP which controls transmission status by gated states, remembering what you need to remember for a long time, and forgetting what is not important. FSMN is based on LSTM with an additional attention factor and divides the situation into two ways: with or without delay. Paddle is mainly used in audio synthesis but also shows great potential for ASR. We compare three common indicators: time of recognition, accuracy, and word error rate (WER). The definitions of WER and accuracy are:
Interaction between people with dysarthria and speech recognition systems: A review
Published in Assistive Technology, 2023
Aisha Jaddoh, Fernando Loizides, Omer Rana
To identify studies, we used the following inclusion criteria: (i) studies published 2011–2022 (Siri launched in 2011, making ASR mainstream and enabling people with dysarthria to use ASR ubiquitously); (ii) studies evaluating interactions between people with dysarthria and ASR systems/devices; (iii) studies using word error rate (WER) as the measurement criteria, the most commonly used metric for accuracy. Selecting studies that use the same metrics will allow the authors to compare one thing that is common across studies. Our focus is on human–computer interactions rather than on the disorder and therapeutic intervention; therefore, we excluded clinical and therapeutic research and research that examined dysarthria in individuals with language and cognitive impairment (e.g., aphasia and dementia) to eliminate factors that affect the interaction process. The Preferred Reporting Items for Systematic Review and Meta Analysis (PRISMA) was followed (Moher et al., 2015). One author conducted the screening process, while another undertook the final review of the screening results. Out of 101 studies that were selected and assessed for eligibility, 40 were chosen to answer the research questions (see flowchart Figure A1 in Appendix A). To systematically review the literature, the authors used the interaction framework (Abowd & Beale, 1991), which proposes that interactions between users and systems follow a four-component cycle: user, input, system, and output. We classify the ASR literature and people with dysarthria according to these four components. A table (Table B1) that maps the framework component with the research questions is attached in Appendix B.