Explore chapters and articles related to this topic
Communications
Published in David Burden, Maggi Savin-Baden, Virtual Humans, 2019
David Burden, Maggi Savin-Baden
The Speech Synthesis Markup Language, SSML (Taylor and Isard 1997), is a commonly used platform independent interface standard for speech synthesis systems, which can be used to mark up the prosody elements of speech to complement the words in the text. It is a formal Recommendation of the World Wide Web Consortium’s voice browser working group.
Media Resource Control Protocol Version 2
Published in Radhika Ranjan Roy, Handbook on Networked Multipoint Multimedia Conferencing and Multistream Immersive Telepresence using SIP, 2020
An MRCPv2 server may offer one or more of the following media processing resources to its clients: Basic Synthesizer: A speech synthesizer resource that has very limited capabilities and can generate its media stream exclusively from concatenated audio clips. The speech data is described using a limited subset of the Speech Synthesis Markup Language (SSML) [3] elements. A basic synthesizer MUST support the SSML tags <speak>, <audio>, <say-as>, and <mark>.Speech Synthesizer: A full-capability speech synthesis resource that can render speech from text. Such a synthesizer MUST have full SSML (W3C.REC-speech-synthesis-20040907) support.Recorder: A resource capable of recording audio and providing a URI pointer to the recording. A recorder MUST provide endpointing capabilities for suppressing silence at the beginning and end of a recording and MAY also suppress silence in the middle of a recording. If such suppression is done, the recorder MUST maintain timing metadata to indicate the actual timestamps of the recorded media.DTMF Recognizer: A recognizer resource capable of extracting and interpreting DTMF [2] digits in a media stream and matching them against a supplied digit grammar. It could also do a semantic interpretation based on semantic tags in the grammar.Speech Recognizer: A full speech recognition resource that is capable of receiving a media stream containing audio and interpreting the recognition results. It also has a natural language semantic interpreter to post-process the recognized data according to the semantic data in the grammar and provide semantic results along with the recognized input. The recognizer MAY also support enrolled grammars, where the client can enroll and create new personal grammars for use in future recognition operations.Speaker Verifier: A resource capable of verifying the authenticity of a claimed identity by matching a media stream containing spoken input to a pre-existing voiceprint. This may also involve matching the caller’s voice against more than one voiceprint, also called multi-verification or speaker identification.
How to Design Audio-Gamification for Language Learning with Amazon Alexa?—A Long-Term Field Experiment
Published in International Journal of Human–Computer Interaction, 2022
Paula Bräuer, Athanasios Mazarakis
A skill was developed for the Amazon Alexa platform that allows learning German as a foreign language to collect the data for this study. Because the speech input is determined by Alexa’s default system language, this means that an IVA with an English setting as a system language cannot process German-language input. However, with the help of Amazon Alexa’s Speech Synthesis Markup Language, it is also possible to reproduce texts in other languages via text-to-speech (Speech Synthesis Markup Language (SSML) Reference | Alexa Skills Kit, n.d.). Therefore, the Alexa skill is designed to train listening comprehension, according to the flashcard concept that Skidmore and Moore (2019) describe in their article. In addition to the advantage of not requiring users to make any additional system settings, this approach provides the benefit of assuming that native speakers can easily interact with the IVA (Wu et al., 2020).