Explore chapters and articles related to this topic
Speech Synthesis from Textual or Conceptual Input
Published in John Holmes, Wendy Holmes, Speech Synthesis and Recognition, 2002
Text will enter the TTS system as a string of characters in some electronically coded format, which in the case of English would normally be ASCII. The first stage in text analysis is text segmentation, whereby the character string is split into manageable chunks, usually sentences with each sentence subdivided into individual words. For a language such as English the separation into words is fairly easy as words are usually delimited by white space. The detection of sentence boundaries is less straightforward. For example, a full stop can usually be interpreted as marking the end of a sentence, but is also used for other functions, such as to mark abbreviations and as a decimal point in numbers.
Document Clustering: The Next Frontier
Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018
David C. Anastasiu, Andrea Tagarelli, George Karypis
Text segmentation is concerned with the fragmentation of input text into smaller units (e.g., paragraphs) each possibly discussing a single main topic. Regardless of the presence of logical structure clues in the document, linguistic criteria and statistical similarity measures have been mainly used to identify thematically coherent, contiguous text blocks in unstructured documents [44, 10, 21].
Morphological segmentation method for Turkic language neural machine translation
Published in Cogent Engineering, 2020
U. Tukeyev, A. Karibayeva, Z h. Zhumanov
When training NMT for these language pairs, the volume of the corresponding NMT dictionary rapidly increases; therefore, it requires excessive computer memory resources. The well-known approaches for text segmentation are BPE-based method (Senrich et al., 2016) and Morfessor (Creutz & Lagus, 2002), both of which are unsupervised and statistics-based methods. The advantage of these two methods lies in their universal applicability to different languages.