Text segmentation

Text segmentation

Text segmentation refers to the process of dividing written text into meaningful units such as words, sentences, or topics. This can be achieved through various unsupervised and statistics-based methods such as BPE-based method and Morfessor. Researchers often use different methods such as vertical projection, curve segmenting path, integrated segmentation, clustering, and recognition feedback to segment text into its component words and sentences.From: Natural Language Processing [2019], Morphological segmentation method for Turkic language neural machine translation [2020], Handbook of Natural Language Processing [2019], Electronic Engineering and Information Science [2019]

Speech Synthesis from Textual or Conceptual Input

View Chapter

Purchase Book

Published in John Holmes, Wendy Holmes, Speech Synthesis and Recognition, 2002

John Holmes, Wendy Holmes

Text will enter the TTS system as a string of characters in some electronically coded format, which in the case of English would normally be ASCII. The first stage in text analysis is text segmentation, whereby the character string is split into manageable chunks, usually sentences with each sentence subdivided into individual words. For a language such as English the separation into words is fairly easy as words are usually delimited by white space. The detection of sentence boundaries is less straightforward. For example, a full stop can usually be interpreted as marking the end of a sentence, but is also used for other functions, such as to mark abbreviations and as a decimal point in numbers.

Document Clustering: The Next Frontier

View Chapter

Purchase Book

Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018

David C. Anastasiu, Andrea Tagarelli, George Karypis

Text segmentation is concerned with the fragmentation of input text into smaller units (e.g., paragraphs) each possibly discussing a single main topic. Regardless of the presence of logical structure clues in the document, linguistic criteria and statistical similarity measures have been mainly used to identify thematically coherent, contiguous text blocks in unstructured documents [44, 10, 21].

Morphological segmentation method for Turkic language neural machine translation

View Article

Journal Information

Published in Cogent Engineering, 2020

U. Tukeyev, A. Karibayeva, Z h. Zhumanov

When training NMT for these language pairs, the volume of the corresponding NMT dictionary rapidly increases; therefore, it requires excessive computer memory resources. The well-known approaches for text segmentation are BPE-based method (Senrich et al., 2016) and Morfessor (Creutz & Lagus, 2002), both of which are unsupervised and statistics-based methods. The advantage of these two methods lies in their universal applicability to different languages.

Explore chapters and articles related to this topic

Speech Synthesis from Textual or Conceptual Input

Document Clustering: The Next Frontier

Morphological segmentation method for Turkic language neural machine translation