Text normalization – Knowledge and References

Explore chapters and articles related to this topic

Text-to-Speech Synthesis

Published in Michael Filimowicz, Foundations in Sound Design for Embedded Media, 2019

Once abbreviations are detected, they are expanded into the full words to be spoken. Sometimes there is ambiguity and an abbreviation such as St. can be expanded to “Saint” or “Street.” The linguistic front-end uses information about surrounding words to pick the correct expansion. Besides expanding abbreviations, text normalization also involves normalization of other nonstandard words including numbers, currency amounts, dates and acronyms (Sproat et al. 2001). For example, in English whenever a number is preceded by a currency symbol, the number is pronounced first followed by the currency. If the number is “one,” the currency word (pound, dollar) is singular, otherwise it is plural (pounds, dollars). And if there is a decimal in the currency amount, one has to expand “$3.25” to “three dollars and twenty-five cents.” As you can see in the dollar amount, the decimal is indicated by a period, but in continental Europe the comma is used as the decimal marker and the period as the thousands marker. When converting a date to words, the language also matters. For US English 06/07/2008 would be expanded to “June seventh two thousand eight” whereas in British English it would be expanded to “July sixth two thousand and eight.”

A proposed model for predicting stock market behavior based on detecting fake news

View Chapter

Purchase Book

Published in Yuli Rahmawati, Peter Charles Taylor, Empowering Science and Mathematics for Global Competitiveness, 2019

A.M. Idrees, M.H. Ibrahim, N.Y. Hegazy

Fake news detection is considered an important task to determine factually incorrect and misleading news for investors. Stock market fake news aims to affect the investors’ opinions and decisions about their investment portfolio, so it can cause large financial losses. The stock market fake news detection model is shown in Figure 1. Stock market news was collected from different authenticated data sources and from different knowledge-sharing platforms such as Seeking Alpha and Motley Fool. The text preprocessing techniques in this study include using tokenization, stopping words and stemming, and text normalization. Text normalization techniques were applied for the news corpus by transforming different forms of text into a common standard format through the modification of all letters in the news into lowercase. After this, N-gram was used for the news corpus as a syntactic analysis technique in order to extract features from the news corpus as a series of tokens for length N. Our model generated bi-grams from the news corpus. N-gram has a robust performance in extracting features from text because of automatic capturing of the most frequent roots in news data. Also, good representation that is provided by N-gram does not require using a specific dictionary, as noted in its tolerance for spelling errors (Lyon & Cedex, 2009).

Text Analysis

View Chapter

Purchase Book

Published in Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane, Big Data and Social Science, 2020

Evgeny Klochikhin, Jordan Boyd-Graber

Generally, the process for text normalization is implemented using established lemmatization and stemming algorithms. A lemma is the original dictionary form of a word. For example, “go,” “went,” and “goes” will all have the lemma “go.” The stem is a central part of a given word bearing its primary semantic meaning and uniting a group of similar lexical units. For example, the words “order” and “ordering” will have the same stem “ord.” Morphy (a lemmatizer provided by the electronic dictionary WordNet), Lancaster Stemmer, and Snowball Stemmer are common tools used to derive lemmas and stems for tokens, and all have implementations in the NLTK (Bird et al., 2009).

Neural Text Normalization in Speech-to-Text Systems with Rich Features

View Article

Journal Information

Published in Applied Artificial Intelligence, 2021

Oanh Thi Tran, Viet The Bui

Text normalization is an important stage in processing non-canonical language from natural sources such as social texts, speech, short messages, etc. This is a new research field and most of its papers published are done for popular languages such as English, Japanese, Chinese, etc. All of these text normalization systems usually focus on social texts (Eryigit and Torunoglu-selamet (2017); Ikeda, Shindo, and Matsumoto (2016); Hassan and Menezes (2013)), short messages (Aw et al. (2006)), text-to-speech systems (Yolchuyeva, Gyires-Toth, and Nemeth (2018)), etc. For example, Eryigit and Torunoglu-selamet (2017) present the first work on the social media text normalization of an MRL and introduces the first text normalization system for Turkish. Ikeda, Shindo, and Matsumoto (2016) present a Japanese text normalization using Encoder-Decoder model. Aw et al. (2006) propose a phrase-based statistical model for normalizing SMS texts. For text normalization systems involving speech and language technologies, there have been several works to convert texts from written expressions into their appropriate ‘spoken’ forms. For example, Yolchuyeva, Gyires-Toth, and Nemeth (2018) introduce a novel CNNs based text normalizer and verify its effectiveness on the dataset of a text normalization challenge on Kaggle.1