Statistical machine translation – Knowledge and References

Explore chapters and articles related to this topic

HealFavor: Machine Translation Enabled Healthcare Chat Based Application

Published in Satya Ranjan Dash, Shantipriya Parida, Esaú Villatoro Tello, Biswaranjan Acharya, Ondřej Bojar, Natural Language Processing in Healthcare, 2022

Sahinur Rahman Laskar, Abdullah Faiz Ur Rahman Khilji, Partha Pakray, Rabiah Abdul Kadir, Maya Silvi Lydia, Sivaji Bandyopadhyay

MT switched from a rule-based (Vauquois 1968; Eisele et al. 2008) to corpus-based approach that leads to developing a language-independent translation system. RBMT relies on a set of rules which needs the intervention of linguistic experts. In corpus-based MT, example-based machine translation (EBMT) (Nagao 1984; Somers 1999) is the naive approach that depends on the text-similarity analogy concept. The EBMT approach translates sentences based on the example sentences translated in advance on the bilingual parallel corpora. However, the EBMT approach has a drawback: we cannot determine various types of sentences by simply relying on other sentences as examples. Another approach included in the corpus-based approach is the statistical machine translation (SMT) (Brown et al. 1990; Koehn 2009). SMT relies on a statistical model that is used to predict target sentences based on source sentences using parameters estimated or learned from the results of parallel corpora analysis. SMT comprises several types, namely word-based translation, phrase-based translation, syntax-based translation, and hierarchical phrase-based translation. Among the four types mentioned above, the most widely used one is phrase-based translation (Koehn et al. 2003). SMT considers the translation task a probabilistic task by predicting the best translation for a given source sentence. The translation and language models are used to evaluate the probability of the target sentence, given the source sentence and the likelihood of the target sentence. And, the decoder is used to find the best translation.

Natural Language Processing Viewed from Semantics

View Chapter

Purchase Book

Published in Masao Yokota, Natural Language Understanding and Cognitive Robotics, 2019

Masao Yokota

In the late 1980s, natural language processing based on statistics, so-called statistical natural language processing, became another major trend. In principle, statistical natural language processing is centered on machine learning and driven by statistical inferences automatically acquired from text corpora instead of hand-written rules. Therefore, its accuracy depends on the learning algorithms involved and the quality of the corpora. In particular, statistical machine translation requires multilingual corpora as collections of translations of high quality. Here, what is meant vaguely by the phrase ‘high quality’ is always a serious problem for machine learning. For example, what is learnt by free translations must be quite different from what is learned by literal translations. So, which is higher in quality? The less vague interpretation of the phrase is that one corpus is higher in quality than another if it makes machine learning more successful in machine translation (or natural language processing) than the other, where, in turn, a certain authorized metric is required in order to evaluate the successfulness in machine translation, including automated means, such as BLEU, NIST, METEOR, and LEPOR (e.g., Han et al., 2012).

Speech and Language Interfaces, Applications, and Technologies

View Chapter

Purchase Book

Published in Julie A. Jacko, The Human–Computer Interaction Handbook, 2012

Clare-Marie Karat, Jennifer Lai, Osamuyimen Stewart, Nicole Yankelovich

A statistical machine translation (SMT) works with textual data, not predefined language rules. It processes text by means of pattern-matching algorithms that do not contain any formal “language rules” just a collection of patterns or words that make up the bilingual text corpora to which the statistical methods apply (TAUS Report 2007). Essentially, the system looks at and stores all the linear patterns of words (groups of two, three, or more words) in a text in one language. It then tries to “match” a correlating pattern in a translated version of this same text. This matching can be exact (where the patterns are exactly the same) or fuzzy (where the patterns do not match 100%). In principle, a SMT “learns” from a body of existing translations in order to identify plausible patterns of language in both texts, without reference to any linguistic rules. In this regard, one crucial component necessary for teaching the SMT to recognize or “learn” recurring patterns is the “translation memory.” A translation memory is the repository of all the exact matches that exist in parallel text corpora. Quite often, the SMT system relies very heavily on the knowledge-bases provided by the translation memory, and these patterns therein can be used to translate segments of new texts, which will often contain similar groups of words (TAUS Report 2007).

Unsupervised SMT: an analysis of Indic languages and a low resource language

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2022

Shefali Saxena, Shweta Chauhan, Paras Arora, Philemon Daniel

For the conventional SMT model to provide accurate translations, a sizeable amount of bilingual data is needed. For high-quality translation, a limited number of parallel corpora available for the LRLs. This paper illustrates the examination of the USMT model for the leading Indic languages as well as the critically endangered Kangri language, which has a great morphological diversity. On sentences other than those in the training corpus, we tested the SMT model. We have tested the SMT model on sentences apart from the training corpus.