BLEU – Knowledge and References

Explore chapters and articles related to this topic

A Clinical Practice by Machine Translation on Low Resource Languages

Published in Satya Ranjan Dash, Shantipriya Parida, Esaú Villatoro Tello, Biswaranjan Acharya, Ondřej Bojar, Natural Language Processing in Healthcare, 2022

Rupjyoti Baruah, Anil Kumar Singh

BLEU, a corpus-based metric, calculates the automatic quality score for MT systems that estimate the correspondence between a translated output and a human reference translation (Papineni et al. 2002). The primary notion of BLEU is closer to a professional translation with its machine-translated output. BLEU counts the number of matches by comparing the n-gram of the candidate translation with the n-gram of the reference translation. The more matches, the better the translation quality, where matches are independent of their positions.

Results and Conclusions

View Chapter

Purchase Book

Published in Krzysztof Wołk, Machine Learning in Translation Corpora Processing, 2019

Krzysztof Wołk

The evaluation was conducted using official test sets from IWSLT 2010–2013 campaigns and averaged. For scoring purposes, Bilingual Evaluation Understudy (BLEU) metric was used. The results of the experiments are shown in Table 80. BASE in Table 80 stands for baseline system and EXT for enriched systems.

An Analysis of the Evaluation of the Translation Quality of Neural Machine Translation Application Systems

View Article

Journal Information

Published in Applied Artificial Intelligence, 2023

Shanshan Liu, Wenxiao Zhu

BLEU evaluation index: BLEU (Bilingual Evaluation Understudy) is an evaluation index for evaluating machine translation results, and its value ranges from 0 to 1. The closer it is to 1, the closer the machine translation result is to the reference translation; the closer it is to 0, the more the machine translation result deviates from the reference translation (Mathur, Baldwin, and Cohn 2020). BLEU uses accuracy to measure the length of the machine translation result approaching reference translation. When calculating the accuracy, the number of n consecutive sequence matches between the machine translation results and the reference translation must be first known. More matches indicate a higher BLEU value, which means that the machine translation result is more like the reference translation. Eq. (1) presents the number of n consecutive matches,

Fully Unsupervised Machine Translation Using Context-Aware Word Translation and Denoising Autoencoder

View Article

Journal Information

Published in Applied Artificial Intelligence, 2022

Shweta Chauhan, Philemon Daniel, Shefali Saxena, Ayush Sharma

BLEU (Papineni et al. 2002) has been mostly used in MT evaluation due to its easy implementation, competitive performance to capture the fluency of translation, and language independence. It depends upon the n-gram matching of the hypothesis and reference translation. Other metrics have also been used for evaluation like WER (Su, Wu, and Chang 1992), PER (Tillmann et al., 1997), NIST (George Doddington 2002), TER (Snover et al. 2009), and ROUGE (Lin 2004). They mainly depend on the exact matches of the surface words in the output machine translation. WER, PER, and TER measure the edit distance between the reference and hypothesis by estimating the minimum total number of editing steps to transform the hypothesis to reference translation. Like BLEU, NIST calculates the degree of the n-gram overlapping between the hypothesis and reference translation. METEOR-Hindi (Gupta, Venkatapathy, and Sangal 2010) has extended the implementation of METEOR (Lavie and Agarwal, 2005) to support the evaluation of translations into Hindi. As the properties of other Indian languages are very similar to those of Hindi, METEOR-Hindi can be easily extended to different Indian languages.

Unsupervised SMT: an analysis of Indic languages and a low resource language

View Article

Journal Information

Published in Journal of Experimental & Theoretical Artificial Intelligence, 2022

Shefali Saxena, Shweta Chauhan, Paras Arora, Philemon Daniel

The BLEU (Papineni et al., 2002) evaluation metric is shown in Figures 2 and 3 (a) for both the cases. The BLEU is the most frequently used evaluation metric since it is language agnostic, easy to construct, and performs well in roughly reflecting translation fluency. The key to BLEU’s success is that all systems are treated identically, and many human translators with different styles are used, so this effect is balanced out when comparing systems. BLEU’s primary duty is to compare the candidate translation’s n-grams to the reference translation’s n-grams and count the number of matches. These are position-independent matches. The BLEU score is a number that goes from zero to one. A score of zero denotes a total mismatch, while a score of one denotes the best match. BLEU compares the accuracy of unigrams, bigrams, 3-grams, and 4-grams to a group of reference translations, penalising excessively short phrases.