Explore chapters and articles related to this topic
Development of a Machine Translation System for Promoting the Use of a Low Resource Language in the Clinical Domain: The Case of Basque
Published in Satya Ranjan Dash, Shantipriya Parida, Esaú Villatoro Tello, Biswaranjan Acharya, Ondřej Bojar, Natural Language Processing in Healthcare, 2022
Xabier Soto, Olatz Perez-de-Viñaspre, Maite Oronoz, Gorka Labaka
Considering the rich morphology of Basque language and the rich terminology of health domain texts, we use Byte Pair Encoding (BPE) (Sennrich et al. 2015) with 90,000 merge operations for subword segmentation. Additionally, we try BPE-dropout (Provilkov et al. 2020) with 0.1 probability for preprocessing our training corpora, whether applied in both sides of the training corpus or only in the source side. We believe this regularization technique can be especially useful in our rich vocabulary setting, and can improve the robustness of the system against the usual typos appearing in EHRs.
Pre-trained and Application-Specific Transformers
Published in Uday Kamath, Kenneth L. Graham, Wael Emara, Transformers for Machine Learning, 2022
Uday Kamath, Kenneth L. Graham, Wael Emara
GPT-2 uses byte-pair encoding (BPE) tokenization [92] so that any UTF-8 string can be represented using a vocabulary that of only 256 bytes. Computing with the raw UTF-8 bytes was not done here, since byte-level language models were not performing at the level of word-level language models. 1
Analysis of Neural Machine Translation KANGRI Language by Unsupervised and Semi Supervised Methods
Published in IETE Journal of Research, 2022
Shweta Chauhan, Shefali Saxena, Philemon Daniel
The analysis of various techniques used to resolve the out of vocabulary challenges of this low resource language pair is given in Table 2. Byte Pair Encoding (BPE), a simple data compression technique which replaces the most frequently occurring pair of bytes with a single, unused byte. This algorithm is used for word segmentation and thus generate subword embedding. We experiment with 30k, 50k, 70k merge operations for BPE and also examine the impact of learning the encoding on the union of the vocabularies for the two languages with 50k, 70k and 90k operations. A language modelling (LM)-based technique for handling OOV is also analysed [40]. This language model is used for predicting the most probable words in place of the OOV word based on its context [41]. Then a weighted average of their mapped word embeddings is calculated for generating the word vector of the OOV word [42]. The complete evaluation of back translation with different BPE value and LM are shown in Table 2 for Hindi-Kangri.