Byte pair encoding

Byte pair encoding (BPE) is a data compression technique that replaces the most frequently occurring pair of bytes in a UTF-8 string with a single, unused byte. This allows any UTF-8 string to be represented using a vocabulary of only 256 bytes. BPE is used in GPT-2 for tokenization, which involves breaking down text into smaller units for processing.From: Transformers for Machine Learning [2022], Analysis of Neural Machine Translation KANGRI Language by Unsupervised and Semi Supervised Methods [2022]

Development of a Machine Translation System for Promoting the Use of a Low Resource Language in the Clinical Domain: The Case of Basque

View Chapter

Purchase Book

Published in Satya Ranjan Dash, Shantipriya Parida, Esaú Villatoro Tello, Biswaranjan Acharya, Ondřej Bojar, Natural Language Processing in Healthcare, 2022

Xabier Soto, Olatz Perez-de-Viñaspre, Maite Oronoz, Gorka Labaka

Considering the rich morphology of Basque language and the rich terminology of health domain texts, we use Byte Pair Encoding (BPE) (Sennrich et al. 2015) with 90,000 merge operations for subword segmentation. Additionally, we try BPE-dropout (Provilkov et al. 2020) with 0.1 probability for preprocessing our training corpora, whether applied in both sides of the training corpus or only in the source side. We believe this regularization technique can be especially useful in our rich vocabulary setting, and can improve the robustness of the system against the usual typos appearing in EHRs.

Pre-trained and Application-Specific Transformers

View Chapter

Purchase Book

Published in Uday Kamath, Kenneth L. Graham, Wael Emara, Transformers for Machine Learning, 2022

Uday Kamath, Kenneth L. Graham, Wael Emara

GPT-2 uses byte-pair encoding (BPE) tokenization [92] so that any UTF-8 string can be represented using a vocabulary that of only 256 bytes. Computing with the raw UTF-8 bytes was not done here, since byte-level language models were not performing at the level of word-level language models. 1

Analysis of Neural Machine Translation KANGRI Language by Unsupervised and Semi Supervised Methods

View Article

Journal Information

Published in IETE Journal of Research, 2022

Shweta Chauhan, Shefali Saxena, Philemon Daniel

The analysis of various techniques used to resolve the out of vocabulary challenges of this low resource language pair is given in Table 2. Byte Pair Encoding (BPE), a simple data compression technique which replaces the most frequently occurring pair of bytes with a single, unused byte. This algorithm is used for word segmentation and thus generate subword embedding. We experiment with 30k, 50k, 70k merge operations for BPE and also examine the impact of learning the encoding on the union of the vocabularies for the two languages with 50k, 70k and 90k operations. A language modelling (LM)-based technique for handling OOV is also analysed [40]. This language model is used for predicting the most probable words in place of the OOV word based on its context [41]. Then a weighted average of their mapped word embeddings is calculated for generating the word vector of the OOV word [42]. The complete evaluation of back translation with different BPE value and LM are shown in Table 2 for Hindi-Kangri.

Byte pair encoding

Explore chapters and articles related to this topic

Development of a Machine Translation System for Promoting the Use of a Low Resource Language in the Clinical Domain: The Case of Basque

Pre-trained and Application-Specific Transformers

Analysis of Neural Machine Translation KANGRI Language by Unsupervised and Semi Supervised Methods