Explore chapters and articles related to this topic
Multilingual Transformer Architectures
Published in Uday Kamath, Kenneth L. Graham, Wael Emara, Transformers for Machine Learning, 2022
Uday Kamath, Kenneth L. Graham, Wael Emara
Language-agnostic BERT Sentence Embedding (LaBSE) [88] is an architecture for training cross-lingual sentence representations which combines Masked Language Model (MLM) and Translation Language Model (TLM) pre-training tasks from XLM [146] with a translation ranking task using bi-directional dual-encoders with additive margin softmax loss[283]. The dual-encoders, as shown in Fig. 4.6, consist of two paired mBERT encoders. The [CLS] token sentence representation from both encoders are fed to a scoring function. Fig. 4.6 provide an illustration of the LaBSE architecture.
An empirical evaluation of text representation schemes to filter the social media stream
Published in Journal of Experimental & Theoretical Artificial Intelligence, 2022
Sandip Modha, Prasenjit Majumder, Thomas Mandl
To address the limitation of unsupervised sentence embedding techniques, such as Skip-thoughts, Conneau et al. (2017) proposed a supervised method which is trained using Stanford Natural language inference dataset to learn the embedding for the small piece of text such as sentence or paragraph. This is analogous to the computer vision problem where features are extracted from the Imagenet and use in another set of tasks. The dataset contains 570k manually created English sentence pairs and annotated in one of the three classes-entailment, contradiction, and neutral. Natural language inference aims at finding a directional relationship between the two sentences. The authors tested seven different architectures for the sentence encoder, and the best results are achieved with a bidirectional LSTM (BiLSTM) encoder. So the idea is to learn sentence embedding for the NLI task and transfer these embedding to the other downstream task such as text classification.
Attention-Based Bi-LSTM Network for Abusive Language Detection
Published in IETE Journal of Research, 2022
Kiran Babu Nelatoori, Hima Bindu Kommanti
While conventional machine-learning techniques have been used extensively in tasks of text classification, they face an important disadvantage: they do not effectively combine the written language’s semantic and cultural variations. For example, it is a very difficult task to consider the negation of words or sarcastic expressions with conventional machine learning techniques, as the sentence structure must be effectively presented in the collection of features. Deep-learning algorithms that rely on neural networks have been proposed to solve such difficulties. We built a neural network model for abusive language detection as shown in Figure 1. We use a Character CNN network that operates on character embeddings to extract the character representation of a word from characters. We concatenate the pre-trained word embeddings and character representations to form the input vector. Furthermore, we use Bi-LSTM (Long Short-Term Memory) based neural network because they have been effective in recognizing word sequences and interpreting their significance. The synthesis of the two forms of hidden vectors will use word ordering information while ensuring that information is preserved from both ends of long sequences. The Attention mechanism is used to generate an attention vector by giving importance to the most contributing words. In addition to the Attention mechanism, Global Average Pooling is applied on hidden state vectors to give equal importance to all the words. The concatenated Attention vector and Pooled vector can work as sentence embedding.
Confidently extracting hierarchical taxonomy information from unstructured maintenance records of industrial equipment
Published in International Journal of Production Research, 2023
Abhijeet S. Bhardwaj, Dharmaraj Veeramani, Shiyu Zhou
After generating word embeddings from the corpus, the distance between a given maintenance record and an equipment taxonomy branch is measured. To do so, the sentence embedding for the maintenance record () is first created using word embeddings. The sentence embedding is generated by taking the weighted average of the word embeddings present in the document, and then modifying it using PCA/SVD as in Arora, Liang, and Ma (2016). The sentence embedding vector for a document is represented by . Then, embedding vectors for each taxonomy branch () is created by averaging the word embeddings of individual words present in the taxonomy-branch. While using the word embeddings, the average of each word embedding and its corresponding POS tag embedding is taken. For the tokens in the taxonomy, the POS tag is assumed to be noun. The cosine similarity between the sentence embedding vector and the taxonomy embedding vector measures the similarity between the document and the taxonomy branch . The Bwd-Fwd algorithm's score for the taxonomy branch () and the Verb-Analysis algorithm's score () is multiplied by this similarity measure to generate the adjusted Bwd-Fwd () and Verb-Analysis scores () for each taxonomy branch (t) for a given document .