Explore chapters and articles related to this topic
Artificial Intelligence-Based Categorization of Healthcare Text
Published in Ankan Bhattacharya, Bappadittya Roy, Samarendra Nath Sur, Saurav Mallik, Subhasis Dasgupta, Internet of Things and Data Mining for Modern Engineering and Healthcare Applications, 2023
A more recent approach in feature extraction is using word embeddings. Word embeddings acquire the meaning of words or phrases mapped to vector representations, enabling similar text grouping in a new vector space. Word embeddings might be more efficient than the BOW models. In the BOW models, the broadness of document collection and tagging at the index position causes data sparsity problems. However, word embeddings take the token's surrounding words into account to solve the data sparsity problem. The given text's information is transferred to the model to end up with dense vectors. In this continuous vector space representation, semantically alike words are close to each other. This deduction can either be ensured by utilizing neural networks for language modeling, predicting a word in a sentence, given the nearby words as input, or capturing the training corpus's statistical properties. When predicting words in similar context inputs, neural networks generate similar predicted word outputs, resulting in a semantic representation space with the desired property. The following subsections present some well-known word embedding techniques briefly.
Classification of Text Data in Healthcare Systems – A Comparative Study
Published in Om Prakash Jena, Bharat Bhushan, Nitin Rakesh, Parma Nand Astya, Yousef Farhaoui, Machine Learning and Deep Learning in Efficacy Improvement of Healthcare Systems, 2022
A more recent approach in feature extraction is using word embeddings. Word embeddings acquire the meaning of words or phrases which are mapped to vector representations, enabling a similar grouping of text in a new vector space. Word embeddings might be more efficient than the BOW models. In the BOW models, the broadness of document collection and tagging at the index position causes data sparsity problems. However, word embeddings take the token's surrounding words into account to solve the data sparsity problem. The given text's information is transferred to the model to end up with dense vectors. In this continuous vector space representation, semantically alike words are close to each other. Although several word embedding techniques such as Word2Vec [18] and GLOVE [19] exist in the literature, in this paper, we select to use the Fasttext [20] method for word embeddings. In the following sub-section, we briefly mention Fasttext to justify our selection.
Applying a total error framework for digital traces to social media research
Published in Uwe Engel, Anabel Quan-Haase, Sunny Xun Liu, Lars Lyberg, Handbook of Computational Social Science, Volume 2, 2021
Indira Sen, Fabian Flöck, Katrin Weller, Bernd Weiß, Claudia Wagner
Finally, the creators of the model, according to the Model card,9 use a convolutional neural network (CNN) (LeCun & Bengio, 1995) trained with a specialized NLP technique, namely word embeddings (GloVe), which is fine-tuned during training on the annotated data. Word embeddings are resulting from an unsupervised method that automatically learns the semantic associations between words by discovering them from large unstructured data sets. Therefore, GloVe embeddings, trained on a large corpus of web data including tweets (Li et al., 2017), often show “human-like” biases such as gender and racial stereotypes (Caliskan et al., 2017). In the section Platform Selection and Data Collection, we saw that the data sampling technique can inadvertently introduce bias in the data set, depending on which kind of users are represented in the data set. Trace measurement error, in the form of biases of tools like GloVe, can exacerbate these stereotypes even further.
Prediction of user loyalty in mobile applications using deep contextualized word representations
Published in Journal of Information and Telecommunication, 2022
Word embedding models allow the representation of words as vectors of numerical values. In these models, the words with similar meanings have similar vector representations. A corpus with a large collection of documents is used for learning the vector representations of words. Word embedding models can learn semantic, syntactic similarity, and relationships among words in a given context from a corpus. Although the study of word embedding models started with the studies (Bengio et al., 2003; Collobert & Weston, 2008), these models became popular after Word2Vec model developed by Mikolov and his colleagues at Google. After this study, a large number of new embedding models have appeared in the literature (Bojanowski et al., 2017; Mikolov, Chen, et al., 2013; Mikolov, Sutskever, et al., 2013; Peters et al., 2018). In this study, words are represented by three well-known word embedding models: Word2Vec (Mikolov, Chen, et al., 2013), GloVe (Pennington et al., 2014), and FastText (Joulin et al., 2016).
Semi-Supervised Self-Training of Hate and Offensive Speech from Social Media
Published in Applied Artificial Intelligence, 2021
A. Select best baseline classifier for the SSST process: for selecting the best classifier for the self-training process, we train and assess multiple heterogeneous OHS classifiers based on standard and deep learning algorithms (called MLAs). Before training any MLA, we first transform the textual data into numerical vectors using Text Vectorization Algorithms (called TVAs), such as word embeddings. Word embedding is a mechanism that maps a word into a fixed and real-valued vector to capture the semantic and syntactic information of the word. It converts each word into an m*n matrix where m is the sequence length of the text and n is the embedding dimension. The text classifiers take advantage of word embeddings to extract discriminative and effective features. Word embedding initializes the weights of the input layer of the deep neural networks, and its quality significantly impacts the learners’ performance.
Detection of Compromised Online Social Network Account with an Enhanced Knn
Published in Applied Artificial Intelligence, 2020
Edward Kwadwo Boahen, Wang Changda, Bouya-Moko Brunel Elvire
It is a feature learning technique in natural language processing (NLP) which maps words or phrases to vectors of real numbers. From (Ying et al., 2018) (Al-Qurishi et al. 2017) (Shah et al. 2018) (Hazimeh et al., 2019) (Kaur et al., 2018) (Sahoo and Gupta 2019) word embedding’s have emerged as one of the powerful tool used to encode relationships between words and bridging the vocabulary gap. It has also led to an enhancement in the work of natural language processing (NLP) (Hazimeh et al., 2019) (Sahoo and Gupta 2019) (Wang and Yang 2018). In the NLP community, the most widely used word embeddings are Word2Vec (Mikolov et al. 2013), Glove (Pennington et al., 2014), and FastText (Mikolov et al. 2017). Since the dataset used contains textual components, there is the need to represent the true meaning in a vector format after processing the raw data. The continuous skip-gram model proposed by Mikolov et al. (Karimi et al. 2018) (Bojanowski et al. 2016) is considered due to its ability to learn from a dense low-dimensional word vector that is good in predicting the surrounding words with a center word (Hazimeh et al., 2019) (Bojanowski et al. 2016).