Explore chapters and articles related to this topic
Related works
Published in Claudia Lanza, Semantic Control for the Cybersecurity Domain, 2023
In this section, the methodologies exploited for the purposes of this research activity are presented. For what concerns the categorization in text mining extraction process from source textual databases, (i) Latent Dirichlet Algorithm (LDA) has been applied to the set of documents related to the Cybesecurity documents to obtain significant categories to thematically gather the range of data in the source corpus; for the main keywords extraction; (ii) the PKE [40] library has also been used, which implements a series of keyphrase extraction approaches, e.g., TopicRank, MultipartiteRank, TD*IDF and TopicalPageRank; (ii) regarding the recognition of semantic relations in the field of distributional semantics; (iv) variation (Daille, 2017) in the domain-specific terminology has represented the starting point from which to detect synonyms and hierarchical structures; (v) more sophisticated computer-based algorithms of word embedding, i.e., Word2vec and FastText, have been applied to textual documents in order to retrieve the similarities in terms and levels of proximity that could help in building the automatic identification of the semantic network structure of the Italian Cybersecurity thesaurus; (v) finally, a pattern-based configuration both assisted by pre-trained computer software and outlined by coding has been implemented on the documents used to extract candidate terms with the objective of collecting the most representative semantic recursive occurrences, verbal and nominal, that could match semantically meaningful chains to be conceived as triggers to the semantic relationships construction to be imported in the thesaurus.
From frequency counts to contextualized word embeddings
Published in Uwe Engel, Anabel Quan-Haase, Sunny Xun Liu, Lars Lyberg, Handbook of Computational Social Science, Volume 2, 2021
Gregor Wiedemann, Cornelia Fedtke
In the adaptation of Saussure’s notion of paradigmatic word relations for statistical semantics, Harris (1954) stated his famous distributional hypothesis that words that occur in similar contexts tend to have a similar meaning. Based on this assumption, language models that strive to encode word meaning by observing statistical word co-occurrence patterns in empirical language data were developed in computational linguistics. A very successful approach to distributional semantics is so-called word embedding vectors computed by artificial neural networks. Latent semantic models reduce the dimensionality of the vocabulary to represent the meaning of entire documents as a vector. Word embedding models, in contrast, learn meaningful low-dimensional vectors for each word of the vocabulary by observing their neighboring context words in a large text collection. Popular models such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) result in vectors for which the proximity of words in their vector space indicates their semantic relatedness, for example similarity. As in Saussure’s chess analogy, an embedding model tries to find a positioning of all the words to each other to optimally describe the language use it is presented with during its training phase. But, instead of the two-dimensional chessboard, it typically uses between 50 and 300 dimensions to position its elements and encode different aspects of meaning along with these dimensions. This way, it is not only the relative proximity of words that implies their meaning. To some extent, vector arithmetic can also be applied to reveal more complex semantic properties than solely word similarity. For instance, the vector operation ‘king’ – ‘man’ + ‘woman’ yields ‘queen’ as the closest vector indicating that some notion of gender is encoded in the model. Kozlowski et al. (2019) use this property of word2vec embeddings to their approach of the “geometry of culture”. The main idea is to investigate cultural patterns simply by studying distances of words in an embedding model trained on a large corpus of texts representative for some base population (e.g. millions of US-American newspaper articles from one decade). The embedding vectors for music genres, sports, or professions can be projected along with gender, class, or ideological dimensions through the relative word distance to opposite terms of an imagined continuum (e.g. male-female, rich-poor, or liberal-conservative). The approach can reveal interesting cultural connotations especially in a diachronic perspective, for example the feminization of the occupation “journalist” in the second half of the 20th century (ibid.). In natural language processing, the use of word embeddings pretrained on very large generic corpora drastically improved the state of the art for almost any inference task due to the circumstance that they allow for some form of knowledge transfer in machine learning. For instance, a sentiment classifier presented a sentence containing the attribute ‘good’ during training already knows something about other sentences containing closely related terms such as ‘great’ or ‘awesome’ (Rudkowsky et al., 2018).
Real-valued syntactic word vectors
Published in Journal of Experimental & Theoretical Artificial Intelligence, 2020
A distributional semantic space is a finite-dimensional vector space (or linear space) whose dimensions correspond to the contextual environment of words in a corpus. Word similarities in a distributional semantic space are reflected through the similarities between vectors associated with them. In other words, similar vectors are associated with similar words.