Lexical resources – Knowledge and References

Explore chapters and articles related to this topic

Real-Time 2D Avatar Lip Syncing for the on Demand Interactive Chatbots

Published in Anupama Namburu, Soubhagya Sankar Barpanda, Recent Advances in Computer Based Systems, Processes and Applications, 2020

Venkata Susmitha Lalam, Abhinav Dayal, Sajid Vali Rehman Sheik, Vinay Kumar Adabala

A phoneme is a distinguishable unit of sound in a language, which can be perceived. It uniquely identifies one word from another. English language has 44 such unique phonemes despite having only 26 characters in the alphabet [9]. The authors use Python’s Natural Language Toolkit (NLTK) [http://www.nltk.org/] that has unique representation of 39 phonemes as shown in Figure 1. NLTK is a collection of libraries and functions implementing the English natural language processing (NLP). The package comes handy for research purposes in linguistics, information retrieval and other domains where large chunks of data are involved. NLTK is the most effective and extensively used platform to deal with data involving human language. It also contains various tools for lexical analysis and reasoning, lemmatization and sentiment analysis. It also provides more than 50 corpus and other lexical resources like WordNet. To determine phonemes, the first step is tokenization, which is the process of identifying and extracting the words from the given text. The next step is to find the sequence of phonemes to pronounce each word. The authors use word_tokenize(), a method in NLTK library, which is one among the lexical tools to break the sentences in the text into a list of meaningful words and punctuation marks. Corpora are a set of large and structured collection of written or spoken sentences. In NLTK, corpora are a set of written texts. NTLK provides the users with huge number of corpus, like abc for plain-text corpora, brown for annotated corpora, treebank for parsed corpora, etc. In this work, cmudict, which is a CMU Pronunciation Dictionary corpus, containing the phonetic transcription of more than 100,000 words is used as a standard method to extract the phonemes.

Fully Unsupervised Machine Translation Using Context-Aware Word Translation and Denoising Autoencoder

View Article

Journal Information

Published in Applied Artificial Intelligence, 2022

Shweta Chauhan, Philemon Daniel, Shefali Saxena, Ayush Sharma

CLWE cannot capture the complexity of words with multiple meanings, such as homonyms or polysemous words. A solution to this limitation is learning separate representations for each meaning of the word that is word senses. Traditional techniques for this task rely on lexical resources built by humans, such as WordNet. These resources are like a dictionary or thesaurus and include a list of all the possible meanings for each word. These knowledge-based techniques give rise to an additional challenge of creation of such resources and sense-annotated corpora. The time-consuming and expensive nature of the task limits these approaches to a very few well-studied languages; thus, it is not scalable to other languages. Identification of word senses and learning their sense representation can also be automated by analyzing the contexts in which it appears. An unsupervised method (Pelevina et al. 2017) to learn the sense vector space uses a semantic graph, which is constructed by connecting each word to the set of its semantically similar words. In our approach, we use the following steps to learn the sense vector representations. First, a semantic graph of word similarities is built. Each word is connected to its nearest neighbors, and the weights of branches are set as the similarity score of the retrieved nearest neighbor with the word under consideration. In this case, nearest neighbors are words with the highest cosine similarity of their respective word vectors.Second, the sense induction step is a step in which an ego network is constructed for every word in the vocabulary. In this ego network, words (nodes) referring to the same sense tend to be tightly connected, while having fewer connections to words refers to a different sense. A word sense can be represented by a group of these tightly connected words or word clusters. For instance, the cluster “chair, bed, bench, stool, sofa, desk, cabinet” can represent the sense “table (furniture).” This ego network is then clustered with the Chinese Whispers algorithm. The number of senses induced may vary for each word as the clustering algorithm used is parameter free.Finally, the sense vectors are calculated for each induced sense of the words present in the vocabulary. It is assumed that word sense should be represented by a combination of words in the cluster corresponding to that sense. Thus, sense vectors are calculated as the weighted average of the word vectors present in the cluster of the corresponding sense.