Explore chapters and articles related to this topic
Text mining and topic modeling
Published in Uwe Engel, Anabel Quan-Haase, Sunny Xun Liu, Lars Lyberg, Handbook of Computational Social Science, Volume 2, 2021
Raphael H. Heiberger, Sebastian Munoz-Najar Galvez
Working with textual data is messy, in the sense that large portions of an unprocessed vocabulary may contain little to no information, conditional on the research question. According to Zipf’s law, the frequency distribution of words resemble a power law, meaning that any word is inversely proportional to its rank in the frequency table (Manning et al., 1999). Thus, few words like “the”, “can”, “non”, and so on appear very often. Because all texts contain those stopwords, they are of relatively little value to extract meanings and themes. One common preprocessing step is therefore the removal of stopwords. While the removal of most common stopwords seems obvious, an extended custom list of stopwords can also decrease the quality of a topic model (Schofield et al., 2017a). In addition, extending stopwords is time consuming and decisions are often hard to reproduce. Thus, the first preprocessing step (PS) we test is to remove a short list of stopwords or use an extended list. As default, we used the snowball stopword list from the stopwords package in R and a custom list containing time-related words (“year”, “january”, etc.), numbers (“one”, “tenth”, etc.) and miscellaneous subject-related words (“examine”, “study”, etc.).
Atm Switch Architecture and Systems
Published in Naoaki Yamanaka, High-Performance Backbone Network Technology, 2020
Real-life traffic destinations are not uniformly distributed; traffic tends to be focused on preferred or popular destinations. Unfortunately, the performance of many scheduling algorithms degrades under nonuniform traffic conditions, where not all queues are evenly and heavily loaded. Maximum matching algorithms are known to perform poorly and cause queue starvation [2] under these conditions. We introduce here a destination distribution model named Zipf’s law. The Zipf law was proposed by G. K. Zipf [10]-[12]. This model may be used as a reference to investigate and compare the performance of different scheduling algorithms. The Zipf law states that frequency of occurrence of some events (P), as a function of the rank where the rank is determined by the above frequency of occurrence, is a power-law function: PI ~ 1/ik, with the exponent k close to unity. The most famous example of Zipf’s law is the frequency of English words in a given text. Most common is the word “the,” then “of,” “to” etc. When the number of occurrence is plotted as the function of the rank (i = 1 most common, i = 2 second most common, etc.), the functional form is a power-law function with exponent close to 1. Figure 5 illustrates the Zipf distribution for different values of the k parameter. It was shown that many natural and human phenomena such as Web access statistics, company size and biomolecular sequences all obey the Zipf law with k close to 1 [11]. We use the Zipf distribution to model packet destination distribution. The probability that an arriving packet is heading destination i is given by: Zipf(i)=1ikΣj=0N1jk
Pre-Processing of Dogri Text Corpus
Published in Durgesh Kumar Mishra, Nilanjan Dey, Bharat Singh Deora, Amit Joshi, ICT for Competitive Strategies, 2020
Tesseract have been used by various researchers for identification of the text from the images. (Kumar Audichya and Saini 2017) have used this open-source tool for recognition of Guajarati characters by using an available training script. Mean confidence of 86% has been achieved even when variations in font-size and styles are done. (L, J, and N 2016) has also compared the text extraction process by comparing the combination of Imagemagick and Tesseract with OCRopus. The results of the later combination are more promising as compared to the OCRopus. For pre-processing task like stop-word removal, the various methods used by the researchers account from DFA based models to frequency based to seeking help of linguistic experts for creating these lists as discussed by (Gandotra 2018). (Siddiqi and Sharan 2018) has prepared a generic stop-word list consisting of 800+ words entered manually in consultation with the linguistic experts. Manual creation of such lists is time-consuming and expensive. These lists are biased and there are chances of missing out some important information. (Jha et al. 2016) has employed the DFA approach for construction of the stop-word list by making use of the linguist features of the Hindi language. Pattern based on sequence of Hindi characters are used for DFA modelling. 200 documents are used for testing the generated stop-word list made from the proposed model and attained an accuracy of 99% and also employed least time for execution i.e. 1.77 secs only. Dictionary-based technique also known as the classical method has been used by (Vijayarani, Ilamathi, and Nithya 2015) for creating the stop-word list. In this case also, manual experts are employed for creating the list but more than one linguistic expert are employed for the task. This manual creation is helpful if no digital data is present or the corpus is not available. Statistical based approach has been used by (Garg et al. 2014) for Hindi stop-word list creation. Zipf’s law has been applied to extract high frequency and low-ranked words from the corpus. Also, (Puri, Bedi, and Goyal 2013) has also applied frequency-based technique for generating stop-word list for Punjabi language. They used a combination of two approaches i.e. frequency distribution and probability distribution for extraction of stop-words from the corpus. Both the approaches are combined to generate the desired list.
Spatial heterogeneity of ports in the global maritime network detected by weighted ego network analysis
Published in Maritime Policy & Management, 2018
Chengliang Liu, Jiaqi Wang, Hong Zhang
Among all the measures used to capture the topological structure of real-life networks, centralities are most widely used. The degree, betweenness, and strength centrality of each port are calculated, together with WACR. It is found that all of them demonstrate power-law distributions and conform to the Pareto principle and Zipf’s law to some extent (see Figure 1). Obviously, the fitting curves of the cumulative probability distribution of these measures follow rank-size rule and exhibit scale-free property. In other words, the whole maritime network is heterogeneous and polarized. The values of the exponent γ of the power-law distributions of all the four measures are all between 1 and 2, which indicates sparse and unbalanced connection pattern of the maritime network.
Statistical Universals of Language: Mathematical Chance vs. Human Choice
Published in Technometrics, 2022
The second part of the book “Property of population” comprises three chapters—chapters four through six. Chapter 4 “Relation between rank and frequency” is concerned with the analysis of words. The chapter explains the concepts of the frequency and the rank of a word in a language. The frequency of a word in a text is simply the number of occurrences of the word in th text. The chapter contains a detailed account of Zipf’s law. The author illustrates Zipf’s law by elucidating the rank-frequency of several words in Moby Dick. In it, the word “whale” appears 783 times and its rank is 38 as it is the 38th most frequently occurring word in the text. Similarly, the word “ship” appears 451 times and has rank 67. This law states that the rank-frequency relation for every text forms a power law with a slope of –1. In other words, if f is the frequency of a word and r is the rank, thenfor some constant parameters c and η. The law could be understood to mean that the word with rank r appears with frequency , where fM is the frequency of the word of rank . For example, the frequency of word with rank is , and the word of rank appears times, following a harmonic progression. The chapter also contains a detailed account of the scale-free property, the fact that the vocabulary population in a text is invariant with respect to the text size. One striking aspect of Zipf’s law is the observation that in every text, there is a considerably large number of words that are rare. This includes hapax legomena that refers to a word that appears only once in a text. An interesting corollary of Zipf’s law states that the population of rare words in a text can be estimated from that of the frequent words. This is because the vocabulary population is invariant with respect to the text size. There are also dedicated sections on the monkey text, power law of n-grams and relative rank-frequency distribution. “Bias between rank and frequency,” chapter 5 of the book brings to light how Zipf’s law deviates when different texts are analyzed. It also contains accounts of alternatives to the law. The sixth chapter, “Related statistical universals” considers the nature of a vocabulary population by taking into consideration two related properties that have mathematical relations with Zipf’s law. These properties are the density function and the vocabulary growth.