Explore chapters and articles related to this topic
BERT- and FastText-Based Research Paper Recommender System
Published in Pallavi Vijay Chavan, Parikshit N Mahalle, Ramchandra Mangrulkar, Idongesit Williams, Data Science, 2022
Nemil Shah, Yash Goda, Naitik Rathod, Vatsal Khandor, Pankaj Kulkarni, Ramchandra Mangrulkar
Flags can be used for controlling max and min for the length of n-grams. The range of values is controlled for getting the n-grams. The bag of words model and the appearance of any character in a particular order doesn’t make any difference. The most common example is ids as the words. During the update of the model, the weights are learned by FastText for each n-grams token as well as the entire word token as in Young et al., 2019 [31]. Now, a recommender system will be made using a cosine similarity score. First of all, all the stop words are filtered. Elimination of stop words is one of the first steps in all Natural Language Processing tasks. Stop words are usually removed because they are not pertinent and the words which add some meaning to the sentences are kept. Stop words are only removed in tasks where grammatical coherence isn’t a necessity. Since the chapter is aimed at building a research paper recommender system, only the keywords are converted to vectors as this drastically reduces the computational power and resources.
A fuzzy algorithm for text classification in data science
Published in Arun Kumar Sinha, John Pradeep Darsy, Computer-Aided Developments: Electronics and Communication, 2019
Kondapalli Beulah, Penmetsa Vamsi Krishna Raja, P. Krishna Subba Rao
The goal of this test case is removing stop words. For getting an exact output that what we expected, for this purpose we need to provide an input file (the input file must be a .txt file). Stop words is the name given to words which are filtered out prior to, or after, processing of natural language data (text). Stop words are small, frequently occurring words that are often ignored when typed into a database or search engine search. Some examples for stop words are, A, AN, OF, THE. If a stop word is typed at the beginning of a title search, this will often stop the search entirely. Very common words, information on which is not required in the linguistic analysis of a corpus. To exclude these and thus speed up processing considerably one passes a list of them to the processing software, the latter then ignoring the words in this list.
Data Pre-processing
Published in Peter Wlodarczak, Machine Learning and its Applications, 2019
Relevance filtering typically happens at different stages of a machine learning project. Data deduplication can be considered a relevance filtering step if every instance has to be unique. Feature selection can also be considered relevance filtering since relevant features are separated from irrelevant ones. Stop words removal in text analysis is a relevance filtering procedure since irrelevant words or signs such as smileys are removed. Many natural language processing frameworks offer stop words removal functionality. Stop words are usually the most common words in a language such as “the”, “a”, or “that”. However, the list often needs to be adjusted since a stop word might be relevant, for instance, in a name such as “The Beatles”.
What we talk about when we talk about EEMs: using text mining and topic modeling to understand building energy efficiency measures (1836-RP)
Published in Science and Technology for the Built Environment, 2023
Apoorv Khanuja, Amanda L. Webb
After tokenization, the stop words were removed from this bag of words using the R package stopwords (Benoit, Muhr, and Watanabe 2021). Stop words are frequently occurring but un-informative words (e.g., and, or, to, the) and are often removed from textual data prior to text mining. The list of stop words used for this analysis came from the snowball lexicon within the stopwords package, which was selected because its relatively short list of stop words would retain most of the EEM text. In addition to removing stop words, the tokens in which the first character was a number were also removed. This was because these tokens generally provided unnecessary level of detail (e.g., specific temperature setpoints, COP values, or the name of a standard such as ASHRAE 62.1) that was not essential to describing the EEM. However, tokens that contained numbers but started with an alphanumeric letter (e.g., T8, T12, CO2, etc.) were not removed since they provided useful information regarding the specific type of building component affected by an EEM.
Web service discovery with incorporation of web services clustering
Published in International Journal of Computers and Applications, 2023
Sunita Jalal, Dharmendra Kumar Yadav, Chetan Singh Negi
In order to evaluate the proposed approach, we prepared a dataset of 1000 web services (or API services) descriptions from different domains such as Mapping, Weather, Inventory, TripAdvisor, and many more published by ProgrammableWeb1 and other online sources. Preprocessing of data was done before applying LDA technique. It involves tokenization, removal of stop words and punctuations, and word stemming. Tokenization converts string into meaningful English words. Stop words are commonly used words such as ‘a’, ‘an’, ‘in’, ‘the’ and many more. Punctuations are special symbols such as period, question mark, hypen, parentheses, etc. Removal of stop words, digits, and punctuations from text reduces the size of text without losing its valuable information. Word stemming is used to reduce the word to its root form. We used NLTK toolkit for data preprocessing. Number of services in each domain is given in Table 1.
Automated categorization of student's query
Published in International Journal of Computers and Applications, 2022
Naveen Kumar, Hare Krishna, Shashi Shubham, Prabhu Padarbind Rout
The text pre-processing involves four phases, i.e. tokenization, stop word removal, stemming, and vectorization. Tokenization removes white space and special characters from a document and converts the sentences and paragraph into words.Stop words are very common words that carry very little information. These words are mainly used for syntactic purposes in the language; they hardly contribute to the problem domain [25]. Few examples of stop words are ‘the’, ‘a’, ‘and’, and ‘that’. These stop words are removed from the word set, which is received after tokenization.In stemming, each word is converted to its root word or stem. Thus, it reduces the number of keywords in the dataset [25]. For example, ‘eat’, ‘eats’, ‘eaten’, and ‘eating’ will be replaced by ‘eat.’In vectorization, each unique keyword in the dataset is converted to an attribute or feature. Each query is converted to a vector of length n, where n is the number of unique keywords in the whole dataset. This vector contains the frequency of each keyword in the query.