Text mining – Knowledge and References

Explore chapters and articles related to this topic

The Application of Text Mining in Detecting Financial Fraud: A Literature Review

Published in Deepmala Singh, Anurag Singh, Amizan Omar, S.B. Goyal, Business Intelligence and Human Resource Management, 2023

Pratibha Maurya, Anurag Singh, Mohd Salim

Text is a common means of data exchange in the modern world. Text mining encompasses a variety of subfields, including natural language processing (NLP), information retrieval, web mining, computational linguistics, data extraction, and data mining. Automated structured data extraction from unstructured and semi-structured materials was accomplished through the use of text mining (Kautish, 2008, Kautish and Thapliyal, 2013). Commercially, it is rather valuable. It is a novel technique for analysing massive sets of formless documents with the goal of extracting knowledge or non-trivial patterns. Document files come in a variety of forms, including text files, flat files, and PDF files. These files were assembled from a number of sources, including message boards, newsgroups, emails, online chat, text messages, and websites (Bagale et al., 2021). Humans are capable of rapidly resolving problems and of identifying and applying linguistic patterns to text (Singh & Gite, 2015). On the other hand, computers are incapable of handling difficulties, such as spelling, context, slang, and variation. Nonetheless, our language abilities and computing capabilities enable us to analyse text quickly or in enormous quantities in order to grasp unstructured data. A computer can analyse unstructured data using the text-mining technique. Fraud detection is a priority for financial sector organisations (Figure 12.1).

Concluding Remarks

View Chapter

Purchase Book

Published in John Atkinson-Abutridy, Text Analytics, 2022

John Atkinson-Abutridy

The process of text mining comprises several activities that enable users to uncover information from unstructured text data. Before you can apply different text mining techniques, one must start with text preprocessing, which is the practice of cleaning and transforming text data into a usable format. This practice is a core aspect of NLP and it usually involves the use of techniques such as language identification, tokenization, part-of-speech tagging, chunking, and syntax parsing to format data appropriately for analysis. When text preprocessing is complete, you can apply text mining algorithms to derive insights from the data. Some of these common text mining techniques include information retrieval (i.e., tokenization, stemming), Natural-Language Processing (i.e., part-of-speech tagging, summarization, categorization, sentiment analysis), and information extraction (i.e., named-entity recognition, feature selection and extraction).

A Study of Proximity of Domains for Text Categorization

View Chapter

Purchase Book

Published in Sk Md Obaidullah, KC Santosh, Teresa Gonçalves, Nibaran Das, Kaushik Roy, Document Processing Using Machine Learning, 2019

Ankita Dhar, Niladri Sekhar Dash, Kaushik Roy

In the last few years, text document management systems based on contenthave gained tremendous attention in the field of computer and information science. The reasons behind this demand are the availability of digital text documents at a huge scale and the need to access these documents in more efficient manner. Thus, the emergenceof ‘text categorization’ (TC), which can also be referred to as ‘text classification’ or ‘topic spotting’. Text categorization is a dynamic research domain of text mining that refers to the task of assigning text documents to their respective categories using some classification techniques for efficiently managing information. If the text document is categorized, then searching for and retrieving information from these texts will be quick and effective. The prime goal of text categorization is to classify a random text document to its category. The text categorization can be either single-label or multi-label: in the former case the text document will be classified with only one class, whereas in the latter it will fit into more than one category.

What we talk about when we talk about EEMs: using text mining and topic modeling to understand building energy efficiency measures (1836-RP)

View Article

Journal Information

Published in Science and Technology for the Built Environment, 2023

Apoorv Khanuja, Amanda L. Webb

Text mining and related natural language processing (NLP) techniques, such as topic modeling, present a promising strategy for analyzing EEM names and descriptions. Text mining, broadly, is the process of automatically extracting previously unknown information and insights from unstructured text within any written resource (Hearst 1999). Topic modeling is an unsupervised text mining technique that can be used to uncover hidden themes (i.e., topics) across a collection of documents, as well as within individual documents (Blei 2012). Text mining and topic modeling have been used to analyze textual data in a variety of applications, like examining newspaper articles related to government funding of artists and arts organizations (DiMaggio, Nag, and Blei 2013), uncovering themes in educational leadership research literature over time (Wang, Bowers, and Fikis 2017), and evaluating Consumer Financial Protection Bureau complaints (Bastani, Namavari, and Shaffer 2019). Research has also been conducted testing the effectiveness of topic models in analyzing twitter data (L. Hong and Davison 2010). Overall, these studies show that topic modeling is a valuable technique to analyze large collections of texts where manual review would be unfeasible, and that it works well in uncovering the thematic makeup of documents across a variety of different fields.

Exploring deep learning approaches for Urdu text classification in product manufacturing

View Article

Journal Information

Published in Enterprise Information Systems, 2022

Muhammad Pervez Akhter, Zheng Jiangbin, Irfan Raza Naqvi, Mohammed Abdelmajeed, Muhammad Fayyaz

The tremendous growth of Urdu text documents on the internet is creating challenges for researchers to find an automatic, reliable and fast way to organise these documents. Text document classification is a task of automatically assigning a label from a set of pre-defined labels to a document based on its contents. Text document classification has several applications in text mining and information retrieval like spam detection (Akhtar, Tahir, and Shakeel 2017; Jain, Sharma, and Agarwal 2018), tweet analysis (Ali et al. 2018), sentiment analysis (Mehmood, Essam, and Shafi 2019), document organisations (Tripathy, Anand, and Rath 2017; Rao et al. 2018). Urdu is a national language of Pakistan and has more than 300 million speakers all over the world (Riaz 2012) but it is a resource-poor language. The rich and complex morphological script, no capitalisation of characters, has diacritics, free word order, context-sensitive are some main characteristics of Urdu that make it more challenging for automatic text processing.

SentiXGboost: enhanced sentiment analysis in social media posts with ensemble XGBoost classifier

View Article

Journal Information

Published in Journal of the Chinese Institute of Engineers, 2021

Roza Hikmat Hama Aziz, Nazife Dimililer

Sentiment Analysis, also known as opinion mining, is a subfield of text mining that incorporates natural language processing techniques to analyze people’s sentiments, opinions, attitudes, evaluations, and emotions about a particular product or topic (Pang and Lee 2008; Liu, Bi, and Fan 2017; Bi et al. 2019a). Since sentiments and views or opinions are at the core of human communication activities, Sentiment Analysis has recently been at the focus of both business applications and research. The advances in sentiment analysis research coincide with the proliferation of social media and involve building a system to collect and examine opinions about products or other topics in blog posts, micro-blogs, reviews, comments, forum discussions, and social networks (Liu, Bi, and Fan 2017; Bi et al. 2019a). The key challenges in Sentiment Analysis, specifically in social media, include the use of (1) informal language, (2) widespread but inconsistent and ad-hoc use of abbreviations and acronyms and (3) the brevity of the messages. A significant number of studies have been conducted on analyzing informal texts and classifying them by using the lexicon and machine learning approaches (Liu, Bi, and Fan 2017; Bi et al. 2019b; González, Pla, and Hurtado 2017; Symeonidis et al. 2017; Rozental and Fleischer 2017; Hasan et al. 2018; Cliche 2017). Notably, collecting and utilizing the noisy content within these texts using a dictionary or lexicon is not practical; therefore, machine learning techniques have been used to address challenges in sentiment analysis.