Document clustering

Explore chapters and articles related to this topic

Feature Selection for Clustering: A Review

Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018

Salem Alelyani, Jiliang Tang, Huan Liu

Document clustering aims to segregate documents into meaningful clusters that reflect the content of each document. For example, in the news wire, manually assigning one or more categories for each document requires exhaustive human labor, especially with the huge amount of text uploaded online daily. Thus, efficient clustering is essential. Another problem associated with document clustering is the huge number of terms. In a matrix representation, each term will be a feature and each document is an instance. In typical cases, the number of features will be close to the number of words in the dictionary. This imposes a great challenge for clustering methods where the efficiency will be greatly degraded. However, a huge number of these words are either stop words, irrelevant to the topic, or redundant. Thus, removing these unnecessary words may help significantly reduce dimensionality.

View Chapter

Purchase Book

Published in John Atkinson-Abutridy, Text Analytics, 2022

John Atkinson-Abutridy

In general, a clustering method performs an unsupervised learning task, which groups data objects without prior information to characterize each cluster. In the case of textual information, there are many applications where document clustering is vital, including improved document rankings, clustering complains from users using a service, customer segmentation, topic identification, text summarization, and exploring or browsing similar documents.

A novel approach to text clustering using genetic algorithm based on the nearest neighbour heuristic

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

D. Mustafi, A. Mustafi, G. Sahoo

Clustering is the process of finding groups of objects such that the objects in a same group are similar or related to each other based on some chosen measure of similarity [22] while objects in other groups are dissimilar or unrelated [8]. Document clustering is a variant of the traditional clustering process that groups similar documents into clusters. Mathematically, a corpus Z consists of N unlabeled documents such that , where represents a document. In our work, all documents contain English text encoded using the eight-bit ASCII encoding. As a problem definition, our task is to create K disjoint groups of similar documents, where we assume K, is known apriori [23].

An Efficient Document Clustering Approach for Devising Semantic Clusters

View Article

Journal Information

Published in Cybernetics and Systems, 2023

E. K. Jasila, N. Saleena, K. A. Abdul Nazeer

Document clustering plays an important role in data mining as it provides means to group similar documents and to find meaningful topics in documents. Obtaining quality clusters from a huge volume of documents require more accurate feature space representation and efficient clustering algorithms. In the evolutionary stage, clustering did not take into account the semantic relationship between the documents. But in later years, working with a plethora of unstructured textual data demanded better methods to incorporate semantic similarity, especially in specific domains. The demands on computational time due to the high dimensional feature space representation is a major concern in document clustering.