Catalogue Search | MBRL

by Aggarwal, Charu C. , Reddy, Chandan K. in Cluster analysis , Data mining , Document clustering

2014,2013,2018

In this book, top researchers from around the world cover the entire area of clustering, from basic methods to more refined and complex data clustering approaches. They pay special attention to recent issues in graphs, social networks, and other domains. The book explores the characteristics of clustering problems in a variety of application areas. It also explains how to glean detailed insight from the clustering process--including how to verify the quality of the underlying clusters--through supervision, human intervention, or the automated generation of alternative clusters.

eBook

Share this book

Add to My Shelf

Unprecedented Rough Sets Model for Arabic Document Clustering

by Khafaji, Hussein K. , Salman, Khitam A. in Algorithms , Approximation , Clustering

2024

Numerous features, such as various morphologies, orthography, structural features, unique linguistics, different word meanings, and more uncountable features of Arabic, are considered key challenges in Arabic document clustering. Clustering Arabic documents is a paramount task in the information retrieval and data mining fields. In this paper, we suggest a novel model based on the rough set theory for clustering Arabic documents. Two well-known datasets, CNN and OSAC, are preprocessed and prepared as input for the model. The feature table is created from the preprocessed data. Documents’ similarities are calculated by adapting the rough discernibility relation to determine semantically coherent documents. This relation is represented as a weighted distance graph (WDG), from which the similarity matrix was constructed. The resulting similarity values play crucial roles in the suggested clustering algorithm. The model effectiveness was evaluated on CNN and OSAC datasets, achieving an F-score of 0.85 for both.

Journal Article

Share this book

Add to My Shelf

Hybrid clustering analysis using improved krill herd algorithm

by Ahamad Tajudin Khader , Essam Said Hanandeh , Laith Mohammad Abualigah in Algorithms , Cluster analysis , Clustering

2018

In this paper, a novel text clustering method, improved krill herd algorithm with a hybrid function, called MMKHA, is proposed as an efficient clustering way to obtain promising and precise results in this domain. Krill herd is a new swarm-based optimization algorithm that imitates the behavior of a group of live krill. The potential of this algorithm is high because it performs better than other optimization methods; it balances the process of exploration and exploitation by complementing the strength of local nearby searching and global wide-range searching. Text clustering is the process of grouping significant amounts of text documents into coherent clusters in which documents in the same cluster are relevant. For the purpose of the experiments, six versions are thoroughly investigated to determine the best version for solving the text clustering. Eight benchmark text datasets are used for the evaluation process available at the Laboratory of Computational Intelligence (LABIC). Seven evaluation measures are utilized to validate the proposed algorithms, namely, ASDC, accuracy, precision, recall, F-measure, purity, and entropy. The proposed algorithms are compared with the other successful algorithms published in the literature. The results proved that the proposed improved krill herd algorithm with hybrid function achieved almost all the best results for all datasets in comparison with the other comparative algorithms.

Journal Article

Share this book

Add to My Shelf

Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach

by Wei, Chih-Ping , Wu, Chia-Chen , Chiang, Roger H.L. in Benchmarks , Classification , Clustering

2006

As electronic commerce and knowledge economy environments proliferate, both individuals and organizations increasingly generate and consume large amounts of online information, typically available as textual documents. To manage this ever-increasing volume of documents, individuals and organizations frequently organize their documents into categories that facilitate document management and subsequent access and browsing. Document clustering is an intentional act that should reflect individual preferences with regard to the semantic coherency and relevant categorization of documents. Hence, effective document clustering must consider individual preferences and needs to support personalization in document categorization. In this paper, we present an automatic document-clustering approach that incorporates an individual's partial clustering as preferential information. Combining two document representation methods, feature refinement and feature weighting, with two clustering methods, precluster-based hierarchical agglomerative clustering (HAC) and atomic-based HAC, we establish four personalized document-clustering techniques. Using a traditional content-based document-clustering technique as a performance benchmark, we find that the proposed personalized document-clustering techniques improve clustering effectiveness, as measured by cluster precision and cluster recall.

Journal Article

Share this book

Add to My Shelf

Text documents clustering using data mining techniques

by Jalal, Ahmed Adeeb , Ali, Basheer Husham in Categories , Classification , Clustering

2021

Increasing progress in numerous research fields and information technologies, led to an increase in the publication of research papers. Therefore, researchers take a lot of time to find interesting research papers that are close to their field of specialization. Consequently, in this paper we have proposed documents classification approach that can cluster the text documents of research papers into the meaningful categories in which contain a similar scientific field. Our presented approach based on essential focus and scopes of the target categories, where each of these categories includes many topics. Accordingly, we extract word tokens from these topics that relate to a specific category, separately. The frequency of word tokens in documents impacts on weight of document that calculated by using a numerical statistic of term frequency-inverse document frequency (TF-IDF). The proposed approach uses title, abstract, and keywords of the paper, in addition to the categories topics to perform the classification process. Subsequently, documents are classified and clustered into the primary categories based on the highest measure of cosine similarity between category weight and documents weights.

Journal Article

Share this book

Add to My Shelf

Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering

by Venkatachalam, K. , Stoean, Catalin , Bacanin, Nebojsa in Algorithms , Benchmarks , Clustering

2021

The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-fly optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods.

Journal Article

Share this book

Add to My Shelf

Multi-view document clustering based on geometrical similarity measurement

by Diallo, Bassoma , Li, Tianrui , Khan, Ghufran Ahmad in Algorithms , Artificial Intelligence , Bibliometrics

2022

Numerous works implemented multi-view clustering algorithms in document clustering. A challenging problem in document clustering is the similarity metric. Existing multi-view document clustering methods broadly utilized two measurements: the Cosine similarity (CS) and the Euclidean distance (ED). The first did not consider the magnitude difference (MD) between the two vectors. The second can’t register the divergence of two vectors that offer a similar ED. In this paper, we originally created five models of similarity metric. This methodology foils the downside of the CS and ED similarity metrics by figuring the divergence between documents with the same ED while thinking about their sizes. Furthermore, we proposed our multi-view document clustering plan which dependent on the proposed similarity metric. Firstly, CS, ED, triangle’s area similarity and sector’s area similarity metric, and our five similarity metrics have been applied to every view of a dataset to generate a corresponding similarity matrix. Afterward, we ran clustering algorithms on these similarity matrices to evaluate the performance of single view. Later, we aggregated these similarity matrices to obtain a unified similarity matrix and apply spectral clustering algorithm on it to generate the final clusters. The experimental results show that the proposed similarity functions can gauge the similitude between documents more accurately than the existing metrics, and the proposed clustering scheme surpasses considerably up-to-date algorithms.

Journal Article

Share this book

Add to My Shelf

Anonymizing and Sharing Medical Text Records

by Li, Xiao-Bai , Qin, Jialun in Analysis , anonymization , data analytics

2017

Health information technology has increased accessibility of health and medical data and benefited medical research and healthcare management. However, there are rising concerns about patient privacy in sharing medical and healthcare data. A large amount of these data are in free text form. Existing techniques for privacy-preserving data sharing deal largely with structured data. Current privacy approaches for medical text data focus on detection and removal of patient identifiers from the data, which may be inadequate for protecting privacy or preserving data quality. We propose a new systematic approach to extract, cluster, and anonymize medical text records. Our approach integrates methods developed in both data privacy and health informatics fields. The key novel elements of our approach include a recursive partitioning method to cluster medical text records based on the similarity of the health and medical information and a value-enumeration method to anonymize potentially identifying information in the text data. An experimental study is conducted using real-world medical documents. The results of the experiments demonstrate the effectiveness of the proposed approach.

Journal Article

Share this book

Add to My Shelf

Swarm Intelligence Algorithms in Text Document Clustering with Various Benchmarks

by Selvaraj, Suganya , Choi, Eunmi in Algorithms , artificial intelligence , Cluster analysis

2021

Text document clustering refers to the unsupervised classification of textual documents into clusters based on content similarity and can be applied in applications such as search optimization and extracting hidden information from data generated by IoT sensors. Swarm intelligence (SI) algorithms use stochastic and heuristic principles that include simple and unintelligent individuals that follow some simple rules to accomplish very complex tasks. By mapping features of problems to parameters of SI algorithms, SI algorithms can achieve solutions in a flexible, robust, decentralized, and self-organized manner. Compared to traditional clustering algorithms, these solving mechanisms make swarm algorithms suitable for resolving complex document clustering problems. However, each SI algorithm shows a different performance based on its own strengths and weaknesses. In this paper, to find the best performing SI algorithm in text document clustering, we performed a comparative study for the PSO, bat, grey wolf optimization (GWO), and K-means algorithms using six data sets of various sizes, which were created from BBC Sport news and 20 newsgroups. Based on our experimental results, we discuss the features of a document clustering problem with the nature of SI algorithms and conclude that the PSO and GWO SI algorithms are better than K-means, and among those algorithms, the PSO performs best in terms of finding the optimal solution.

Journal Article

Share this book

Add to My Shelf

WEClustering: word embeddings based text clustering technique for large datasets

by Mehta, Vivek , Singh, Jasmeet , Bawa, Seema in Clustering , Coders , Complexity

2021

A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter