Catalogue Search | MBRL

Text Classification: How Machine Learning Is Revolutionizing Text Categorization

by Akinwolere, Kehinde , Allam, Hesham , Makubvure, Lisa in Accuracy , Adaptability , Algorithms

2025

The automated classification of texts into predefined categories has become increasingly prominent, driven by the exponential growth of digital documents and the demand for efficient organization. This paper serves as an in-depth survey of text classification and machine learning, consolidating diverse aspects of the field into a single, comprehensive resource—a rarity in the current body of literature. Few studies have achieved such breadth, and this work aims to provide a unified perspective, offering a significant contribution to researchers and the academic community. The survey examines the evolution of machine learning in text categorization (TC), highlighting its transformative advantages over manual classification, such as enhanced accuracy, reduced labor, and adaptability across domains. It delves into various TC tasks and contrasts machine learning methodologies with knowledge engineering approaches, demonstrating the strengths and flexibility of data-driven techniques. Key applications of TC are explored, alongside an analysis of critical machine learning methods, including document representation techniques and dimensionality reduction strategies. Moreover, this study evaluates a range of text categorization models, identifies persistent challenges like class imbalance and overfitting, and investigates emerging trends shaping the future of the field. It discusses essential components such as document representation, classifier construction, and performance evaluation, offering a well-rounded understanding of the current state of TC. Importantly, this paper also provides clear research directions, emphasizing areas requiring further innovation, such as hybrid methodologies, explainable AI (XAI), and scalable approaches for low-resource languages. By bridging gaps in existing knowledge and suggesting actionable paths forward, this work positions itself as a vital resource for academics and industry practitioners, fostering deeper exploration and development in text classification.

Journal Article

Share this book

Add to My Shelf

Chinese text classification based on attention mechanism and feature-enhanced fusion neural network

by Wang, Yujing , Hou Yongjin , Li Baiwei in Algorithms , Artificial neural networks , Chinese languages

2020

Owing to the uneven distribution of key features in Chinese texts, key features play different roles in text recognition in Chinese text classification tasks. We propose a feature-enhanced fusion model based on attention mechanism for Chinese text classification, a long short-term memory (LSTM) network, a convolutional neural network (CNN), and a feature-difference enhancement attention algorithm model. The Chinese text is digitized into a vector form containing certain semantic context information into the embedding layer to train and test the neural network by preprocessing. The feature-enhanced fusion model is implemented by double-layer LSTM and CNN modules to enhance the fusion of text features extracted from the attention mechanism for classifying the classifiers. The feature-difference enhancement attention algorithm model not only adds more weight to important text features but also strengthens the differences between them and other text features. This operation can further improves the effect of important features on Chinese text recognition. The two models are classified by the softmax function. The text classification experiments are conducted based on the Chinese text corpus. The experimental results show that compared with the contrast model, the proposed algorithm can significantly improve the recognition ability of Chinese text features.

Journal Article

Share this book

Add to My Shelf

Learning metric space with distillation for large-scale multi-label text classification

by Zhou, Lihua , Du, Guowang , Wu, Hao in Accuracy , Artificial Intelligence , Artificial neural networks

2023

Deep neural network-based methods have achieved outstanding results in the task of text classification. However, the relationship of text–label and label–label has not been thoroughly investigated for most existing methods. Furthermore, these methods have excessive computational and memory overhead for large-scale classification. To address these challenges, we propose a novel framework with metric learning and knowledge distillation. We first project the texts and labels into the same embedding space by utilizing the symmetry metric learning on both text–centric and label–centric relationships. Then the distillation component is introduced to learn the text representation features with a deep module. Finally, we use this distilled module to encode new text and make predictions with label embeddings in the metric space. Experimental results on four real datasets show that our model achieves very competitive prediction accuracy while improving training and prediction efficiency.

Journal Article

Share this book

Add to My Shelf

An Arabic text categorization approach using term weighting and multiple reducts

by Al-Radaideh, Qasem A. , Al-Abrat, Mohammed A. in Algorithms , Approximation , Arabic language

2019

Text categorization is the process of assigning a predefined category label to an unlabeled document based on its content. One of the challenges of automatic text categorization is the high dimensionality of data that may affect the performance of the categorization model. This paper proposed an approach for the categorization of Arabic text based on term weighting and the reduct concept of the rough set theory to reduce the number of terms used to generate the classification rules that form the classifier. The paper proposed a multiple minimal reduct extraction algorithm by improving the Quick reduct algorithm. The multiple reducts are used to generate the set of classification rules which represent the rough set classifier. To evaluate the proposed approach, an Arabic corpus of 2700 documents nine categories is used. In the experiment, we compared the results of the proposed approach when using multiple and single minimal reducts. The results showed that the proposed approach had achieved an accuracy of 94% when using multiple reducts, which outperformed the single reduct method which achieved an accuracy of 86%. The results of the experiments also showed that the proposed approach outperforms both the K -NN and J48 algorithms regarding classification accuracy using the dataset on hand.

Journal Article

Share this book

Add to My Shelf

Competitive Particle Swarm Optimization for Multi-Category Text Feature Selection

by Kim, Hae-Cheon , Lee, Jaesung , Park, Jaegyun in Accuracy , Algorithms , Classification

2019

Multi-label feature selection is an important task for text categorization. This is because it enables learning algorithms to focus on essential features that foreshadow relevant categories, thereby improving the accuracy of text categorization. Recent studies have considered the hybridization of evolutionary feature wrappers and filters to enhance the evolutionary search process. However, the relative effectiveness of feature subset searches of evolutionary and feature filter operators has not been considered. This results in degenerated final feature subsets. In this paper, we propose a novel hybridization approach based on competition between the operators. This enables the proposed algorithm to apply each operator selectively and modify the feature subset according to its relative effectiveness, unlike conventional methods. The experimental results on 16 text datasets verify that the proposed method is superior to conventional methods.

Journal Article

Share this book

Add to My Shelf

Deep Neural Models and Retrofitting for Arabic Text Categorization

by El-Alami, Fatima-Zahra , En-Nahnahi, Noureddine , El Alaoui, Said Ouatik in Arabic language , Artificial neural networks , Classification

2020

Arabic text categorization is an important task in text mining particularly with the fast-increasing quantity of the Arabic online data. Deep neural network models have shown promising performance and indicated great data modeling capacities in managing large and substantial datasets. This article investigates convolution neural networks (CNNs), long short-term memory (LSTM) and their combination for Arabic text categorization. This work additionally handles the morphological variety of Arabic words by exploring the word embeddings model using position weights and subword information. To guarantee the nearest vector representations for connected words, this article adopts a strategy for refining Arabic vector space representations using semantic information embedded in lexical resources. Several experiments utilizing different architectures have been conducted on the OSAC dataset. The obtained results show the effectiveness of CNN-LSTM without and with retrofitting for Arabic text categorization in comparison with major competing methods.

Journal Article

Share this book

Add to My Shelf

An enhanced short text categorization model with deep abundant representation

by Gu, Yanhui , Long, Yi , Xu, Guandong in Classification , Data base management systems , Information retrieval

2018

Short text categorization is a crucial issue to many applications, e.g., Information Retrieval, Question-Answering System, MRI Database Construction and so forth. Many researches focus on data sparsity and ambiguity issues in short text categorization. To tackle these issues, we propose a novel short text categorization strategy based on abundant representation, which utilizes Bi-directional Recurrent Neural Network(Bi-RNN) with Long Short-Term Memory(LSTM) and topic model to catch more contextual and semantic information. Bi-RNN enriches contextual information, and topic model discovers more latent semantic information for abundant text representation of short text. Experimental results demonstrate that the proposed model is comparable to state-of-the-art neural network models and method proposed is effective.

Journal Article

Share this book

Add to My Shelf

NADA: New Arabic Dataset for Text Classification

by Alalyani, Nada , Larabi, Souad in Classification , Datasets , Natural language

2018

In the recent years, Arabic Natural Language Processing, including Text summarization, Text simplification, Text Categorization and other Natural Language-related disciplines, are attracting more researchers. Appropriate resources for Arabic Text Categorization are becoming a big necessity for the development of this research. The few existing corpora are not ready for use, they require preprocessing and filtering operations. In addition, most of them are not organized based on standard classification methods which makes unbalanced classes and thus reduced the classification accuracy. This paper proposes a New Arabic Dataset (NADA) for Text Categorization purpose. This corpus is composed of two existing corpora OSAC and DAA. The new corpus is preprocessed and filtered using the recent state of the art methods. It is also organized based on Dewey decimal classification scheme and Synthetic Minority Over-Sampling Technique. The experiment results show that NADA is an efficient dataset ready for use in Arabic Text Categorization.

Journal Article

Share this book

Add to My Shelf

Text Classification Algorithms: A Survey

by Mendu, Sanjana , Kowsari, Kamran , Barnes, Laura in Algorithms , Classification , Data mining

2019

In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed.

Journal Article

Share this book

Add to My Shelf