Catalogue Search | MBRL

Introducing RAPTOR: RevMan Parsing Tool for Reviewers

by Shokraneh, Farhad , Schmidt, Lena , Adams, Clive E. in Automatic document classification , Automation , Biomedicine

2019

Background Much effort is made to ensure Cochrane reviews are based on reliably extracted data. There is a commitment to wide access to these data—for novel processing and/or reuse—but delivering this access is problematic. Aim To describe a proof-of-concept programme to extract, curate and structure data from Cochrane reviews. Methods One student of Applied Sciences (16 weeks full time), access to pre-publication review files and use of ‘Eclipse’ to create an open-access tool (RAPTOR) using the programming language Java. Results The final software batch processes hundreds of reviews in seconds, extracting all study data and automatically tidying and unifying presentation of data for return into the source review, reuse, or export for novel analyses. Conclusions This software, despite being limited, illustrates how the efforts of reviewers meticulously extracting study data can be improved, disseminated and reused with little additional effort.

Journal Article

Share this book

Add to My Shelf

DocXclassifier: towards a robust and interpretable deep neural network for document image classification

by Dengel, Andreas , Saifullah, Saifullah , Agne, Stefan in Accuracy , Artificial neural networks , Attention

2024

Model interpretability and robustness are becoming increasingly critical today for the safe and practical deployment of deep learning (DL) models in industrial settings. As DL-backed automated document processing systems become increasingly common in business workflows, there is a pressing need today to enhance interpretability and robustness for the task of document image classification, an integral component of such systems. Surprisingly, while much research has been devoted to improving the performance of deep models for this task, little attention has been given to their interpretability and robustness. In this paper, we aim to improve upon both aspects and introduce two inherently interpretable deep document classifiers, DocXClassifier and DocXClassifierFPN, both of which not only achieve significant performance improvements over existing approaches but also hold the capability to simultaneously generate feature importance maps while making their predictions. Our approach involves integrating a convolutional neural network (ConvNet) backbone with an attention mechanism to perform weighted aggregation of features based on their importance to the class, enabling the generation of interpretable importance maps. Additionally, we propose integrating Feature Pyramid Networks with the attention mechanism to significantly enhance the resolution of the interpretability maps, especially for pyramidal ConvNet architectures. Our approach attains state-of-the-art performance in image-based classification on two popular document datasets, RVL-CDIP and Tobacco3482, with top-1 classification accuracies of 94.19% and 95.71%, respectively. Additionally, it sets a new record for the highest image-based classification accuracy on Tobacco3482 without transfer learning from RVL-CDIP, at 90.29%. In addition, our proposed training strategy demonstrates superior robustness compared to existing approaches, significantly outperforming them on 19 out of 21 different types of novel data distortions, while achieving comparable results on the remaining two. By combining robustness with interpretability, DocXClassifier presents a promising step toward the practical deployment of DL models for document classification tasks.

Journal Article

Share this book

Add to My Shelf

Text Classification Algorithms: A Survey

by Mendu, Sanjana , Kowsari, Kamran , Barnes, Laura in Algorithms , Classification , Data mining

2019

In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed.

Journal Article

Share this book

Add to My Shelf

Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training

by Shah, Jamal Hussain , Scherer, Rafał , Yasmin, Mussarat in Accuracy , Classification , data augmentation

2020

Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique’s major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.

Journal Article

Share this book

Add to My Shelf

Explaining Data-Driven Document Classifications

by Martens, David , Provost, Foster in Advertisers , Algorithms , Classification

2014

Many document classification applications require human understanding of the reasons for data-driven classification decisions by managers, client-facing employees, and the technical team. Predictive models treat documents as data to be classified, and document data are characterized by very high dimensionality, often with tens of thousands to millions of variables (words). Unfortunately, due to the high dimensionality, understanding the decisions made by document classifiers is very difficult. This paper begins by extending the most relevant prior theoretical model of explanations for intelligent systems to account for some missing elements. The main theoretical contribution is the definition of a new sort of explanation as a minimal set of words (terms, generally), such that removing all words within this set from the document changes the predicted class from the class of interest. We present an algorithm to find such explanations, as well as a framework to assess such an algorithm’s performance. We demonstrate the value of the new approach with a case study from a real-world document classification task: classifying web pages as containing objectionable content, with the goal of allowing advertisers to choose not to have their ads appear on those pages. A second empirical demonstration on news-story topic classification shows the explanations to be concise and document-specific, and to be capable of providing understanding of the exact reasons for the classification decisions, of the workings of the classification models, and of the business application itself. We also illustrate how explaining the classifications of documents can help to improve data quality and model performance.

Journal Article

Share this book

Add to My Shelf

Diagnosis extraction from unstructured Dutch echocardiogram reports using span- and document-level characteristic classification

by van Osch, Dirk , van der Harst, Pim , Teske, Arco in Annotations , Artificial intelligence , Automatic classification

2025

Background Clinical machine learning research and artificial intelligence driven clinical decision support models rely on clinically accurate labels. Manually extracting these labels with the help of clinical specialists is often time-consuming and expensive. This study tests the feasibility of automatic span- and document-level diagnosis extraction from unstructured Dutch echocardiogram reports. Methods We included 115,692 unstructured echocardiogram reports from the University Medical Center Utrecht, a large university hospital in the Netherlands. A randomly selected subset was manually annotated for the occurrence and severity of eleven commonly described cardiac characteristics. We developed and tested several automatic labelling techniques at both span and document levels, using weighted and macro F1-score, precision, and recall for performance evaluation. We compared the performance of span labelling against document labelling methods, which included both direct document classifiers and indirect document classifiers that rely on span classification results. Results The SpanCategorizer and MedRoBERTa.nl models outperformed all other span and document classifiers, respectively. The weighted F1-score varied between characteristics, ranging from 0.60 to 0.93 in SpanCategorizer and 0.96 to 0.98 in MedRoBERTa.nl. Direct document classification was superior to indirect document classification using span classifiers. SetFit achieved competitive document classification performance using only 10% of the training data. Utilizing a reduced label set yielded near-perfect document classification results. Conclusion We recommend using our published SpanCategorizer and MedRoBERTa.nl models for span- and document-level diagnosis extraction from Dutch echocardiography reports. For settings with limited training data, SetFit may be a promising alternative for document classification. Future research should be aimed at training a RoBERTa based span classifier and applying English based models on translated echocardiogram reports.

Journal Article

Share this book

Add to My Shelf

Correlation Clustering

by Chawla, Shuchi , Bansal, Nikhil , Blum, Avrim in Algorithms , Studies

2004

Issue Title: Theoretical Advances in Data Clustering We consider the following clustering problem: we have a complete graph on n vertices (items), where each edge (u, v) is labeled either + or - depending on whether u and v have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of + edges within clusters, plus the number of - edges between clusters (equivalently, minimizes the number of disagreements: the number of - edges inside clusters plus the number of + edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function f learned from past data, and the goal is to partition the current set of documents in a way that correlates with f as much as possible; it can also be viewed as a kind of \"agnostic learning\" problem. An interesting feature of this clustering formulation is that one does not need to specify the number of clusters k as a separate parameter, as in measures such as k-median or min-sum or min-max clustering. Instead, in our formulation, the optimal number of clusters could be any value between 1 and n, depending on the edge labels. We look at approximation algorithms for both minimizing disagreements and for maximizing agreements. For minimizing disagreements, we give a constant factor approximation. For maximizing agreements we give a PTAS, building on ideas of Goldreich, Goldwasser, and Ron (1998) and de la Veg (1996). We also show how to extend some of these results to graphs with edge labels in [-1, +1], and give some results for the case of random noise.[PUBLICATION ABSTRACT]

Journal Article

Share this book

Add to My Shelf

Document Vector Representation with Enhanced Features Based on Doc2VecC

in Algorithms , Classification , Deletion

2024

The main purpose of document vectorization is to represent words into a series of vectors that can express the semantics of documents. Whether in Chinese or English, words are the most basic units to express text processing. The effectiveness of the natural language processing tasks is highly correlated with the document vector representation method. Document vectorization methods include statistical-based methods and neural network-based methods. However, in general, many document vectorization methods are generic methods that do not distinguish between both long and short texts as well as English and Chinese usage scenarios, thus leading to unsatisfactory document classification results. In addition to developing a PV-IDF model with enhanced features to address the issue of document feature loss caused by the Doc2VecC model using random deletion method, this paper suggests the inverse document frequency as an important indicator of candidate word deletion strategy. This will speed up model training and improve the effectiveness of document classification. From the experimental data, the PV-IDF model with enhanced features performs better for both long and short documents,as well as English and Chinese documents, and it has important advantages in terms of algorithm execution efficiency and error rate, particularly for short documents. The proposed method outperforms the Doc2VecC model in each of the five evaluation indicators that evaluate the effect of classification, with the average error rate for short document classification being 41% lower than that of the Doc2VecC model and 45.2% lower than that of the PV-DM model, respectively. Compared with the Doc2VecC model, which can only show high efficiency on small-scale data sets, the PV-IDF model can demonstrate high training efficiency on a variety of scale datasets, outperforming the comparison approach. As a result, the proposed method can provide high-quality vector representations for documents of varying length and enhance the effectiveness of related operations.

Journal Article

Share this book

Add to My Shelf

Sequence-aware multimodal page classification of Brazilian legal documents

by Luz de Araujo, Pedro H. , de Campos, Teofilo E. , de Almeida, Ana Paula G. S. in Annotations , Classification , Computer Science

2023

The Brazilian Supreme Court receives tens of thousands of cases each semester. Court employees spend thousands of hours to execute the initial analysis and classification of those cases—which takes effort away from posterior, more complex stages of the case management workflow. In this paper, we explore multimodal classification of documents from Brazil’s Supreme Court. We train and evaluate our methods on a novel multimodal dataset of 6510 lawsuits (339,478 pages) with manual annotation assigning each page to one of six classes. Each lawsuit is an ordered sequence of pages, which are stored both as an image and as a corresponding text extracted through optical character recognition. We first train two unimodal classifiers: A ResNet pre-trained on ImageNet is fine-tuned on the images, and a convolutional network with filters of multiple kernel sizes is trained from scratch on document texts. We use them as extractors of visual and textual features, which are then combined through our proposed fusion module. Our fusion module can handle missing textual or visual input by using learned embeddings for missing data. Moreover, we experiment with bidirectional long short-term memory (biLSTM) networks and linear-chain conditional random fields to model the sequential nature of the pages. The multimodal approaches outperform both textual and visual classifiers, especially when leveraging the sequential nature of the pages.

Journal Article

Share this book

Add to My Shelf

Enhanced effective convolutional attention network with squeeze-and-excitation inception module for multi-label clinical document classification

by Raghavendar Raju, L. , Veerabhadram, Vadlamani , Kumari, Dr. D Anitha in 631/158 , 631/45 , 631/67

2025

Clinical Document Classification (CDC) is crucial in healthcare for organizing and categorizing large volumes of medical information, leading to improved patient care, streamlined research, and enhanced administrative efficiency. With the advancement of artificial intelligence, automatic CDC is now achievable through deep learning techniques. While existing research has shown promising results, more effective and accurate classification of long clinical documents is still desired. To address this, we propose a new model called the Enhanced Effective Convolutional Attention Network (EECAN), which incorporates a Squeeze-and-Excitation (SE) Inception module to improve feature representation by adaptively recalibrating channel-wise feature responses. This architecture introduces an Encoder and Attention-Based Clinical Document Classification (EAB-CDC) strategy, which utilizes sum-pooling and multi-layer attention mechanisms to extract salient features from clinical document representations. This study proposes EECAN (Enhanced Effective Convolutional Attention Network) as the overall model architecture and EAB-CDC (Encoder and Attention-Based Clinical Document Classification) as a core strategy conducted in EECAN. EAB-CDC is not a standalone model but a functional part applied to the architecture for discriminative feature extraction by sum-pooling and multi-layer attention mechanisms. With this integrated design, EECAN can transform multi-label clinical texts’ general and label-specific contexts without losing information. Our empirical study, conducted on benchmark datasets such as MIMIC-III and MIMIC-III-50, demonstrates that the proposed EECAN model outperforms several existing deep learning approaches, achieving AUC scores of 99.70% and 99.80% using sum-pooling and multi-layer attention, respectively. These results highlight the model’s substantial potential for integration into clinical systems, such as Electronic Health Record (EHR) platforms, for the automated classification of clinical texts and improved healthcare decision-making support.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter