Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
1,793
result(s) for
"Text extraction"
Sort by:
A Novel Approach for Semantic Extractive Text Summarization
by
Nisar, Kashif
,
Andleeb Siddiqui, Maria
,
Naz, Laviza Falak
in
Algorithms
,
Retention
,
semantic text extraction
2022
Text summarization is a technique for shortening down or exacting a long text or document. It becomes critical when someone needs a quick and accurate summary of very long content. Manual text summarization can be expensive and time-consuming. While summarizing, some important content, such as information, concepts, and features of the document, can be lost; therefore, the retention ratio, which contains informative sentences, is lost, and if more information is added, then lengthy texts can be produced, increasing the compression ratio. Therefore, there is a tradeoff between two ratios (compression and retention). The model preserves or collects all the informative sentences by taking only the long sentences and removing the short sentences with less of a compression ratio. It tries to balance the retention ratio by avoiding text redundancies and also filters irrelevant information from the text by removing outliers. It generates sentences in chronological order as the sentences are mentioned in the original document. It also uses a heuristic approach for selecting the best cluster or group, which contains more meaningful sentences that are present in the topmost sentences of the summary. Our proposed model extractive summarizer overcomes these deficiencies and tries to balance between compression and retention ratios.
Journal Article
Shaping the Future of Business Sustainability: LDA Topic Modeling Insights, Definitions, and Research Agenda
2025
This article offers a comprehensive overview of Business Sustainability (BuS), and directly addresses the lack of consensus around this important concept. Through a mixed-methods approach, we conduct the first systematic literature review of BuS employing Latent Dirichlet Allocation (LDA) topic modeling to uncover hidden thematic structures, Narrative Synthesis to refine and extend BuS definitions within different contexts, and the LDA-HSIM method to classify topics and design a new framework. We analyzed an extensive dataset comprising 92,311 articles sourced from 11,579 journal outlets. From this dataset, we identified 9,561 articles suitable for LDA topic modeling by applying funnel criteria, focusing on articles with clear theoretical underpinnings. A text extraction technique enabled us to identify and analyze theories used in BuS studies. This analysis revealed 150 underlying theories that advance the BuS concept across different research topics. The study contributes to BuS theory development with great potential to improve ethical decision-making by establishing meaningful, context-specific definitions and providing clear guidance for future researchers in selecting appropriate theoretical perspectives for their work. We identify research gaps, propose a prioritized research agenda focused on theory development, and formulate key implications for practitioners and policymakers. This study demonstrates the effectiveness of machine learning methods in conducting large-scale literature reviews to accelerate theoretical advancements and generate research agendas.
Journal Article
The cultural environment: measuring culture with big data
2014
The rise of the Internet, social media, and digitized historical archives has produced a colossal amount of text-based data in recent years. While computer scientists have produced powerful new tools for automated analyses of such \"big data,\" they lack the theoretical direction necessary to extract meaning from them. Meanwhile, cultural sociologists have produced sophisticated theories of the social origins of meaning, but lack the methodological capacity to explore them beyond micro-levels of analysis. I propose a synthesis of these two fields that adjoins conventional qualitative methods and new techniques for automated analysis of large amounts of text in iterative fashion. First, I explain how automated text extraction methods may be used to map the contours of cultural environments. Second, I discuss the potential of automated text-classification methods to classify different types of culture such as frames, schema, or symbolic boundaries. Finally, I explain how these new tools can be combined with conventional qualitative methods to trace the evolution of such cultural elements over time. While my assessment of the integration of big data and cultural sociology is optimistic, my conclusion highlights several challenges in implementing this agenda. These include a lack of information about the social context in which texts are produced, the construction of reliable coding schemes that can be automated algorithmically, and the relatively high entry costs for cultural sociologists who wish to develop the technical expertise currently necessary to work with big data.
Journal Article
Transforming images into words: optical character recognition solutions for image text extraction
2025
Optical character recognition (OCR) tool is a boon and greatest advancement in today’s emerging technology which has proven its remarkability in recent years by making it easier for humans to convert the textual information in images or physical documents into text data making it useful for analysis, automation processes and improvised productivity for different purposes. This paper presents the designing, development and implementation of a novel OCR tool aiming at text extraction and recognition tasks. The tool incorporates advanced techniques such as computer vision and natural language processing (NLP) which offer powerful performance for various document types. The performance of the tool is subject to metrics like analysis, accuracy, speed, and document format compatibility. The developed OCR tool provides an accuracy of 98.8% upon execution providing a character error rate of 2.4% and word error rate (WER) of 2.8%. OCR tool finds its applications in document digitization, personal identification, archival of valuable documents, processing of invoices, and other documents. OCR tool holds an immense amount of value for researchers, practitioners and many organizations which seek effective techniques for relevant and accurate text extraction and recognition tasks.
Journal Article
A review of deep learning methods for digitisation of complex documents and engineering diagrams
by
Francisco Moreno-García, Carlos
,
Jamieson, Laura
,
Elyan, Eyad
in
Acknowledgment
,
Annotations
,
Artificial Intelligence
2024
This paper presents a review of deep learning on engineering drawings and diagrams. These are typically complex diagrams, that contain a large number of different shapes, such as text annotations, symbols, and connectivity information (largely lines). Digitising these diagrams essentially means the automatic recognition of all these shapes. Initial digitisation methods were based on traditional approaches, which proved to be challenging as these methods rely heavily on hand-crafted features and heuristics. In the past five years, however, there has been a significant increase in the number of deep learning-based methods proposed for engineering diagram digitalisation. We present a comprehensive and critical evaluation of existing literature that has used deep learning-based methods to automatically process and analyse engineering drawings. Key aspects of the digitisation process such as symbol recognition, text extraction, and connectivity information detection, are presented and thoroughly discussed. The review is presented in the context of a wide range of applications across different industry sectors, such as Oil and Gas, Architectural, Mechanical sectors, amongst others. The paper also outlines several key challenges, namely the lack of datasets, data annotation, evaluation and class imbalance. Finally, the latest development in digitalising engineering drawings are summarised, conclusions are drawn, and future interesting research directions to accelerate research and development in this area are outlined.
Journal Article
Intelligent analysis of android application privacy policy and permission consistency
2024
With the continuous development of mobile devices, mobile applications bring a lot of convenience to people’s lives. The abuse of mobile device permissions is prone to the risk of privacy leakage. The existing detection technology can detect the inconsistency between the declared authority and the actual use authority. But using the third-party privacy policy as the analysis basis for SDK permissions will result in a large set of extracted declaration permissions, which will lead to identifying risky applications as normal applications during consistency comparison. The prevailing approach involves utilizing models based on TextCNN to extract information from privacy policies. However, the training of TextCNN relies on large-scale annotated datasets, leading to high costs. This paper uses BERT as the word vector extraction model to obtain private phrases from the privacy policy. And then we use cosine similarity to automatically filter permission phrase samples, reducing the workload of manual labeling. On the other hand, existing methods do not support the analysis of Chinese privacy policies. In order to solve the problem of consistency judgment between Chinese privacy policy and permission usage, we implement a BERT-based Android privacy policy and permission usage consistency analysis engine. The engine first uses static analysis to obtain the permission list of Android applications, and then combines the BERT model to achieve consistency analysis. After functional and speed testing, we found that the engine can successfully run the consistency analysis function of Chinese declaration permissions and usage permissions, and it is better than the existing detection methods.
Journal Article
AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature
by
Galhardo, Luisa M.T.
,
Hayward, Laura E.
,
Birgmeier, Johannes
in
automatic variant retrieval
,
Biomedical and Life Sciences
,
Biomedicine
2020
Both monogenic pathogenic variant cataloging and clinical patient diagnosis start with variant-level evidence retrieval followed by expert evidence integration in search of diagnostic variants and genes. Here, we try to accelerate pathogenic variant evidence retrieval by an automatic approach.
Automatic VAriant evidence DAtabase (AVADA) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic genetic variant evidence in full-text primary literature about monogenic disease and convert it to genomic coordinates.
AVADA automatically retrieved almost 60% of likely disease-causing variants deposited in the Human Gene Mutation Database (HGMD), a 4.4-fold improvement over the current best open source automated variant extractor. AVADA contains over 60,000 likely disease-causing variants that are in HGMD but not in ClinVar. AVADA also highlights the challenges of automated variant mapping and pathogenicity curation. However, when combined with manual validation, on 245 diagnosed patients, AVADA provides valuable evidence for an additional 18 diagnostic variants, on top of ClinVar’s 21, versus only 2 using the best current automated approach.
AVADA advances automated retrieval of pathogenic monogenic variant evidence from full-text literature. Far from perfect, but much faster than PubMed/Google Scholar search, careful curation of AVADA-retrieved evidence can aid both database curation and patient diagnosis.
Journal Article
Research on the Generation of Patented Technology Points in New Energy Based on Deep Learning
2023
Effective extraction of patent technology points in new energy fields is profitable, which motivates technological innovation and facilitates patent transformation and application. However, since patent data exists the ununiform distribution of technology points information, long length of term, and long sentences, technology point extraction faces the dilemmas of poor readability and logic confusion. To mitigate these problems, the article proposes a method to generate patent technology points called IGPTP—a two-stage strategy, which fuses the advantage of extractive and generative ways. IGPTP utilizes the RoBERTa+CNN model to obtain the key sentences of text and takes the output as input of UNILM (unified pre-trained language model). Simultaneously, it takes a multi-strategies integration technique to enhance the quality of patent technology points by combining the copy mechanism and external knowledge guidance model. Substantial experimental results manifest that IGPTP outperforms the current mainstream models, which can generate more coherent and richer text.
Journal Article
Knowledge graph of wastewater-based epidemiology development: A data-driven analysis based on research topics and trends
by
Ren, Yuan
,
Zhan, Zhi-Hui
,
Gao, Zhihan
in
Aquatic Pollution
,
Atmospheric Protection/Air Quality Control/Air Pollution
,
Australia
2023
Wastewater-based epidemiology (WBE) has contributed significantly to the monitoring of drug use and transmission of viruses that has been published in numerous research papers. In this paper, we used LitStraw, a self-developed text extraction tool, to extract, analyze, and construct knowledge graphs from nearly 900 related papers in PDF format collected in Web of Science from 2000 to 2021 to analyze the research hotspots and development trends of WBE. The results showed a growing number of WBE publications in multidisciplinary cross-collaboration, with more publications and close collaboration between the USA, Australia, China, and European countries. The keywords of illicit drugs and pharmaceuticals still maintain research hotness, but the specific research hotspots change significantly, among which the research hotspots of new psychoactive substances, biomarkers, and stability show an increasing trend. In addition, judging the spread of COVID-19 by the presence of SARS-CoV-2 RNA in sewage has become the focus since 2020. This work can show the development of WBE more clearly by constructing a knowledge graph and also provide new ideas for the paper mining analysis methods in different fields.
Journal Article
Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program
by
Alexander, J. Trent
,
Bleckley, David A.
,
Lafia, Sara
in
Accuracy
,
Acknowledgment
,
Administrative records
2023
PurposeMany libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit the documents' access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes the authors' approach using digital scanning, optical character recognition (OCR) and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen's Readjustment Act of 1944, also known as the G.I. Bill.Design/methodology/approachThe authors used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers and the name and location of the bank handling the loan. The authors extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources.FindingsThe authors compared the flexible character accuracy of five OCR methods. The authors then compared the character error rate (CER) of three text extraction approaches (regular expressions, DIA and named entity recognition (NER)). The authors were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, the authors demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public.Originality/valueThe authors' workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR and DIA processes, the authors created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans and the institutions that implemented them.
Journal Article