Catalogue Search | MBRL

Automatic indexing of scientific articles on Library and Information Science with SISA, KEA and MAUI

by Ortuño, Pedro Díaz , Gil-Leiva, Isidoro , Corrêa, Renato Fernandes in Access to information , Algorithms , Automatic

2022

This article evaluates the SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) and MAUI (Multi-Purpose Automatic Topic Indexing) automatic indexing systems to find out how they perform in relation to human indexing. SISA's algorithm is based on rules about the position of terms in the different structural components of the document, while the algorithms for KEA and MAUI are based on machine learning and the statistical features of terms. For evaluation purposes, a document collection of 230 scientific articles from the Revista Española de Documentación Científica published by the Consejo Superior de Investigaciones Científicas (CSIC) was used, of which 30 were used for training tasks and were not part of the evaluation test set. The articles were written in Spanish and indexed by human indexers using a controlled vocabulary in the InDICES database, also belonging to the CSIC. The human indexing of these documents constitutes the baseline or golden indexing, against which to evaluate the output of the automatic indexing systems by comparing terms sets using the evaluation metrics of precision, recall, F-measure and consistency. The results show that the SISA system performs best, followed by KEA and MAUI.

Journal Article

Share this book

Add to My Shelf

Research on the Automatic Subject-Indexing Method of Academic Papers Based on Climate Change Domain Ontology

by Yang, Lina , Yang, Heng , Wang, Nan in Automatic indexing , Citation analysis , Classification

2023

It is important to classify academic papers in a fine-grained manner to uncover deeper implicit themes and semantics in papers for better semantic retrieval, paper recommendation, research trend prediction, topic analysis, and a series of other functions. Based on the ontology of the climate change domain, this study used an unsupervised approach to combine two methods, syntactic structure and semantic modeling, to build a framework of subject-indexing techniques for academic papers in the climate change domain. The framework automatically indexes a set of conceptual terms as research topics from the domain ontology by inputting the titles, abstracts and keywords of the papers using natural language processing techniques such as syntactic dependencies, text similarity calculation, pre-trained language models, semantic similarity calculation, and weighting factors such as word frequency statistics and graph path calculation. Finally, we evaluated the proposed method using the gold standard of manually annotated articles and demonstrated significant improvements over the other five alternative methods in terms of precision, recall and F1-score. Overall, the method proposed in this study is able to identify the research topics of academic papers more accurately, and also provides useful references for the application of domain ontologies and unsupervised data annotation.

Journal Article

Share this book

Add to My Shelf

Sometimes the apple does fall far from the tree: a case study on automatic indexing precision errors in PubMed

by Wilson, Paije in Abstract and Indexing , Abstracting and Indexing - methods , Abstracting and Indexing - standards

2025

Objective: This case study identifies the presence and prevalence of precision indexing errors in a subset of automatically indexed MEDLINE records in PubMed (specifically, all MEDLINE records automatically indexed with the MeSH term Malus, the genus name for apple trees). In short, how well does automatic indexing compare [figurative] apples to [literal] apples? Methods: 1,705 MEDLINE records automatically indexed with the MeSH term Malus underwent title/abstract and full text screening to determine whether they were correctly indexed (i.e., the records were about Malus, meaning they discussed the literal fruit or tree) or incorrectly indexed (i.e., they were not about Malus, meaning they did not discuss the literal fruit or tree). The context and type of indexing error were documented for each erroneously indexed record. Results: 135 (7.9%) records were incorrectly indexed with the MeSH term Malus. The most common indexing error was due to the word \"apple\" being used in similes, metaphors, and idioms (80, or 59.2%), with the next most common error being due to \"apple\" being present in a name or term (50, or 37%). Additional indexing errors were attributed to the use of \"apple\" in acronyms, and, in one case, a reference to Sir Isaac Newton. Conclusion: As indicated by this study's findings, automatic indexing can commit errors when indexing records that have words with non-literal or alternative meanings in their titles or abstracts. Librarians should be mindful of the existence of automatic indexing errors, and instruct authors on how best to ameliorate the effects of them within their own manuscripts.

Journal Article

Share this book

Add to My Shelf

Filtering failure: the impact of automated indexing in Medline on retrieval of human studies for knowledge synthesis

by Askin, Nicole , Epp, Carla , Ostapyk, Tyler in Abstract and Indexing , Abstracting and Indexing - methods , Abstracting and Indexing - standards

2025

Objective: Use of the search filter ‘exp animals/ not humans.sh’ is a well-established method in evidence synthesis to exclude non-human studies. However, the shift to automated indexing of Medline records has raised concerns about the use of subject-heading-based search techniques. We sought to determine how often this string inappropriately excludes human studies among automated as compared to manually indexed records in Ovid Medline. Methods: We searched Ovid Medline for studies published in 2021 and 2022 using the Cochrane Highly Sensitive Search Strategy for randomized trials. We identified all results excluded by the non-human-studies filter. Records were divided into sets based on indexing method: automated, curated, or manual. Each set was screened to identify human studies. Results: Human studies were incorrectly excluded in all three conditions, but automated indexing inappropriately excluded human studies at nearly double the rate as manual indexing. In looking specifically at human clinical randomized controlled trials (RCTs), the rate of inappropriate exclusion of automated-indexing records was seven times that of manually-indexed records. Conclusions: Given our findings, searchers are advised to carefully review the effect of the ‘exp animals/ not humans.sh’ search filter on their search results, pending improvements to the automated indexing process.

Journal Article

Share this book

Add to My Shelf

The expansion of Google Scholar versus Web of Science: a longitudinal study

by Dodou, Dimitra , de Winter, Joost C. F , Zadpoor, Amir A in Chemistry , Citation analysis , Citations

2014

Web of Science (WoS) and Google Scholar (GS) are prominent citation services with distinct indexing mechanisms. Comprehensive knowledge about the growth patterns of these two citation services is lacking. We analyzed the development of citation counts in WoS and GS for two classic articles and 56 articles from diverse research fields, making a distinction between retroactive growth (i.e., the relative difference between citation counts up to mid-2005 measured in mid-2005 and citation counts up to mid-2005 measured in April 2013) and actual growth (i.e., the relative difference between citation counts up to mid-2005 measured in April 2013 and citation counts up to April 2013 measured in April 2013). One of the classic articles was used for a citation-by-citation analysis. Results showed that GS has substantially grown in a retroactive manner (median of 170 % across articles), especially for articles that initially had low citations counts in GS as compared to WoS. Retroactive growth of WoS was small, with a median of 2 % across articles. Actual growth percentages were moderately higher for GS than for WoS (medians of 54 vs. 41 %). The citation-by-citation analysis showed that the percentage of citations being unique in WoS was lower for more recent citations (6.8 % for citations from 1995 and later vs. 41 % for citations from before 1995), whereas the opposite was noted for GS (57 vs. 33 %). It is concluded that, since its inception, GS has shown substantial expansion, and that the majority of recent works indexed in WoS are now also retrievable via GS. A discussion is provided on quantity versus quality of citations, threats for WoS, weaknesses of GS, and implications for literature research and research evaluation.

Journal Article

Share this book

Add to My Shelf

Aplicação da folksonomia assistida na construção de corpus de referência em Ciência da Informação

by Correa, Renato Fernandes , Silva, Bruno Felipe de Melo in Application , Assisted Folksonomy , Automatic Indexing

2020

O presente trabalho propõe e discute a aplicação da folksonomia assistida na construção de corpus de referência de artigos científicos da área de Ciência da Informação. A hipótese levantada é que tal aplicação pode garantir maior qualidade na indexação de artigos científicos e uma melhor avaliação dos sistemas de indexação automática através do corpus compilado. Para a pesquisa foi delimitado o uso do corpus composto por 60 artigos escritos em língua portuguesa selecionados por Souza (2005). A plataforma colaborativa de indexação social assistida do corpus foi configurada usando o software de gerenciamento de coleção denominado Tainacan. As etapas da pesquisa envolveram a configuração e preparação da coleção no Tainacan, a realização da indexação social assistida por grupos de usuários e análise dos resultados do processo de indexação. A análise da folksonomia assistida ocorreu mediante comparação daquilo que consta disponibilizado nos campos de metadados Assuntos e tags dos artigos. Como indicadores da qualidade da indexação obtiveram-se média de 28% do coeficiente de consistência, 32% de precisão, 68% de revocação, e 41% de medida F. As médias alcançadas representam bons níveis de consistência e revocação, e níveis satisfatórios de precisão e medida F, dando a entender que o uso da folksonomia assistida é útil no aperfeiçoamento da indexação do corpus de referência.

Journal Article

Share this book

Add to My Shelf

Deep neural model with self-training for scientific keyphrase extraction

by Liao, Han , Zhu, Xun , Lyu, Chen in Annotations , Artificial intelligence , Artificial neural networks

2020

Scientific information extraction is a crucial step for understanding scientific publications. In this paper, we focus on scientific keyphrase extraction, which aims to identify keyphrases from scientific articles and classify them into predefined categories. We present a neural network based approach for this task, which employs the bidirectional long short-memory (LSTM) to represent the sentences in the article. On top of the bidirectional LSTM layer in our neural model, conditional random field (CRF) is used to predict the label sequence for the whole sentence. Considering the expensive annotated data for supervised learning methods, we introduce self-training method into our neural model to leverage the unlabeled articles. Experimental results on the ScienceIE corpus and ACL keyphrase corpus show that our neural model achieves promising performance without any hand-designed features and external knowledge resources. Furthermore, it efficiently incorporates the unlabeled data and achieve competitive performance compared with previous state-of-the-art systems.

Journal Article

Share this book

Add to My Shelf

Keyword Extraction: A Modern Perspective

by Nomoto, Tadashi in Computer Imaging , Computer Science , Computer Systems Organization and Communication Networks

2023

The goal of keyword extraction is to extract from a text, words, or phrases indicative of what it is talking about. In this work, we look at keyword extraction from a number of different perspectives: Statistics, Automatic Term Indexing, Information Retrieval (IR), Natural Language Processing (NLP), and the emerging Neural paradigm. The 1990s have seen some early attempts to tackle the issue primarily based on text statistics [ 13 , 17 ]. Meanwhile, in IR, efforts were largely led by DARPA’s Topic Detection and Tracking (TDT) project [ 2 ]. In this contribution, we discuss how past innovations paved a way for more recent developments, such as LDA, PageRank, and Neural Networks. We walk through the history of keyword extraction over the last 50 years, noting differences and similarities among methods that emerged during the time. We conduct a large meta-analysis of the past literature using datasets from news media, science, and medicine to business and bureaucracy, to draw a general picture of what a successful approach would look like.

Journal Article

Share this book

Add to My Shelf

MeSH indexing based on automatically generated summaries

by Jimeno-Yepes, Antonio J , Aronson, Alan R , Díaz, Alberto in Abstracting and Indexing - methods , Algorithms , Analysis

2013

Background MEDLINE citations are manually indexed at the U.S. National Library of Medicine (NLM) using as reference the Medical Subject Headings (MeSH) controlled vocabulary. For this task, the human indexers read the full text of the article. Due to the growth of MEDLINE, the NLM Indexing Initiative explores indexing methodologies that can support the task of the indexers. Medical Text Indexer (MTI) is a tool developed by the NLM Indexing Initiative to provide MeSH indexing recommendations to indexers. Currently, the input to MTI is MEDLINE citations, title and abstract only. Previous work has shown that using full text as input to MTI increases recall, but decreases precision sharply. We propose using summaries generated automatically from the full text for the input to MTI to use in the task of suggesting MeSH headings to indexers. Summaries distill the most salient information from the full text, which might increase the coverage of automatic indexing approaches based on MEDLINE. We hypothesize that if the results were good enough, manual indexers could possibly use automatic summaries instead of the full texts, along with the recommendations of MTI, to speed up the process while maintaining high quality of indexing results. Results We have generated summaries of different lengths using two different summarizers, and evaluated the MTI indexing on the summaries using different algorithms: MTI, individual MTI components, and machine learning. The results are compared to those of full text articles and MEDLINE citations. Our results show that automatically generated summaries achieve similar recall but higher precision compared to full text articles. Compared to MEDLINE citations, summaries achieve higher recall but lower precision. Conclusions Our results show that automatic summaries produce better indexing than full text articles. Summaries produce similar recall to full text but much better precision, which seems to indicate that automatic summaries can efficiently capture the most important contents within the original articles. The combination of MEDLINE citations and automatically generated summaries could improve the recommendations suggested by MTI. On the other hand, indexing performance might be dependent on the MeSH heading being indexed. Summarization techniques could thus be considered as a feature selection algorithm that might have to be tuned individually for each MeSH heading.

Journal Article

Share this book

Add to My Shelf

Information filtering based on corrected redundancy-eliminating mass diffusion

by Cai, Shi-Min , Tian, Hui , Yang, Yujie in Accuracy , Algorithms , Area Under Curve

2017

Methods used in information filtering and recommendation often rely on quantifying the similarity between objects or users. The used similarity metrics often suffer from similarity redundancies arising from correlations between objects' attributes. Based on an unweighted undirected object-user bipartite network, we propose a Corrected Redundancy-Eliminating similarity index (CRE) which is based on a spreading process on the network. Extensive experiments on three benchmark data sets-Movilens, Netflix and Amazon-show that when used in recommendation, the CRE yields significant improvements in terms of recommendation accuracy and diversity. A detailed analysis is presented to unveil the origins of the observed differences between the CRE and mainstream similarity indices.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter