Catalogue Search | MBRL

Extracting accurate materials data from research papers with conversational language models and prompt engineering

by Morgan, Dane , Polak, Maciej P. in 639/301/1034 , 706/648/697/129 , Amorphous materials

2024

There has been a growing effort to replace manual extraction of data from research papers with automated data extraction based on natural language processing, language models, and recently, large language models (LLMs). Although these methods enable efficient extraction of data from large sets of research papers, they require a significant amount of up-front effort, expertise, and coding. In this work, we propose the ChatExtract method that can fully automate very accurate data extraction with minimal initial effort and background, using an advanced conversational LLM. ChatExtract consists of a set of engineered prompts applied to a conversational LLM that both identify sentences with data, extract that data, and assure the data’s correctness through a series of follow-up questions. These follow-up questions largely overcome known issues with LLMs providing factually inaccurate responses. ChatExtract can be applied with any conversational LLMs and yields very high quality data extraction. In tests on materials data, we find precision and recall both close to 90% from the best conversational LLMs, like GPT-4. We demonstrate that the exceptional performance is enabled by the information retention in a conversational model combined with purposeful redundancy and introducing uncertainty through follow-up prompts. These results suggest that approaches similar to ChatExtract, due to their simplicity, transferability, and accuracy are likely to become powerful tools for data extraction in the near future. Finally, databases for critical cooling rates of metallic glasses and yield strengths of high entropy alloys are developed using ChatExtract. Efficient data extraction from research papers accelerates science and engineering. Here, the authors develop an automated approach which uses conversational large language models to achieve high precision and recall in extracting materials data.

Journal Article

Share this book

Add to My Shelf

Structured information extraction from scientific text with large language models

by Ceder, Gerbrand , Walker, Nicholas , Persson, Kristin A. in 639/301 , 639/301/1034 , 639/705/1046

2024

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.

Journal Article

Share this book

Add to My Shelf

Best practices in machine learning for chemistry

by Coudert François-Xavier , Artrith Nongnuch , Jain Anubhav in Best practice , Learning algorithms , Machine learning

2021

Statistical tools based on machine learning are becoming integrated into chemistry research workflows. We discuss the elements necessary to train reliable, repeatable and reproducible models, and recommend a set of guidelines for machine learning reports.

Journal Article

Share this book

Add to My Shelf

Inferring causation from time series in Earth system sciences

by Zscheischler, Jakob , Runge, Jakob , Coumou, Dim in 639/705/1042 , 639/766/530 , 704/106

2019

The heart of the scientific enterprise is a rational effort to understand the causes behind the phenomena we observe. In large-scale complex dynamical systems such as the Earth system, real experiments are rarely feasible. However, a rapidly increasing amount of observational and simulated data opens up the use of novel data-driven causal methods beyond the commonly adopted correlation techniques. Here, we give an overview of causal inference frameworks and identify promising generic application cases common in Earth system sciences and beyond. We discuss challenges and initiate the benchmark platform causeme.net to close the gap between method users and developers. Questions of causality are ubiquitous in Earth system sciences and beyond, yet correlation techniques still prevail. This Perspective provides an overview of causal inference methods, identifies promising applications and methodological challenges, and initiates a causality benchmark platform.

Journal Article

Share this book

Add to My Shelf

Materials Cloud, a platform for open computational science

by Schulthess, Thomas C , Pignedoli, Carlo A , Uhrin, Martin in Computer applications

2020

Materials Cloud is a platform designed to enable open and seamless sharing of resources for computational science, driven by applications in materials modelling. It hosts (1) archival and dissemination services for raw and curated data, together with their provenance graph, (2) modelling services and virtual machines, (3) tools for data analytics, and pre-/post-processing, and (4) educational materials. Data is citable and archived persistently, providing a comprehensive embodiment of entire simulation pipelines (calculations performed, codes used, data generated) in the form of graphs that allow retracing and reproducing any computed result. When an AiiDA database is shared on Materials Cloud, peers can browse the interconnected record of simulations, download individual files or the full database, and start their research from the results of the original authors. The infrastructure is agnostic to the specific simulation codes used and can support diverse applications in computational science that transcend its initial materials domain.

Journal Article

Share this book

Add to My Shelf

MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification

by Pfister, Hanspeter , Wei, Donglai , Yang, Jiancheng in 631/114/1305 , 706/648/697/129 , Algorithms

2023

We introduce MedMNIST v2 , a large-scale MNIST-like dataset collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into a small size of 28 × 28 (2D) or 28 × 28 × 28 (3D) with the corresponding classification labels so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various dataset scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression, and multi-label). The resulting dataset, consisting of 708,069 2D images and 9,998 3D images in total, could support numerous research/educational purposes in biomedical image analysis, computer vision, and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D/3D neural networks and open-source/commercial AutoML tools. The data and code are publicly available at https://medmnist.com/ . Measurement(s) supervised machine learning Technology Type(s) machine learning

Journal Article

Share this book

Add to My Shelf

Different scaling of linear models and deep learning in UKBiobank brain images versus machine-learning datasets

by Kording, Konrad , Bzdok, Danilo , Kather, Jakob N. in 59/36 , 59/57 , 631/378/116/2394

2020

Recently, deep learning has unlocked unprecedented success in various domains, especially using images, text, and speech. However, deep learning is only beneficial if the data have nonlinear relationships and if they are exploitable at available sample sizes. We systematically profiled the performance of deep, kernel, and linear models as a function of sample size on UKBiobank brain images against established machine learning references. On MNIST and Zalando Fashion, prediction accuracy consistently improves when escalating from linear models to shallow-nonlinear models, and further improves with deep-nonlinear models. In contrast, using structural or functional brain scans, simple linear models perform on par with more complex, highly parameterized models in age/sex prediction across increasing sample sizes. In sum, linear models keep improving as the sample size approaches ~10,000 subjects. Yet, nonlinearities for predicting common phenotypes from typical brain scans remain largely inaccessible to the examined kernel and deep learning methods. Schulz et al . systematically benchmark performance scaling with increasingly sophisticated prediction algorithms and with increasing sample size in reference machine-learning and biomedical datasets. Complicated nonlinear intervariable relationships remain largely inaccessible for predicting key phenotypes from typical brain scans.

Journal Article

Share this book

Add to My Shelf

Data sharing practices and data availability upon request differ across scientific disciplines

by Astapova, Anastasiya , Eenmaa, Helen , Pedaste, Margus in 631/158/2452 , 706/648/697/129/2043 , Analysis

2021

Data sharing is one of the cornerstones of modern science that enables large-scale analyses and reproducibility. We evaluated data availability in research articles across nine disciplines in Nature and Science magazines and recorded corresponding authors’ concerns, requests and reasons for declining data sharing. Although data sharing has improved in the last decade and particularly in recent years, data availability and willingness to share data still differ greatly among disciplines. We observed that statements of data availability upon (reasonable) request are inefficient and should not be allowed by journals. To improve data sharing at the time of manuscript acceptance, researchers should be better motivated to release their data with real benefits such as recognition, or bonus points in grant and job applications. We recommend that data management costs should be covered by funding agencies; publicly available research data ought to be included in the evaluation of applications; and surveillance of data sharing should be enforced by both academic publishers and funders. These cross-discipline survey data are available from the plutoF repository.

Journal Article

Share this book

Add to My Shelf

T cell lymphoma and secondary primary malignancy risk after commercial CAR T cell therapy

by Van Deerlin, Vivianna M. , Levine, Bruce L. , Svoboda, Jakub in 631/67/1059/2325 , 631/67/1990 , 692/308/409

2024

We report a T cell lymphoma (TCL) occurring 3 months after anti-CD19 chimeric antigen receptor (CAR) T cell immunotherapy for non-Hodgkin B cell lymphoma. The TCL was diagnosed from a thoracic lymph node upon surgery for lung cancer. The TCL exhibited CD8 + cytotoxic phenotype and a JAK3 variant, while the CAR transgene was very low. The T cell clone was identified at low levels in the blood before CAR T infusion and in lung cancer. To assess the overall risk of secondary primary malignancy after commercial CAR T (CD19, BCMA), we analyzed 449 patients treated at the University of Pennsylvania. At a median follow-up of 10.3 months, 16 patients (3.6%) had a secondary primary malignancy. The median onset time was 26.4 and 9.7 months for solid and hematological malignancies, respectively. The projected 5-year cumulative incidence is 15.2% for solid and 2.3% for hematological malignancies. Overall, one case of TCL was observed, suggesting a low risk of TCL after CAR T. Profiling of a case of secondary T cell lymphoma following anti-CD19 CAR T cell therapy suggests that it was not caused by CAR insertional mutagenesis, with single-center analysis indicating that secondary T cell lymphoma risk after commercial CAR T cell treatment is low.

Journal Article

Share this book

Add to My Shelf

EV-TRACK: transparent reporting and centralizing knowledge in extracellular vesicle research

by National Cancer Center , Centre National de la Recherche Scientifique (CNRS) , Hyenne, Vincent in 631/80/313/1481 , 706/648/697/129 , Bioinformatics

2017

We argue that the field of extracellular vesicle (EV) biology needs more transparent reporting to facilitate interpretation and replication of experiments. To achieve this, we describe EV-TRACK, a crowdsourcing knowledgebase (http://evtrack.org) that centralizes EV biology and methodology with the goal of stimulating authors, reviewers, editors and funders to put experimental guidelines into practice.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter