Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
LanguageLanguage
-
SubjectSubject
-
Item TypeItem Type
-
DisciplineDiscipline
-
YearFrom:-To:
-
More FiltersMore FiltersIs Peer Reviewed
Done
Filters
Reset
872
result(s) for
"706/648/697/129"
Sort by:
Extracting accurate materials data from research papers with conversational language models and prompt engineering
2024
There has been a growing effort to replace manual extraction of data from research papers with automated data extraction based on natural language processing, language models, and recently, large language models (LLMs). Although these methods enable efficient extraction of data from large sets of research papers, they require a significant amount of up-front effort, expertise, and coding. In this work, we propose the ChatExtract method that can fully automate very accurate data extraction with minimal initial effort and background, using an advanced conversational LLM. ChatExtract consists of a set of engineered prompts applied to a conversational LLM that both identify sentences with data, extract that data, and assure the data’s correctness through a series of follow-up questions. These follow-up questions largely overcome known issues with LLMs providing factually inaccurate responses. ChatExtract can be applied with any conversational LLMs and yields very high quality data extraction. In tests on materials data, we find precision and recall both close to 90% from the best conversational LLMs, like GPT-4. We demonstrate that the exceptional performance is enabled by the information retention in a conversational model combined with purposeful redundancy and introducing uncertainty through follow-up prompts. These results suggest that approaches similar to ChatExtract, due to their simplicity, transferability, and accuracy are likely to become powerful tools for data extraction in the near future. Finally, databases for critical cooling rates of metallic glasses and yield strengths of high entropy alloys are developed using ChatExtract.
Efficient data extraction from research papers accelerates science and engineering. Here, the authors develop an automated approach which uses conversational large language models to achieve high precision and recall in extracting materials data.
Journal Article
Structured information extraction from scientific text with large language models
by
Ceder, Gerbrand
,
Walker, Nicholas
,
Persson, Kristin A.
in
639/301
,
639/301/1034
,
639/705/1046
2024
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.
Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
Journal Article
Best practices in machine learning for chemistry
by
Coudert François-Xavier
,
Artrith Nongnuch
,
Jain Anubhav
in
Best practice
,
Learning algorithms
,
Machine learning
2021
Statistical tools based on machine learning are becoming integrated into chemistry research workflows. We discuss the elements necessary to train reliable, repeatable and reproducible models, and recommend a set of guidelines for machine learning reports.
Journal Article
Inferring causation from time series in Earth system sciences
2019
The heart of the scientific enterprise is a rational effort to understand the causes behind the phenomena we observe. In large-scale complex dynamical systems such as the Earth system, real experiments are rarely feasible. However, a rapidly increasing amount of observational and simulated data opens up the use of novel data-driven causal methods beyond the commonly adopted correlation techniques. Here, we give an overview of causal inference frameworks and identify promising generic application cases common in Earth system sciences and beyond. We discuss challenges and initiate the benchmark platform
causeme.net
to close the gap between method users and developers.
Questions of causality are ubiquitous in Earth system sciences and beyond, yet correlation techniques still prevail. This Perspective provides an overview of causal inference methods, identifies promising applications and methodological challenges, and initiates a causality benchmark platform.
Journal Article
Materials Cloud, a platform for open computational science
2020
Materials Cloud is a platform designed to enable open and seamless sharing of resources for computational science, driven by applications in materials modelling. It hosts (1) archival and dissemination services for raw and curated data, together with their provenance graph, (2) modelling services and virtual machines, (3) tools for data analytics, and pre-/post-processing, and (4) educational materials. Data is citable and archived persistently, providing a comprehensive embodiment of entire simulation pipelines (calculations performed, codes used, data generated) in the form of graphs that allow retracing and reproducing any computed result. When an AiiDA database is shared on Materials Cloud, peers can browse the interconnected record of simulations, download individual files or the full database, and start their research from the results of the original authors. The infrastructure is agnostic to the specific simulation codes used and can support diverse applications in computational science that transcend its initial materials domain.
Journal Article
MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification
by
Pfister, Hanspeter
,
Wei, Donglai
,
Yang, Jiancheng
in
631/114/1305
,
706/648/697/129
,
Algorithms
2023
We introduce
MedMNIST v2
, a large-scale MNIST-like dataset collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into a small size of 28 × 28 (2D) or 28 × 28 × 28 (3D) with the corresponding classification labels so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST v2 is designed to perform classification on lightweight 2D and 3D images with various dataset scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression, and multi-label). The resulting dataset, consisting of 708,069 2D images and 9,998 3D images in total, could support numerous research/educational purposes in biomedical image analysis, computer vision, and machine learning. We benchmark several baseline methods on MedMNIST v2, including 2D/3D neural networks and open-source/commercial AutoML tools. The data and code are publicly available at
https://medmnist.com/
.
Measurement(s)
supervised machine learning
Technology Type(s)
machine learning
Journal Article
Different scaling of linear models and deep learning in UKBiobank brain images versus machine-learning datasets
2020
Recently, deep learning has unlocked unprecedented success in various domains, especially using images, text, and speech. However, deep learning is only beneficial if the data have nonlinear relationships and if they are exploitable at available sample sizes. We systematically profiled the performance of deep, kernel, and linear models as a function of sample size on UKBiobank brain images against established machine learning references. On MNIST and Zalando Fashion, prediction accuracy consistently improves when escalating from linear models to shallow-nonlinear models, and further improves with deep-nonlinear models. In contrast, using structural or functional brain scans, simple linear models perform on par with more complex, highly parameterized models in age/sex prediction across increasing sample sizes. In sum, linear models keep improving as the sample size approaches ~10,000 subjects. Yet, nonlinearities for predicting common phenotypes from typical brain scans remain largely inaccessible to the examined kernel and deep learning methods.
Schulz
et al
. systematically benchmark performance scaling with increasingly sophisticated prediction algorithms and with increasing sample size in reference machine-learning and biomedical datasets. Complicated nonlinear intervariable relationships remain largely inaccessible for predicting key phenotypes from typical brain scans.
Journal Article
Data sharing practices and data availability upon request differ across scientific disciplines
by
Astapova, Anastasiya
,
Eenmaa, Helen
,
Pedaste, Margus
in
631/158/2452
,
706/648/697/129/2043
,
Analysis
2021
Data sharing is one of the cornerstones of modern science that enables large-scale analyses and reproducibility. We evaluated data availability in research articles across nine disciplines in
Nature
and
Science
magazines and recorded corresponding authors’ concerns, requests and reasons for declining data sharing. Although data sharing has improved in the last decade and particularly in recent years, data availability and willingness to share data still differ greatly among disciplines. We observed that statements of data availability upon (reasonable) request are inefficient and should not be allowed by journals. To improve data sharing at the time of manuscript acceptance, researchers should be better motivated to release their data with real benefits such as recognition, or bonus points in grant and job applications. We recommend that data management costs should be covered by funding agencies; publicly available research data ought to be included in the evaluation of applications; and surveillance of data sharing should be enforced by both academic publishers and funders. These cross-discipline survey data are available from the plutoF repository.
Journal Article
T cell lymphoma and secondary primary malignancy risk after commercial CAR T cell therapy
by
Van Deerlin, Vivianna M.
,
Levine, Bruce L.
,
Svoboda, Jakub
in
631/67/1059/2325
,
631/67/1990
,
692/308/409
2024
We report a T cell lymphoma (TCL) occurring 3 months after anti-CD19 chimeric antigen receptor (CAR) T cell immunotherapy for non-Hodgkin B cell lymphoma. The TCL was diagnosed from a thoracic lymph node upon surgery for lung cancer. The TCL exhibited CD8
+
cytotoxic phenotype and a
JAK3
variant, while the CAR transgene was very low. The T cell clone was identified at low levels in the blood before CAR T infusion and in lung cancer. To assess the overall risk of secondary primary malignancy after commercial CAR T (CD19, BCMA), we analyzed 449 patients treated at the University of Pennsylvania. At a median follow-up of 10.3 months, 16 patients (3.6%) had a secondary primary malignancy. The median onset time was 26.4 and 9.7 months for solid and hematological malignancies, respectively. The projected 5-year cumulative incidence is 15.2% for solid and 2.3% for hematological malignancies. Overall, one case of TCL was observed, suggesting a low risk of TCL after CAR T.
Profiling of a case of secondary T cell lymphoma following anti-CD19 CAR T cell therapy suggests that it was not caused by CAR insertional mutagenesis, with single-center analysis indicating that secondary T cell lymphoma risk after commercial CAR T cell treatment is low.
Journal Article
EV-TRACK: transparent reporting and centralizing knowledge in extracellular vesicle research
by
National Cancer Center
,
Centre National de la Recherche Scientifique (CNRS)
,
Hyenne, Vincent
in
631/80/313/1481
,
706/648/697/129
,
Bioinformatics
2017
We argue that the field of extracellular vesicle (EV) biology needs more transparent reporting to facilitate interpretation and replication of experiments. To achieve this, we describe EV-TRACK, a crowdsourcing knowledgebase (http://evtrack.org) that centralizes EV biology and methodology with the goal of stimulating authors, reviewers, editors and funders to put experimental guidelines into practice.
Journal Article