Catalogue Search | MBRL

Wide-Open: Accelerating public data release by automating detection of overdue datasets

by Howe, Bill , Poon, Hoifung , Grechkin, Maxim in Access to Information , Animals , Application programming interface

2017

Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

Journal Article

Share this book

Add to My Shelf

Identifying Network Perturbation in Cancer

by Lee, Su-In , Gentles, Andrew J. , Grechkin, Maxim in Biology and Life Sciences , Breast Neoplasms - genetics , Cancer

2016

We present a computational framework, called DISCERN (DIfferential SparsE Regulatory Network), to identify informative topological changes in gene-regulator dependence networks inferred on the basis of mRNA expression datasets within distinct biological states. DISCERN takes two expression datasets as input: an expression dataset of diseased tissues from patients with a disease of interest and another expression dataset from matching normal tissues. DISCERN estimates the extent to which each gene is perturbed-having distinct regulator connectivity in the inferred gene-regulator dependencies between the disease and normal conditions. This approach has distinct advantages over existing methods. First, DISCERN infers conditional dependencies between candidate regulators and genes, where conditional dependence relationships discriminate the evidence for direct interactions from indirect interactions more precisely than pairwise correlation. Second, DISCERN uses a new likelihood-based scoring function to alleviate concerns about accuracy of the specific edges inferred in a particular network. DISCERN identifies perturbed genes more accurately in synthetic data than existing methods to identify perturbed genes between distinct states. In expression datasets from patients with acute myeloid leukemia (AML), breast cancer and lung cancer, genes with high DISCERN scores in each cancer are enriched for known tumor drivers, genes associated with the biological processes known to be important in the disease, and genes associated with patient prognosis, in the respective cancer. Finally, we show that DISCERN can uncover potential mechanisms underlying network perturbation by explaining observed epigenomic activity patterns in cancer and normal tissue types more accurately than alternative methods, based on the available epigenomic data from the ENCODE project.

Journal Article

Share this book

Add to My Shelf

Computational Curation of Open Science Data

by Grechkin, Maxim in Computer science , Information science

2018

Rapid advances in data collection, storage and processing technologies are driving a new, data-driven paradigm in science. In the life sciences, progress is driven by plummeting genome sequencing costs, opening up new fields of bioinformatics, genomics, and systems biology. The return on the enormous investments into the collection and storage of the data is hindered by a lack of curation, leaving significant portion of the data stagnant and underused. In this dissertation, we introduce several approaches aimed at making open scientific data accessible, valuable, and reusable. First, in the Wide-Open project, we introduce a text mining system for detecting datasets that are referenced in published papers but are still kept private. After parsing over 1.5 million open access publications, Wide-Open has identified hundreds of datasets overdue for publication, 400 of them were then released within one week. Second, we propose a machine learning system, EZLearn, for annotating scientific data into potentially thousands of classes without manual work required to provide training labels. EZLearn is based on an observation that in scientific domains, data samples often come with natural language descriptions meant for human consumption. We take advantage of those descriptions by introducing an auxiliary natural language processing system, training it together with the main classifier in a co-training fashion. Third, we introduce Cedalion, a system that can capture scientific claims from papers, validate them against the data associated with the paper, then generalize and adapt the claims to other relevant datasets in the repository to gather additional statistical evidence. We evaluated Cedalion by applying it to gene expression datasets, and producing reports summarizing the evidence for or against the claim based on the entirety of the collected knowledge in the repository. We find that the claim-based algorithms we propose outperform conventional data integration methods and achieve high accuracy against manually validated claims.

Dissertation

Share this book

Add to My Shelf

EZLearn: Exploiting Organic Supervision in Large-Scale Data Annotation

by Poon, Hoifung , Grechkin, Maxim , Howe, Bill in Annotations , Domains , Image classification

2018

Many real-world applications require automated data annotation, such as identifying tissue origins based on gene expressions and classifying images into semantic categories. Annotation classes are often numerous and subject to changes over time, and annotating examples has become the major bottleneck for supervised learning methods. In science and other high-value domains, large repositories of data samples are often available, together with two sources of organic supervision: a lexicon for the annotation classes, and text descriptions that accompany some data samples. Distant supervision has emerged as a promising paradigm for exploiting such indirect supervision by automatically annotating examples where the text description contains a class mention in the lexicon. However, due to linguistic variations and ambiguities, such training data is inherently noisy, which limits the accuracy of this approach. In this paper, we introduce an auxiliary natural language processing system for the text modality, and incorporate co-training to reduce noise and augment signal in distant supervision. Without using any manually labeled data, our EZLearn system learned to accurately annotate data samples in functional genomics and scientific figure comprehension, substantially outperforming state-of-the-art supervised methods trained on tens of thousands of annotated examples.

Paper

Share this book

Add to My Shelf

Identifying Network Perturbation in Cancer

by Lee, Su-In , Gentles, Andrew J , Grechkin, Maxim in Activity patterns , Acute myeloid leukemia , Breast cancer

2016

We present a computational framework, called DISCERN (DIfferential SparsE Regulatory Network), to identify informative topological changes in gene-regulator dependence networks inferred on the basis of mRNA expression datasets within distinct biological states. DISCERN takes two expression datasets as input: an expression dataset of diseased tissues from patients with a disease of interest and another expression dataset from matching normal tissues. DISCERN estimates the extent to which each gene is perturbed - having distinct regulator connectivity in the inferred gene-regulatory dependencies between the disease and normal conditions. This approach has distinct advantages over existing methods. First, DISCERN infers conditional dependencies between candidate regulators and genes, where conditional dependence relationships discriminate the evidence for direct interactions from indirect interactions more precisely than pairwise correlation. Second, DISCERN uses a new likelihood-based scoring function to alleviate concerns about accuracy of the specific edges inferred in a particular network. DISCERN identifies perturbed genes more accurately in synthetic data than existing methods to identify perturbed genes between distinct states. In expression datasets from patients with acute myeloid leukemia (AML), breast cancer and lung cancer, genes with high DISCERN scores in each cancer are enriched for known tumor drivers, genes associated with the biological processes known to be important in the disease, and genes associated with patient prognosis, in the respective cancer. Finally, we show that DISCERN can uncover potential mechanisms underlying network perturbation by explaining observed epigenomic activity patterns in cancer and normal tissue types more accurately than alternative methods, based on the available epigenomic from the ENCODE project.

Paper

Share this book

Add to My Shelf

Aptamer-Conjugated Tb(III)-Doped Silica Nanoparticles for Luminescent Detection of Leukemia Cells

by Mustafina, Asiya R. , Berezovski, Maxim V. , Grechkina, Svetlana L. in Acute lymphoblastic leukemia , Apoptosis , Aptamers

2020

DNA aptamers have many benefits for cell imaging, such as high affinity and specificity, easiness of chemical functionalization, and low cost of production. Among known aptamers, Sgc8-aptamer was selected against acute lymphoblastic leukemia cells with a dissociation constant in a nanomolar range. The aptamer was previously used for the covalent coupling with fluorescent and magnetic nanoparticles, as well as for the fabrication of aptamer-based biosensors. Among commonly used fluorescent tags, lanthanide nanoparticles offer stable luminescence with narrow, well-resolved emission peaks and the absence of photoblinking. In other words, lanthanide nanoparticles could serve as luminescence reporters and be used in biosensing. In our study, we conjugated amino- and carboxyl-modified silica-coated terbium (III) thiacalix[4]arenesulfonate luminescent nanoparticles with Sgc8-aptamer and showed the ability of the aptamer-conjugated nanoparticles to detect leukemia cells using fluorescence microscopy. In addition, we conducted a cell viability assay and confirmed that the nanoparticles do not induce spontaneous cell apoptosis or necrosis and could be potentially used for bioimaging applications.

Journal Article

Share this book

Add to My Shelf

Stress-strain state analysis of the leading car body of DPKr-2 diesel train under action of design and operational loads

by Grechkin, Alex , Kuzyshyn, Andriy , Kramarenko, Maxim in Acceleration , Accidental collisions , Automobile industry

2019

Purpose. Provision of strength and durability of the main structural element of DPKr-2 diesel train -the leading car body. Methodology. A spatial solid-state 3-D model of the body is built and durability calculations are carried out concerning action of loads stipulated by regulatory documents operating in Ukraine. In particular, the following main estimated modes are considered: mode 1 – a notional safety mode which takes into account the possibility of considerable longitudinal forces arising during shunting movements, transportation and accidental collision; mode 2 – an operational mode which takes into account forces acting on a train during acceleration to constructional speed, coasting or braking from this speed while passing a curve. Results. Based on the results of theoretical and experimental studies a conclusion has been made that the leading car body construction of DPKr-2 diesel train meets the requirements of regulatory documents regarding strength and durability. Practical relevance. A complex of calculation and experimental work concerning assessment of stress-strain state of the leading car body of DPKr-2 diesel train under action of design and operational loads allowed the creation of construction which meets not only operational requirements but also strength and durability ones.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter