Catalogue Search | MBRL

Quickly identifying identical and closely related subjects in large databases using genotype data

by Jin, Yumi , Feolo, Michael , Schäffer, Alejandro A. in Algorithms , Archives & records , Biology and Life Sciences

2017

Genome-wide association studies (GWAS) usually rely on the assumption that different samples are not from closely related individuals. Detection of duplicates and close relatives becomes more difficult both statistically and computationally when one wants to combine datasets that may have been genotyped on different platforms. The dbGaP repository at the National Center of Biotechnology Information (NCBI) contains datasets from hundreds of studies with over one million samples. There are many duplicates and closely related individuals both within and across studies from different submitters. Relationships between studies cannot always be identified by the submitters of individual datasets. To aid in curation of dbGaP, we developed a rapid statistical method called Genetic Relationship and Fingerprinting (GRAF) to detect duplicates and closely related samples, even when the sets of genotyped markers differ and the DNA strand orientations are unknown. GRAF extracts genotypes of 10,000 informative and independent SNPs from genotype datasets obtained using different methods, and implements quick algorithms that enable it to find all of the duplicate pairs from more than 880,000 samples within and across dbGaP studies in less than two hours. In addition, GRAF uses two statistical metrics called All Genotype Mismatch Rate (AGMR) and Homozygous Genotype Mismatch Rate (HGMR) to determine subject relationships directly from the observed genotypes, without estimating probabilities of identity by descent (IBD), or kinship coefficients, and compares the predicted relationships with those reported in the pedigree files. We implemented GRAF in a freely available C++ program of the same name. In this paper, we describe the methods in GRAF and validate the usage of GRAF on samples from the dbGaP repository. Other scientists can use GRAF on their own samples and in combination with samples downloaded from dbGaP.

Journal Article

Share this book

Add to My Shelf

Consent Codes: Upholding Standard Data Use Conditions

by Brookes, Anthony J. , Hurles, Matthew , Rehm, Heidi L. in Archives & records , Big data , Codes

2016

A systematic way of recording data use conditions that are based on consent permissions as found in the datasets of the main public genome archives (NCBI dbGaP and EMBL-EBI/CRG EGA).

Journal Article

Share this book

Add to My Shelf

Completing the map of human genetic variation

by Brooks, Lisa D. , Church, Deanna M. , Waterston, Robert H. in Biological and medical sciences , Cloning , Disease

2007

A plan to identify and integrate normal structural variation into the human genome sequence. Genetic variation Studies of human genetic variation tend to concentrate on single-nucleotide differences. But at the next level up there are structural differences — insertions, deletions and inversions for instance — a few kilobases to hundreds of kilobases in size that add another dimension to genetic variation. A new project under the auspices of the National Human Genome Research Institute aims to accumulate a dataset of structural variation within the genome to give a comprehensive picture of DNA sequence-level differences found in phenotypically normal individuals. From there it’s a short step to studies of disease at the level of the individual genome. In this issue members of the project working group describe in detail its aims and methodology.

Journal Article

Share this book

Add to My Shelf

DNA Identifications after the 9/11 World Trade Center Attack

by Forman, Lisa , Parsons, Thomas J. , Ballantyne, Jack in Analysis , Casualties , Company business management

2005

The attack on the World Trade Center on 9/11/2001 challenged current approaches to forensic DNA typing methods. The large number of victims and the extreme thermal and physical conditions of the site necessitated special approaches to the DNA-based identification. Because of these and many additional challenges, new procedures were created or modified from routine forensic protocols. This effort facilitated the identification of 1594 of the 2749 victims. In this Policy Forum, the authors, who were were members of the World Trade Center Kinship and Data Analysis Panel, review the lessons of the attack response from the perspective of DNA forensic identification and suggest policies and procedures for future mass disasters or large-scale terrorist attacks.

Journal Article

Share this book

Add to My Shelf

ClinGen — The Clinical Genome Resource

by Evans, James P , Plon, Sharon E , Rehm, Heidi L in Databases, Genetic , Disease , Genetic Diseases, Inborn - genetics

2015

The assignment of pathogenic status to genetic variants has been stymied by conflicting study results and lack of a publicly accessible database, such as ClinVar, which is now part of the Clinical Genome Resource. On autopsy, a patient is found to have hypertrophic cardiomyopathy. The patient’s family pursues genetic testing that shows a “likely pathogenic” variant for the condition on the basis of a study in an original research publication. Given the dominant inheritance of the condition and the risk of sudden cardiac death, other family members are tested for the genetic variant to determine their risk. Several family members test negative and are told that they are not at risk for hypertrophic cardiomyopathy and sudden cardiac death, and those who test positive are told that they need to be regularly monitored for cardiomyopathy . . .

Journal Article

Share this book

Add to My Shelf

Phenotype–Genotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources

by Junkins, Heather A , Sherry, Stephen T , Feolo, Mike in Archives & records , Biotechnology , Celiac disease

2014

Rapidly accumulating data from genome-wide association studies (GWASs) and other large-scale studies are most useful when synthesized with existing databases. To address this opportunity, we developed the Phenotype-Genotype Integrator (PheGenI), a user-friendly web interface that integrates various National Center for Biotechnology Information (NCBI) genomic databases with association data from the National Human Genome Research Institute GWAS Catalog and supports downloads of search results. Here, we describe the rationale for and development of this resource. Integrating over 66,000 association records with extensive single nucleotide polymorphism (SNP), gene, and expression quantitative trait loci data already available from the NCBI, PheGenI enables deeper investigation and interrogation of SNPs associated with a wide range of traits, facilitating the examination of the relationships between genetic variation and human diseases.

Journal Article

Share this book

Add to My Shelf

A Mathematical Approach to the Analysis of Multiplex DNA Profiles

by Goor, Robert M. , Forman Neall, Lisa , Hoffman, Douglas in Algorithms , Alleles , Cell Biology

2011

Multiplex DNA profiles are used extensively for biomedical and forensic purposes. However, while DNA profile data generation is automated, human analysis of those data is not, and the need for speed combined with accuracy demands a computer-automated approach to sample interpretation and quality assessment. In this paper, we describe an integrated mathematical approach to modeling the data and extracting the relevant information, while rejecting noise and sample artifacts. We conclude with examples showing the effectiveness of our algorithms.

Journal Article

Share this book

Add to My Shelf

The NCBI dbGaP database of genotypes and phenotypes

by Shevelev, Sergey , Graeff, Alan , Ziyabari, Lora in Agriculture , Analysis , Animal Genetics and Genomics

2007

The National Center for Biotechnology Information has created the dbGaP public repository for individual-level phenotype, exposure, genotype and sequence data and the associations between them. dbGaP assigns stable, unique identifiers to studies and subsets of information from those studies, including documents, individual phenotypic variables, tables of trait data, sets of genotype data, computed phenotype-genotype associations, and groups of study subjects who have given similar consents for use of their data.

Journal Article

Share this book

Add to My Shelf

Supplementing High-Density SNP Microarrays for Additional Coverage of Disease-Related Genes: Addiction as a Paradigm

by Uhl, George R. , Saccone, Scott F. , Saccone, Nancy L. in Addiction , Addictions , Alcohol

2009

Commercial SNP microarrays now provide comprehensive and affordable coverage of the human genome. However, some diseases have biologically relevant genomic regions that may require additional coverage. Addiction, for example, is thought to be influenced by complex interactions among many relevant genes and pathways. We have assembled a list of 486 biologically relevant genes nominated by a panel of experts on addiction. We then added 424 genes that showed evidence of association with addiction phenotypes through mouse QTL mappings and gene co-expression analysis. We demonstrate that there are a substantial number of SNPs in these genes that are not well represented by commercial SNP platforms. We address this problem by introducing a publicly available SNP database for addiction. The database is annotated using numeric prioritization scores indicating the extent of biological relevance. The scores incorporate a number of factors such as SNP/gene functional properties (including synonymy and promoter regions), data from mouse systems genetics and measures of human/mouse evolutionary conservation. We then used HapMap genotyping data to determine if a SNP is tagged by a commercial microarray through linkage disequilibrium. This combination of biological prioritization scores and LD tagging annotation will enable addiction researchers to supplement commercial SNP microarrays to ensure comprehensive coverage of biologically relevant regions.

Journal Article

Share this book

Add to My Shelf

Sequence Variations in the Public Human Genome Data Reflect a Bottlenecked Population History

by Schuler, Greg , Baker, Jonathan , Wheelan, Sarah in Anthropology , Biological Sciences , Censorship

2003

Single-nucleotide polymorphisms (SNPs) constitute the great majority of variations in the human genome, and as heritable variable landmarks they are useful markers for disease mapping and resolving population structure. Redundant coverage in overlaps of large-insert genomic clones, sequenced as part of the Human Genome Project, comprises a quarter of the genome, and it is representative in terms of base compositional and functional sequence features. We mined these regions to produce 500,000 high-confidence SNP candidates as a uniform resource for describing nucleotide diversity and its regional variation within the genome. Distributions of marker density observed at different overlap length scales under a model of recombination and population size change show that the history of the population represented by the public genome sequence is one of collapse followed by a recent phase of mild size recovery. The inferred times of collapse and recovery are Upper Paleolithic, in agreement with archaeological evidence of the initial modern human colonization of Europe.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter