Catalogue Search | MBRL

Creating artificial human genomes using generative neural networks

by Ongaro, Linda , Tallec, Corentin , Pagani, Luca in Biodiversity and Ecology , Bioinformatics , Biology and Life Sciences

2021

Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.

Journal Article

Share this book

Add to My Shelf

Factor analysis of ancient population genomic samples

by François, Olivier , Jay, Flora in 45/23 , 631/208/182 , 631/208/457

2020

The recent years have seen a growing number of studies investigating evolutionary questions using ancient DNA. To address these questions, one of the most frequently-used method is principal component analysis (PCA). When PCA is applied to temporal samples, the sample dates are, however, ignored during analysis, leading to imperfect representations of samples in PC plots. Here, we present a factor analysis (FA) method in which individual scores are corrected for the effect of allele frequency drift over time. We obtained exact solutions for the estimates of corrected factors, and we provided a fast algorithm for their computation. Using computer simulations and ancient European samples, we compared geometric representations obtained from FA with PCA and with ancestry estimation programs. In admixture analyses, FA estimates agreed with tree-based statistics, and they were more accurate than those obtained from PCA projections and from ancestry estimation programs. A great advantage of FA over existing approaches is to improve descriptive analyses of ancient DNA samples without requiring inclusion of outgroup or present-day samples. Principal component analysis is often used in studies of ancient DNA, but does not account for the age of the samples. Here, the authors present a factor analysis (FA) which corrects for this by including the effect of allele frequency drift over time.

Journal Article

Share this book

Add to My Shelf

Deep convolutional and conditional neural networks for large-scale genomic data generation

by Szatkownik, Antoine , Charpiat, Guillaume , Jay, Flora in Analysis , Biology and Life Sciences , Comparative analysis

2023

Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy.

Journal Article

Share this book

Add to My Shelf

Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach

by Mona, Stefano , Boitard, Simon , Jay, Flora in Alleles , Analysis , Animals

2016

Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles.

Journal Article

Share this book

Add to My Shelf

Genomic Variation in Seven Khoe-San Groups Reveals Adaptation and Complex African History

by Schlebusch, Carina M. , Gattepaille, Lucie M. , Jay, Flora in Adaptation , Adaptation, Biological , Adaptation, Biological - genetics

2012

The history of click-speaking Khoe-San, and African populations in general, remains poorly understood. We genotyped ~2.3 million single-nucleotide polymorphisms in 220 southern Africans and found that the Khoe-San diverged from other populations ≥100,000 years ago, but population structure within the Khoe-San dated back to about 35,000 years ago. Genetic variation in various sub-Saharan populations did not localize the origin of modern humans to a single geographic region within Africa; instead, it indicated a history of admixture and stratification. We found evidence of adaptation targeting muscle function and immune response; potential adaptive introgression of protection from ultraviolet light; and selection predating modern human diversification, involving skeletal and neurological development. These new findings illustrate the importance of African genomic diversity in understanding human evolutionary history.

Journal Article

Share this book

Add to My Shelf

A High-Coverage Genome Sequence from an Archaic Denisovan Individual

by Kircher, Martin , Siebauer, Michael , Shendure, Jay in Alleles , ancestry , Animals

2012

We present a DNA library preparation method that has allowed us to reconstruct a high-coverage (30×) genome sequence of a Denisovan, an extinct relative of Neandertals. The quality of this genome allows a direct estimation of Denisovan heterozygosity indicating that genetic diversity in these archaic hominins was extremely low. It also allows tentative dating of the specimen on the basis of \"missing evolution\" in its genome, detailed measurements of Denisovan and Neandertal admixture into present-day human populations, and the generation of a near-complete catalog of genetic changes that swept to high frequency in modern humans since their divergence from Denisovans.

Journal Article

Share this book

Add to My Shelf

Spatial Inference of Admixture Proportions and Secondary Contact Zones

by Durand, Eric , Flora, Jay , Gaggiotti, Oscar E in Algorithms , Bayesian analysis , Clustering

2009

Genetic admixture of distinct gene pools is the consequence of complex spatiotemporal processes that could have involved massive migration and local mating during the history of a species. However, current methods for estimating individual admixture proportions lack the incorporation of such a piece of information. Here, we extend Bayesian clustering algorithms by including global trend surfaces and spatial autocorrelation in the prior distribution on individual admixture coefficients. We test our algorithm by using spatially explicit and realistic coalescent simulations of colonization followed by secondary contact. By coupling our multiscale spatial analyses with a Bayesian evaluation of model complexity and fit, we show that the algorithm provides a correct description of smooth clinal variation, while still detecting zones of sharp variation when they are present in the data. We also apply our approach to understand the population structure of the killifish, Fundulus heteroclitus, for which the algorithm uncovers a presumed contact zone in the Atlantic coast of North America. [PUBLICATION ABSTRACT]

Journal Article

Share this book

Add to My Shelf

Higher Levels of Neanderthal Ancestry in East Asians than in Europeans

by Durand, Eric Y , Jay, Flora , Stevison, Laurie S in Animals , Asian Continental Ancestry Group - genetics , Deoxyribonucleic acid

2013

Neanderthals were a group of archaic hominins that occupied most of Europe and parts of Western Asia from ∼30,000 to 300,000 years ago (KYA). They coexisted with modern humans during part of this time. Previous genetic analyses that compared a draft sequence of the Neanderthal genome with genomes of several modern humans concluded that Neanderthals made a small (1–4%) contribution to the gene pools of all non-African populations. This observation was consistent with a single episode of admixture from Neanderthals into the ancestors of all non-Africans when the two groups coexisted in the Middle East 50–80 KYA. We examined the relationship between Neanderthals and modern humans in greater detail by applying two complementary methods to the published draft Neanderthal genome and an expanded set of high-coverage modern human genome sequences. We find that, consistent with the recent finding of Meyer et al. (2012), Neanderthals contributed more DNA to modern East Asians than to modern Europeans. Furthermore we find that the Maasai of East Africa have a small but significant fraction of Neanderthal DNA. Because our analysis is of several genomic samples from each modern human population considered, we are able to document the extent of variation in Neanderthal ancestry within and among populations. Our results combined with those previously published show that a more complex model of admixture between Neanderthals and modern humans is necessary to account for the different levels of Neanderthal ancestry among human populations. In particular, at least some Neanderthal–modern human admixture must postdate the separation of the ancestors of modern European and modern East Asian populations.

Journal Article

Share this book

Add to My Shelf

Genomic Evidence for Island Population Conversion Resolves Conflicting Theories of Polar Bear Evolution

by Stirling, Ian , Stiller, Mathias , Jay, Flora in Animal behavior , Animal populations , Animals

2013

Despite extensive genetic analysis, the evolutionary relationship between polar bears (Ursus maritimus) and brown bears (U. arctos) remains unclear. The two most recent comprehensive reports indicate a recent divergence with little subsequent admixture or a much more ancient divergence followed by extensive admixture. At the center of this controversy are the Alaskan ABC Islands brown bears that show evidence of shared ancestry with polar bears. We present an analysis of genome-wide sequence data for seven polar bears, one ABC Islands brown bear, one mainland Alaskan brown bear, and a black bear (U. americanus), plus recently published datasets from other bears. Surprisingly, we find clear evidence for gene flow from polar bears into ABC Islands brown bears but no evidence of gene flow from brown bears into polar bears. Importantly, while polar bears contributed <1% of the autosomal genome of the ABC Islands brown bear, they contributed 6.5% of the X chromosome. The magnitude of sex-biased polar bear ancestry and the clear direction of gene flow suggest a model wherein the enigmatic ABC Island brown bears are the descendants of a polar bear population that was gradually converted into brown bears via male-dominated brown bear admixture. We present a model that reconciles heretofore conflicting genetic observations. We posit that the enigmatic ABC Islands brown bears derive from a population of polar bears likely stranded by the receding ice at the end of the last glacial period. Since then, male brown bear migration onto the island has gradually converted these bears into an admixed population whose phenotype and genotype are principally brown bear, except at mtDNA and X-linked loci. This process of genome erosion and conversion may be a common outcome when climate change or other forces cause a population to become isolated and then overrun by species with which it can hybridize.

Journal Article

Share this book

Add to My Shelf

Differences in local population history at the finest level: the case of the Estonian population

by Metspalu Mait , Saag Lauri , Hudjashov Georgi in Demography , Famine , Gene frequency

2020

Several recent studies detected fine-scale genetic structure in human populations. Hence, groups conventionally treated as single populations harbour significant variation in terms of allele frequencies and patterns of haplotype sharing. It has been shown that these findings should be considered when performing studies of genetic associations and natural selection, especially when dealing with polygenic phenotypes. However, there is little understanding of the practical effects of such genetic structure on demography reconstructions and selection scans when focusing on recent population history. Here we tested the impact of population structure on such inferences using high-coverage (~30×) genome sequences of 2305 Estonians. We show that different regions of Estonia differ in both effective population size dynamics and signatures of natural selection. By analyzing identity-by-descent segments we also reveal that some Estonian regions exhibit evidence of a bottleneck 10–15 generations ago reflecting sequential episodes of wars, plague and famine, although this signal is virtually undetected when treating Estonia as a single population. Besides that, we provide a framework for relating effective population size estimated from genetic data to actual census size and validate it on the Estonian population. This approach may be widely used both to cross-check estimates based on historical sources as well as to get insight into times and/or regions with no other information available. Our results suggest that the history of human populations within the last few millennia can be highly region specific and cannot be properly studied without taking local genetic structure into account.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter