Catalogue Search | MBRL

Data‐driven guidelines for phylogenomic analyses using SNP data

by Specht, Chelsea D. , Wickell, David , Jelley, Chloe in Accuracy , ancestral state reconstructions , Application

2024

Premise There is a general lack of consensus on the best practices for filtering of single‐nucleotide polymorphisms (SNPs) and whether it is better to use SNPs or include flanking regions (full “locus”) in phylogenomic analyses and subsequent comparative methods. Methods Using genotyping‐by‐sequencing data from 22 Glycine species, we assessed the effects of SNP vs. locus usage and SNP retention stringency. We compared branch length, node support, and divergence time estimation across 16 datasets with varying amounts of missing data and total size. Results Our results revealed five aspects of phylogenomic data usage that may be generally applicable: (1) tree topology is largely congruent across analyses; (2) filtering strictly for SNP retention (e.g., 90–100%) reduces support and can alter some inferred relationships; (3) absolute branch lengths vary by two orders of magnitude between SNP and locus datasets; (4) data type and branch length variation have little effect on divergence time estimation; and (5) phylograms alter the estimation of ancestral states and rates of morphological evolution. Discussion Using SNP or locus datasets does not alter phylogenetic inference significantly, unless researchers want or need to use absolute branch lengths. We recommend against using excessive filtering thresholds for SNP retention to reduce the risk of producing inconsistent topologies and generating low support.

Journal Article

Share this book

Add to My Shelf

Efficiency of different strategies to mitigate ascertainment bias when using SNP panels in diversity studies

by Sharifi, Ahmad Reza , Weigend, Annett , Simianer, Henner in Animal Genetics and Genomics , Animals , Ascertainment bias

2018

Background Single nucleotide polymorphism (SNP) panels have been widely used to study genomic variations within and between populations. Methods of SNP discovery have been a matter of debate for their potential of introducing ascertainment bias, and genetic diversity results obtained from the SNP genotype data can be misleading. We used a total of 42 chicken populations where both individual genotyped array data and pool whole genome resequencing (WGS) data were available. We compared allele frequency distributions and genetic diversity measures (expected heterozygosity ( H e ), fixation index ( F ST ) values, genetic distances and principal components analysis (PCA)) between the two data types. With the array data, we applied different filtering options (SNPs polymorphic in samples of two Gallus gallus wild populations, linkage disequilibrium (LD) based pruning and minor allele frequency (MAF) filtering, and combinations thereof) to assess their potential to mitigate the ascertainment bias. Results Rare SNPs were underrepresented in the array data. Array data consistently overestimated H e compared to WGS data, however, with a similar ranking of the breeds, as demonstrated by Spearman’s rank correlations ranging between 0.956 and 0.985. LD based pruning resulted in a reduced overestimation of H e compared to the other filters and slightly improved the relationship with the WGS results. The raw array data and those with polymorphic SNPs in the wild samples underestimated pairwise F ST values between breeds which had low F ST (<0.15) in the WGS, and overestimated this parameter for high WGS F ST (>0.15). LD based pruned data underestimated F ST in a consistent manner. The genetic distance matrix from LD pruned data was more closely related to that of WGS than the other array versions. PCA was rather robust in all array versions, since the population structure on the PCA plot was generally well captured in comparison to the WGS data. Conclusions Among the tested filtering strategies, LD based pruning was found to account for the effects of ascertainment bias in the relatively best way, producing results which are most comparable to those obtained from WGS data and therefore is recommended for practical use.

Journal Article

Share this book

Add to My Shelf

Hardy–Weinberg Equilibrium Filtering in Population Genomics: Empirical Review and Decision Framework for Improved Practice

by Hsu, Yu‐Hsun in Adaptation , Animal populations , Empirical analysis

2026

Hardy–Weinberg equilibrium (HWE) filtering remains widely used in population genomics, but its application remains inconsistent, often lacking detailed justification, and not always aligned with biological context. To evaluate whether conceptual awareness has translated into methodological change, we review empirical studies citing Pearman et al. (2022), a representative study testing the impacts of different grouping approaches for HWE filtering. While pooled filtering is becoming rare, we found a decreasing but still considerable heterogeneity in the decision of filtering schemes, limited reporting of thresholds, and few explicit justifications for applied approaches. These patterns suggest that awareness of HWE filtering limitations is increasing but has not yet led to consistent practice. We synthesise the biological and technical causes of HWE deviation, review recent advances, including population‐aware and structure‐informed filtering tools, and propose a transparent decision framework for population genomic studies. Rather than a default quality‐control step, HWE filtering should be applied as a hypothesis‐aware decision that reflects study aims and biological context. A citation‐based mini‐survey and decision workflow are provided to support biologically informed and reproducible applications. Hardy–Weinberg equilibrium (HWE) filtering is widely used in population genomics, but its application remains inconsistent and often lacks biological justification. This review went through recent empirical studies that demonstrate awareness of these issues to assess whether conceptual understanding has led to improved practice. This review synthesises the causes of HWE deviation and proposes a biologically informed decision workflow to support transparent filtering applications.

Journal Article

Share this book

Add to My Shelf

Assessing the potential of genotyping‐by‐sequencing‐derived single nucleotide polymorphisms to identify the geographic origins of intercepted gypsy moth (Lymantria dispar) specimens: A proof‐of‐concept study

by Keena, Melody , Pouliot, Esther , Cusson, Michel in Butterflies & moths , Dispersal , Ecosystem stability

2018

Forest invasive alien species are a major threat to ecosystem stability and can have enormous economic and social impacts. For this reason, preventing the introduction of Asian gypsy moths (AGM; Lymantria dispar asiatica and L. d. japonica) into North America has been identified as a top priority by North American authorities. The AGM is an important defoliator of a wide variety of hardwood and coniferous trees, displaying a much broader host range and an enhanced dispersal ability relative to the already established European gypsy moth (L. d. dispar). Although molecular assays have been developed to help distinguish gypsy moth subspecies, these tools are not adequate for tracing the geographic origins of AGM samples intercepted on foreign vessels. Yet, this type of information would be very useful in characterizing introduction pathways and would help North American regulatory authorities in preventing introductions. The present proof‐of‐concept study assessed the potential of single nucleotide polymorphism (SNP) markers, obtained through genotyping by sequencing (GBS), to identify the geographic origins of gypsy moth samples. The approach was applied to eight laboratory‐reared gypsy moth populations, whose original stocks came from locations distributed over the entire range of L. dispar, comprising representatives of the three recognized subspecies. The various analyses we performed showed strong differentiation among populations (FST ≥ 0.237), enabling clear distinction of subspecies and geographic variants, while revealing introgression near the geographic boundaries between subspecies. This strong population structure resulted in 100% assignment success of moths to their original population when 2,327 SNPs were used. Although the SNP panels we developed are not immediately applicable to contemporary, natural populations because of distorted allele frequencies in the laboratory‐reared populations we used, our results attest to the potential of genomewide SNP markers as a tool to identify the geographic origins of intercepted gypsy moth samples.

Journal Article

Share this book

Add to My Shelf

Comparing a few SNP calling algorithms using low-coverage sequencing data

by Sun, Shuying , Yu, Xiaoqing in Algorithms , Autoimmune diseases , Bioinformatics

2013

Background Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations. Results To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs’ quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs. Conclusions Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.

Journal Article

Share this book

Add to My Shelf

A bi-filtering method for processing single nucleotide polymorphism array data improves the quality of genetic map and accuracy of quantitative trait locus mapping in doubled haploid populations of polyploid Brassica napus

by Yi, Bin , Zhang, Chunyu , Zhou, Yongming in Analysis , Animal Genetics and Genomics , Biomedical and Life Sciences

2015

Background Single nucleotide polymorphism (SNP) markers have a wide range of applications in crop genetics and genomics. Due to their polyploidy nature, many important crops, such as wheat, cotton and rapeseed contain a large amount of repeat and homoeologous sequences in their genomes, which imposes a huge challenge in high-throughput genotyping with sequencing and/or array technologies. Allotetraploid Brassica napus (AACC, 2n = 4x = 38) comprises of two highly homoeologous sub-genomes derived from its progenitor species B. rapa (AA, 2n = 2x = 20) and B. oleracea (CC, 2n = 2x = 18), and is an ideal species to exploit methods for reducing the interference of extensive inter-homoeologue polymorphisms (mHemi-SNPs and Pseudo-simple SNPs) between closely related sub-genomes. Results Based on a recent B. napus 6K SNP array, we developed a bi-filtering procedure to identify unauthentic lines in a DH population, and mHemi-SNPs and Pseudo-simple SNPs in an array data matrix. The procedure utilized both monomorphic and polymorphic SNPs in the DH population and could effectively distinguish the mHemi-SNPs and Pseudo-simple SNPs that resulted from superposition of the signals from multiple SNPs. Compared with conventional procedure for array data processing, the bi-filtering method could minimize the pseudo linkage relationship caused by the mHemi-SNPs and Pseudo-simple SNPs, thus improving the quality of SNP genetic map. Furthermore, the improved genetic map could increase the accuracies of mapping of QTLs as demonstrated by the ability to eliminate non-real QTLs in the mapping population. Conclusions The bi-filtering analysis of the SNP array data represents a novel approach to effectively assigning the multi-loci SNP genotypes in polyploid B. napus and may find wide applications to SNP analyses in polyploid crops.

Journal Article

Share this book

Add to My Shelf

Impact of pre-imputation SNP-filtering on genotype imputation results

by Horn, Katrin , Roshyara, Nab Raj , Scholz, Markus in Accuracy , Algorithms , Data analysis

2014

Doc number: 88 Abstract Background: Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE. Results: We considered three scenarios: imputation of partially missing genotypes with usage of an external reference panel, without usage of an external reference panel, as well as imputation of completely un-typed SNPs using an external reference panel. We first created various datasets applying different SNP quality filters and masking certain percentages of randomly selected high-quality SNPs. We imputed these SNPs and compared the results between the different filtering scenarios by using established and newly proposed measures of imputation quality. While the established measures assess certainty of imputation results, our newly proposed measures focus on the agreement with true genotypes. These measures showed that pre-imputation SNP-filtering might be detrimental regarding imputation quality. Moreover, the strongest drivers of imputation quality were in general the burden of missingness and the number of SNPs used for imputation. We also found that using a reference panel always improves imputation quality of partially missing genotypes. MaCH performed slightly better than IMPUTE2 in most of our scenarios. Again, these results were more pronounced when using our newly defined measures of imputation quality. Conclusion: Even a moderate filtering has a detrimental effect on the imputation quality. Therefore little or no SNP filtering prior to imputation appears to be the best strategy for imputing small to moderately sized datasets. Our results also showed that for these datasets, MaCH performs slightly better than IMPUTE2 in most scenarios at the cost of increased computing time.

Journal Article

Share this book

Add to My Shelf

Interpreting Several Types of Measurements in Bioscience

by Janbu, Astrid Oust , Kohler, Achim , Martens, Harald in sequencing or single nucleotide polymorphism (SNP) technologies , standard normal variate (SNV) filtering , vibrational biospectroscopic techniques

2008

This chapter contains sections titled: Introduction to the Analysis of Several Data Sets Principal Component Analysis of One Data Table Simultaneous Analysis of Two Data Blocks by Partial Least‐Squares Regression (PLSR) Simultaneous Analysis of Several Data Blocks by Multiblock PCA Alternative Multiblock Methods References

Book Chapter

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter