Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
218 result(s) for "Shamir, Ron"
Sort by:
PlasClass improves plasmid sequence classification
Many bacteria contain plasmids, but separating between contigs that originate on the plasmid and those that are part of the bacterial genome can be difficult. This is especially true in metagenomic assembly, which yields many contigs of unknown origin. Existing tools for classifying sequences of plasmid origin give less reliable results for shorter sequences, are trained using a fraction of the known plasmids, and can be difficult to use in practice. We present PlasClass, a new plasmid classifier. It uses a set of standard classifiers trained on the most current set of known plasmid sequences for different sequence lengths. We tested PlasClass sequence classification on held-out data and simulations, as well as publicly available bacterial isolates and plasmidome samples and plasmids assembled from metagenomic samples. PlasClass outperforms the state-of-the-art plasmid classification tool on shorter sequences, which constitute the majority of assembly contigs, allowing it to achieve higher F1 scores in classifying sequences from a wide range of datasets. PlasClass also uses significantly less time and memory. PlasClass can be used to easily classify plasmid and bacterial genome sequences in metagenomic or isolate assemblies. It is available under the MIT license from: https://github.com/Shamir-Lab/PlasClass.
DOMINO: a network‐based active module identification algorithm with reduced rate of false calls
Algorithms for active module identification (AMI) are central to analysis of omics data. Such algorithms receive a gene network and nodes' activity scores as input and report subnetworks that show significant over‐representation of accrued activity signal (“active modules”), thus representing biological processes that presumably play key roles in the analyzed conditions. Here, we systematically evaluated six popular AMI methods on gene expression and GWAS data. We observed that GO terms enriched in modules detected on the real data were often also enriched on modules found on randomly permuted data. This indicated that AMI methods frequently report modules that are not specific to the biological context measured by the analyzed omics dataset. To tackle this bias, we designed a permutation‐based method that empirically evaluates GO terms reported by AMI methods. We used the method to fashion five novel AMI performance criteria. Last, we developed DOMINO, a novel AMI algorithm, that outperformed the other six algorithms in extensive testing on GE and GWAS data. Software is available at https://github.com/Shamir‐Lab . SYNOPSIS DOMINO is an algorithm for detecting active network modules with a low rate of false GO term calls. This merit is demonstrated by using EMP, a new procedure that validates GO terms empirically. Algorithms for active module identification (AMI) in a network based on gene activity scores tend to over‐report GO terms. A procedure that empirically calls out non‐specific GO terms is proposed. Five new criteria for evaluation of AMI algorithm solutions are developed. DOMINO outperforms six leading AMI algorithms based on these criteria. Graphical Abstract DOMINO is an algorithm for detecting active network modules with a low rate of false GO term calls. This merit is demonstrated by using EMP, a new procedure that validates GO terms empirically.
Parameterized syncmer schemes improve long-read mapping
Motivation We introduce parameterized syncmer schemes (PSS), a generalization of syncmers, and provide a theoretical analysis for multi-parameter schemes. By combining PSS with downsampling or minimizers we can achieve any desired compression and window guarantee. We implemented the use of PSS in the popular minimap2 and Winnowmap2 mappers. In tests on simulated and real long read data from a variety of genomes, the PSS-based algorithms, with scheme parameters selected on the basis of our theoretical analysis, reduced unmapped reads by 20-60% at high compression while usually using less memory. The advantage was more pronounced at low sequence identity. At sequence identity of 75% and medium compression, PSS-minimap had only 37% as many unmapped reads, and 8% fewer of the reads that did map were incorrectly mapped. Even at lower compression and error rates, PSS-based mapping mapped more reads than the original minimizer-based mappers as well as mappers using the original syncmer schemes. We conclude that using PSS can improve mapping of long reads in a wide range of settings.
PROMO: an interactive tool for analyzing clinically-labeled multi-omic cancer datasets
Background Analysis of large genomic datasets along with their accompanying clinical information has shown great promise in cancer research over the last decade. Such datasets typically include thousands of samples, each measured by one or several high-throughput technologies (‘omics’) and annotated with extensive clinical information. While instrumental for fulfilling the promise of personalized medicine, the analysis and visualization of such large datasets is challenging and necessitates programming skills and familiarity with a large array of software tools to be used for the various steps of the analysis. Results We developed PROMO (Profiler of Multi-Omic data), a friendly, fully interactive stand-alone software for analyzing large genomic cancer datasets together with their associated clinical information. The tool provides an array of built-in methods and algorithms for importing, preprocessing, visualizing, clustering, clinical label enrichment testing, and survival analysis that can be performed on a single or multi-omic dataset. The tool can be used for quick exploration and stratification of tumor samples taken from patients into clinically significant molecular subtypes. Identification of prognostic biomarkers and generation of simple subtype classifiers are additional important features. We review PROMO’s main features and demonstrate its analysis capabilities on a breast cancer cohort from TCGA. Conclusions PROMO provides a single integrated solution for swiftly performing a complete analysis of cancer genomic data for subtype discovery and biomarker identification without writing a single line of code, and can, therefore, make the analysis of these data much easier for cancer biologists and biomedical researchers. PROMO is freely available for download at http://acgt.cs.tau.ac.il/promo/ .
FOCS: a novel method for analyzing enhancer and gene activity patterns infers an extensive enhancer–promoter map
Recent sequencing technologies enable joint quantification of promoters and their enhancer regions, allowing inference of enhancer–promoter links. We show that current enhancer–promoter inference methods produce a high rate of false positive links. We introduce FOCS, a new inference method, and by benchmarking against ChIA-PET, HiChIP, and eQTL data show that it results in lower false discovery rates and at the same time higher inference power. By applying FOCS to 2630 samples taken from ENCODE, Roadmap Epigenomics, FANTOM5, and a new compendium of GRO-seq samples, we provide extensive enhancer–promotor maps ( http://acgt.cs.tau.ac.il/focs ). We illustrate the usability of our maps for deriving biological hypotheses.
Dissection of Regulatory Networks that Are Altered in Disease via Differential Co-expression
Comparing the gene-expression profiles of sick and healthy individuals can help in understanding disease. Such differential expression analysis is a well-established way to find gene sets whose expression is altered in the disease. Recent approaches to gene-expression analysis go a step further and seek differential co-expression patterns, wherein the level of co-expression of a set of genes differs markedly between disease and control samples. Such patterns can arise from a disease-related change in the regulatory mechanism governing that set of genes, and pinpoint dysfunctional regulatory networks. Here we present DICER, a new method for detecting differentially co-expressed gene sets using a novel probabilistic score for differential correlation. DICER goes beyond standard differential co-expression and detects pairs of modules showing differential co-expression. The expression profiles of genes within each module of the pair are correlated across all samples. The correlation between the two modules, however, differs markedly between the disease and normal samples. We show that DICER outperforms the state of the art in terms of significance and interpretability of the detected gene sets. Moreover, the gene sets discovered by DICER manifest regulation by disease-specific microRNA families. In a case study on Alzheimer's disease, DICER dissected biological processes and protein complexes into functional subunits that are differentially co-expressed, thereby revealing inner structures in disease regulatory networks.
Inaccuracy of the log‐rank approximation in cancer data analysis
Since survival information was available for the patients, we used the log‐rank test chi‐square approximation to evaluate each solution. [...]the APs for 48 out of the 90 clustering solutions were not within their 95% confidence intervals constructed using the permutation test. Joachim et al ( ) reported that use of the chemotherapeutic agent Topotecan resulted in a significant survival benefit in a murine model of endotoxemia. [...]erroneous significance conclusions due to the use of AP occur both in biomedical research and in algorithm development.
Plasmids in the human gut reveal neutral dispersal and recombination that is overpowered by inflammatory diseases
Plasmids are pivotal in driving bacterial evolution through horizontal gene transfer. Here, we investigated 3467 human gut microbiome samples across continents and disease states, analyzing 11,086 plasmids. Our analyses reveal that plasmid dispersal is predominantly stochastic, indicating neutral processes as the primary driver of their wide distribution. We find that only 20-25% of plasmid DNA is being selected in various disease states, constraining its distribution across hosts. Selective pressures shape specific plasmid segments with distinct ecological functions, influenced by plasmid mobilization lifestyle, antibiotic usage, and inflammatory gut diseases. Notably, these elements are more commonly shared within groups of individuals with similar health conditions, such as Inflammatory Bowel Disease (IBD), regardless of geographic location across continents. These segments contain essential genes such as iron transport mechanisms- a distinctive gut signature of IBD that impacts the severity of inflammation. Our findings shed light on mechanisms driving plasmid dispersal and selection in the human gut, highlighting their role as carriers of vital gene pools impacting bacterial hosts and ecosystem dynamics. Here, the authors analyze the plasmidome in 3,467 human gut microbiome samples across continents and disease states, revealing that plasmid dispersal in the human gut is predominantly neutral, but becomes more selective in inflammatory diseases, shedding light on microbial evolution in health and disease.
Transcription factor family‐specific DNA shape readout revealed by quantitative specificity models
Transcription factors (TFs) achieve DNA‐binding specificity through contacts with functional groups of bases (base readout) and readout of structural properties of the double helix (shape readout). Currently, it remains unclear whether DNA shape readout is utilized by only a few selected TF families, or whether this mechanism is used extensively by most TF families. We resequenced data from previously published HT‐SELEX experiments, the most extensive mammalian TF–DNA binding data available to date. Using these data, we demonstrated the contributions of DNA shape readout across diverse TF families and its importance in core motif‐flanking regions. Statistical machine‐learning models combined with feature‐selection techniques helped to reveal the nucleotide position‐dependent DNA shape readout in TF‐binding sites and the TF family‐specific position dependence. Based on these results, we proposed novel DNA shape logos to visualize the DNA shape preferences of TFs. Overall, this work suggests a way of obtaining mechanistic insights into TF–DNA binding without relying on experimentally solved all‐atom structures. Synopsis The role of DNA shape in transcription factor (TF)‐binding specificity is explored using a TF–DNA binding dataset covering more than 400 mammalian TFs. DNA shape readout is important for many TF families and improves binding specificity models. The largest protein–DNA binding dataset derived from HT‐SELEX experiments covering more than 400 mammalian TFs is analyzed. DNA shape readout plays an important role in DNA‐binding specificities of TFs across many protein families. DNA shape in regions immediately flanking the core‐binding site is generally recognized upon TF binding. Feature selection based on DNA sequencing data alone can provide structural insights into TF–DNA readout mechanisms. Graphical Abstract The role of DNA shape in transcription factor (TF)‐binding specificity is explored using a TF–DNA binding dataset covering more than 400 mammalian TFs. DNA shape readout is important for many TF families and improves binding specificity models.
Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.