Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Reading LevelReading Level
-
Content TypeContent Type
-
YearFrom:-To:
-
More FiltersMore FiltersItem TypeIs Full-Text AvailableSubjectPublisherSourceDonorLanguagePlace of PublicationContributorsLocation
Done
Filters
Reset
3,309
result(s) for
"Amino acid sequence Data processing."
Sort by:
Pattern discovery in biomolecular data
by
Shapiro, Bruce A
,
Wang, Jason T. L
,
Shasha, Dennis Elliott
in
Amino acid sequence
,
Aminoacid sequence
,
Aminoacid sequence -- Data processing
1999
A clear, up-to-date summary of techniques for pattern discovery in molecular biology. The emphasis is on techniques that readers can apply to their own work, and the topics focus on finding patterns in DNA and protein sequences, finding patterns in 3D structures, and choosing system components.
c-Src and c-Abl kinases control hierarchic phosphorylation and function of the CagA effector protein in Western and East Asian Helicobacter pylori strains
by
Mueller, Doreen
,
Smolka, Adam
,
Wessler, Silja
in
Amino Acid Motifs
,
Amino Acid Sequence
,
Antigens, Bacterial
2012
Many bacterial pathogens inject into host cells effector proteins that are substrates for host tyrosine kinases such as Src and Abl family kinases. Phosphorylated effectors eventually subvert host cell signaling, aiding disease development. In the case of the gastric pathogen Helicobacter pylori, which is a major risk factor for the development of gastric cancer, the only known effector protein injected into host cells is the oncoprotein CagA. Here, we followed the hierarchic tyrosine phosphorylation of H. pylori CagA as a model system to study early effector phosphorylation processes. Translocated CagA is phosphorylated on Glu-Pro-Ile-Tyr-Ala (EPIYA) motifs EPIYA-A, EPIYA-B, and EPIYA-C in Western strains of H. pylori and EPIYA-A, EPIYA-B, and EPIYA-D in East Asian strains. We found that c-Src only phosphorylated EPIYA-C and EPIYA-D, whereas c-Abl phosphorylated EPIYA-A, EPIYA-B, EPIYA-C, and EPIYA-D. Further analysis revealed that CagA molecules were phosphorylated on 1 or 2 EPIYA motifs, but never simultaneously on 3 motifs. Furthermore, none of the phosphorylated EPIYA motifs alone was sufficient for inducing AGS cell scattering and elongation. The preferred combination of phosphorylated EPIYA motifs in Western strains was EPIYA-A and EPIYA-C, either across 2 CagA molecules or simultaneously on 1. Our study thus identifies a tightly regulated hierarchic phosphorylation model for CagA starting at EPIYA-C/D, followed by phosphorylation of EPIYA-A or EPIYA-B. These results provide insight for clinical H. pylori typing and clarify the role of phosphorylated bacterial effector proteins in pathogenesis.
Journal Article
Modeling aspects of the language of life through transfer-learning protein sequences
by
Rost, Burkhard
,
Elnaggar, Ahmed
,
Nechaev, Dmitrii
in
Algorithms
,
Amino Acid Sequence
,
Amino acids
2019
Background
Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the
Dark Proteome
. Both these problems are addressed by the new methodology introduced here.
Results
We introduced a novel way to represent protein sequences as continuous vectors (
embeddings
) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as
SeqVec
(
Seq
uence-to-
Vec
tor) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although
SeqVec
embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast
HHblits
needed on average about two minutes to generate the evolutionary information for a target protein,
SeqVec
created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases,
SeqVec
provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis.
Conclusion
Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.
Journal Article
Genetic databases
2004
Genetic Databases offers a timely analysis of the underlying tensions, contradictions and limitations of the current regulatory frameworks for, and policy debates about, genetic databases. Drawing on original empirical research and theoretical debates in the fields of sociology, anthropology and legal studies, the contributors to this book challenge the prevailing orthodoxy of informed consent and explore the relationship between personal privacy and the public good. They also consider the multiple meanings attached to human tissue and the role of public consultations and commercial involvement in the creation and use of genetic databases.
The authors argue that policy and regulatory frameworks produce a representation of participation that is often at odds with the experiences and understandings of those taking part. The findings present a serious challenge for public policy to provide mechanisms to safeguard the welfare of individuals participating in genetic databases.
ProtGPT2 is a deep unsupervised language model for protein design
by
Ferruz, Noelia
,
Schmidt, Steffen
,
Höcker, Birte
in
631/114/1305
,
639/705/1042
,
Amino Acid Sequence
2022
Protein design aims to build novel proteins customized for specific purposes, thereby holding the potential to tackle many environmental and biomedical problems. Recent progress in Transformer-based architectures has enabled the implementation of language models capable of generating text with human-like capabilities. Here, motivated by this success, we describe ProtGPT2, a language model trained on the protein space that generates de novo protein sequences following the principles of natural ones. The generated proteins display natural amino acid propensities, while disorder predictions indicate that 88% of ProtGPT2-generated proteins are globular, in line with natural sequences. Sensitive sequence searches in protein databases show that ProtGPT2 sequences are distantly related to natural ones, and similarity networks further demonstrate that ProtGPT2 is sampling unexplored regions of protein space. AlphaFold prediction of ProtGPT2-sequences yields well-folded non-idealized structures with embodiments and large loops and reveals topologies not captured in current structure databases. ProtGPT2 generates sequences in a matter of seconds and is freely available.
Protein design aims to build novel proteins customized for specific purposes, thereby holding the potential to tackle many environmental and biomedical problems. Here the authors apply some of the latest advances in natural language processing, generative Transformers, to train ProtGPT2, a language model that explores unseen regions of the protein space while designing proteins with nature-like properties.
Journal Article
Genome Reduction and Co-evolution between the Primary and Secondary Bacterial Symbionts of Psyllids
2012
Genome reduction in obligately intracellular bacteria is one of the most well-established patterns in the field of molecular evolution. In the extreme, many sap-feeding insects harbor nutritional symbionts with genomes that are so reduced that it is not clear how they perform basic cellular functions. For example, the primary symbiont of psyllids (Carsonella) maintains one of the smallest and most AT-rich bacterial genomes ever identified and has surprisingly lost many genes that are thought to be essential for its role in provisioning its host with amino acids. However, our understanding of this extreme case of genome reduction is limited, as genomic data for Carsonella are available from only a single host species, and little is known about the functional role of “secondary” bacterial symbionts in psyllids. To address these limitations, we analyzed complete Carsonella genomes from pairs of congeneric hosts in three divergent genera within the Psyllidae (Ctenarytaina, Heteropsylla, and Pachypsylla) as well as complete secondary symbiont genomes from two of these host species (Ctenarytaina eucalypti and Heteropsylla cubana). Although the Carsonella genomes are generally conserved in size, structure, and GC content and exhibit genome-wide signatures of purifying selection, we found that gene loss has remained active since the divergence of the host species and had a particularly large impact on the amino acid biosynthesis pathways that define the symbiotic role of Carsonella. In some cases, the presence of additional bacterial symbionts may compensate for gene loss in Carsonella, as functional gene content indicates a high degree of metabolic complementarity between co-occurring symbionts. The genomes of the secondary symbionts also show signatures of long-term evolution as vertically transmitted, intracellular bacteria, including more extensive genome reduction than typically observed in facultative symbionts. Therefore, a history of co-evolution with secondary bacterial symbionts can partially explain the ongoing genome reduction in Carsonella. However, the absence of these secondary symbionts in other host lineages indicates that the relationships are dynamic and that other mechanisms, such as changes in host diet or functional coordination with the host genome, must also be at play.
Journal Article
Deficiency of terminal ADP-ribose protein glycohydrolase TARG1/C6orf130 in neurodegenerative disease
by
Rossi, Marianna N
,
Timinszky, Gyula
,
Schellenberg, Matthew J
in
Adenosine diphosphate
,
Adenosine triphosphatase
,
ADP-ribose
2013
Adenosine diphosphate (ADP)‐ribosylation is a post‐translational protein modification implicated in the regulation of a range of cellular processes. A family of proteins that catalyse ADP‐ribosylation reactions are the poly(ADP‐ribose) (PAR) polymerases (PARPs). PARPs covalently attach an ADP‐ribose nucleotide to target proteins and some PARP family members can subsequently add additional ADP‐ribose units to generate a PAR chain. The hydrolysis of PAR chains is catalysed by PAR glycohydrolase (PARG). PARG is unable to cleave the mono(ADP‐ribose) unit directly linked to the protein and although the enzymatic activity that catalyses this reaction has been detected in mammalian cell extracts, the protein(s) responsible remain unknown. Here, we report the homozygous mutation of the
c6orf130
gene in patients with severe neurodegeneration, and identify C6orf130 as a PARP‐interacting protein that removes mono(ADP‐ribosyl)ation on glutamate amino acid residues in PARP‐modified proteins. X‐ray structures and biochemical analysis of C6orf130 suggest a mechanism of catalytic reversal involving a transient C6orf130 lysyl‐(ADP‐ribose) intermediate. Furthermore, depletion of C6orf130 protein in cells leads to proliferation and DNA repair defects. Collectively, our data suggest that C6orf130 enzymatic activity has a role in the turnover and recycling of protein ADP‐ribosylation, and we have implicated the importance of this protein in supporting normal cellular function in humans.
Crystal structure and biochemical data reveal a gene mutated in patients with severe neurodegeneration to encode an elusive enzyme for removing ADP‐ribose from proteins.
Journal Article
MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods
by
Kumar, Sudhir
,
Tamura, Koichiro
,
Peterson, Nicholas
in
Algorithms
,
Amino acid substitution
,
Amino acids
2011
Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. Here, we announce the release of Molecular Evolutionary Genetics Analysis version 5 (MEGA5), which is a user-friendly software for mining online databases, building sequence alignments and phylogenetic trees, and using methods of evolutionary bioinformatics in basic biology, biomedicine, and evolution. The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models (nucleotide or amino acid), inferring ancestral states and sequences (along with probabilities), and estimating evolutionary rates site-by-site. In computer simulation analyses, ML tree inference algorithms in MEGA5 compared favorably with other software packages in terms of computational efficiency and the accuracy of the estimates of phylogenetic trees, substitution parameters, and rate variation among sites. The MEGA user interface has now been enhanced to be activity driven to make it easier for the use of both beginners and experienced scientists. This version of MEGA is intended for the Windows platform, and it has been configured for effective use on Mac OS X and Linux desktops. It is available free of charge from http://www.megasoftware.net.
Journal Article
The molecular basis for cellular function of intrinsically disordered protein regions
by
Kragelund, Birthe B
,
Holehouse, Alex S
in
Adaptability
,
Amino acid sequence
,
Cellular structure
2024
Intrinsically disordered protein regions exist in a collection of dynamic interconverting conformations that lack a stable 3D structure. These regions are structurally heterogeneous, ubiquitous and found across all kingdoms of life. Despite the absence of a defined 3D structure, disordered regions are essential for cellular processes ranging from transcriptional control and cell signalling to subcellular organization. Through their conformational malleability and adaptability, disordered regions extend the repertoire of macromolecular interactions and are readily tunable by their structural and chemical context, making them ideal responders to regulatory cues. Recent work has led to major advances in understanding the link between protein sequence and conformational behaviour in disordered regions, yet the link between sequence and molecular function is less well defined. Here we consider the biochemical and biophysical foundations that underlie how and why disordered regions can engage in productive cellular functions, provide examples of emerging concepts and discuss how protein disorder contributes to intracellular information processing and regulation of cellular function.Intrinsically disordered regions of proteins lack a defined 3D structure and exist in a collection of interconverting conformations. Recent work is shedding light on how — through their conformational malleability and adaptability — intrinsically disordered regions extend the repertoire of macromolecular interactions in the cell and contribute to key cellular functions.
Journal Article