Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
87,301
result(s) for
"Sequence Analysis, Protein"
Sort by:
Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis
by
Raimondi, Daniele
,
Orlando, Gabriele
,
Moreau, Yves
in
631/114/1305
,
631/114/2184
,
631/114/663/2009
2019
Machine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformatics because they describe amino acids properties that are intuitively relevant for many structural and functional aspects of proteins, and are thus commonly used as input features for ML methods. In this paper we reproduce three classical structural bioinformatics prediction tasks to investigate the main assumptions about the use of propensity scales as input features for ML methods. We investigate their usefulness with different randomization experiments and we show that their effectiveness varies among the ML methods used and the tasks. We show that while linear methods are more dependent on the feature encoding, the specific biophysical meaning of the features is less relevant for non-linear methods. Moreover, we show that even among linear ML methods, the simpler one-hot encoding can surprisingly outperform the “biologically meaningful” scales. We also show that feature selection performed with non-linear ML methods may not be able to distinguish between randomized and “real” propensity scales by properly prioritizing to the latter. Finally, we show that learning problem-specific embeddings could be a simple, assumptions-free and optimal way to perform feature learning/engineering for structural bioinformatics tasks.
Journal Article
Systematic identification of cell cycle-dependent yeast nucleocytoplasmic shuttling proteins by prediction of composite motifs
by
Hasebe, Masako
,
Kosugi, Shunichi
,
Yanagawa, Hiroshi
in
Algorithms
,
alpha Karyopherins - metabolism
,
Amino Acid Sequence
2009
The cell cycle-dependent nucleocytoplasmic transport of proteins is predominantly regulated by CDK kinase activities; however, it is currently difficult to predict the proteins thus regulated, largely because of the low prediction efficiency of the motifs involved. Here, we report the successful prediction of CDK1-regulated nucleocytoplasmic shuttling proteins using a prediction system for nuclear localization signals (NLSs). By systematic amino acid replacement analyses in budding yeast, we created activity-based profiles for different classes of importin-α-dependent NLSs that represent the functional contributions of different amino acids at each position within an NLS class. We then developed a computer program for prediction of the classical importin-α/β pathway-specific NLSs (cNLS Mapper, available at http//nls-mapper.iab.keio.ac.jp/) that calculates NLS activities by using these profiles and an additivity-based motif scoring algorithm. This calculation method achieved significantly higher prediction accuracy in terms of both sensitivity and specificity than did current methods. The search for NLSs that overlap the consensus CDK1 phosphorylation site by using cNLS Mapper identified all previously reported and 5 previously uncharacterized yeast proteins (Yen1, Psy4, Pds1, Msa1, and Dna2) displaying CDK1- and cell cycle-regulated nuclear transport. CDK1 activated or repressed their nuclear import activity, depending on the position of CDK1-phosphorylation sites within NLSs. The application of this strategy to other functional linear motifs should be useful in systematic studies of protein-protein networks.
Journal Article
TRIC: an automated alignment strategy for reproducible protein quantification in targeted proteomics
2016
TRIC, a cross-run alignment algorithm and software tool, enables reproducible quantification of thousands of peptides across multiple targeted liquid chromatography–tandem mass spectrometry runs.
Next-generation mass spectrometric (MS) techniques such as SWATH-MS have substantially increased the throughput and reproducibility of proteomic analysis, but ensuring consistent quantification of thousands of peptide analytes across multiple liquid chromatography–tandem MS (LC-MS/MS) runs remains a challenging and laborious manual process. To produce highly consistent and quantitatively accurate proteomics data matrices in an automated fashion, we developed TRIC (
http://proteomics.ethz.ch/tric/
), a software tool that utilizes fragment-ion data to perform cross-run alignment, consistent peak-picking and quantification for high-throughput targeted proteomics. TRIC reduced the identification error compared to a state-of-the-art SWATH-MS analysis without alignment by more than threefold at constant recall while correcting for highly nonlinear chromatographic effects. On a pulsed-SILAC experiment performed on human induced pluripotent stem cells, TRIC was able to automatically align and quantify thousands of light and heavy isotopic peak groups. Thus, TRIC fills a gap in the pipeline for automated analysis of massively parallel targeted proteomics data sets.
Journal Article
Biochemical classification of tauopathies by immunoblot, protein sequence and mass spectrometric analyses of sarkosyl-insoluble and trypsin-resistant tau
by
Tarutani Airi
,
Taniguchi-Watanabe Sayuri
,
Masuda-Suzukake Masami
in
Aged
,
Aged, 80 and over
,
Alzheimer's disease
2016
Intracellular filamentous tau pathology is the defining feature of tauopathies, which form a subset of neurodegenerative diseases. We have analyzed pathological tau in Alzheimer's disease, and in frontotemporal lobar degeneration associated with tauopathy to include cases with Pick bodies, corticobasal degeneration, progressive supranuclear palsy, and ones due to intronic mutations in MAPT. We found that the C-terminal band pattern of the pathological tau species is distinct for each disease. Immunoblot analysis of trypsin-resistant tau indicated that the different band patterns of the 7–18 kDa fragments in these diseases likely reflect different conformations of tau molecular species. Protein sequence and mass spectrometric analyses revealed the carboxyl-terminal region (residues 243–406) of tau comprises the protease-resistant core units of the tau aggregates, and the sequence lengths and precise regions involved are different among the diseases. These unique assembled tau cores may be used to classify and diagnose disease strains. Based on these results, we propose a new clinicopathological classification of tauopathies based on the biochemical properties of tau.
Journal Article
FreeContact: fast and free software for protein contact prediction from residue co-evolution
2014
Background
20 years of improved technology and growing sequences now renders residue-residue contact constraints in large protein families through correlated mutations accurate enough to drive
de novo
predictions of protein three-dimensional structure. The method EVfold broke new ground using mean-field Direct Coupling Analysis (EVfold-mfDCA); the method PSICOV applied a related concept by estimating a sparse inverse covariance matrix. Both methods (EVfold-mfDCA and PSICOV) are publicly available, but both require too much CPU time for interactive applications. On top, EVfold-mfDCA depends on proprietary software.
Results
Here, we present
FreeContact
, a fast, open source implementation of EVfold-mfDCA and PSICOV. On a test set of 140 proteins,
FreeContact
was almost eight times faster than PSICOV without decreasing prediction performance. The EVfold-mfDCA implementation of
FreeContact
was over 220 times faster than PSICOV with negligible performance decrease. EVfold-mfDCA was unavailable for testing due to its dependency on proprietary software.
FreeContact
is implemented as the free C++ library “libfreecontact”, complete with command line tool “freecontact”, as well as Perl and Python modules. All components are available as Debian packages.
FreeContact
supports the BioXSD format for interoperability.
Conclusions
FreeContact
provides the opportunity to compute reliable contact predictions in any environment (desktop or cloud).
Journal Article
TwinCons: Conservation score for uncovering deep sequence similarity and divergence
by
Petar I. Penev
,
Claudia Alvarez-Carreño
,
Loren Dean Williams
in
Algorithms
,
Alignment
,
Archaeal Proteins
2021
We have developed the program TwinCons, to detect noisy signals of deep ancestry of proteins or nucleic acids. As input, the program uses a composite alignment containing pre-defined groups, and mathematically determines a ‘cost’ of transforming one group to the other at each position of the alignment. The output distinguishes conserved, variable and signature positions. A signature is conserved within groups but differs between groups. The method automatically detects continuous characteristic stretches (segments) within alignments. TwinCons provides a convenient representation of conserved, variable and signature positions as a single score, enabling the structural mapping and visualization of these characteristics. Structure is more conserved than sequence. TwinCons highlights alternative sequences of conserved structures. Using TwinCons, we detected highly similar segments between proteins from the translation and transcription systems. TwinCons detects conserved residues within regions of high functional importance for the ribosomal RNA (rRNA) and demonstrates that signatures are not confined to specific regions but are distributed across the rRNA structure. The ability to evaluate both nucleic acid and protein alignments allows TwinCons to be used in combined sequence and structural analysis of signatures and conservation in rRNA and in ribosomal proteins (rProteins). TwinCons detects a strong sequence conservation signal between bacterial and archaeal rProteins related by circular permutation. This conserved sequence is structurally colocalized with conserved rRNA, indicated by TwinCons scores of rRNA alignments of bacterial and archaeal groups. This combined analysis revealed deep co-evolution of rRNA and rProtein buried within the deepest branching points in the tree of life.
Journal Article
Selection of relevant features from amino acids enables development of robust classifiers
2014
Machine learning (ML) has been extensively applied to develop models and to understand high-throughput data of biological processes. However, new ML models, trained with novel experimental results, are required to build regularly for more precise predictions. ML methods can build models from numeric data, whereas biological data are generally textual (DNA, protein sequences) or images and needs feature calculation algorithms to generate quantitative features. Programming skills along with domain knowledge are required to develop these algorithms. Therefore, the process of knowledge discovery through ML is decelerated due to lack of generic tools to construct features and to build models directly from the data. Hence, we developed a schema that calculates about 5,000 features, selects relevant features and develops protein classifiers from the training data. To demonstrate the general applicability and robustness of our method, fungal adhesins and nuclear receptor proteins were used for building classifiers which outperformed existing classifiers when tested on independent data. Next, we built a classifier for mitochondrial proteins of Plasmodium falciparum which causes human malaria because the latest corresponding classifiers are not publically accessible. Our classifier attained 98.18 % accuracy and 0.95 Matthews correlation coefficient by fivefold cross-validation and outperformed existing classifiers on independent test set. We implemented this schema as user-friendly and open source application Pro-Gyan (http://code.google.com/p/pro-gyan/), to build and share executable classifiers without programming knowledge.
Journal Article
Peptide and Protein Sequence Analysis by Electron Transfer Dissociation Mass Spectrometry
by
John E. P. Syka
,
Coon, Joshua J.
,
Shabanowitz, Jeffrey
in
Amino Acid Sequence
,
Anions
,
Anions - chemistry
2004
Peptide sequence analysis using a combination of gas-phase ion/ion chemistry and tandem mass spectrometry (MS/MS) is demonstrated. Singly charged anthracene anions transfer an electron to multiply protonated peptides in a radio frequency quadrupole linear ion trap (QLT) and induce fragmentation of the peptide backbone along pathways that are analogous to those observed in electron capture dissociation. Modifications to the QLT that enable this ion/ion chemistry are presented, and automated acquisition of high-quality, single-scan electron transfer dissociation MS/MS spectra of phosphopeptides separated by nanoflow HPLC is described.
Journal Article
The abc's (and xyz's) of peptide sequencing
2004
Key Points
For mass spectrometry (MS) analysis, the proteins of interest are proteolytically digested — the resulting peptides are easier to handle, easier to sequence and have better detection efficiencies than intact proteins.
Thousands of peptides can be introduced to the mass spectrometer through 'on-line' capillary chromatography. Using MS, their masses can be measured and they can be fragmented to yield partial amino-acid-sequence information (tandem MS).
Powerful algorithms can match the data from tandem MS against possible peptide sequences in amino-acid databases. The resulting protein probability scores need to be studied carefully to avoid over-interpreting the identification results, and unbiased statistical techniques are now helping to address such problems.
Protein modifications are amenable to MS analysis, as these modifications normally induce mass shifts. However, due to the substoichiometric amounts of protein modifications, selective enrichment and detection methods are usually necessary and there is no guarantee that the complete primary structure of the protein will be covered.
Proteins can be quantified by MS using stable-isotope labels. If the relative abundance of a protein in two samples is to be compared, labelling with stable isotopes is the method of choice. The use of isotopically labelled internal standards is recommended for absolute quantification. However, peak intensities and the number of peptides that are observed during a liquid-chromatography–MS experiment (versus the number of theoretically observable peptides that can be derived from the protein of interest) can also be used to estimate protein abundance.
There has been great progress in the proteomic analysis of multiprotein complexes and subcellular organelles. However, routine, in-depth proteome analyses of whole-cell lysates, tissue samples and plasma still elude the dynamic-range capabilities and sensitivity of the instruments that are available at present.
Proteomics is an increasingly powerful and indispensable technology in molecular cell biology. It can be used to identify the components of small protein complexes and large organelles, to determine post-translational modifications and in sophisticated functional screens. The key — but little understood — technology in mass-spectrometry-based proteomics is peptide sequencing, which we describe and review here in an easily accessible format.
Journal Article
Prediction of HIV drug resistance based on the 3D protein structure: Proposal of molecular field mapping
by
So, Kanako
,
Yamashita, Fumiyoshi
,
Ota, Ryosaku
in
Amino Acid Sequence
,
Amino acids
,
Anti-HIV Agents - chemistry
2021
A method for predicting HIV drug resistance by using genotypes would greatly assist in selecting appropriate combinations of antiviral drugs. Models reported previously have had two major problems: lack of information on the 3D protein structure and processing of incomplete sequencing data in the modeling procedure. We propose obtaining the 3D structural information of viral proteins by using homology modeling and molecular field mapping, instead of just their primary amino acid sequences. The molecular field potential parameters reflect the physicochemical characteristics associated with the 3D structure of the proteins. We also introduce the Bayesian conditional mutual information theory to estimate the probabilities of occurrence of all possible protein candidates from an incomplete sequencing sample. This approach allows for the effective use of uncertain information for the modeling process. We applied these data analysis techniques to the HIV-1 protease inhibitor dataset and developed drug resistance prediction models with reasonable performance.
Journal Article