Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
35
result(s) for
"Rives, Alexander"
Sort by:
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
by
Goyal, Siddharth
,
Ma, Jerry
,
Guo, Demi
in
Amino acid sequence
,
Amino acids
,
Artificial intelligence
2021
SignificanceLearning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction.
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Journal Article
Modular Organization of Cellular Networks
by
Galitski, Timothy
,
Rives, Alexander W.
in
Biological Sciences
,
Cell Communication
,
Cellular biology
2003
We investigated the organization of interacting proteins and protein complexes into networks of modules. A network-clustering method was developed to identify modules. This method of network-structure determination was validated by clustering known signaling-protein modules and by identifying module rudiments in exclusively high-throughput protein-interaction data with high error frequencies and low coverage. The signaling network controlling the yeast developmental transition to a filamentous form was clustered. Abstraction of a modular network-structure model identified module-organizer proteins and module-connector proteins. The functions of these proteins suggest that they are important for module function and intermodule communication.
Journal Article
Language Models at the Scale of Evolution
This thesis describes the development of evolutionary scale modeling (ESM), which proposes to solve an inverse problem across evolution to learn the biology of proteins from their sequences at the scale of life. Beginning from the idea that the sequences of proteins contain an image of biology in their patterns, this thesis shows that language models trained on protein sequences spanning the natural diversity of the Earth, by learning to predict which amino acids evolution chooses, develop feature spaces that reflect the immense scope and complexity of protein biology containing known and unknown biology. Biological structure and function emerge in the representations of the models. This emergence is shown to occur in a direct linkage with improvements in the language modeling of sequences. The representation space has an ordered structure in which proteins are organized according to their underlying biology, and directions correspond to meaningful biological variations. Attention patterns materialize in the neural network that correspond to the folded three-dimensional structure of proteins. The probabilities assigned to amino acids within a given sequence context reflect protein function and predict the effects of mutations. The representations learned by protein language models constitute a general and transferable feature space which supports the discovery and generation of new biology. This has enabled an effort to reveal the structures of hundreds of millions of metagenomic proteins for the first time. The thesis concludes with experimental characterizations of proteins created by language models, which demonstrate that the feature space learned from natural proteins supports generating proteins beyond those in nature.
Dissertation
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
2021
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.
Journal Article
Transformer protein language models are unsupervised structure learners
by
Rao, Roshan
,
Ovchinnikov, Sergey
,
Meier, Joshua
in
Bioinformatics
,
Language
,
Protein structure
2020
Abstract Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1 Competing Interest Statement The authors have declared no competing interest. Footnotes * ↵* Work performed during an internship at Facebook. * https://github.com/facebookresearch/esm * 2 PSICOV fails to converge on 24 sequences using default parameters. Following the suggestion in github.com/psipred/psicov, we increase ρ to 0.005, 0.01, and thereafter by increments of 0.01, to a maximum of 0.1. PSICOV fails to converge altogether on 6 / 14842 sequences. We assign a score of 0 for these sequences. * 3 PSICOV fails to converge on 3 / 15 targets with default parameters. We follow the procedure suggested in https://github.com/psipred/psicov to increase rho to 0.005 for those domains.
MSA Transformer
by
Verkuil, Robert
,
Rao, Roshan M
,
Abbeel, Pieter
in
Computer applications
,
Language
,
Nucleotide sequence
2021
Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models. Competing Interest Statement The authors have declared no competing interest. Footnotes * Added citation to Huang et al. 2019 (CCNet)
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
by
Goyal, Siddharth
,
Ma, Jerry
,
Guo, Demi
in
Amino acid sequence
,
Artificial intelligence
,
Evolution
2020
Abstract In the field of artificial intelligence, a combination of scale in data and model capacity enabled by un-supervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state-of-the-art features for long-range contact prediction. Competing Interest Statement The authors have declared no competing interest. Footnotes * ↵† Work performed while at Facebook AI Research * v4 (Dec 2020): Paper revised to include (a) ESM1-b model and hyperparameter optimization; (b) evaluation on structural split datasets at family, superfamily, and fold levels. v3 (Aug 2020): Paper thoroughly revised, with improved results and updated methodology. Main changes: (a) improved pre-training setup and experimental results studying sequence diversity; (b) additional benchmarks on remote homology detection, secondary structure prediction, contact prediction, and mutational effects; (c) combination of pretraining features and classical features for secondary structure and contact prediction. * https://github.com/facebookresearch/esm
Modular organization of cellular networks
2003
We investigated the organization of interacting proteins and protein complexes into networks of modules. A network-clustering method was developed to identify modules. This method of network-structure determination was validated by clustering known signaling-protein modules and by identifying module rudiments in exclusively high-throughput protein-interaction data with high error frequencies and low coverage. The signaling network controlling the yeast developmental transition to a filamentous form was clustered. Abstraction of a modular network-structure model identified module-organizer proteins and module-connector proteins. The functions of these proteins suggest that they are important for module function and intermodule communication.
Journal Article
A comprehensive map of the dendritic cell transcriptional network engaged upon innate sensing of HIV
by
Menager, Mickael M
,
Bonneau, Richard
,
De Veaux, Nicholas
in
Cell activation
,
Chromatin
,
Computer applications
2019
Transcriptional programming of the innate immune response is pivotal for host protection. However, the transcriptional mechanisms that link pathogen sensing with innate activation remain poorly understood. During infection with HIV-1, human dendritic cells (DCs) can detect the virus through an innate sensing pathway leading to antiviral interferon and DC maturation. Here, we developed an iterative experimental and computational approach to map the innate response circuitry during HIV-1 infection. By integrating genome-wide chromatin accessibility with expression kinetics, we inferred a gene regulatory network that links 542 transcription factors with 21,862 target genes. We observed that an interferon response is required, yet insufficient to drive DC maturation, and identified PRDM1 and RARA as essential regulators of the interferon response and DC maturation, respectively. Our work provides a resource for interrogation of regulators of HIV replication and innate immunity, highlighting complexity and cooperativity in the regulatory circuit controlling the DC response to infection.
A high-level programming language for generative protein design
by
Smetanin, Nikita
,
Candido, Salvatore
,
Rives, Alexander
in
Artificial intelligence
,
Design
,
Programming languages
2022
Combining a basic set of building blocks into more complex forms is a universal design principle. Most protein designs have proceeded from a manual bottom-up approach using parts created by nature, but top-down design of proteins is fundamentally hard due to biological complexity. We demonstrate how the modularity and programmability long sought for protein design can be realized through generative artificial intelligence. Advanced protein language models demonstrate emergent learning of atomic resolution structure and protein design principles. We leverage these developments to enable the programmable design of de novo protein sequences and structures of high complexity. First, we describe a high-level programming language based on modular building blocks that allows a designer to easily compose a set of desired properties. We then develop an energy-based generative model, built on atomic resolution structure prediction with a language model, that realizes all-atom structure designs that have the programmed properties. Designing a diverse set of specifications, including constraints on atomic coordinates, secondary structure, symmetry, and multimerization, demonstrates the generality and controllability of the approach. Enumerating constraints at increasing levels of hierarchical complexity shows that the approach can access a combinatorially large design space.Competing Interest StatementThe authors have declared no competing interest.