Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
77 result(s) for "Gimpel, Kevin"
Sort by:
Sequence-to-sequence modeling for graph representation learning
We propose sequence-to-sequence architectures for graph representation learning in both supervised and unsupervised regimes. Our methods use recurrent neural networks to encode and decode information from graph-structured data. Recurrent neural networks require sequences, so we choose several methods of traversing graphs using different types of substructures with various levels of granularity to generate sequences of nodes for encoding. Our unsupervised approaches leverage long short-term memory (LSTM) encoder-decoder models to embed the graph sequences into a continuous vector space. We then represent a graph by aggregating its graph sequence representations. Our supervised architecture uses an attention mechanism to collect information from the neighborhood of a sequence. The attention module enriches our model in order to focus on the subgraphs that are crucial for the purpose of a graph classification task. We demonstrate the effectiveness of our approaches by showing improvements over the existing state-of-the-art approaches on several graph classification tasks.
From Paraphrase Database to Compositional Paraphrase Model and Back
The Paraphrase Database (PPDB; Ganitkevitch et al., 2013) is an extensive semantic resource, consisting of a list of phrase pairs with (heuristic) confidence estimates. However, it is still unclear how it can best be used, due to the heuristic nature of the confidences and its necessarily incomplete coverage. We propose models to leverage the phrase pairs from the PPDB to build parametric paraphrase models that score paraphrase pairs more accurately than the PPDB’s internal scores while simultaneously improving its coverage. They allow for learning phrase embeddings as well as improved word embeddings. Moreover, we introduce two new, manually annotated datasets to evaluate short-phrase paraphrasing models. Using our paraphrase model trained using PPDB, we achieve state-of-the-art results on standard word and bigram similarity tasks and beat strong baselines on our new short phrase paraphrase tasks.
A Sense-Topic Model for Word Sense Induction with Unsupervised Data Enrichment
Word sense induction (WSI) seeks to automatically discover the senses of a word in a corpus via unsupervised methods. We propose a sense-topic model for WSI, which treats sense and topic as two separate latent variables to be inferred jointly. Topics are informed by the entire document, while senses are informed by the local context surrounding the ambiguous word. We also discuss unsupervised ways of enriching the original corpus in order to improve model performance, including using neural word embeddings and external corpora to expand the context of each data instance. We demonstrate significant improvements over the previous state-of-the-art, achieving the best results reported to date on the SemEval-2013 WSI task.
Gaussian Error Linear Units (GELUs)
We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is \\(x\\Phi(x)\\), where \\(\\Phi(x)\\) the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs (\\(x\\mathbf{1}_{x>0}\\)). We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all considered computer vision, natural language processing, and speech tasks.
Discriminative Feature-Rich Modeling for Syntax-Based Machine Translation
Fully-automated, high-quality machine translation promises to revolutionize human communication. But as anyone who has used a machine translation system knows, we are not there yet. In this thesis, we address four areas in which we believe translation quality can be improved across a large number of language pairs. The first relates to flexible tree-to-tree translation modeling. Building translation systems for many language pairs requires addressing a wide range of translation divergence phenomena (Dorr, 1994). Recent research has shown clear improvement in translation quality by exploiting linguistic syntax for either the source or target language (Yamada and Knight, 2001; Galley et al., 2006; Zollmann and Venugopal, 2006; Liu et al., 2006). However, when using syntax for both languages (“tree-to-tree” translation), syntactic divergence hampers the extraction of useful rules (Ding and Palmer, 2005; Cowan et al., 2006; Ambati and Lavie, 2008; Liu et al., 2009a). Recent research shows that using soft constraints can substantially improve performance (Liu et al., 2009a; Chiang, 2010; Zhang et al., 2011; Hanneman and Lavie, 2011). Recently, Smith and Eisner (2006a) developed a flexible family of formalisms that they called quasi-synchronous grammar (QG). QG treats non-isomorphic structure softly using features rather than hard constraints. While a natural fit for syntactic translation modeling, the increased flexibility of the formalism has proved challenging for building real-world systems. In this thesis, we present the first machine translation system based on quasi-synchronous grammar. Relatedly, we seek to unify disparate translation models. In designing a statistical model for translation, a researcher seeks to capture intuitions about how humans translate. This is typically done by specifying the form of translation rules and learning them automatically from large corpora. The current trend is toward larger and increasingly-intricate rules. Some systems use rules with flat phrase mappings (Koehn et al., 2003), while others use rules inspired by linguistic syntax (Yamada and Knight, 2001). Neither is always better than the other (DeNeefe et al., 2007; Birch et al., 2009; Galley and Manning, 2010). In this thesis, we build a system that unifies rules from these two categories in a single model. Specifically, we use rules that combine phrases and dependency syntax by developing a new formalism called quasi-synchronous phrase dependency grammar. In order to build these models, we need learning algorithms that can support feature-rich translation modeling. Due to characteristics of the translation problem, machine learning algorithms change when adapted to machine translation (Och and Ney, 2002; Liang et al., 2006a; Arun and Koehn, 2007; Watanabe et al., 2007; Chiang et al., 2008b), producing a breed of complex learning procedures that, though effective, are not well-understood or easily replicated. In this thesis, we contribute a new family of learning algorithms based on minimizing the structured ramp loss (Do et al., 2008). We develop novel variations on this loss, draw connections to several popular learning methods for machine translation, and develop algorithms for optimization. Our algorithms are effective in practice while remaining conceptually straightforward and easy to implement. Our final focus area is the use of syntactic structure for translation when linguistic annotations are not available. Syntax-based models typically use automatic parsers, which are built using corpora of manually-annotated parse trees. Such corpora are available for perhaps twenty languages (Marcus et al., 1993; Buchholz and Marsi, 2006; Petrov et al., 2012). In order to apply our models to the thousands of language pairs for which we do not have annotations, we turn to unsupervised parsers. These induce syntactic structures from raw text. The statistical NLP community has been doing unsupervised syntactic analysis for years (Magerman and Marcus, 1990; Brill and Marcus, 1992; Yuret, 1998; Paskin, 2002; Klein and Manning, 2002, 2004), but these systems have not yet found a foothold in translation research. In this thesis, we take the first steps in using unsupervised parsing for machine translation.
TVStoryGen: A Dataset for Generating Stories with Character Descriptions
We introduce TVStoryGen, a story generation dataset that requires generating detailed TV show episode recaps from a brief summary and a set of documents describing the characters involved. Unlike other story generation datasets, TVStoryGen contains stories that are authored by professional screen-writers and that feature complex interactions among multiple characters. Generating stories in TVStoryGen requires drawing relevant information from the lengthy provided documents about characters based on the brief summary. In addition, we propose to train reverse models on our dataset for evaluating the faithfulness of generated stories. We create TVStoryGen from fan-contributed websites, which allows us to collect 26k episode recaps with 1868.7 tokens on average. Empirically, we take a hierarchical story generation approach and find that the neural model that uses oracle content selectors for character descriptions demonstrates the best performance on automatic metrics, showing the potential of our dataset to inspire future research on story generation with constraints. Qualitative analysis shows that the best-performing model sometimes generates content that is unfaithful to the short summaries, suggesting promising directions for future work.
Reconsidering the Past: Optimizing Hidden States in Language Models
We present Hidden-State Optimization (HSO), a gradient-based method for improving the performance of transformer language models at inference time. Similar to dynamic evaluation (Krause et al., 2018), HSO computes the gradient of the log-probability the language model assigns to an evaluation text, but uses it to update the cached hidden states rather than the model parameters. We test HSO with pretrained Transformer-XL and GPT-2 language models, finding improvement on the WikiText103 and PG-19 datasets in terms of perplexity, especially when evaluating a model outside of its training distribution. We also demonstrate downstream applicability by showing gains in the recently developed prompt-based few-shot evaluation setting, again with no extra parameters or training data.
MAP's not dead yet: Uncovering true language model modes by conditioning away degeneracy
It has been widely observed that exact or approximate MAP (mode-seeking) decoding from natural language generation (NLG) models consistently leads to degenerate outputs (Holtzman et al., 2019; Stahlberg and Byrne, 2019). Prior work has attributed this behavior to either a fundamental and unavoidable inadequacy of modes in probabilistic models or weaknesses in language modeling. Contrastingly, we argue that degenerate modes can even occur in the absence of any modeling error, due to contamination of the training data. Specifically, we argue that mixing even a tiny amount of low-entropy noise with a population text distribution can cause the data distribution's mode to become degenerate. We therefore propose to apply MAP decoding to the model's true conditional distribution where the conditioning variable explicitly avoids specific degenerate behavior. Using exact search, we empirically verify that the length-conditional modes of machine translation models and language models are indeed more fluent and topical than their unconditional modes. For the first time, we also share many examples of exact modal sequences from these models, and from several variants of the LLaMA-7B model. Notably, we observe that various kinds of degenerate modes persist, even at the scale of LLaMA-7B. Although we cannot tractably address these degeneracies with exact search, we perform a classifier-based approximate search on LLaMA-7B, a model which was not trained for instruction following, and find that we are able to elicit reasonable outputs without any finetuning.
Structured Tree Alignment for Evaluation of (Speech) Constituency Parsing
We present the structured average intersection-over-union ratio (STRUCT-IOU), a similarity metric between constituency parse trees motivated by the problem of evaluating speech parsers. STRUCT-IOU enables comparison between a constituency parse tree (over automatically recognized spoken word boundaries) with the ground-truth parse (over written words). To compute the metric, we project the ground-truth parse tree to the speech domain by forced alignment, align the projected ground-truth constituents with the predicted ones under certain structured constraints, and calculate the average IOU score across all aligned constituent pairs. STRUCT-IOU takes word boundaries into account and overcomes the challenge that the predicted words and ground truth may not have perfect one-to-one correspondence. Extending to the evaluation of text constituency parsing, we demonstrate that STRUCT-IOU can address token-mismatch issues, and shows higher tolerance to syntactically plausible parses than PARSEVAL (Black et al., 1991).