Catalogue Search | MBRL

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

by Lawson, Nze , Matangira, Tapiwanashe , Rivera, Clara in Ambiguity , Artificial intelligence , Audits

2022

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Journal Article

Share this book

Add to My Shelf

On Generative Spoken Language Modeling from Raw Audio

by Bolte, Benjamin , Kharitonov, Eugene , Baevski, Alexei in Acoustics , Automatic text generation , Computation and Language

2021

We introduce , the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text units), a generative language model (trained on pseudo- text), and a speech decoder (generating a waveform from pseudo-text) all trained without supervision and validate the proposed metrics with human evaluation. Across 3 speech encoders (CPC, wav2vec 2.0, HuBERT), we find that the number of discrete units (50, 100, or 200) matters in a task-dependent and encoder- dependent way, and that some combinations approach text-based systems.

Journal Article

Share this book

Add to My Shelf

Hallucinations in Large Multilingual Translation Models

by Waldendorf, Jonas , Haddow, Barry , Birch, Alexandra in Computation and Language , Computer Science , Data quality

2023

Hallucinated translations can severely undermine and raise safety issues when machine translation systems are deployed in the wild. Previous research on the topic focused on small bilingual models trained on high-resource languages, leaving a gap in our understanding of hallucinations in multilingual models across diverse translation scenarios. In this work, we fill this gap by conducting a comprehensive analysis—over 100 language pairs across various resource levels and going beyond English-centric directions—on both the M2M neural machine translation (NMT) models and GPT large language models (LLMs). Among several insights, we highlight that models struggle with hallucinations primarily in low-resource directions and when translating out of English, where, critically, they may reveal toxic patterns that can be traced back to the training data. We also find that LLMs produce qualitatively different hallucinations to those of NMT models. Finally, we show that hallucinations are hard to reverse by merely scaling models trained with the same data. However, employing more diverse models, trained on different data or with different procedures, as fallback systems can improve translation quality and virtually eliminate certain pathologies.

Journal Article

Share this book

Add to My Shelf

The Failure of the Strong Pumping Lemma for Multiple Context-Free Languages

by Salvati, Sylvain , Kobele, Gregory M. , Yoshinaka, Ryo in Analysis , Computation , Computation and Language

2014

Seki et al. (Theor. Comput. Sci. 88(2):191–229, 1991 ) showed that every m -multiple context-free language L is weakly 2 m -iterative in the sense that either L is finite or L contains a subset of the form , where w 1 ⋯ w 2 n ≠ ε . Whether every m -multiple context-free language L is 2 m -iterative, that is to say, whether all but finitely many elements z of L can be written as z = u 0 w 1 u 1 ⋯ w 2 m u 2 m with w 1 ⋯ w 2 m ≠ ε and , has been open. We show that there is a 3-multiple context-free language that is not k -iterative for any k .

Journal Article

Share this book

Add to My Shelf

Artificial intelligence bot ChatGPT in medical research: the potential game changer as a double-edged sword

by Ollivier, Matthieu , Winkler, Philipp W. , Hirschmann, Michael T. in Artificial Intelligence , Bioengineering , Biomedical Research

2023

Journal Article

Share this book

Add to My Shelf

Survey on evaluation methods for dialogue systems

in Artificial intelligence , Conversation , Educational activities

2021

In this paper, we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation, in and of itself, is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and questionnaires. However, this tends to be very cost- and time-intensive. Thus, much work has been put into finding methods which allow a reduction in involvement of human labour. In this survey, we present the main concepts and methods. For this, we differentiate between the various classes of dialogue systems (task-oriented, conversational, and question-answering dialogue systems). We cover each class by introducing the main technologies developed for the dialogue systems and then present the evaluation methods regarding that class.

Journal Article

Share this book

Add to My Shelf

Exploring optimal granularity for extractive summarization of unstructured health records: Analysis of the largest multi-institutional archive of health records in Japan

by Yuji Matsumoto , Hiromasa Horiguchi , Takashi Okumura in Computation and Language (cs.CL) , Computer applications to medicine. Medical informatics , Computer Science - Computation and Language

2022

Automated summarization of clinical texts can reduce the burden of medical professionals. “Discharge summaries” are one promising application of the summarization, because they can be generated from daily inpatient records. Our preliminary experiment suggests that 20–31% of the descriptions in discharge summaries overlap with the content of the inpatient records. However, it remains unclear how the summaries should be generated from the unstructured source. To decompose the physician’s summarization process, this study aimed to identify the optimal granularity in summarization. We first defined three types of summarization units with different granularities to compare the performance of the discharge summary generation: whole sentences, clinical segments, and clauses. We defined clinical segments in this study, aiming to express the smallest medically meaningful concepts. To obtain the clinical segments, it was necessary to automatically split the texts in the first stage of the pipeline. Accordingly, we compared rule-based methods and a machine learning method, and the latter outperformed the formers with an F1 score of 0.846 in the splitting task. Next, we experimentally measured the accuracy of extractive summarization using the three types of units, based on the ROUGE-1 metric, on a multi-institutional national archive of health records in Japan. The measured accuracies of extractive summarization using whole sentences, clinical segments, and clauses were 31.91, 36.15, and 25.18, respectively. We found that the clinical segments yielded higher accuracy than sentences and clauses. This result indicates that summarization of inpatient records demands finer granularity than sentence-oriented processing. Although we used only Japanese health records, it can be interpreted as follows: physicians extract “concepts of medical significance” from patient records and recombine them in new contexts when summarizing chronological clinical records, rather than simply copying and pasting topic sentences. This observation suggests that a discharge summary is created by higher-order information processing over concepts on sub-sentence level, which may guide future research in this field. Author summary Medical practice includes significant paperwork, and therefore, automated processing of clinical texts can reduce medical professionals’ burden. Accordingly, we focused on hospitals’ discharge summaries from daily inpatient records stored in Electric Health Records. By applying summarization technologies, which are well-studied in Natural Language Processing, discharge summaries could be generated automatically from the source texts. However, automated summarization of daily inpatient records involves various technical topics and challenges, and the generation of discharge summaries is a complex process of mixing extractive and abstractive summarization. Thus, in this study, we explored optimal granularity for extractive summarization, attempting to decompose actual physicians’ processing. In the experiments, we used three types of summarization units with different granularities to compare performances of discharge summary generation: whole sentences, clinical segments, and clauses. We originally defined clinical segments, aiming to express the smallest medically meaningful concepts. The result indicated that sub-sentence processing, larger than clauses, improves the quality of the summaries. This finding can guide future development of medical documents’ automated summarization.

Journal Article

Share this book

Add to My Shelf

Machine Learning and Natural Language Processing in Mental Health: Systematic Review

by Kim-Dufor, Deok-Hee , Lenca, Philippe , Marsh, Jonathan in Algorithms , Apprentissage machine , Artificial Intelligence

2021

Machine learning systems are part of the field of artificial intelligence that automatically learn models from data to make better decisions. Natural language processing (NLP), by using corpora and learning approaches, provides good performance in statistical tasks, such as text classification or sentiment mining. The primary aim of this systematic review was to summarize and characterize, in methodological and technical terms, studies that used machine learning and NLP techniques for mental health. The secondary aim was to consider the potential use of these methods in mental health clinical practice. This systematic review follows the PRISMA (Preferred Reporting Items for Systematic Review and Meta-analysis) guidelines and is registered with PROSPERO (Prospective Register of Systematic Reviews; number CRD42019107376). The search was conducted using 4 medical databases (PubMed, Scopus, ScienceDirect, and PsycINFO) with the following keywords: machine learning, data mining, psychiatry, mental health, and mental disorder. The exclusion criteria were as follows: languages other than English, anonymization process, case studies, conference papers, and reviews. No limitations on publication dates were imposed. A total of 327 articles were identified, of which 269 (82.3%) were excluded and 58 (17.7%) were included in the review. The results were organized through a qualitative perspective. Although studies had heterogeneous topics and methods, some themes emerged. Population studies could be grouped into 3 categories: patients included in medical databases, patients who came to the emergency room, and social media users. The main objectives were to extract symptoms, classify severity of illness, compare therapy effectiveness, provide psychopathological clues, and challenge the current nosography. Medical records and social media were the 2 major data sources. With regard to the methods used, preprocessing used the standard methods of NLP and unique identifier extraction dedicated to medical texts. Efficient classifiers were preferred rather than transparent functioning classifiers. Python was the most frequently used platform. Machine learning and NLP models have been highly topical issues in medicine in recent years and may be considered a new paradigm in medical research. However, these processes tend to confirm clinical hypotheses rather than developing entirely new information, and only one major category of the population (ie, social media users) is an imprecise cohort. Moreover, some language-specific features can improve the performance of NLP methods, and their extension to other languages should be more closely investigated. However, machine learning and NLP techniques provide useful information from unexplored data (ie, patients' daily habits that are usually inaccessible to care providers). Before considering It as an additional tool of mental health care, ethical issues remain and should be discussed in a timely manner. Machine learning and NLP methods may offer multiple perspectives in mental health research but should also be considered as tools to support clinical practice.

Journal Article

Share this book

Add to My Shelf

Generative Spoken Dialogue Language Modeling

by Nguyen, Tu Anh , Kharitonov, Eugene , Tomasello, Paden in Computation and Language , Computer Science , Conversation

2023

We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model. ,

Journal Article

Share this book

Add to My Shelf

Variable Scale Pruning for Transformer Model Compression in End-to-End Speech Recognition

by Rouas, Jean-Luc , Ben Letaifa, Leila in Computation and Language , Computer Science

2023

Transformer models are being increasingly used in end-to-end speech recognition systems for their performance. However, their substantial size poses challenges for deploying them in real-world applications. These models heavily rely on attention and feedforward layers, with the latter containing a vast number of parameters that significantly contribute to the model’s memory footprint. Consequently, it becomes pertinent to consider pruning these layers to reduce the model’s size. In this article, our primary focus is on the feedforward layers. We conduct a comprehensive analysis of their parameter count and distribution. Specifically, we examine the weight distribution within each layer and observe how the weight values progress across the transformer model’s blocks. Our findings demonstrate a correlation between the depth of the feedforward layers and the magnitude of their weights. Consequently, layers with higher weight values require less pruning. Building upon this insight, we propose a novel pruning algorithm based on variable rates. This approach sets the pruning rate according to the significance and location of each feedforward layer within the network. To evaluate our new pruning method, we conduct experiments on various datasets. The results reveal its superiority over conventional pruning techniques, such as local pruning and global pruning.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter