Catalogue Search | MBRL

Generative Spoken Dialogue Language Modeling

by Nguyen, Tu Anh , Kharitonov, Eugene , Tomasello, Paden in Computation and Language , Computer Science , Conversation

2023

We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model. ,

Journal Article

Share this book

Add to My Shelf

SpiRit-LM : Interleaved Spoken and Written Language Model

by Yu, Bokai , Williamson, Mary , Dupoux, Emmanuel in Classification , Computation and Language , Computational linguistics

2025

We introduce , a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level method using a small automatically curated speech-text parallel corpus. comes in two versions: a version that uses speech phonetic units (HuBERT) and an version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that can learn new tasks in a few-shot fashion across modalities (i.e., ASR, TTS, Speech Classification). We make available model weights and inference code.

Journal Article

Share this book

Add to My Shelf

Mapping Urban Air Quality from Mobile Sensors Using Spatio-Temporal Geostatistics

by Chatellier, Patrice , Idir, Yacine Mohamed , Judalet, Vincent in Air pollution , air quality , Algorithms

2021

With the advancement of technology and the arrival of miniaturized environmental sensors that offer greater performance, the idea of building mobile network sensing for air quality has quickly emerged to increase our knowledge of air pollution in urban environments. However, with these new techniques, the difficulty of building mathematical models capable of aggregating all these data sources in order to provide precise mapping of air quality arises. In this context, we explore the spatio-temporal geostatistics methods as a solution for such a problem and evaluate three different methods: Simple Kriging (SK) in residuals, Ordinary Kriging (OK), and Kriging with External Drift (KED). On average, geostatistical models showed 26.57% improvement in the Root Mean Squared Error (RMSE) compared to the standard Inverse Distance Weighting (IDW) technique in interpolating scenarios (27.94% for KED, 26.05% for OK, and 25.71% for SK). The results showed less significant scores in extrapolating scenarios (a 12.22% decrease in the RMSE for geostatisical models compared to IDW). We conclude that univariable geostatistics is suitable for interpolating this type of data but is less appropriate for an extrapolation of non-sampled places since it does not create any information.

Journal Article

Share this book

Add to My Shelf

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

by Zaiem, Salah , Dupoux, Emmanuel , Ricoul, Tristan in Ablation , Algorithms , Bayesian analysis

2022

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., , ) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.

Journal Article

Share this book

Add to My Shelf

Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

by Denis, Pascal , Sagot, Benoît in Accuracy , Applied linguistics , Artificial intelligence

2012

This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.

Journal Article

Share this book

Add to My Shelf

An Engine Load Monitoring Approach for Quantifying Yearly Methane Slip Emissions from an LNG-Powered RoPax Vessel

by Defossez, Raphael , Joubert, Aurélie , Mahi, Ridha in Air pollution , Air quality management , Carbon dioxide

2025

Liquefied natural gas (LNG) is increasingly used as a marine fuel due to its capacity to significantly reduce emissions of particulate matter, sulfur oxides (SOx), and nitrogen oxides (NOx), compared to conventional fuels. In addition, LNG combustion produces less carbon dioxide (CO2) than conventional marine fuels, and the use of non-fossil LNG offers further potential for reducing greenhouse gas emissions. However, this benefit can be partially offset by methane slip—the release of unburned methane in engine exhaust—which has a much higher global warming potential than CO2. This study presents an experimental evaluation of methane emissions from a RoPax vessel powered by low-pressure dual-fuel four-stroke engines with a direct mechanical propulsion system. Methane slip was measured directly during onboard testing and combined with a year-long analysis of engine operation using an Engine Load Monitoring (ELM) method. The yearly average methane slip coefficient (Cslip) obtained was 1.57%, slightly lower than values reported in previous studies on cruise ships (1.7%), and significantly lower than the default values specified by the FuelEU (3.1%) Maritime regulation and IMO (3.5%) LCA guidelines. This result reflects the ship’s operational profile, characterized by long crossings at high and stable engine loads. This study provides results that could support more representative emission assessments and can contribute to ongoing regulatory discussions.

Journal Article

Share this book

Add to My Shelf

Methodological Development for Studying the Chemical Composition of Exhaust Particle Emissions: Application to a Passenger Vessel Operating on Marine Gas Oil

by Joubert, Aurélie , Le Coq, Laurence , Mahi, Ridha in Air pollution , Alcohols , Alkanes

2025

On-board emission measurements were conducted at the exhaust of a passenger ship operating under real-world conditions. The chemical composition of exhaust particulate emissions from a turbocharged four-stroke marine diesel engine, operated on Marine Gas Oil was studied. A variety of organic compounds, including alkanes, alkenes, alcohols, cycloalkanes, cycloalkenes, esters, ketones, carboxylic acids, etc., were analyzed. Alkanes were the most abundant organic compounds, followed by alkenes, esters, and alcohols. Emission factors for these compounds were determined under two operating conditions: low engine load (at berth at 400 rpm/4% load, and during port maneuvers at 800 rpm/14% load) and high engine load (during cruising at 1000 rpm, 68% load). A clear increase in organic-compound emission factors was observed at lower loads. The total particulate matter emission factors were between 0.02 and 0.03 g/kWh at high-load points and exhibited significant variability under low-load conditions, from 0.02 to 2.83 g/kWh. The effect of a marine fuel additive was evaluated in this study. Using this fuel additive resulted in a significant decrease in both particulate matter and organic-compound emission factors, especially at low engine loads. Furthermore, the marine fuel additive decreased the total emission factors (EFTOCs) by a factor of 56 under low-load conditions. For high loads, the additive had no effect on the EFTOCs.

Journal Article

Share this book

Add to My Shelf

Constructing a poor man's wordnet in a resource-rich world

by Fišer, Darja , Sagot, Benoît in Accuracy , Automation , Bootstrap method

2015

In this paper we present a language-independent, fully modular and automatic approach to bootstrap a wordnet for a new language by recycling different types of already existing language resources, such as machine-readable dictionaries, parallel corpora, and Wikipedia. The approach, which we apply here to Slovene, takes into account monosemous and polysemous words, general and specialised vocabulary as well as simple and multi-word lexemes. The extracted words are then assigned one or several synset ids, based on a classifier that relies on several features including distributional similarity. Finally, we identify and remove highly dubious (literal, synset) pairs, based on simple distributional information extracted from a large corpus in an unsupervised way. Automatic, manual and task-based evaluations show that the resulting resource, the latest version of the Slovene wordnet, is already a valuable source of lexico-semantic information.

Journal Article

Share this book

Add to My Shelf

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

by Lawson, Nze , Matangira, Tapiwanashe , Rivera, Clara in Ambiguity , Artificial intelligence , Audits

2022

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Journal Article

Share this book

Add to My Shelf

Data-driven synset induction and disambiguation for wordnet development

by Apidianaki, Marianna , Sagot, Benoît in Alignment , Artificial intelligence , Automation

2014

Automatic methods for wordnet development in languages other than English generally exploit information found in Princeton WordNet (PWN) and translations extracted from parallel corpora. A common approach consists in preserving the structure of PWN and transferring its content in new languages using alignments, possibly combined with information extracted from multilingual semantic resources. Even if the role of PWN remains central in this process, these automatic methods offer an alternative to the manual elaboration of new wordnets. However, their limited coverage has a strong impact on that of the resulting resources. Following this line of research, we apply a cross-lingual word sense disambiguation method to wordnet development. Our approach exploits the output of a data-driven sense induction method that generates sense clusters in new languages, similar to wordnet synsets, by identifying word senses and relations in parallel corpora. We apply our cross-lingual word sense disambiguation method to the task of enriching a French wordnet resource, the WOLF, and show how it can be efficiently used for increasing its coverage. Although our experiments involve the English–French language pair, the proposed methodology is general enough to be applied to the development of wordnet resources in other languages for which parallel corpora are available. Finally, we show how the disambiguation output can serve to reduce the granularity of new wordnets and the degree of polysemy present in PWN.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter