Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
57 result(s) for "parallel speech corpora"
Sort by:
The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages
The development of parallel audio corpora for Turkic languages, such as Kazakh, Uzbek, and Tatar, remains a significant challenge in the development of multilingual speech synthesis, recognition systems, and machine translation. These languages are low-resource in speech technologies, lacking sufficiently large audio datasets with aligned transcriptions that are crucial for modern recognition, synthesis, and understanding systems. This article presents the development and experimental evaluation of a speech corpus focused on Turkic languages, intended for use in speech synthesis and automatic translation tasks. The primary objective is to create parallel audio corpora using a cascade generation method, which combines artificial intelligence and text-to-speech (TTS) technologies to generate both audio and text, and to evaluate the quality and suitability of the generated data. To evaluate the quality of synthesized speech, metrics measuring naturalness, intonation, expressiveness, and linguistic adequacy were applied. As a result, a multimodal (Kazakh–Turkish, Kazakh–Tatar, Kazakh–Uzbek) corpus was created, combining high-quality natural Kazakh audio with transcription and translation, along with synthetic audio in Turkish, Tatar, and Uzbek. These corpora offer a unique resource for speech and text processing research, enabling the integration of ASR, MT, TTS, and speech-to-speech translation (STS).
SpiRit-LM : Interleaved Spoken and Written Language Model
We introduce , a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single stream of tokens, and trained with a word-level method using a small automatically curated speech-text parallel corpus. comes in two versions: a version that uses speech phonetic units (HuBERT) and an version that models expressivity using pitch and style units in addition to the phonetic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that can learn new tasks in a few-shot fashion across modalities (i.e., ASR, TTS, Speech Classification). We make available model weights and inference code.
A Corpus-Assisted Translation Study of Strategies Used in Rendering Culture-Bound Expressions in the Speeches of King Abdullah II
Translation is defined as transferring meaning and style from one language to another, taking the text producer's intended purpose and the audience culture into account. This paper uses a 256,000-word Arabic-English parallel corpus of the speeches of King Abdullah II of Jordan from 1999 to 2015 to examine how some culture-bound expressions were translated from Arabic into English. To do so, two software packages were used, namely Wordsmith 6 and SketchEngine. Comparing the size of the Arabic corpus with its English counterpart using the wordlist tool of WS6, the researchers found that the number of words (tokens) in the English translation is more than the Arabic source text. However, the results showed that the Arabic language has more unique words, which means that it has more lexical density than its English counterpart. The researchers carried out a keyword analysis and compared the Arabic corpus with the ArTenTen corpus to identify the words that King Abdullah II saliently used in his speeches. Most of the keywords were culture-bound and related to the Jordanian context, which might be challenging to render. Using the parallel concordance tool and comparing the Arabic text with its English translation showed that the translator/s mainly resorted to the strategies of deletion, addition, substitution, and transliteration. The researchers recommend that further studies be conducted using the same approach but on larger corpora of other genres, such as legal, religious, press, and scientific texts.
Enhancing Code-Switching Research Through Comparable Corpora: Introducing the El Paso Bilingual Corpus
Research on language contact outcomes, such as code-switching, continues to face theoretical and methodological challenges, particularly due to the difficulty of comparing findings across studies that use divergent data collection methods. Accordingly, scholars have emphasized the need for publicly available and comparable bilingual corpora. This paper introduces the El Paso Bilingual Corpus, a new Spanish–English bilingual corpus recorded in El Paso (TX) in 2022, designed to be methodologically comparable to the Bangor Miami Corpus. The paper is structured in three main sections. First, we review the existing Spanish–English corpora and examine the theoretical challenges posed by studies using non-comparable methodologies, thereby underscoring the gap addressed by the El Paso Bilingual Corpus. Second, we outline the corpus creation process, discussing participant recruitment, data collection, and transcription, and provide an overview of these data, including participants’ sociolinguistic profiles. Third, to demonstrate the practical value of methodologically aligned corpora, we report a comparative case study on diminutive expressions in the El Paso and Bangor Miami corpora, illustrating how shared collection protocols can elucidate the role of community-specific social factors on bilinguals’ morphosyntactic choices.
Lost in translation: Decoding the errors in consecutive interpreting by Chinese EFL learners
Errors, as linguistic manifestations of cognitive challenges in interpreting, have attracted considerable attention in the fields of teaching, practice, and assessment. This paper examines the nature of interpreting errors among Chinese EFL (English as a Foreign Language) learners and explores their relationship with gender differences and interpreting performance, using the parallel corpus of Chinese EFL Learners-Spoken (PACCEL-S). The findings indicate the following: (1) improper speech flow was the most frequent and dense type of error, followed by grammatical errors and semantic deviations, while information default occurred least frequently. These patterns can be attributed to deficiencies in learners’ language proficiency, interpreting skills, and emotional regulation; (2) only grammatical errors showed a statistically significant correlation with gender, suggesting that gender-related differences in communication psychology may influence interpreting performance; (3) semantic deviation, information default, and improper speech flow were all significantly and negatively correlated with interpretation scores, whereas grammatical errors showed no significant correlation with these scores. These results suggest that English proficiency assessments may tolerate a certain degree of grammatical errors in interpreting tasks. Situated within the context of interpreting education, this study extends research on interpreting errors, enriches interpreting pedagogy and assessment, and deepens our understanding of the challenges faced by Chinese EFL learners.
A Pilot Study on Multilingual Detection of Irregular Migration Discourse on X and Telegram Using Transformer-Based Models
The rise of Online Social Networks has reshaped global discourse, enabling real-time conversations on complex issues such as irregular migration. Yet the informal, multilingual, and often noisy nature of content on platforms like X (formerly Twitter) and Telegram presents significant challenges for reliable automated analysis. This study presents an exploratory multilingual natural language processing (NLP) framework for detecting irregular migration discourse across five languages. Conceived as a pilot study addressing extreme data scarcity in sensitive migration contexts, this work evaluates transformer-based models on a curated multilingual corpus. It provides an initial baseline for monitoring informal migration narratives on X and Telegram. We evaluate a broad range of approaches, including traditional machine learning classifiers, SetFit sentence-embedding models, fine-tuned multilingual BERT (mBERT) transformers, and a Large Language Model (GPT-4o). The results show that GPT-4o achieves the highest performance overall (F1-score: 0.84), with scores reaching 0.89 in French and 0.88 in Greek. While mBERT excels in English, SetFit outperforms mBERT in low-resource settings, specifically in Arabic (0.79 vs. 0.70) and Greek (0.88 vs. 0.81). The findings highlight the effectiveness of transformer-based and large-language-model approaches, particularly in low-resource or linguistically heterogeneous environments. Overall, the proposed framework provides an initial, compact benchmark for multilingual detection of irregular migration discourse under extreme, low-resource conditions. The results should be viewed as exploratory indicators of model behavior on this synthetic, small-scale corpus, not as statistically generalizable evidence or deployment-ready tools. In this context, “multilingual” refers to robustness across different linguistic realizations of identical migration narratives under translation, rather than coverage of organically diverse multilingual public discourse.
Low-Resourced Alphabet-Level Pivot-Based Neural Machine Translation for Translating Korean Dialects
Developing a machine translator from a Korean dialect to a foreign language presents significant challenges due to a lack of a parallel corpus for direct dialect translation. To solve this issue, this paper proposes a pivot-based machine translation model that consists of two sub-translators. The first sub-translator is a sequence-to-sequence model with minGRU as an encoder and GRU as a decoder. It normalizes a dialect sentence into a standard sentence, and it employs alphabet-level tokenization. The other type of sub-translator is a legacy translator, such as off-the-shelf neural machine translators or LLMs, which translates the normalized standard sentence to a foreign sentence. The effectiveness of the alphabet-level tokenization and the minGRU encoder for the normalization model is demonstrated through empirical analysis. Alphabet-level tokenization is proven to be more effective for Korean dialect normalization than other widely used sub-word tokenizations. The minGRU encoder exhibits comparable performance to GRU as an encoder, and it is faster and more effective in managing longer token sequences. The pivot-based translation method is also validated through a broad range of experiments, and its effectiveness in translating Korean dialects to English, Chinese, and Japanese is demonstrated empirically.
Does simplification hold true for machine translations? A corpus-based analysis of lexical diversity in text varieties across genres
Extensive studies have described the linguistic features of human translations and verified the existence of the simplification translation universal. However, little has been known about the linguistic features of machine translations, although machine translation, as a unique modality of translation, has become an integral part of translation practice. This study is intended to test whether the simplification translation universal observed in human translations also holds true for machine translations. If so, are simplification features in machine translations different significantly from those in human translations? And does genre significantly affect simplification features? To this end, we built a balanced comparable corpus containing three text varieties, i.e., machine translations, human translations and target-language originals across three genres namely contemporary novels, government documents and academic abstracts. Based on the corpus, we conducted a systematic comparison of lexical diversity, as a proxy for simplification, of different text varieties. The results show that simplification is corroborated overall in both machine and human translations when compared with target-language originals, and machine translations are more simplified than human translations. Additionally, genre is found to exert a significant influence on the lexical diversity of different text varieties. This study is expected to expand the scope of corpus-based translation studies on the one hand and to offer insights into the improvement of machine translation systems on the other hand.
Mixtec–Spanish Parallel Text Dataset for Language Technology Development
This article introduces a freely available Spanish–Mixtec parallel corpus designed to foster natural language processing (NLP) development for an indigenous language that remains digitally low-resourced. The dataset, comprising 14,587 sentence pairs, covers Mixtec variants from Guerrero (Tlacoachistlahuaca, Northern Guerrero, and Xochapa) and Oaxaca (Western Coast, Southern Lowland, Santa María Yosoyúa, Central, Lower Cañada, Western Central, San Antonio Huitepec, Upper Western, and Southwestern Central). Texts are classified into four main domains as follows: education, law, health, and religion. To compile these data, we conducted a two-phase collection process as follows: first, an online search of government portals, religious organizations, and Mixtec language blogs; and second, an on-site retrieval of physical texts from the library of the Autonomous University of Querétaro. Scanning and optical character recognition were then performed to digitize physical materials, followed by manual correction to fix character misreadings and remove duplicates or irrelevant segments. We conducted a preliminary evaluation of the collected data to validate its usability in automatic translation systems. From Spanish to Mixtec, a fine-tuned GPT-4o-mini model yielded a BLEU score of 0.22 and a TER score of 122.86, while two fine-tuned open source models mBART-50 and M2M-100 yielded BLEU scores of 4.2 and 2.63 and TER scores of 98.99 and 104.87, respectively. All code demonstrating data usage, along with the final corpus itself, is publicly accessible via GitHub and Figshare. We anticipate that this resource will enable further research into machine translation, speech recognition, and other NLP applications while contributing to the broader goal of preserving and revitalizing the Mixtec language.
A Corpus-Based Comparative Study of Interpreting Styles in the Press Conferences of China’s “Two Sessions”
On an annual basis, China’s “Two Sessions” draw researchers’ attention to the study of interpreting that comes along. However, the related research mostly focuses on interpreting strategies and skills, and rarely involves the interpreter who produces the interpreting. This paper collects both of two interpreters’ texts at the press conferences of the “Two Sessions” from 2010 to 2020 to build a parallel corpus. By comparing their interpreting, with the speeches of the leaders of the White House as a comparable corpus, the study seeks to probe into the similarities and differences between the interpreting styles of two interpreters from the linguistic perspective. It is found that when interpreting, the interpreter Sun has richer vocabulary and a higher lexical density than the interpreter Zhang. The interpreting style of Sun is closer to the written style, while Zhang’s style is more colloquial. Compared with the transcripts of the U.S. press conference, the linguistic features of Zhang’s interpreting is more similar to those in the comparable corpus in vocabulary richness, word length and readability. Therefore, Zhang’s interpreting might be more likely to be intelligible and acceptable to English speakers.