Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
16,804 result(s) for "Language modeling"
Sort by:
Diversity and language technology: how language modeling bias causes epistemic injustice
It is well known that AI-based language technology—large language models, machine translation systems, multilingual dictionaries, and corpora—is currently limited to three percent of the world’s most widely spoken, financially and politically backed languages. In response, recent efforts have sought to address the “digital language divide” by extending the reach of large language models to “underserved languages.” We show how some of these efforts tend to produce flawed solutions that adhere to a hard-wired representational preference for certain languages, which we call language modeling bias. Language modeling bias is a specific and under-studied form of linguistic bias were language technology by design favors certain languages, dialects, or sociolects with respect to others. We show that language modeling bias can result in systems that, while being precise regarding languages and cultures of dominant powers, are limited in the expression of socio-culturally relevant notions of other communities. We further argue that at the root of this problem lies a systematic tendency of technology developer communities to apply a simplistic understanding of diversity which does not do justice to the more profound differences that languages, and ultimately the communities that speak them, embody. Drawing on the concept of epistemic injustice, we point to the broader ethico-political implications and show how it can lead not only to a disregard for valuable aspects of diversity but also to an under-representation of the needs of marginalized language communities. Finally, we present an alternative socio-technical approach that is designed to tackle some of the analyzed problems.
Building Family Capacity: supporting multiple family members to implement aided Language modeling
Family-centered capacity-building practices have been shown to benefit children and families. However, limited research explores these practices for children who use augmentative and alternative communication. This study explored an intervention to teach family members to implement an Aided Language Modeling (ALM) strategy across natural activities at home. A single case multiple probe design was used to evaluate the intervention with five family members and a girl with autism. Results showed the intervention increased family members’ percentage of high-fidelity ALM strategy use and rate of ALM. Descriptively, a modest increase was also observed in the proportion of the child’s communication using the speech-generating device. Social validity interviews suggested the goals, procedures, and outcomes were socially valid and supported family capacity building.
Improved Spoken Language Representation for Intent Understanding in a Task-Oriented Dialogue System
Successful applications of deep learning technologies in the natural language processing domain have improved text-based intent classifications. However, in practical spoken dialogue applications, the users’ articulation styles and background noises cause automatic speech recognition (ASR) errors, and these may lead language models to misclassify users’ intents. To overcome the limited performance of the intent classification task in the spoken dialogue system, we propose a novel approach that jointly uses both recognized text obtained by the ASR model and a given labeled text. In the evaluation phase, only the fine-tuned recognized language model (RLM) is used. The experimental results show that the proposed scheme is effective at classifying intents in the spoken dialogue system containing ASR errors.
MarIA: Spanish Language Models
This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community. Currently, MarIA includes RoBERTa-base, RoBERTa-large, GPT2 and GPT2-large Spanish language models, which can arguably be presented as the largest and most proficient language models in Spanish. The models were pretrained using a massive corpus of 570GB of clean and deduplicated texts with 135 billion words extracted from the Spanish Web Archive crawled by the National Library of Spain between 2009 and 2019. We assessed the performance of the models with nine existing evaluation datasets and with a novel extractive Question Answering dataset created ex novo. Overall, MarIA models outperform the existing Spanish models across a variety of NLU tasks and training settings.
Enhancing sentiment and emotion translation of review text through MLM knowledge integration in NMT
Producing a high-quality review translation is a multifaceted process. It goes beyond successful semantic transfer and requires conveying the original message’s tone and style in a way that resonates with the target audience, whether they are human readers or Natural Language Processing (NLP) applications. Capturing these subtle nuances of the review text demands a deeper understanding and better encoding of the source message. In order to achieve this goal, we explore the use of self-supervised masked language modeling (MLM) and a variant called polarity masked language modeling (p-MLM) as auxiliary tasks in a multi-learning setup. MLM is widely recognized for its ability to capture rich linguistic representations of the input and has been shown to achieve state-of-the-art accuracy in various language understanding tasks. Motivated by its effectiveness, in this paper we adopt joint learning, combining the neural machine translation (NMT) task with source polarity-masked language modeling within a shared embedding space to induce a deeper understanding of the emotional nuances of the text. We analyze the results and observe that our multi-task model indeed exhibits a better understanding of linguistic concepts like sentiment and emotion. Intriguingly, this is achieved even without explicit training on sentiment-annotated or domain-specific sentiment corpora. Our multi-task NMT model consistently improves the translation quality of affect sentences from diverse domains in three language pairs.
Bangla language modeling algorithm for automatic recognition of hand-sign-spelled Bangla sign language
Because of using traditional hand-sign segmentation and classification algorithm, many diversities of Bangla language including joint-letters, dependent vowels etc. and representing 51 Bangla written characters by using only 36 hand-signs, continuous hand-sign-spelled Bangla sign language (BdSL) recognition is challenging. This paper presents a Bangla language modeling algorithm for automatic recognition of hand-sign-spelled Bangla sign language which consists of two phases. First phase is designed for hand-sign classification and the second phase is designed for Bangla language modeling algorithm (BLMA) for automatic recognition of hand-sign-spelledBangla sign language. In first phase, we have proposed two step classifiers for hand-sign classification using normalized outer boundary vector (NOBV) and window-grid vector (WGV) by calculating maximum inter correlation coefficient (ICC) between test feature vector and pre-trained feature vectors. At first, the system classifies hand-signs using NOBV. If classification score does not satisfy specific threshold then another classifier based on WGV is used. The system is trained using 5,200 images and tested using another (5, 200 × 6) images of 52 hand-signs from 10 signers in 6 different challenging environments achieving mean accuracy of 95.83% for classification with the computational cost of 39.972 milliseconds per frame. In the Second Phase, we have proposed Bangla language modeling algorithm (BLMA) which discovers all \"hidden characters\" based on \"recognized characters\" from 52 hand-signs of BdSL to make any Bangla words, composite numerals and sentences in BdSL with no training, only based on the result of first phase. To the best of our knowledge, the proposed system is the first system in BdSL designed on automatic recognition of hand-sign-spelled BdSL for large lexicon. The system is tested for BLMA using hand-sign-spelled 500 words, 100 composite numerals and 80 sentences in BdSL achieving mean accuracy of 93.50%, 95.50% and 90.50% respectively.
When Data Is Scarce: Training a Kazakh Speech Language Model from Discrete Units
This research explores the development of a decoder-only speech language model (SLM) for Kazakh, a language currently characterized by limited computational resources. Our approach leverages discrete acoustic units synthesized from self-supervised speech representations. Specifically, we utilize a pretrained Wav2Vec 2.0 model to extract continuous latent features, which are then transformed into discrete semantic tokens via the k-means clustering algorithm. These tokens serve as the foundation for training a generative model designed to predict and maximize the likelihood of speech-unit sequences. To facilitate this study, we curated a specialized Kazakh speech corpus by synthesizing and refining multiple publicly available audio datasets. Given the constrained hardware resources available, we conducted large-scale feature extraction and tokenization to train the unit-based model. We evaluated the system’s efficacy using negative log-likelihood and perplexity metrics on independent test sets. The model captures Kazakh vowel harmony but struggles with long-range agglutinative chains. Key observations include the model’s high sensitivity to data quality, tokenization techniques, and specific training hyperparameters. Although constrained by data volume and training time relative to global benchmarks, the model successfully captures the underlying structural patterns in Kazakh speech. This work establishes a vital empirical baseline and suggests future improvements through refined unit discovery and integrated speech-text modeling.
XLNet-CRF: Efficient Named Entity Recognition for Cyber Threat Intelligence with Permutation Language Modeling
As cyberattacks continue to rise in frequency and sophistication, extracting actionable Cyber Threat Intelligence (CTI) from diverse online sources has become critical for proactive threat detection and defense. However, accurately identifying complex entities from lengthy and heterogeneous threat reports remains challenging due to long-range dependencies and domain-specific terminology. To address this, we propose XLNet-CRF, a hybrid framework that combines permutation-based language modeling with structured prediction using Conditional Random Fields (CRF) to enhance Named Entity Recognition (NER) in cybersecurity contexts. XLNet-CRF directly addresses key challenges in CTI-NER by modeling bidirectional dependencies and capturing non-contiguous semantic patterns more effectively than traditional approaches. Comprehensive evaluations on two benchmark cybersecurity corpora validate the efficacy of our approach. On the CTI-Reports dataset, XLNet-CRF achieves a precision of 97.41% and an F1-score of 97.43%; on MalwareTextDB, it attains a precision of 85.33% and an F1-score of 88.65%—significantly surpassing strong BERT-based baselines in both accuracy and robustness.
A Refutation of Finite-State Language Models through Zipf’s Law for Factual Knowledge
We present a hypothetical argument against finite-state processes in statistical language modeling that is based on semantics rather than syntax. In this theoretical model, we suppose that the semantic properties of texts in a natural language could be approximately captured by a recently introduced concept of a perigraphic process. Perigraphic processes are a class of stochastic processes that satisfy a Zipf-law accumulation of a subset of factual knowledge, which is time-independent, compressed, and effectively inferrable from the process. We show that the classes of finite-state processes and of perigraphic processes are disjoint, and we present a new simple example of perigraphic processes over a finite alphabet called Oracle processes. The disjointness result makes use of the Hilberg condition, i.e., the almost sure power-law growth of algorithmic mutual information. Using a strongly consistent estimator of the number of hidden states, we show that finite-state processes do not satisfy the Hilberg condition whereas Oracle processes satisfy the Hilberg condition via the data-processing inequality. We discuss the relevance of these mathematical results for theoretical and computational linguistics.
A Comparison of Six UML-Based Languages for Software Process Modeling
Describing and managing activities, resources, and constraints of software development processes is a challenging goal for many organizations. A first generation of Software Process Modeling Languages (SPMLs) appeared in the 1990s but failed to gain broad industrial support. Recently, however, a second generation of SPMLs has appeared, leveraging the strong industrial interest for modeling languages such as UML. In this paper, we propose a comparison of these UML-based SPMLs. While not exhaustive, this comparison concentrates on SPMLs most representative of the various alternative approaches, ranging from UML-based framework specializations to full-blown executable metamodeling approaches. To support the comparison of these various approaches, we propose a frame gathering a set of requirements for process modeling, such as semantic richness, modularity, executability, conformity to the UML standard, and formality. Beyond discussing the relative merits of these approaches, we also evaluate the overall suitability of these UML-based SPMLs for software process modeling. Finally, we discuss the impact of these approaches on the current state of the practice, and conclude with lessons we have learned in doing this comparison.