Catalogue Search | MBRL

Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts

by Masui, Fumito , Ptaszynski, Michal , Mahmud, Tanjim in Accuracy , Annotations , Bullying

2024

Cyberbullying is a serious problem in online communication. It is important to find effective ways to detect cyberbullying content to make online environments safer. In this paper, we investigated the identification of cyberbullying contents from the Bangla and Chittagonian languages, which are both low-resource languages, with the latter being an extremely low-resource language. In the study, we used both traditional baseline machine learning methods, as well as a wide suite of deep learning methods especially focusing on hybrid networks and transformer-based multilingual models. For the data, we collected over 5000 both Bangla and Chittagonian text samples from social media. Krippendorff’s alpha and Cohen’s kappa were used to measure the reliability of the dataset annotations. Traditional machine learning methods used in this research achieved accuracies ranging from 0.63 to 0.711, with SVM emerging as the top performer. Furthermore, employing ensemble models such as Bagging with 0.70 accuracy, Boosting with 0.69 accuracy, and Voting with 0.72 accuracy yielded promising results. In contrast, deep learning models, notably CNN, achieved accuracies ranging from 0.69 to 0.811, thus outperforming traditional ML approaches, with CNN exhibiting the highest accuracy. We also proposed a series of hybrid network-based models, including BiLSTM+GRU with an accuracy of 0.799, CNN+LSTM with 0.801 accuracy, CNN+BiLSTM with 0.78 accuracy, and CNN+GRU with 0.804 accuracy. Notably, the most complex model, (CNN+LSTM)+BiLSTM, attained an accuracy of 0.82, thus showcasing the efficacy of hybrid architectures. Furthermore, we explored transformer-based models, such as XLM-Roberta with 0.841 accuracy, Bangla BERT with 0.822 accuracy, Multilingual BERT with 0.821 accuracy, BERT with 0.82 accuracy, and Bangla ELECTRA with 0.785 accuracy, which showed significantly enhanced accuracy levels. Our analysis demonstrates that deep learning methods can be highly effective in addressing the pervasive issue of cyberbullying in several different linguistic contexts. We show that transformer models can efficiently circumvent the language dependence problem that plagues conventional transfer learning methods. Our findings suggest that hybrid approaches and transformer-based embeddings can effectively tackle the problem of cyberbullying across online platforms.

Journal Article

Share this book

Add to My Shelf

Towards scalable and cross-lingual specialist language models for oncology

by Krauthammer, Michael , Rohanian, Morteza , Miglino, Nicola in 639/705/117 , 692/4028 , Accuracy

2025

Clinical oncology generates vast, unstructured data that often contain inconsistencies, missing information, and ambiguities, making it difficult to extract reliable insights for data-driven decision-making. General-purpose large language models (LLMs) struggle with these challenges due to their lack of domain-specific reasoning, including specialized clinical terminology, context-dependent interpretations, and multi-modal data integration. We address these issues with an oncology-specialized, efficient, and adaptable NLP framework that combines instruction tuning, retrieval-augmented generation (RAG), and graph-based knowledge integration. Our lightweight models prove effective at oncology-specific tasks, such as named entity recognition (e.g., identifying cancer diagnoses), entity linking (e.g., linking entities to standardized ontologies), TNM staging, document classification (e.g., cancer subtype classification from pathology reports), and treatment response prediction. Our framework emphasizes adaptability and resource efficiency. We include minimal German instructions, collected at the University Hospital Zurich (USZ), to test whether small amounts of non-English language data can effectively transfer knowledge across languages. This approach mirrors our motivation for lightweight models, which balance strong performance with reduced computational costs, making them suitable for resource-limited healthcare settings. We validated our models on oncology datasets, demonstrating strong results in named entity recognition, relation extraction, and document classification, and showing consistent performance across multiple lightweight architectures.

Journal Article

Share this book

Add to My Shelf

NAS ca and NAS es : Two Monolingual Pre-Trained Models for Abstractive Summarization in Catalan and Spanish

by Lluís-F. Hurtado , José Ángel González , Vicent Ahuir in abstractive summarization , monolingual models , multilingual models

2021

Most of the models proposed in the literature for abstractive summarization are generally suitable for the English language but not for other languages. Multilingual models were introduced to address that language constraint, but despite their applicability being broader than that of the monolingual models, their performance is typically lower, especially for minority languages like Catalan. In this paper, we present a monolingual model for abstractive summarization of textual content in the Catalan language. The model is a Transformer encoder-decoder which is pretrained and fine-tuned specifically for the Catalan language using a corpus of newspaper articles. In the pretraining phase, we introduced several self-supervised tasks to specialize the model on the summarization task and to increase the abstractivity of the generated summaries. To study the performance of our proposal in languages with higher resources than Catalan, we replicate the model and the experimentation for the Spanish language. The usual evaluation metrics, not only the most used ROUGE measure but also other more semantic ones such as BertScore, do not allow to correctly evaluate the abstractivity of the generated summaries. In this work, we also present a new metric, called content reordering, to evaluate one of the most common characteristics of abstractive summaries, the rearrangement of the original content. We carried out an exhaustive experimentation to compare the performance of the monolingual models proposed in this work with two of the most widely used multilingual models in text summarization, mBART and mT5. The experimentation results support the quality of our monolingual models, especially considering that the multilingual models were pretrained with many more resources than those used in our models. Likewise, it is shown that the pretraining tasks helped to increase the degree of abstractivity of the generated summaries. To our knowledge, this is the first work that explores a monolingual approach for abstractive summarization both in Catalan and Spanish.

Journal Article

Share this book

Add to My Shelf

NASca and NASes: Two Monolingual Pre-Trained Models for Abstractive Summarization in Catalan and Spanish

by Hurtado, Lluís-F. , Segarra, Encarna , Ahuir, Vicent in English language , Linguistics , Multilingualism

2021

Most of the models proposed in the literature for abstractive summarization are generally suitable for the English language but not for other languages. Multilingual models were introduced to address that language constraint, but despite their applicability being broader than that of the monolingual models, their performance is typically lower, especially for minority languages like Catalan. In this paper, we present a monolingual model for abstractive summarization of textual content in the Catalan language. The model is a Transformer encoder-decoder which is pretrained and fine-tuned specifically for the Catalan language using a corpus of newspaper articles. In the pretraining phase, we introduced several self-supervised tasks to specialize the model on the summarization task and to increase the abstractivity of the generated summaries. To study the performance of our proposal in languages with higher resources than Catalan, we replicate the model and the experimentation for the Spanish language. The usual evaluation metrics, not only the most used ROUGE measure but also other more semantic ones such as BertScore, do not allow to correctly evaluate the abstractivity of the generated summaries. In this work, we also present a new metric, called content reordering, to evaluate one of the most common characteristics of abstractive summaries, the rearrangement of the original content. We carried out an exhaustive experimentation to compare the performance of the monolingual models proposed in this work with two of the most widely used multilingual models in text summarization, mBART and mT5. The experimentation results support the quality of our monolingual models, especially considering that the multilingual models were pretrained with many more resources than those used in our models. Likewise, it is shown that the pretraining tasks helped to increase the degree of abstractivity of the generated summaries. To our knowledge, this is the first work that explores a monolingual approach for abstractive summarization both in Catalan and Spanish.

Journal Article

Share this book

Add to My Shelf

A Hybrid Ensemble Approach for Greek Text Classification Based on Multilingual Models

by Perikos, Isidoros , Paraskevas, Michael , Liapis, Charalampos M. in Algorithms , Classification , classifiers

2024

The present study explores the field of text classification in the Greek language. A novel ensemble classification scheme based on generated embeddings from Greek text made by the multilingual capabilities of the E5 model is presented. Our approach incorporates partial transfer learning by using pre-trained models to extract embeddings, enabling the evaluation of classical classifiers on Greek data. Additionally, we enhance the predictive capability while maintaining the costs low by employing a soft voting combination scheme that exploits the strengths of XGBoost, K-nearest neighbors, and logistic regression. This method significantly improves all classification metrics, demonstrating the superiority of ensemble techniques in handling the complexity of Greek textual data. Our study contributes to the field of natural language processing by proposing an effective ensemble framework for the categorization of Greek texts, leveraging the advantages of both traditional and modern machine learning techniques. This framework has the potential to be applied to other less-resourced languages, thereby broadening the impact of our research beyond Greek language processing.

Journal Article

Share this book

Add to My Shelf

Evaluating Neural Networks’ Ability to Generalize against Adversarial Attacks in Cross-Lingual Settings

by Dadu, Tanvi , Aggarwal, Swati , Mathur, Vidhu in adversarial attacks , cross-lingual NLP , Datasets

2024

Cross-lingual transfer learning using multilingual models has shown promise for improving performance on natural language processing tasks with limited training data. However, translation can introduce superficial patterns that negatively impact model generalization. This paper evaluates two state-of-the-art multilingual models, Cross-Lingual Model-Robustly Optimized BERT Pretraining Approach (XLM-Roberta) and Multilingual Bi-directional Auto-Regressive Transformer (mBART), on the cross-lingual natural language inference (XNLI) natural language inference task using both original and machine-translated evaluation sets. Our analysis demonstrates that translation can facilitate cross-lingual transfer learning, but maintaining linguistic patterns is critical. The results provide insights into the strengths and limitations of state-of-the-art multilingual natural language processing architectures for cross-lingual understanding.

Journal Article

Share this book

Add to My Shelf

Assessment and Enhancement of Chinese College Students’ Cross- Cultural Learning Competence Based on BP Neural Network Algorithm

by Jiao, Yang in Algorithms , Artificial neural networks , Back propagation networks

2024

Cross-cultural learning competence, a critical skill in our globally interconnected world, is advanced through the application of the Backpropagation (BP) neural network algorithm. This innovative approach involves leveraging neural network techniques to model and enhance individuals' abilities to navigate and understand diverse cultural contexts. The BP neural network algorithm facilitates personalized learning experiences by adapting to individuals' cultural backgrounds and preferences. This research explores a comprehensive approach for assessing and enhancing cross-cultural learning competence among Chinese college students, integrating the Word Embedding Multilingual Model with the Back Propagation Neural Network (WEMM-BPNN) algorithm. Recognizing the importance of global competencies in higher education, our study focuses on leveraging advanced neural network techniques to evaluate and elevate students' cross-cultural learning abilities. The WEMM-BPNN model combines the power of word embedding and multilingual considerations, tailoring the learning experience to individual cultural backgrounds. Through a meticulous analysis of cross-cultural data and linguistic patterns, the algorithm refines its recommendations for personalized learning strategies. The research aims not only to assess the current state of cross-cultural learning competence but also to provide targeted interventions to enhance students' intercultural understanding and adaptability. By merging linguistic models with neural network algorithms, this study offers a pioneering approach to cultivating cross-cultural competencies, contributing valuable insights to the ongoing discourse on globalized education.

Journal Article

Share this book

Add to My Shelf

Natural language processing applications for low-resource languages

by Bandyopadhyay, Sivaji , Pakray, Partha , Gelbukh, Alexander in Access to information , Annotations , Application

2025

Natural language processing (NLP) has significantly advanced our ability to model and interact with human language through technology. However, these advancements have disproportionately benefited high-resource languages with abundant data for training complex models. Low-resource languages, often spoken by smaller or marginalized communities, need help realizing the full potential of NLP applications. The primary challenges in developing NLP applications for low-resource languages stem from the need for large, well-annotated datasets, standardized tools, and linguistic resources. This scarcity of resources hinders the performance of data-driven approaches that have excelled in high-resource settings. Further, low-resource languages frequently exhibit complex grammatical structures, diverse vocabularies, and unique social contexts, which pose additional challenges for standard NLP techniques. Innovative strategies are emerging to address these challenges. Researchers are actively collecting and curating datasets, even utilizing community engagement platforms to expand data resources. Transfer learning, where models pre-trained on high-resource languages are adapted to low-resource settings, has shown significant promise. Multilingual models like Multilingual Bidirectional Encoder Representations from Transformers (mBERT) and Cross Lingual Models (XLM-R), trained on vast quantities of multilingual data, offer a powerful avenue for cross-lingual knowledge transfer. Additionally, researchers are exploring integrating multimodal approaches, combining textual data with images, audio, or video, to enhance NLP performance in low-resource language scenarios. This survey covers applications like part-of-speech tagging, morphological analysis, sentiment analysis, hate speech detection, dependency parsing, language identification, discourse annotation guidelines, question answering, machine translation, information retrieval, and predictive authoring for augmentative and alternative communication systems. The review also highlights machine learning approaches, deep learning approaches, Transformers, and cross-lingual transfer learning as practical techniques. Developing practical NLP applications for low-resource languages is crucial for preserving linguistic diversity, fostering inclusion within the digital world, and expanding our understanding of human language. While challenges remain, the strategies outlined in this survey demonstrate the ongoing progress and highlight the potential for NLP to empower communities that speak low-resource languages and contribute to a more equitable landscape within language technology.

Journal Article

Share this book

Add to My Shelf

Exploring zero-shot and joint training cross-lingual strategies for aspect-based sentiment analysis based on contextualized multilingual language models

by Van Thin, Dang , Ngoc Hao, Duong , Luu-Thuy Nguyen, Ngan in Annotations , Aspect-based sentiment analysis , Data mining

2023

Aspect-based sentiment analysis (ABSA) has attracted many researchers' attention in recent years. However, the lack of benchmark datasets for specific languages is a common challenge because of the prohibitive cost of manual annotation. The zero-shot cross-lingual strategy can be applied to solve this gap in research. Moreover, previous works mainly focus on improving the performance of supervised ABSA with pre-trained languages. Therefore, there are few to no systematic comparisons of the benefits of multilingual models in zero-shot and joint training cross-lingual for the ABSA task. In this paper, we focus on the zero-shot and joint training cross-lingual transfer task for the ABSA. We fine-tune the latest pre-trained multilingual language models on the source language, and then it is directly predicted in the target language. For the joint learning scenario, the models are trained on the combination of multiple source languages. Our experimental results show that (1) fine-tuning multilingual models achieve promising performances in the zero-shot cross-lingual scenario; (2) fine-tuning models on the combination training data of multiple source languages outperforms monolingual data in the joint training scenario. Furthermore, the experimental results indicated that choosing other languages instead of English as the source language can give promising results in the low-resource languages scenario.

Journal Article

Share this book

Add to My Shelf

SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection

by Ptaszynski, Michal , Eronen, Juuso , Masui, Fumito in Accuracy , Architecture , BERT

2025

The automated detection of deceptive language is a crucial challenge in computational linguistics. This study provides a rigorous comparative analysis of three tiers of machine learning models for detecting instructed deception: traditional machine learning (SVM), fine-tuned discriminative models (BERT), and in-context learning with generalist Large Language Models (LLMs). Using the “cross-cultural deception detection” dataset, our findings reveal a clear performance hierarchy. While SVM performance is inconsistent, fine-tuned BERT models achieve substantially superior accuracy. Notably, a multilingual BERT model improves cross-topic accuracy on Spanish text to 90.14%, a gain of over 22 percentage points from its monolingual counterpart (67.20%). In contrast, modern LLMs perform poorly in zero-shot settings and fail to surpass the SVM baseline even with few-shot prompting, underscoring the effectiveness of task-specific fine-tuning. By transparently addressing the limitations of the solicited, low-stakes deception dataset, we establish a robust methodological baseline that clarifies the strengths of different modeling paradigms and informs future research into more complex, real-world deception phenomena.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter