Catalogue Search | MBRL

Neural Network-Based Bilingual Lexicon Induction for Indonesian Ethnic Languages

by Resiandi, Kartika , Murakami, Yohei , Nasution, Arbi Haza in bilingual lexicon induction , Bilingualism , Dictionaries

2023

Indonesia has a variety of ethnic languages, most of which belong to the same language family: the Austronesian languages. Due to the shared language family, words in Indonesian ethnic languages are very similar. However, previous research suggests that these Indonesian ethnic languages are endangered. Thus, to prevent that, we propose the creation of a bilingual dictionary between ethnic languages, using a neural network approach to extract transformation rules, employing character-level embedding and the Bi-LSTM method in a sequence-to-sequence model. The model has an encoder and decoder. The encoder reads the input sequence character by character, generates context, and then extracts a summary of the input. The decoder produces an output sequence wherein each character at each timestep, as well as the subsequent character output, are influenced by the previous character. The first experiment focuses on Indonesian and Minangkabau languages with 10,277 word pairs. To evaluate the model’s performance, five-fold cross-validation was used. The character-level seq2seq method (Bi-LSTM as an encoder and LSTM as a decoder) with an average precision of 83.92% outperformed the SentencePiece byte pair encoding (vocab size of 33) with an average precision of 79.56%. Furthermore, to evaluate the performance of the neural network model in finding the pattern, a rule-based approach was conducted as the baseline. The neural network approach obtained 542 more correct translations compared to the baseline. We implemented the best setting (character-level embedding with Bi-LSTM as the encoder and LSTM as the decoder) for four other Indonesian ethnic languages: Malay, Palembang, Javanese, and Sundanese. These have half the size of input dictionaries. The average precision scores for these languages are 65.08%, 62.52%, 59.69%, and 58.46%, respectively. This shows that the neural network approach can identify transformation patterns of the Indonesian language to closely related languages (such as Malay and Palembang) better than distantly related languages (such as Javanese and Sundanese).

Journal Article

Share this book

Add to My Shelf

Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts

by Masui, Fumito , Ptaszynski, Michal , Mahmud, Tanjim in Accuracy , Annotations , Bullying

2024

Cyberbullying is a serious problem in online communication. It is important to find effective ways to detect cyberbullying content to make online environments safer. In this paper, we investigated the identification of cyberbullying contents from the Bangla and Chittagonian languages, which are both low-resource languages, with the latter being an extremely low-resource language. In the study, we used both traditional baseline machine learning methods, as well as a wide suite of deep learning methods especially focusing on hybrid networks and transformer-based multilingual models. For the data, we collected over 5000 both Bangla and Chittagonian text samples from social media. Krippendorff’s alpha and Cohen’s kappa were used to measure the reliability of the dataset annotations. Traditional machine learning methods used in this research achieved accuracies ranging from 0.63 to 0.711, with SVM emerging as the top performer. Furthermore, employing ensemble models such as Bagging with 0.70 accuracy, Boosting with 0.69 accuracy, and Voting with 0.72 accuracy yielded promising results. In contrast, deep learning models, notably CNN, achieved accuracies ranging from 0.69 to 0.811, thus outperforming traditional ML approaches, with CNN exhibiting the highest accuracy. We also proposed a series of hybrid network-based models, including BiLSTM+GRU with an accuracy of 0.799, CNN+LSTM with 0.801 accuracy, CNN+BiLSTM with 0.78 accuracy, and CNN+GRU with 0.804 accuracy. Notably, the most complex model, (CNN+LSTM)+BiLSTM, attained an accuracy of 0.82, thus showcasing the efficacy of hybrid architectures. Furthermore, we explored transformer-based models, such as XLM-Roberta with 0.841 accuracy, Bangla BERT with 0.822 accuracy, Multilingual BERT with 0.821 accuracy, BERT with 0.82 accuracy, and Bangla ELECTRA with 0.785 accuracy, which showed significantly enhanced accuracy levels. Our analysis demonstrates that deep learning methods can be highly effective in addressing the pervasive issue of cyberbullying in several different linguistic contexts. We show that transformer models can efficiently circumvent the language dependence problem that plagues conventional transfer learning methods. Our findings suggest that hybrid approaches and transformer-based embeddings can effectively tackle the problem of cyberbullying across online platforms.

Journal Article

Share this book

Add to My Shelf

A survey of deep learning techniques for machine reading comprehension

in Bengali , Chinese languages , Comprehension

2023

Reading comprehension involves the process of reading and understanding textual information in order to answer questions related to it. It finds practical applications in various domains such as domain-specific FAQs, search engines, and dialog systems. Resource-rich languages like English, Japanese, Chinese, and most European languages benefit from the availability of numerous datasets and resources, enabling the development of machine reading comprehension (MRC) systems. However, building MRC systems for low-resource languages (LRL) with limited datasets, such as Vietnamese, Urdu, Bengali, and Hindi, poses significant challenges. To address this issue, this study utilizes quantitative analysis to conduct a systematic literature review (SLR) with the aim of comprehending the recent global shift in MRC research from high-resource languages (HRL) to low-resource languages. Notably, existing literature reviews on MRC lack comprehensive studies that compare techniques specifically designed for rich and low-resource languages. Hence, this study provides a comprehensive overview of the MRC research landscape in low-resource languages, offering valuable insights and a list of suggestions to enhance LRL–MRC research.

Journal Article

Share this book

Add to My Shelf

adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds

by Lankford, Séamus , Way, Andy , Afli, Haithem in Datasets , English language , Language

2023

The advent of Multilingual Language Models (MLLMs) and Large Language Models (LLMs) has spawned innovation in many areas of natural language processing. Despite the exciting potential of this technology, its impact on developing high-quality Machine Translation (MT) outputs for low-resource languages remains relatively under-explored. Furthermore, an open-source application, dedicated to both fine-tuning MLLMs and managing the complete MT workflow for low-resources languages, remains unavailable. We aim to address these imbalances through the development of adaptMLLM, which streamlines all processes involved in the fine-tuning of MLLMs for MT. This open-source application is tailored for developers, translators, and users who are engaged in MT. It is particularly useful for newcomers to the field, as it significantly streamlines the configuration of the development environment. An intuitive interface allows for easy customisation of hyperparameters, and the application offers a range of metrics for model evaluation and the capability to deploy models as a translation service directly within the application. As a multilingual tool, we used adaptMLLM to fine-tune models for two low-resource language pairs: English to Irish (EN↔ GA) and English to Marathi (EN↔MR). Compared with baselines from the LoResMT2021 Shared Task, the adaptMLLM system demonstrated significant improvements. In the EN→GA direction, an improvement of 5.2 BLEU points was observed and an increase of 40.5 BLEU points was recorded in the GA→EN direction representing relative improvements of 14% and 117%, respectively. Significant improvements in the translation performance of the EN↔MR pair were also observed notably in the MR→EN direction with an increase of 21.3 BLEU points which corresponds to a relative improvement of 68%. Finally, a fine-grained human evaluation of the MLLM output on the EN→GA pair was conducted using the Multidimensional Quality Metrics and Scalar Quality Metrics error taxonomies. The application and models are freely available.

Journal Article

Share this book

Add to My Shelf

Dataset creation and benchmarking for Kashmiri news snippet classification using fine-tuned transformer and LLM models in a low resource setting

by Gupta, Deepa , Venugopalan, Manju , Ramani, Anirud in 639/705 , 639/705/117 , African languages

2025

Kashmiri language, recognized as one of the low-resource languages, has rich cultural heritage but remains underexplored in NLP due to lack of resources and datasets. The proposed research addresses this gap by creating a dataset of 15,036 news snippets for the task of Kashmiri news snippets classification, created through the translation of English news snippets into Kashmiri using the Microsoft Bing translation tool. These snippets are manually refined to ensure domain specificity, covering ten categories: Medical, Politics, Sports, Tourism, Education, Art and Craft, Environment, Entertainment, Technology, and Culture. Various machine learning, deep learning, transformer-models, and LLMs are explored for text classification. Among the models experimented for classification, fine-tuned ParsBERT-Uncased emerged as the best-performing transformer model, achieving an F1 score of 0.98. This work not only contributes a valuable dataset for Kashmiri but also identifies effective methodologies for accurate news snippet classification in the Kashmiri language. This research developed an essential dataset, which to our best belief, is the first attempt at creating a manually labelled corpus for the Kashmiri language and also devised an architecture using the best combination of embeddings, algorithms, and transformer-models for accurate text classification. It contributes significantly to the field of NLP for this underrepresented language.

Journal Article

Share this book

Add to My Shelf

Low-Resource Neural Machine Translation Improvement Using Source-Side Monolingual Data

by Gelbukh, Alexander , Tonja, Atnafu Lambebo , Kolesnikova, Olga in Bilingualism , Datasets , English–Wolaytta NMT

2023

Despite the many proposals to solve the neural machine translation (NMT) problem of low-resource languages, it continues to be difficult. The issue becomes even more complicated when few resources cover only a single domain. In this paper, we discuss the applicability of a source-side monolingual dataset of low-resource languages to improve the NMT system for such languages. In our experiments, we used Wolaytta–English translation as a low-resource language. We discuss the use of self-learning and fine-tuning approaches to improve the NMT system for Wolaytta–English translation using both authentic and synthetic datasets. The self-learning approach showed +2.7 and +2.4 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively, over the best-performing baseline model. Further fine-tuning the best-performing self-learning model showed +1.2 and +0.6 BLEU score improvements for Wolaytta–English and English–Wolaytta translations, respectively. We reflect on our contributions and plan for the future of this difficult field of study.

Journal Article

Share this book

Add to My Shelf

Urdu-NERD: Urdu named entity recognition with BiGRU-based deep learning architecture

by Rafiq, Zainab , Aldhafferi, Nahier , Wasim, Muhammad in Asian languages , Low-resource languages , Name entity recognition

2026

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP), focusing on identifying and extracting entities such as names, locations, organizations, and other specific labels from unstructured text data. It plays a crucial role in various NLP applications, including information retrieval, question answering, and sentiment analysis. However, while NER systems have been extensively developed for English, adapting them to languages like Urdu poses unique challenges due to linguistic differences and the scarcity of annotated data. In this research, we enhance data diversity and accessibility for Urdu NER by introducing the ZUNERA corpus , the most extensive Urdu NER dataset to date, comprising 1,189,614 tokens and 89,804 named entities. Additionally, we classify the entities into twenty-three different named entities types. We meticulously annotate the corpus , providing clear guidelines and employing the Kappa coefficient to ensure high-quality annotations. Furthermore, we propose the Urdu-Named Entity Recognition with BiGRU-based Deep Learning Architecture (NERD) framework, which facilitates efficient entity recognition in Urdu text. The proposed framework achieves an impressive F1-score of 94.6%. Comparing ZUNERA with the MK-PUCIT dataset underscores its robustness in accurately recognizing entities. Although this study centers on Urdu, the proposed NER framework and annotation pipeline are designed to be language-agnostic. They can be extended to other morphologically rich or low-resource languages, providing a replicable foundation for future cross-lingual research. Overall, our contributions significantly advance Urdu NER research by providing a comprehensive dataset, evaluating state-of-the-art techniques, and introducing a novel framework for efficient Urdu entity recognition.

Journal Article

Share this book

Add to My Shelf

Attention based neural network for cross domain fake news detection in Turkish language

by Akpinar, Kevser Ovaz , Akpinar, Mustafa , Pavlovskaya, Olga in 639/166 , 639/705 , Attention

2025

This study addresses the pressing problem of fake news in low-resource languages by proposing a novel neural network architecture based on attention, optimized for Turkish. The model effectively integrates FastText word embeddings, a Long Short-Term Memory (LSTM) layer, and a focused attention mechanism to capture the nuanced linguistic patterns and morphological intricacies of the Turkish language. Trained and tested on a manually verified dataset of 10,000 Turkish news articles, our system achieved a state-of-the-art accuracy of 92% and significantly outperformed strong baselines, such as a fine-tuned Turkish BERT model. A key advantage of our architecture is its computational efficiency, which demonstrates a 40% reduction in training time compared to BERT, making it highly suitable for real-world, resource-constrained applications. While the model shows strong cross-domain generalization, an in-depth error analysis reveals specific vulnerabilities to satirical content (62% accuracy) and sophisticated fabrications designed to mimic credible sources (68% accuracy). These limitations highlight important directions for future work. This research provides a validated, efficient, and interpretable framework for combating disinformation in Turkish, with promising implications for other morphologically rich, low-resource languages.

Journal Article

Share this book

Add to My Shelf

Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya

by Diallo, Moussa , Fesseha, Awet , Dahou, Abdelghani in Accuracy , Algorithms , Artificial intelligence

2021

This article studies convolutional neural networks for Tigrinya (also referred to as Tigrigna), which is a family of Semitic languages spoken in Eritrea and northern Ethiopia. Tigrinya is a “low-resource” language and is notable in terms of the absence of comprehensive and free data. Furthermore, it is characterized as one of the most semantically and syntactically complex languages in the world, similar to other Semitic languages. To the best of our knowledge, no previous research has been conducted on the state-of-the-art embedding technique that is shown here. We investigate which word representation methods perform better in terms of learning for single-label text classification problems, which are common when dealing with morphologically rich and complex languages. Manually annotated datasets are used here, where one contains 30,000 Tigrinya news texts from various sources with six categories of “sport”, “agriculture”, “politics”, “religion”, “education”, and “health” and one unannotated corpus that contains more than six million words. In this paper, we explore pretrained word embedding architectures using various convolutional neural networks (CNNs) to predict class labels. We construct a CNN with a continuous bag-of-words (CBOW) method, a CNN with a skip-gram method, and CNNs with and without word2vec and FastText to evaluate Tigrinya news articles. We also compare the CNN results with traditional machine learning models and evaluate the results in terms of the accuracy, precision, recall, and F1 scoring techniques. The CBOW CNN with word2vec achieves the best accuracy with 93.41%, significantly improving the accuracy for Tigrinya news classification.

Journal Article

Share this book

Add to My Shelf