Catalogue Search | MBRL

Automatic Vulgar Word Extraction Method with Application to Vulgar Remark Detection in Chittagonian Dialect of Bangla

by Masui, Fumito , Ptaszynski, Michal , Mahmud, Tanjim in Accuracy , Algorithms , Automatic

2023

The proliferation of the internet, especially on social media platforms, has amplified the prevalence of cyberbullying and harassment. Addressing this issue involves harnessing natural language processing (NLP) and machine learning (ML) techniques for the automatic detection of harmful content. However, these methods encounter challenges when applied to low-resource languages like the Chittagonian dialect of Bangla. This study compares two approaches for identifying offensive language containing vulgar remarks in Chittagonian. The first relies on basic keyword matching, while the second employs machine learning and deep learning techniques. The keyword-matching approach involves scanning the text for vulgar words using a predefined lexicon. Despite its simplicity, this method establishes a strong foundation for more sophisticated ML and deep learning approaches. An issue with this approach is the need for constant updates to the lexicon. To address this, we propose an automatic method for extracting vulgar words from linguistic data, achieving near-human performance and ensuring adaptability to evolving vulgar language. Insights from the keyword-matching method inform the optimization of machine learning and deep learning-based techniques. These methods initially train models to identify vulgar context using patterns and linguistic features from labeled datasets. Our dataset, comprising social media posts, comments, and forum discussions from Facebook, is thoroughly detailed for future reference in similar studies. The results indicate that while keyword matching provides reasonable results, it struggles to capture nuanced variations and phrases in specific vulgar contexts, rendering it less robust for practical use. This contradicts the assumption that vulgarity solely relies on specific vulgar words. In contrast, methods based on deep learning and machine learning excel in identifying deeper linguistic patterns. Comparing SimpleRNN models using Word2Vec and fastText embeddings, which achieved accuracies ranging from 0.84 to 0.90, logistic regression (LR) demonstrated remarkable accuracy at 0.91. This highlights a common issue with neural network-based algorithms, namely, that they typically require larger datasets for adequate generalization and competitive performance compared to conventional approaches like LR.

Journal Article

Share this book

Add to My Shelf

Big Five Personality Trait Prediction Based on User Comments

by Masui, Fumito , Ptaszynski, Michal , Shum, Kit-May in Accuracy , Algorithms , automatic personality recognition

2025

The study of personalities is a major component of human psychology, and with an understanding of personality traits, practical applications can be used in various domains, such as mental health care, predicting job performance, and optimising marketing strategies. This study explores the prediction of Big Five personality trait scores from online comments using transformer-based language models, focusing on improving the model performance with a larger dataset and investigating the role of intercorrelations among traits. Using the PANDORA dataset from Reddit, the RoBERTa and BERT models, including both the base and large variants, were fine-tuned and evaluated to determine their effectiveness in personality trait prediction. Compared to previous work, our study utilises a significantly larger dataset to enhance the model’s generalisation and robustness. The results indicate that RoBERTa outperforms BERT across most metrics, with RoBERTa large achieving the best overall performance. In addition to evaluating the overall predictive accuracy, this study investigates the impact of intercorrelations among personality traits. A comparative analysis is conducted between a single-model approach, which predicts all five traits simultaneously, and a multiple-model approach, fine-tuning the models independently and each predicting a single trait. The findings reveal that the single-model approach achieves a lower RMSE and higher R2 values, highlighting the importance of incorporating trait intercorrelations in improving the prediction accuracy. Furthermore, RoBERTa large demonstrated a stronger ability to capture these intercorrelations compared to previous studies. These findings emphasise the potential of transformer-based models in personality computing and underscore the importance of leveraging both larger datasets and intercorrelations to enhance predictive performance.

Journal Article

Share this book

Add to My Shelf

Chinese Tourist Motivations for Hokkaido, Japan: A Hybrid Approach Using Transformer Models and Statistical Methods

by Ptaszynski, Michal , Eronen, Juuso , Liu, Zhenzhen in Automation , Behavior , COVID-19

2025

The COVID-19 pandemic severely impacted Japan’s inbound tourism, but recent recovery trends highlight the growing importance of Chinese tourists. Understanding their motivations is crucial for revitalizing the industry. Building on our previous framework, this study applies Transformer-based natural language processing (NLP) models and principal component analysis (PCA) to analyze large-scale user-generated content (UGC) and identify key motivational factors influencing Chinese tourists’ visits to Hokkaido. Traditional survey-based approaches to tourism motivation research often suffer from response biases and small sample sizes. In contrast, we leverage a pre-trained Transformer model, RoBERTa, to score motivational factors like self-expansion, excitement, and cultural observation. PCA is subsequently used to extract the most significant factors across different destinations. Findings indicate that Chinese tourists are primarily drawn to Hokkaido’s natural scenery and cultural experiences, and the differences in these factors by season. While the model effectively aligns with manual scoring, it shows limitations in capturing more abstract motivations such as excitement and self-expansion. This research advances tourism analytics by applying AI-driven methodologies, offering practical insights for destination marketing and management. Future work can extend this approach to other regions and cross-cultural contexts, further enhancing AI’s role in understanding evolving traveler preferences.

Journal Article

Share this book

Add to My Shelf

TabletStone: Role-Aware, Rotation-Robust On-Stone Visualization for Curling Training

by Chen, Guanyu , Aihara, Shimpei , Masui, Fumito in Biofeedback , Communication , Decision making

2026

Real-time feedback is increasingly valued in sports training, yet in curling it is commonly delivered through off-stone displays or post hoc review, forcing disruptive gaze shifts while athletes track a moving and continuously rotating stone, especially during collaborative sweeping. To address this gap, we present TabletStone, a stone-mounted tablet interface that provides in situ, glanceable feedback with role-aware layouts and rotation-robust visualization. TabletStone is implemented as a lightweight, UDP-driven endpoint that renders upstream training signals on the stone while adapting the UI to throwers and sweepers. To preserve readability under rotation, we formalize an absolute-position fixation strategy based on an on-device yaw estimate and counter-rotation transforms. We evaluate TabletStone through an initial controlled user study with six experienced curlers performing sweeping while reading on-stone values under two conditions (baseline and absolute-position fixation). The study showed higher subjective readability together with improved accuracy and recall for absolute-position fixation, while precision remained high in both conditions; missed readouts remained the dominant failure mode under workload. Overall, these results support the feasibility and potential usefulness of combining role-aware UI/UX with rotation-aware stabilization for on-object feedback in curling training, while broader training effects remain to be validated.

Journal Article

Share this book

Add to My Shelf

Why So Meme? A Comparative and Explainable Analysis of Multimodal Hateful Meme Detection

by Nor Azmi, Nor Saiful Azam Bin , Ptaszynski, Michal , Masui, Fumito in Analysis , Classification , Comparative analysis

2026

The rise of toxic content, particularly in the form of hateful memes, poses a significant challenge to social media platforms. This paper presents an empirical comparative study of unimodal and multimodal architectures for toxic content detection. Rather than proposing a novel architecture, the study evaluates the efficacy of a modular Late Fusion framework (RoBERViT) against specialized unimodal baselines (RoBERTa and ViT) and a generalist Large Multimodal (LLaVA). Both unimodal and multimodal configurations across two distinct benchmarks—the imbalanced Innopolis Hateful Memes dataset and the confounder-driven Facebook Hateful Meme dataset—were explored. Beyond quantitative metrics, this study conducts a qualitative analysis using Explainable AI (LIME) and a Large Multimodal Model (LLaVA) to investigate model reasoning. Results demonstrate that the multimodal fusion model consistently outperformed its unimodal counterparts on the Innopolis Hateful Meme dataset, achieving a toxic class F1-score of 0.6439 compared to the text-only score of 0.5794. However, on the Facebook Hateful Meme dataset, text-only models remain competitive, highlighting the “benign confounder” challenge. The qualitative analysis reveals that text remains the dominant modality, with models often relying on surface-level keywords. Notably, the Vision Transformer frequently uses text overlays as a visual proxy for hate, while the LLaVA model struggles with hallucinated toxicity in benign confounder contexts. These findings underscore the persistent challenge of achieving true multimodal understanding in hate speech detection.

Journal Article

Share this book

Add to My Shelf

A New Approach to Extracting Tourism Focus Points from Chinese Inbound Tourist Reviews after COVID-19

by Ptaszynski, Michal , Eronen, Juuso , Liu, Zhenzhen in Analysis , Big Data , China

2023

The number of inbound tourists in Japan has been increasing steadily in recent years. However, due to the COVID-19 pandemic, the number of inbound tourists decreased in 2020. This is particularly worrisome for Japan, as the number of inbound tourists is expected to reach 60 million per year by 2030. In order to help Japan’s tourism industry to recover from the pandemic, we propose a method of identifying elements that attract the attention of inbound tourists (focus points) by analyzing reviews on tourist sites. We focus on Hokkaido, a popular area in Japan for tourists from China. Our proposed method extracts high-frequency n-gram patterns from reviews written by Chinese inbound tourists, showing which aspects are mentioned most often. We then use seven types of motivational factors for tourists and principal component analysis to quantify the focus points of each tourist destination. Finally, we estimate the focus points by clustering the n-gram patterns extracted from the tourists’ reviews. The results show that our method successfully identifies the features and focus points of each tourist spot.

Journal Article

Share this book

Add to My Shelf

SVM, BERT, or LLM? A Comparative Study on Multilingual Instructed Deception Detection

by Ptaszynski, Michal , Eronen, Juuso , Masui, Fumito in Accuracy , Architecture , BERT

2025

The automated detection of deceptive language is a crucial challenge in computational linguistics. This study provides a rigorous comparative analysis of three tiers of machine learning models for detecting instructed deception: traditional machine learning (SVM), fine-tuned discriminative models (BERT), and in-context learning with generalist Large Language Models (LLMs). Using the “cross-cultural deception detection” dataset, our findings reveal a clear performance hierarchy. While SVM performance is inconsistent, fine-tuned BERT models achieve substantially superior accuracy. Notably, a multilingual BERT model improves cross-topic accuracy on Spanish text to 90.14%, a gain of over 22 percentage points from its monolingual counterpart (67.20%). In contrast, modern LLMs perform poorly in zero-shot settings and fail to surpass the SVM baseline even with few-shot prompting, underscoring the effectiveness of task-specific fine-tuning. By transparently addressing the limitations of the solicited, low-stakes deception dataset, we establish a robust methodological baseline that clarifies the strengths of different modeling paradigms and informs future research into more complex, real-world deception phenomena.

Journal Article

Share this book

Add to My Shelf

Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance

by Eronen, Juuso , Borges, Robert , Janic, Katarzyna in Analysis , Classification , Comparative analysis

2026

Selecting a source language for zero-shot cross-lingual transfer is typically done by intuition or by defaulting to English, despite large performance differences across language pairs. We study whether linguistic similarity can predict transfer performance and support principled source-language selection. We introduce quantified WALS (qWALS), a typology-based similarity metric derived from features in the World Atlas of Language Structures, and evaluate it against existing similarity baselines. Validation uses three complementary signals: computational similarity scores, zero-shot transfer performance of multilingual transformers (mBERT and XLM-R) on four NLP tasks (dependency parsing, named entity recognition, sentiment analysis, and abusive language identification) across eight languages, and an expert-linguist similarity survey. Across tasks and models, higher linguistic similarity is associated with better transfer, and the survey provides independent support for the computational metrics.

Journal Article

Share this book

Add to My Shelf

A Method of Supplementing Reviews to Less-Known Tourist Spots Using Geotagged Tweets

by Masui, Fumito , Ptaszynski, Michal , Silaa, Victor in BERT , Big Data , Classification

2022

When planning a travel or an adventure, sightseers increasingly rely on opinions posted on the Internet tourism related websites, such as TripAdvisor, Booking.com or Expedia. Unfortunately, beautiful, yet less-known places and rarely visited sightspots often do not accumulate sufficient number of valuable opinions on such websites. On the other hand, users often post their opinions on casual social media services, such as Facebook, Instagram or Twitter. Therefore, in this study, we develop a system for supplementing insufficient number of Internet opinions available for sightspots with tweets containing opinions of such sightspots, with a specific focus on wildlife sightspots. To do that, we develop an approach consisting of a system (PSRS) for wildlife sightspots and propose a method for verifying collected geotagged tweets and using them as on-spot reviews. Tweets that contain geolocation information are considered geotagged and therefore treated as possible tourist on-spot reviews. The main challenge, however, is to confirm the authenticity of the extracted tweets. Our method includes the use of location clustering and classification techniques. Specifically, extracted geotagged tweets are clustered by using location information and then annotated taking into consideration specific features applied to machine learning-based classification techniques. As for the machine learning (ML) algorithms, we adopt a fine-tuned transformer neural network-based BERT model which implements the information of token context orientation. The BERT model achieved a higher F-score of 0.936, suggesting that applying a state-of-the-art deep learning-based approach had a significant impact on solving this task. The extracted tweets and annotated scores are then mapped on the designed Park Supplementary Review System (PSRS) as supplementary reviews for travelers seeking additional information about the related sightseeing spots.

Journal Article

Share this book

Add to My Shelf

The Limits of Words: Expanding a Word-Based Emotion Analysis System with Multiple Emotion Dictionaries and the Automatic Extraction of Emotive Expressions

by Rzepka, Rafal , Masui, Fumito , Wang, Lu in affect analysis , affect lexicon , Algorithms

2024

Wide adoption of social media has caused an explosion of information stored online, with the majority of that information containing subjective, opinionated, and emotional content produced daily by users. The field of emotion analysis has helped effectively process such human emotional expressions expressed in daily social media posts. Unfortunately, one of the greatest limitations of popular word-based emotion analysis systems has been the limited emotion vocabulary. This paper presents an attempt to extensively expand one such word-based emotion analysis system by integrating multiple emotion dictionaries and implementing an automatic extraction mechanism for emotive expressions. We first leverage diverse emotive expression dictionaries to expand the emotion lexicon of the system. To do that, we solve numerous problems with the integration of various dictionaries collected using different standards. We demonstrate the performance improvement of the system with improved accuracy and granularity of emotion classification. Furthermore, our automatic extraction mechanism facilitates the identification of novel emotive expressions in an emotion dataset, thereby enriching the depth and breadth of emotion analysis capabilities. In particular, the automatic extraction method shows promising results for applicability in further expansion of the dictionary base in the future, thus advancing the field of emotion analysis and offering new avenues for research in sentiment analysis, affective computing, and human–computer interaction.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter