Catalogue Search | MBRL

BioBBC: a multi-feature model that enhances the detection of biomedical entities

by Gao, Xin , Alamro, Hind , Gojobori, Takashi in 631/1647 , 631/1647/48 , 639/705/117

2024

The rapid increase in biomedical publications necessitates efficient systems to automatically handle Biomedical Named Entity Recognition (BioNER) tasks in unstructured text. However, accurately detecting biomedical entities is quite challenging due to the complexity of their names and the frequent use of abbreviations. In this paper, we propose BioBBC, a deep learning (DL) model that utilizes multi-feature embeddings and is constructed based on the BERT-BiLSTM-CRF to address the BioNER task. BioBBC consists of three main layers; an embedding layer, a Long Short-Term Memory (Bi-LSTM) layer, and a Conditional Random Fields (CRF) layer. BioBBC takes sentences from the biomedical domain as input and identifies the biomedical entities mentioned within the text. The embedding layer generates enriched contextual representation vectors of the input by learning the text through four types of embeddings: part-of-speech tags (POS tags) embedding, char-level embedding, BERT embedding, and data-specific embedding. The BiLSTM layer produces additional syntactic and semantic feature representations. Finally, the CRF layer identifies the best possible tag sequence for the input sentence. Our model is well-constructed and well-optimized for detecting different types of biomedical entities. Based on experimental results, our model outperformed state-of-the-art (SOTA) models with significant improvements based on six benchmark BioNER datasets.

Journal Article

Share this book

Add to My Shelf

Exploiting machine learning models to identify novel Alzheimer’s disease biomarkers and potential targets

by Gao, Xin , Thafar, Maha A. , Albaradei, Somayah in 631/114 , 631/114/2114 , 631/114/2397

2023

We still do not have an effective treatment for Alzheimer's disease (AD) despite it being the most common cause of dementia and impaired cognitive function. Thus, research endeavors are directed toward identifying AD biomarkers and targets. In this regard, we designed a computational method that exploits multiple hub gene ranking methods and feature selection methods with machine learning and deep learning to identify biomarkers and targets. First, we used three AD gene expression datasets to identify 1/ hub genes based on six ranking algorithms (Degree, Maximum Neighborhood Component (MNC), Maximal Clique Centrality (MCC), Betweenness Centrality (BC), Closeness Centrality, and Stress Centrality), 2/ gene subsets based on two feature selection methods (LASSO and Ridge). Then, we developed machine learning and deep learning models to determine the gene subset that best distinguishes AD samples from the healthy controls. This work shows that feature selection methods achieve better prediction performances than the hub gene sets. Beyond this, the five genes identified by both feature selection methods (LASSO and Ridge algorithms) achieved an AUC = 0.979. We further show that 70% of the upregulated hub genes (among the 28 overlapping hub genes) are AD targets based on a literature review and six miRNA (hsa-mir-16-5p, hsa-mir-34a-5p, hsa-mir-1-3p, hsa-mir-26a-5p, hsa-mir-93-5p, hsa-mir-155-5p) and one transcription factor, JUN, are associated with the upregulated hub genes. Furthermore, since 2020, four of the six microRNA were also shown to be potential AD targets. To our knowledge, this is the first work showing that such a small number of genes can distinguish AD samples from healthy controls with high accuracy and that overlapping upregulated hub genes can narrow the search space for potential novel targets.

Journal Article

Share this book

Add to My Shelf

Rise and fall of the global conversation and shifting sentiments during the COVID-19 pandemic

by Salhi, Adil , Gao, Xin , Zhang, Xiangliang in Arabic language , Chinese languages , Classification

2021

Social media (e.g., Twitter) has been an extremely popular tool for public health surveillance. The novel coronavirus disease 2019 (COVID-19) is the first pandemic experienced by a world connected through the internet. We analyzed 105+ million tweets collected between March 1 and May 15, 2020, and Weibo messages compiled between January 20 and May 15, 2020, covering six languages (English, Spanish, Arabic, French, Italian, and Chinese) and represented an estimated 2.4 billion citizens worldwide. To examine fine-grained emotions during a pandemic, we built machine learning classification models based on deep learning language models to identify emotions in social media conversations about COVID-19, including positive expressions ( optimistic, thankful , and empathetic ), negative expressions ( pessimistic, anxious, sad, annoyed , and denial ), and a complicated expression, joking , which has not been explored before. Our analysis indicates a rapid increase and a slow decline in the volume of social media conversations regarding the pandemic in all six languages. The upsurge was triggered by a combination of economic collapse and confinement measures across the regions to which all the six languages belonged except for Chinese, where only the latter drove conversations. Tweets in all analyzed languages conveyed remarkably similar emotional states as the epidemic was elevated to pandemic status, including feelings dominated by a mixture of joking with anxious / pessimistic / annoyed as the volume of conversation surged and shifted to a general increase in positive states ( optimistic , thankful , and empathetic ), the strongest being expressed in Arabic tweets, as the pandemic came under control.

Journal Article

Share this book

Add to My Shelf

Target-aware Abstractive Related Work Generation with Contrastive Learning

by Gao, Xin , Zhang, Xiangliang , Gao, Shen in Coders , Decoding , Optimization

2022

The related work section is an important component of a scientific paper, which highlights the contribution of the target paper in the context of the reference papers. Authors can save their time and effort by using the automatically generated related work section as a draft to complete the final related work. Most of the existing related work section generation methods rely on extracting off-the-shelf sentences to make a comparative discussion about the target work and the reference papers. However, such sentences need to be written in advance and are hard to obtain in practice. Hence, in this paper, we propose an abstractive target-aware related work generator (TAG), which can generate related work sections consisting of new sentences. Concretely, we first propose a target-aware graph encoder, which models the relationships between reference papers and the target paper with target-centered attention mechanisms. In the decoding process, we propose a hierarchical decoder that attends to the nodes of different levels in the graph with keyphrases as semantic indicators. Finally, to generate a more informative related work, we propose multi-level contrastive optimization objectives, which aim to maximize the mutual information between the generated related work with the references and minimize that with non-references. Extensive experiments on two public scholar datasets show that the proposed model brings substantial improvements over several strong baselines in terms of automatic and tailored human evaluations.

Paper

Share this book

Add to My Shelf

Overview of the Arabic Sentiment Analysis 2021 Competition at KAUST

by Zhang, Xiangliang , Alharbi, Basma , Khayyat, Zuhair in Competition , Data mining , Datasets

2021

This paper provides an overview of the Arabic Sentiment Analysis Challenge organized by King Abdullah University of Science and Technology (KAUST). The task in this challenge is to develop machine learning models to classify a given tweet into one of the three categories Positive, Negative, or Neutral. From our recently released ASAD dataset, we provide the competitors with 55K tweets for training, and 20K tweets for validation, based on which the performance of participating teams are ranked on a leaderboard, https://www.kaggle.com/c/arabic-sentiment-analysis-2021-kaust. The competition received in total 1247 submissions from 74 teams (99 team members). The final winners are determined by another private set of 20K tweets that have the same distribution as the training and validation set. In this paper, we present the main findings in the competition and summarize the methods and tools used by the top ranked teams. The full dataset of 100K labeled tweets is also released for public usage, at https://www.kaggle.com/c/arabic-sentiment-analysis-2021-kaust/data.

Paper

Share this book

Add to My Shelf

ASAD: A Twitter-based Benchmark Arabic Sentiment Analysis Dataset

by Zhang, Xiangliang , Alharbi, Basma , Khayyat, Zuhair in Annotations , Benchmarks , Competition

2021

This paper provides a detailed description of a new Twitter-based benchmark dataset for Arabic Sentiment Analysis (ASAD), which is launched in a competition3, sponsored by KAUST for awarding 10000 USD, 5000 USD and 2000 USD to the first, second and third place winners, respectively. Compared to other publicly released Arabic datasets, ASAD is a large, high-quality annotated dataset(including 95K tweets), with three-class sentiment labels (positive, negative and neutral). We presents the details of the data collection process and annotation process. In addition, we implement several baseline models for the competition task and report the results as a reference for the participants to the competition.

Paper

Share this book

Add to My Shelf

SenWave: Monitoring the Global Sentiments under the COVID-19 Pandemic

by Salhi, Adil , Gao, Xin , Zhang, Xiangliang in Coronaviruses , COVID-19 , Data mining

2020

Since the first alert launched by the World Health Organization (5 January, 2020), COVID-19 has been spreading out to over 180 countries and territories. As of June 18, 2020, in total, there are now over 8,400,000 cases and over 450,000 related deaths. This causes massive losses in the economy and jobs globally and confining about 58% of the global population. In this paper, we introduce SenWave, a novel sentimental analysis work using 105+ million collected tweets and Weibo messages to evaluate the global rise and falls of sentiments during the COVID-19 pandemic. To make a fine-grained analysis on the feeling when we face this global health crisis, we annotate 10K tweets in English and 10K tweets in Arabic in 10 categories, including optimistic, thankful, empathetic, pessimistic, anxious, sad, annoyed, denial, official report, and joking. We then utilize an integrated transformer framework, called simpletransformer, to conduct multi-label sentimental classification by fine-tuning the pre-trained language model on the labeled data. Meanwhile, in order for a more complete analysis, we also translate the annotated English tweets into different languages (Spanish, Italian, and French) to generated training data for building sentiment analysis models for these languages. SenWave thus reveals the sentiment of global conversation in six different languages on COVID-19 (covering English, Spanish, French, Italian, Arabic and Chinese), followed the spread of the epidemic. The conversation showed a remarkably similar pattern of rapid rise and slow decline over time across all nations, as well as on special topics like the herd immunity strategies, to which the global conversation reacts strongly negatively. Overall, SenWave shows that optimistic and positive sentiments increased over time, foretelling a desire to seek, together, a reset for an improved COVID-19 world.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter