Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
2
result(s) for
"human-annotated dataset"
Sort by:
Mix-Lingual Relation Extraction: Dataset and a Training Approach
by
Chen, Jia-Jun
,
Ma, Zheng
,
Chu, You-Gang
in
Artificial Intelligence
,
Computer Science
,
Data Structures and Information Theory
2025
Relation extraction is a pivotal task within the field of natural language processing, boasting numerous real-world applications. Existing research predominantly centers on monolingual relation extraction or cross-lingual enhancement for relation extraction. However, there exists a notable gap in understanding relation extraction within mix-lingual (or code-switching) scenarios. In these scenarios, individuals blend content from different languages within sentences, generating mix-lingual content. The effectiveness of existing relation extraction models in such scenarios remains largely unexplored due to the absence of dedicated datasets. To address this gap, we introduce the Mix-Lingual Relation Extraction (MixRE) task and construct a human-annotated dataset MixRED to support this task. Additionally, we propose a hierarchical training approach for the mix-lingual scenario named Mix-Lingual Training (MixTrain), designed to enhance the performance of large language models (LLMs) when capturing relational dependencies from mix-lingual content spanning different semantic levels. Our experiments involve evaluating state-of-the-art supervised models and LLMs on the constructed dataset, with results indicating that MixTrain notably improves model performance. Moreover, we investigate the effectiveness of using mix-lingual content as a tool to transfer learned relational dependencies across different languages. Additionally, we delve into factors influencing model performance for both supervised models and LLMs in the novel MixRE task.
Journal Article
NewsSumm: The World’s Largest Human-Annotated Multi-Document News Summarization Dataset for Indian English
by
Agarwal, Megha
,
Agrawal, Avinash
,
Motghare, Manish
in
abstractive summarization
,
Annotations
,
Benchmarks
2025
The rapid growth of digital journalism has heightened the need for reliable multi-document summarization (MDS) systems, particularly in underrepresented, low-resource, and culturally distinct contexts. However, current progress is hindered by a lack of large-scale, high-quality non-Western datasets. Existing benchmarks—such as CNN/DailyMail, XSum, and MultiNews—are limited by language, regional focus, or reliance on noisy, auto-generated summaries. We introduce NewsSumm, the largest human-annotated MDS dataset for Indian English, curated by over 14,000 expert annotators through the Suvidha Foundation. Spanning 36 Indian English newspapers from 2000 to 2025 and covering more than 20 topical categories, NewsSumm includes over 317,498 articles paired with factually accurate, professionally written abstractive summaries. We detail its robust collection, annotation, and quality control pipelines, and present extensive statistical, linguistic, and temporal analyses that underscore its scale and diversity. To establish benchmarks, we evaluate PEGASUS, BART, and T5 models on NewsSumm, reporting aggregate and category-specific ROUGE scores, as well as factual consistency metrics. All NewsSumm dataset materials are openly released via Zenodo. NewsSumm offers a foundational resource for advancing research in summarization, factuality, timeline synthesis, and domain adaptation for Indian English and other low-resource language settings.
Journal Article