Catalogue Search | MBRL

Clinical document corpora—real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data

by Hahn, Udo in Computational linguistics , Data mining , Language processing

2025

We survey clinical document corpora, with a focus on German textual data. Due to rigid data privacy legislation in Germany, these resources, with only few exceptions, are stored in protected clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing, where easy accessibility and reuse of (textual) data collections are common practice. Hence, alternative corpus designs have been examined to escape from data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several types of domain proxies have come up as substitutes for real clinical documents. Common instances of close proxies are medical journal publications, therapy guidelines, drug labels, etc., more distant proxies include medical contents from social media channels or online encyclopedic medical articles. We follow the PRISM (Preferred Reporting Items for Systematic reviews and Meta-analyses) guidelines for surveying the field of German-language clinical/medical corpora. Four bibliographic databases were searched: PubMed, ACL Anthology, Google Scholar, and the author's personal literature database. After PRISM-conformant identification of 362 hits from the 4 bibliographic systems, the screening process yielded 78 relevant documents for inclusion in this review. They contained overall 92 different published versions of corpora, from which 71 were truly unique in terms of their underlying document sets. Out of these, the majority were clinical corpora-46 real ones from which 32 were unique, 5 translated ones (3 unique), and 6 synthetic ones (3 unique). As to domain proxies, we identified 18 close ones (16 unique) and 17 distant ones (all of them unique). There is a clear divide between the large number of non-accessible real clinical German-language corpora and their publicly accessible substitutes: translated or synthetic datasets, close or more distant proxies. So, at first sight, the data bottleneck seems broken. Intuitively, yet, differences in genre-specific writing style, lexical or terminological diction, and required medical background expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are. A systematic, empirically grounded yardstick for comparing real clinical corpora with those suggested substitutes and proxies is missing until now. The extreme sparsity of real clinical corpora in almost all non-Anglo-American countries worldwide, Germany in particular, has triggered an active search for alternative, publicly accessible data resources laid out in this survey. However, the utility of these substitutes compared with real clinical corpora and their semantic and genre-specific distance to real clinical corpora is still under-researched so that their value remains to be assessed properly. Furthermore, corpus descriptions are often incomplete with respect to relevant descriptional attributes. This paper bundles these observations and proposes a template for a so-called corpus card to improve adequate corpus documentation.

Journal Article

Share this book

Add to My Shelf

“Karampátan ñg Tao”: Tracing the Rise of Tagalog Human Rights Discourse Using a Textual Corpus

by Ramon Guillermo in human rights , political lexicography , Tagalog

2025

This essay is a preliminary study on the rise of human rights discourse in the Tagalog language from the late nineteenth century to the mid-twentieth using a carefully designed textual corpus. The corpus is made up of original Tagalog texts as well as translations of political treatises from European languages into Tagalog. While it has been found that karapatan (rights) is indeed a central notion in the development of a specifically Tagalog revolutionary discourse, the matter of its “inherence” in the tao (human being) has followed a particularly convoluted path due to the existence of alternative interpretations revolving around the moral “worthiness” of individuals and classes.

Journal Article

Share this book

Add to My Shelf

Semantic feature norms: a cross-method and cross-language comparison

by Hultén, Annika , Salmelin, Riitta , van Vliet, Marijn in Adult , Behavioral Science and Psychology , Cognitive Psychology

2024

The ability to assign meaning to perceptual stimuli forms the basis of human behavior and the ability to use language. The meanings of things have primarily been probed using behavioral production norms and corpus-derived statistical methods. However, it is not known to what extent the collection method and the language being probed influence the resulting semantic feature vectors. In this study, we compare behavioral with corpus-based norms, across Finnish and English, using an all-to-all approach. To complete the set of norms required for this study, we present a new set of Finnish behavioral production norms, containing both abstract and concrete concepts. We found that all the norms provide largely similar information about the relationships of concrete objects and allow item-level mapping across norms sets. This validates the use of the corpus-derived norms which are easier to obtain than behavioral norms, which are labor-intensive to collect, for studies that do not depend on subtle differences in meaning between close semantic neighbors.

Journal Article

Share this book

Add to My Shelf

The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis

by Regel, Sven , Merz, Nicolas , Lewandowski, Jirka in Access , Computerization , Computerized corpora

2016

This article presents a digital, open-access, multilingual, annotated corpus of electoral programs. It complements the recent methodological innovations in (semi-) computerized content analysis by providing a large, standardized text corpus for the political science community. The corpus is based on the collection of the Manifesto Project, which comprises of (at the time of writing) the largest hand-annotated text corpus of electoral programs available. Since 2009 the project’s costly and time-intensive procedure of collecting and coding documents has been fully digitized. As a result, it now provides more than 1800 machine readable documents from 40 different countries. Six hundred of these documents contain content-analyzed annotations at the level of single (quasi-) sentences, which correspond to the Manifesto Project coding scheme. Additionally, the corpus will continually be extended by incorporating new elections and digitizing older documents. The database also provides meta-information for each document (eg. party, election, language, etc.) that allow it to be referenced back to the Manifesto Dataset. The corpus is stored in a standardized format in an online database, and an API and R package (manifestoR) guarantee easy access.

Journal Article

Share this book

Add to My Shelf

CH-Bench: a user-oriented benchmark for systems for efficient distant reading (design, performance, and insights)

by Böhm, Klemens , Willkomm, Jens , Schäler, Martin in Benchmarks , Corpus analysis , Corpus linguistics

2023

Data science deals with the discovery of information from large volumes of data. The data studied by scientists in the humanities include large textual corpora. An important objective is to study the ideas and expectations of a society regarding specific concepts, like “freedom” or “democracy,” both for today’s society and even more for societies of the past. Studying the meaning of words using large corpora requires efficient systems for text analysis, so-called distant reading systems. Making such systems efficient calls for a specification of the necessary functionality and clear expectations regarding typical work loads. But this currently is unclear, and there is no benchmark to evaluate distant reading systems. In this article, we propose such a benchmark, with the following innovations: As a first step, we collect and structure various information needs of the target users. We then formalize the notion of word context to facilitate the analysis of specific concepts. Using this notion, we formulate queries in line with the information needs of users. Finally, based on this, we propose concrete benchmark queries. To demonstrate the benefit of our benchmark, we conduct an evaluation, with two objectives. First, we aim at insights regarding the content of different corpora, i.e., whether and how their size and nature (e.g., popular and broad literature or specific expert literature) affect results. Second, we benchmark different data management technologies. This has allowed us to identify performance bottlenecks.

Journal Article

Share this book

Add to My Shelf

Structural–Semantic Term Weighting for Interpretable Topic Modeling with Higher Coherence and Lower Token Overlap

by Konnikov, Evgenii , Yakob, Polina , Golikov, Gleb in Bibliometrics , Coherence , coherence value

2026

Topic modeling of large news streams is widely used to reconstruct economic and political narratives, which requires coherent topics with low lexical overlap while remaining interpretable to domain experts. We propose TF-SYN-NER-Rel, a structural–semantic term weighting scheme that extends classical TF-IDF by integrating positional, syntactic, factual, and named-entity coefficients derived from morphosyntactic and dependency parses of Russian news texts. The method is embedded into a standard Latent Dirichlet Allocation (LDA) pipeline and evaluated on a large Russian-language news corpus from the online archive of Moskovsky Komsomolets (over 600,000 documents), with political, financial, and sports subsets obtained via dictionary-based expert labeling. For each subset, TF-SYN-NER-Rel is compared with standard TF-IDF under identical LDA settings, and topic quality is assessed using the C_v coherence metric. To assess robustness, we repeat model training across multiple random initializations and report aggregate coherence statistics. Quantitative results show that TF-SYN-NER-Rel improves coherence and yields smoother, more stable coherence curves across the number of topics. Qualitative analysis indicates reduced lexical overlap between topics and clearer separation of event-centered and institutional themes, especially in political and financial news. Overall, the proposed pipeline relies on CPU-based NLP tools and sparse linear algebra, providing a computationally lightweight and interpretable complement to embedding- and LLM-based topic modeling in large-scale news monitoring.

Journal Article

Share this book

Add to My Shelf

SrpELTeC: A Serbian Literary Corpus for Distant Reading

by Vitas, Duško , Stanković, Ranka , Krstev, Cvetana in Corpus linguistics , Data science , Digitization

2024

The article presents SrpELTeC, a corpus developed within the COST action Distant Reading for European Literary History (CA16204). All novels in SrpELTeC were selected, prepared, and annotated using the common principles established for all language collections in the European Literary Text Collection (ELTeC). The challenges and solutions in preparing SrpELTeC from scratch are outlined. All novels were manually encoded in TEI with rich metadata and structural annotation. The automatic annotation included POS-tagging, lemmatization, and named entities, relying on Natural Language Processing resources developed and maintained by the JeRTeh Language Resources and Technologies Society. The integration of SrpELTeC with Wikidata was supported with a set of SPARQL queries for the retrieval of metadata with different visualization options. Recent activities within the COST Action NexusLinguarum—European Network for Web-centred Linguistic Data Science (CA18209) are related to the linked data version of SrpELTeC using the NLP Interchange Format. All versions of SrpELTeC are freely available under the CC-BY license.

Journal Article

Share this book

Add to My Shelf

A Stylometric Glance at Novels in Euskara

by Werońska, Dominika in Basque language , Cluster analysis , Dialects

2024

While Basque has been posited as possibly the oldest existing language on the European continent, it appears in written form only in the sixteenth century. The first Basque novel emerges over 300 years later and to this day the genre lacks exhaustive research. The article sets as its aim a stylometric analysis of selected twentieth- and twenty-first-century Basque novels, sourced from the online platforms Armiarma and Booktegi. These are analyzed based on the frequency of the most frequent words measured using cluster analysis and set against a backdrop of foreign novels translated into Euskara. The results show that the originals in Euskara remain distinct from translated works, pointing to the unique linguistic character of the Basque novel. Some linguistic patterns potentially responsible for this distinction are presented. The results are visualized on a map revealing the chronological evolution and the contribution of the Basque novel to the broader literary landscape.

Journal Article

Share this book

Add to My Shelf

On the possibilities of quantitative corpus analysis of the verse of the Slovak variation of Surrealism (Introduction to the problematics)

by Dušan Teplan in quantitative analysis , slovak surrealism , text corpus

2023

The aim of the study is to present and specify the possibilities of quantitative-corpus analysis of verse, which was characteristic for the work of the representatives of the avant-garde movement of Slovak surrealism (Rudolf Fábry, Július Lenko, Vladimír Reisel, Štefan Žáry and others). The study contains three chapters. The first one summarizes the previous research on the verses of Slovak surrealism and characterizes its general features. Then, isolated attempts at quantitative analysis of surrealist verse are presented and critically evaluated. The conclusion presents the possibilities of quantitative research on surrealist verse in digital processing, with the need to create a digital corpus of poetic texts that could be used for the analysis of a variety of versiological problems. The aim should be not only the digitisation of surrealist texts, but of the whole complex of Slovak poetry, so that selected problems can also be studied in their interrelations and from developmental aspects. From the methodological point of view, the study follows the current research in the field of quantitative versology.

Journal Article

Share this book

Add to My Shelf

Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

by Oyucu, Saadin , Polat, Huseyin

2020

To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter