Catalogue Search | MBRL

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

by Riesa, Jason , Garrette, Dan , Constant, Noah in Benchmarks , Datasets , Dialects

2023

We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms. We explore automatic evaluation metrics for FRMT and validate their correlation with expert human evaluation across both region-matched and mismatched rating scenarios. Finally, we present a number of baseline models for this task, and offer guidelines for how researchers can train, evaluate, and compare their own models. Our dataset and evaluation code are publicly available: .

Journal Article

Share this book

Add to My Shelf

Utrecht University; exploration, colonial knowledge: A 'Civilizing Mission'. Interview with Henk van Rinsum

by Vale, Peter , Van Rinsum, Henk J. , Botha, Jan in (de)colonisation , Christianity , civilising mission

2025

This piece relays an interview with Henk van Rinsum (retired from Utrecht University). In the interview, the idea of the university as a detached space connected with the notion of the alleged objectivity of science is challenged by an “older white Dutchman attempting to offer insights on colonialism”. The interview explores the colonial historical development of a university in Western Europe as it finds its place within the entanglements of Christianity, capitalism, commerce, colonialism, and civilisation. The interview calls for a sensitive dialogue on issues of decolonisation. Are we prepared to address the ills of colonialism, given that we still seem to live under the influence of coloniality, including in higher education? Henk van Rinsum (retired from Utrecht University and 2021-2024 Research Fellow at CREST, Stellenbosch University) is a curious creature in the world of universities. For one thing, his professional career was primarily in academic administration, not in the academy. In addition, his educational training in history and anthropology interested him in the world of institutions of higher education and research. And thirdly, early dabbling in theology gave Dr van Rinsum an understanding of the (near) spiritual place of the university in changing times and different spaces. The book is now also published under the English title Utrecht University and Colonial Knowledge; Exploration, Exploitation and the Civilizing Mission since 16362 and is available open access at OAPEN (oapen.org). It is empirically rich, drawing on a largely unexplored archive at Utrecht University and a 40-year career.

Journal Article

Share this book

Add to My Shelf

Academic and Scientific Authorship Practices

by Swartz, Leslie , Breet, Elsie , Horn, Lyn in Academic achievement , Authorship , Authorship and Responsible Conduct of Research

2018

Empirical studies of authorship practices in high-income countries have been conducted, while research on this issue is scarce in low- and middle-income countries. A survey was conducted among South African researchers who have published in peer-reviewed journals, to explore their understanding of and ability to apply academic authorship criteria. A total of 967 researchers participated in the survey; 88% of respondents had knowledge of academic authorship criteria, while only 52% found it easy to apply the criteria. More respondents experienced disagreement regarding who qualifies for coauthorship compared with authorship order (59% vs. 48%). Disagreement was mostly linked to different ways of valuing or measuring contributions. Level of agreement with academic authorship criteria was higher than the perceived ability to apply the criteria.

Journal Article

Share this book

Add to My Shelf

UPL spill — chemical firm has never ‘refused to recognise the Multi-Stakeholder Forum’

by Botha, Jan in Accountability , Chemical spills , Nominations

2022

Web Resource

Share this book

Add to My Shelf

Renvoi in African Private International Law of Contract: A Model Clause for a Model Law

by Botha, Jan Adriaan in Attorneys , Best practice , Economic development

2016

On the 24th of June 2016,1 David Cameron, the Prime Minister of the United Kingdom (UK), announced that “[t]he British people have voted to leave the European Union (EU) and their vote must be respected.”2 While the author of this dissertation takes notice of the announcement and is aware of the tremendous repercussions that could follow the UK’s leaving of the EU,3 the Brexit effect will be “ignored” for the purposes of this dissertation as the author believes that the UK leaving the EU would not likely occur.4 Even if this unlikely withdrawal were to occur it would take a substantial amount of time until the UK pushes the article 50-button.5 Lastly, the mere fact of the UK leaving would not necessarily cause a huge change in the private international law rules of the Laws of England and Wales as it is quite possible, especially in a long-term transitional period, that the UK will still attempt to retain some of their benefits of the European Preference by keeping their laws and policies as closely aligned to the EU as possible.6

Dissertation

Share this book

Add to My Shelf

Probabilistic modelling of morphologically rich languages

by Botha, Jan Abraham

2014

This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.

Dissertation

Share this book

Add to My Shelf

Entity Linking in 100 Languages

by Gillick, Daniel , Shan, Zifei , Botha, Jan A in Coders , Datasets , Frequency analysis

2020

We propose a new formulation for multilingual entity linking, where language-specific mentions resolve to a language-agnostic Knowledge Base. We train a dual encoder in this new setting, building on prior work with improved feature representation, negative mining, and an auxiliary entity-pairing task, to obtain a single entity retrieval model that covers 100+ languages and 20 million entities. The model outperforms state-of-the-art results from a far more limited cross-lingual linking task. Rare entities and low-resource languages pose challenges at this large-scale, so we advocate for an increased focus on zero- and few-shot evaluation. To this end, we provide Mewsli-9, a large new multilingual dataset (http://goo.gle/mewsli-dataset) matched to our setting, and show how frequency-based analysis provided key insights for our model and training enhancements.

Paper

Share this book

Add to My Shelf

Asking without Telling: Exploring Latent Ontologies in Contextual Representations

by Julian, Michael , Tenney, Ian , Botha, Jan A in Annotations , Coders , Knowledge representation

2020

The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs. Without access to fine-grained gold labels, LSL extracts emergent structure from input representations in an interpretable and quantifiable form. In experiments, we find strong evidence of familiar categories, such as a notion of personhood in ELMo, as well as novel ontological distinctions, such as a preference for fine-grained semantic roles on core arguments. Our results provide unique new evidence of emergent structure in pretrained encoders, including departures from existing annotations which are inaccessible to earlier methods.

Paper

Share this book

Add to My Shelf

QUALITY DOCTORAL EDUCATION IN AFRICA

by Mike Kuria , Marc Wilde , Murat Özgören

2021

Much of the teaching at universities in Africa is done by academic staff members who do not have doctoral degrees. In South Africa, for example (in 2017), only 46% of the academic staff of the public universities held a doctoral degree (DHET 2019:48). This has implications for the quality of teaching and research and the capacity to deliver more doctoral graduates. In other African countries, the situation is not different. According to a study of eight “flagship” African universities by Cloete, Bunting and Van Schalkwyk (2018:50), the situation (in 2015) at these universities was as follows: at the University of

Book Chapter

Share this book

Add to My Shelf

Probabilistic Modelling of Morphologically Rich Languages

by Botha, Jan A in Bayesian analysis , Language , Machine translation

2015

This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter