Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
132
result(s) for
"Callison-Burch, Chris"
Sort by:
Artificial Intelligence in mental health and the biases of language based models
2020
The rapid integration of Artificial Intelligence (AI) into the healthcare field has occurred with little communication between computer scientists and doctors. The impact of AI on health outcomes and inequalities calls for health professionals and data scientists to make a collaborative effort to ensure historic health disparities are not encoded into the future. We present a study that evaluates bias in existing Natural Language Processing (NLP) models used in psychiatry and discuss how these biases may widen health inequalities. Our approach systematically evaluates each stage of model development to explore how biases arise from a clinical, data science and linguistic perspective.
A literature review of the uses of NLP in mental health was carried out across multiple disciplinary databases with defined Mesh terms and keywords. Our primary analysis evaluated biases within 'GloVe' and 'Word2Vec' word embeddings. Euclidean distances were measured to assess relationships between psychiatric terms and demographic labels, and vector similarity functions were used to solve analogy questions relating to mental health.
Our primary analysis of mental health terminology in GloVe and Word2Vec embeddings demonstrated significant biases with respect to religion, race, gender, nationality, sexuality and age. Our literature review returned 52 papers, of which none addressed all the areas of possible bias that we identify in model development. In addition, only one article existed on more than one research database, demonstrating the isolation of research within disciplinary silos and inhibiting cross-disciplinary collaboration or communication.
Our findings are relevant to professionals who wish to minimize the health inequalities that may arise as a result of AI and data-driven algorithms. We offer primary research identifying biases within these technologies and provide recommendations for avoiding these harms in the future.
Journal Article
SNAP judgments into the digital age: Reporting on food stamps varies significantly with time, publication type, and political leaning
by
Chrisinger, Benjamin W.
,
Kinsey, Eliza W.
,
Pavlick, Ellie
in
Algorithms
,
Alignment
,
Biology and Life Sciences
2020
The Supplemental Nutrition Assistance Program (SNAP) is the second-largest and most contentious public assistance program administered by the United States government. The media forums where SNAP discourse occurs have changed with the advent of social and web-based media. We used machine learning techniques to characterize media coverage of SNAP over time (1990-2017), between outlets with national readership and those with narrower scopes, and, for a subset of web-based media, by the outlet's political leaning. We applied structural topic models, a machine learning methodology that categorizes and summarizes large bodies of text that have document-level covariates or metadata, to a corpus of print media retrieved via LexisNexis (n = 76,634). For comparison, we complied a separate corpus via web-scrape algorithm of the Google News API (2012-2017), and assigned political alignment metadata to a subset documents according to a recent study of partisanship on social media. A similar procedure was used on a subset of the print media documents that could be matched to the same alignment index. Using linear regression models, we found some, but not all, topics to vary significantly with time, between large and small media outlets, and by political leaning. Our findings offer insights into the polarized and partisan nature of a major social welfare program in the United States, and the possible effects of new media environments on the state of this discourse.
Journal Article
Extracting Lexically Divergent Paraphrases from Twitter
by
Ritter, Alan
,
Callison-Burch, Chris
,
Dolan, William B.
in
Annotations
,
Classifiers
,
Community research
2021
We present M
P (Multi-instance Learning Paraphrase Model), a new
model suited to identify paraphrases within the short messages on Twitter. We
jointly model paraphrase relations between word and sentence pairs and assume
only sentence-level annotations during learning. Using this principled latent
variable model alone, we achieve the performance competitive with a
state-of-the-art method which combines a latent space model with a feature-based
supervised classifier. Our model also captures lexically divergent paraphrases
that differ from yet complement previous methods; combining our model with
previous work significantly outperforms the state-of-the-art. In addition, we
present a novel annotation methodology that has allowed us to crowdsource a
paraphrase corpus from Twitter. We make this new dataset available to the
research community.
Journal Article
Learning to translate with products of novices: a suite of open-ended challenge problems for teaching MT
2021
Machine translation (MT) draws from several different disciplines, making it a complex subject to teach. There are excellent pedagogical texts, but problems in MT and current algorithms for solving them are best learned by doing. As a centerpiece of our MT course, we devised a series of open-ended challenges for students in which the goal was to improve performance on carefully constrained instances of four key MT tasks: alignment, decoding, evaluation, and reranking. Students brought a diverse set of techniques to the problems, including some novel solutions which performed remarkably well. A surprising and exciting outcome was that student solutions or their combinations fared competitively on some tasks, demonstrating that even newcomers to the field can help improve the state-of-the-art on hard NLP problems while simultaneously learning a great deal. The problems, baseline code, and results are freely available.
Journal Article
Problems in Current Text Simplification Research: New Data Can Help
by
Xu, Wei
,
Callison-Burch, Chris
,
Napoles, Courtney
in
Comparative analysis
,
Corpus analysis
,
Data quality
2015
Simple Wikipedia has dominated simplification research in the past 5 years. In
this opinion paper, we argue that focusing on Wikipedia limits simplification
research. We back up our arguments with corpus analysis and by highlighting
statements that other researchers have made in the simplification literature. We
introduce a new simplification dataset that is a significant improvement over
Simple Wikipedia, and present a novel quantitative-comparative approach to study
the quality of simplification data resources.
Journal Article
Optimizing Statistical Machine Translation for Text Simplification
by
Pavlick, Ellie
,
Callison-Burch, Chris
,
Chen, Quanze
in
Computerized corpora
,
Grammatical aspect
,
Iterative methods
2016
Most recent sentence simplification systems use basic machine translation models
to learn lexical and syntactic paraphrases from a manually simplified parallel
corpus. These methods are limited by the quality and quantity of manually
simplified corpora, which are expensive to build. In this paper, we conduct an
in-depth adaptation of statistical machine translation to perform text
simplification, taking advantage of large-scale paraphrases learned from
bilingual texts and a small amount of manual simplifications with multiple
references. Our work is the first to design automatic metrics that are effective
for tuning and evaluating simplification systems, which will facilitate
iterative development for this task.
Journal Article
The Language Demographics of Amazon Mechanical Turk
by
Pavlick, Ellie
,
Callison-Burch, Chris
,
Kachaev, Dmitry
in
Ability
,
Bilingual dictionaries
,
Bilingualism
2014
We present a large scale study of the languages spoken by bilingual workers on
Mechanical Turk (MTurk). We establish a methodology for determining the language
skills of anonymous crowd workers that is more robust than simple surveying. We
validate workers’ self-reported language skill claims by measuring their ability
to correctly translate words, and by geolocating workers to see if they reside
in countries where the languages are likely to be spoken. Rather than posting a
one-off survey, we posted paid tasks consisting of 1,000 assignments to
translate a total of 10,000 words in each of 100 languages. Our study ran for
several months, and was highly visible on the MTurk crowdsourcing platform,
increasing the chances that bilingual workers would complete it. Our study was
useful both to create bilingual dictionaries and to act as census of the
bilingual speakers on MTurk. We use this data to recommend languages with the
largest speaker populations as good candidates for other researchers who want to
develop crowdsourced, multilingual technologies. To further demonstrate the
value of creating data via crowdsourcing, we hire workers to create bilingual
parallel corpora in six Indian languages, and use them to train statistical
machine translation systems.
Journal Article
End-to-end statistical machine translation with zero or small parallel texts
2016
We use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.
Journal Article
Paraphrase-Sense-Tagged Sentences
2019
Many natural language processing tasks require discriminating the particular
meaning of a word in context, but building corpora for developing sense-aware
models can be a challenge. We present a large resource of example usages for
words having a particular meaning, called Paraphrase-Sense-Tagged Sentences
(PSTS). Built on the premise that a word’s paraphrases instantiate its
fine-grained meanings (i.e.,
has different meanings
corresponding to its paraphrases
and
) the resource contains up to 10,000 sentences for
each of 3 million target-paraphrase pairs where the target word takes on the
meaning of the paraphrase. We describe an automatic method based on bilingual
pivoting used to enumerate sentences for PSTS, and present two models for
ranking PSTS sentences based on their quality. Finally, we demonstrate the
utility of PSTS by using it to build a dataset for the task of hypernym
prediction in context. Training a model on this automatically generated dataset
produces accuracy that is competitive with a model trained on smaller datasets
crafted with some manual effort.
Journal Article
Autorubric: Unifying Rubric-based LLM Evaluation
2026
Techniques for reliable rubric-based LLM evaluation -- ensemble judging, bias mitigation, few-shot calibration -- are scattered across papers with inconsistent terminology and partial implementations. We introduce Autorubric, an open-source framework that unifies these rubric-based LLM evaluation lessons with opinionated defaults: analytic rubrics with binary, ordinal, and nominal criteria; single-judge and ensemble evaluation; few-shot calibration; bias mitigations; and psychometric reliability metrics. We validate on three benchmarks: RiceChem (college chemistry grading, 80\\% accuracy with 5-shot calibration), ResearcherBench (deep research evaluation, 931 criteria, cross-judge agreement analysis), and CHARM-100, a new chatbot evaluation dataset combining all three criterion types with ground truth labels (87\\% binary accuracy, moderate-to-substantial \\(\\kappa\\)). Beyond measurement, per-criterion scores and explanations serve as optimization signals. We demonstrate how Autorubric's rubric-evaluation explanations raise a peer review agent's score from 0.47 to 0.85 (above the 0.82 expert-curated baseline), and its scores serve as RL rewards to produce statistically significant improvement on AdvancedIF (+0.039, Wilcoxon \\(p = 0.032\\)) with positive transfer to IFEval. In all of these cases, Autorubric enabled us to rapidly operationalize various rubric design choices and best practices with minimal effort.