Catalogue Search | MBRL

Real-World AI Evaluation: How FRAME Generates Systematic Evidence to Resolve the Decision-Maker's Dilemma

by Schwartz, Reva , Waters, Gabriella in Decision making , Heterogeneity

2026

Organizational leaders are being asked to make high-stakes decisions about AI deployment without dependable evidence of what these systems actually do in the environments they oversee. The predominant AI evaluation ecosystem yields scalable but abstract metrics that reflect the priorities of model development. By smoothing over the heterogeneity of real-world use, these model-centric approaches obscure how behavior varies across users, workflows, and settings, and rarely show where risk and value accumulate in practice. More user-centric studies reveal rich contextual detail, yet are fragmented, small-scale and loosely coupled to the mechanisms that shape model behavior. The Forum for Real-World AI Measurement and Evaluation (FRAME) aims to address this gap by combining large-scale trials of AI systems with structured observation of how they are used in context, the outcomes they generate, and how those outcomes arise. By tracing the path from an AI system's output through its practical use and downstream effects, FRAME turns the heterogeneity of AI-in-use into a measurable signal rather than a trade-off for achieving scale. The Forum establishes two core assets to achieve this: a Testing Sandbox that captures AI-in-use under real workflows at scale and a Metrics Hub that translates those traces into actionable indicators.

Paper

Share this book

Add to My Shelf

Making AI Evaluation Deployment Relevant Through Context Specification

by Lacerda, Thiago , Schwartz, Reva , Holmes, Matthew in Decision making , Organizations , Specifications

2026

With many organizations struggling to gain value from AI deployments, pressure to evaluate AI in an informed manner has intensified. Status quo AI evaluation approaches mask the operational realities that ultimately determine deployment success, making it difficult for decision makers outside the stack to know whether and how AI tools will deliver durable value. We introduce and describe context specification as a process to support and inform the deployment decision making process. Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.

Paper

Share this book

Add to My Shelf

Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition

by Schwartz, Reva , Caliskan, Aylin , Slaughter, Isaac in Bias , Embedding , Emotion recognition

2023

Previous work has established that a person's demographics and speech style affect how well speech processing models perform for them. But where does this bias come from? In this work, we present the Speech Embedding Association Test (SpEAT), a method for detecting bias in one type of model used for many speech tasks: pre-trained models. The SpEAT is inspired by word embedding association tests in natural language processing, which quantify intrinsic bias in a model's representations of different concepts, such as race or valence (something's pleasantness or unpleasantness) and capture the extent to which a model trained on large-scale socio-cultural data has learned human-like biases. Using the SpEAT, we test for six types of bias in 16 English speech models (including 4 models also trained on multilingual data), which come from the wav2vec 2.0, HuBERT, WavLM, and Whisper model families. We find that 14 or more models reveal positive valence (pleasantness) associations with abled people over disabled people, with European-Americans over African-Americans, with females over males, with U.S. accented speakers over non-U.S. accented speakers, and with younger people over older people. Beyond establishing that pre-trained speech models contain these biases, we also show that they can have real world effects. We compare biases found in pre-trained models to biases in downstream models adapted to the task of Speech Emotion Recognition (SER) and find that in 66 of the 96 tests performed (69%), the group that is more associated with positive valence as indicated by the SpEAT also tends to be predicted as speaking with higher valence by the downstream model. Our work provides evidence that, like text and image-based models, pre-trained speech based-models frequently learn human-like biases. Our work also shows that bias found in pre-trained models can propagate to the downstream task of SER.

Paper

Share this book

Add to My Shelf

Ask What Your Country Can Do For You: Towards a Public Red Teaming Model

by Hagen, Jack , Chowdhury, Rumman , Dhanotiya, Aayush in Intelligence gathering , Systems design

2025

AI systems have the potential to produce both benefits and harms, but without rigorous and ongoing adversarial evaluation, AI actors will struggle to assess the breadth and magnitude of the AI risk surface. Researchers from the field of systems design have developed several effective sociotechnical AI evaluation and red teaming techniques targeting bias, hate speech, mis/disinformation, and other documented harm classes. However, as increasingly sophisticated AI systems are released into high-stakes sectors (such as education, healthcare, and intelligence-gathering), our current evaluation and monitoring methods are proving less and less capable of delivering effective oversight. In order to actually deliver responsible AI and to ensure AI's harms are fully understood and its security vulnerabilities mitigated, pioneering new approaches to close this \"responsibility gap\" are now more urgent than ever. In this paper, we propose one such approach, the cooperative public AI red-teaming exercise, and discuss early results of its prior pilot implementations. This approach is intertwined with CAMLIS itself: the first in-person public demonstrator exercise was held in conjunction with CAMLIS 2024. We review the operational design and results of this exercise, the prior National Institute of Standards and Technology (NIST)'s Assessing the Risks and Impacts of AI (ARIA) pilot exercise, and another similar exercise conducted with the Singapore Infocomm Media Development Authority (IMDA). Ultimately, we argue that this approach is both capable of delivering meaningful results and is also scalable to many AI developing jurisdictions.

Paper

Share this book

Add to My Shelf

CIRCLE: A Framework for Evaluating AI from a Real-World Lens

by Lacerda, Thiago , Chowdhury, Rumman , Fadaee, Marzieh in Co-design , Performance measurement , Systems stability

2026

This paper proposes CIRCLE, a six-stage, lifecycle-based framework to bridge the reality gap between model-centric performance metrics and AI's materialized outcomes in deployment. Current approaches such as MLOps frameworks and AI model benchmarks offer detailed insights into system stability and model capabilities, but they do not provide decision-makers outside the AI stack with systematic evidence of how these systems actually behave in real-world contexts or affect their organizations over time. CIRCLE operationalizes the Validation phase of TEVV (Test, Evaluation, Verification, and Validation) by formalizing the translation of stakeholder concerns outside the stack into measurable signals. Unlike participatory design, which often remains localized, or algorithmic audits, which are often retrospective, CIRCLE provides a structured, prospective protocol for linking context-sensitive qualitative insights to scalable quantitative metrics. By integrating methods such as field testing, red teaming, and longitudinal studies into a coordinated pipeline, CIRCLE produces systematic knowledge: evidence that is comparable across sites yet sensitive to local context. This, in turn, can enable governance based on materialized downstream effects rather than theoretical capabilities.

Paper

Share this book

Add to My Shelf

Turning Minds On and Faucets Off: Water Conservation Education in Jordanian Schools

by Sanchack, Julie , Grieser, Mona , Schwartz, Reva in Conservation (Environment) , Conservation Education , Control Groups

2001

An evaluation was conducted to measure the impact of a curriculum implementation through the Jordan Water Conservation Education Project funded by USAID. This study examined the effect of recommending water conservation at the household level and the impact of using interactive teaching methods to promote conservation behaviors among students and their families. The evaluation used a postintervention design with random selection of participants. Comparisons were made among 671 students (424 experimental, 247 control) belonging to high school eco-clubs in central Jordan. Most students were girls in rural settings. The experimental group consisted of students whose teachers implemented an interactive curriculum and promoted household water-conservation behaviors. Teachers of students in the control group did not participate in the curriculum implementation, but those students were exposed to lectures about biodiversity issues. The results indicate that students who were exposed to the new curriculum demonstrated a higher level of knowledge about water conservation and performed recommended behaviors more often than students in the control group.

Journal Article

Share this book

Add to My Shelf

Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects

by Chowdhury, Rumman , Lu, Qinghua , Fadaee, Marzieh in Position measurement

2025

Conventional AI evaluation approaches concentrated within the AI stack exhibit systemic limitations for exploring, navigating and resolving the human and societal factors that play out in real world deployment such as in education, finance, healthcare, and employment sectors. AI capability evaluations can capture detail about first-order effects, such as whether immediate system outputs are accurate, or contain toxic, biased or stereotypical content, but AI's second-order effects, i.e. any long-term outcomes and consequences that may result from AI use in the real world, have become a significant area of interest as the technology becomes embedded in our daily lives. These secondary effects can include shifts in user behavior, societal, cultural and economic ramifications, workforce transformations, and long-term downstream impacts that may result from a broad and growing set of risks. This position paper argues that measuring the indirect and secondary effects of AI will require expansion beyond static, single-turn approaches conducted in silico to include testing paradigms that can capture what actually materializes when people use AI technology in context. Specifically, we describe the need for data and methods that can facilitate contextual awareness and enable downstream interpretation and decision making about AI's secondary effects, and recommend requirements for a new ecosystem.

Paper

Share this book

Add to My Shelf

The Role of Individual User Differences in Interpretable and Explainable Machine Learning Systems

by Schwartz, Reva , Gleaves, Lydia P , Broniatowski, David A in End users , Machine learning , System effectiveness

2020

There is increased interest in assisting non-expert audiences to effectively interact with machine learning (ML) tools and understand the complex output such systems produce. Here, we describe user experiments designed to study how individual skills and personality traits predict interpretability, explainability, and knowledge discovery from ML generated model output. Our work relies on Fuzzy Trace Theory, a leading theory of how humans process numerical stimuli, to examine how different end users will interpret the output they receive while interacting with the ML system. While our sample was small, we found that interpretability -- being able to make sense of system output -- and explainability -- understanding how that output was generated -- were distinct aspects of user experience. Additionally, subjects were more able to interpret model output if they possessed individual traits that promote metacognitive monitoring and editing, associated with more detailed, verbatim, processing of ML output. Finally, subjects who are more familiar with ML systems felt better supported by them and more able to discover new patterns in data; however, this did not necessarily translate to meaningful insights. Our work motivates the design of systems that explicitly take users' mental representations into account during the design process to more effectively support end user requirements.

Paper

Share this book

Add to My Shelf

Towards Trustworthy Artificial Intelligence for Equitable Global Health

by Qin, Hong , Wang, Xiaoqian , Effoduh, Jake Okechukwu in AI ethics , Artificial intelligence , Bias

2023

Artificial intelligence (AI) can potentially transform global health, but algorithmic bias can exacerbate social inequities and disparity. Trustworthy AI entails the intentional design to ensure equity and mitigate potential biases. To advance trustworthy AI in global health, we convened a workshop on Fairness in Machine Intelligence for Global Health (FairMI4GH). The event brought together a global mix of experts from various disciplines, community health practitioners, policymakers, and more. Topics covered included managing AI bias in socio-technical systems, AI's potential impacts on global health, and balancing data privacy with transparency. Panel discussions examined the cultural, political, and ethical dimensions of AI in global health. FairMI4GH aimed to stimulate dialogue, facilitate knowledge transfer, and spark innovative solutions. Drawing from NIST's AI Risk Management Framework, it provided suggestions for handling AI risks and biases. The need to mitigate data biases from the research design stage, adopt a human-centered approach, and advocate for AI transparency was recognized. Challenges such as updating legal frameworks, managing cross-border data sharing, and motivating developers to reduce bias were acknowledged. The event emphasized the necessity of diverse viewpoints and multi-dimensional dialogue for creating a fair and ethical AI framework for equitable global health.

Paper

Share this book

Add to My Shelf

Relationships of shame, anger, and Gestalt resistances

by Schwartz, Reva in Clinical psychology , Families & family life , Individual & family studies

1999

This study is the first empirical study to investigate the patterns among shame, anger, and the Gestalt resistances. Shame is a subject that has long been in the background, but has gained attention in the last 30 years. When shame is not acknowledged or considered, anger often arises. This can lead to a variety of problems and, in some cases, pathology and violence. Shame is defended against or resisted in a variety of ways, as exhibited by the Gestalt resistances. Participants in this study included 623 subjects over the age of 18. The Gestalt Contact Styles Questionnaire Revised-2 was used to measure the resistance processes. The Internalized Shame Scale was used to measure shame, and the State-Trait Anger Expression Inventory was used to measure trait anger and anger expression. Results revealed the higher the shame group, the greater the scores for the resistances in general, and especially for retroflection and deflection. Elevations for the high shame group were also found for egotism, projection, and introjection. However, for desensitization, the higher the shame group, the lower the resistance. Furthermore, males exhibited high projection, egotism, and especially desensitization scores relative to other resistances when compared with females. For trait anger, the higher the shame group, the higher the level of both types of trait anger. A pattern emerged such that the high shame group was particularly high on angry temperament whereas the low shame group was markedly low, actually showing a decrease on angry reaction. The higher the shame group, the higher the level of experienced anger, with scores more spread out on suppressed anger expressed inward but more clustered on anger expressed outward. However, controlled anger (prevention of the experience and expression of anger) was conversely lower as shame levels increased, with the high shame group being markedly lower. In addition, males showed markedly higher anger expressed inward than did females, while showing a dramatic decrease in anger controlled. When each Gestalt resistance scale was examined separately in relation to trait and expressed anger groups to see the interaction of anger and shame on each resistance, it was verified that, overall, high shame is consistent with high resistances and, likewise, high anger is also consistent with high resistances. However, there were no interactions between shame and either trait or expressed anger. Applications of these findings would be relevant to mental health professionals as well as parents. Often, when anger and resistances are exhibited, an underlying component of shame may be present. This is especially true for the resistances of retroflection and deflection, and is opposite for desensitization. If shame is present, suppressed anger would be high, whereas the control of anger would be quite low.

Dissertation

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter