Catalogue Search | MBRL

Automatically Labeling Low Quality Content on Wikipedia by Leveraging Patterns in Editing Behaviors

by Asthana, Sumit , Banovic, Nikola , Sabrina Tobar Thommel in Algorithms , Editing , Encyclopedias

2021

Wikipedia articles aim to be definitive sources of encyclopedic content. Yet, only 0.6% of Wikipedia articles have high quality according to its quality scale due to insufficient number of Wikipedia editors and enormous number of articles. Supervised Machine Learning (ML) quality improvement approaches that can automatically identify and fix content issues rely on manual labels of individual Wikipedia sentence quality. However, current labeling approaches are tedious and produce noisy labels. Here, we propose an automated labeling approach that identifies the semantic category (e.g., adding citations, clarifications) of historic Wikipedia edits and uses the modified sentences prior to the edit as examples that require that semantic improvement. Highest-rated article sentences are examples that no longer need semantic improvements. We show that training existing sentence quality classification algorithms on our labels improves their performance compared to training them on existing labels. Our work shows that editing behaviors of Wikipedia editors provide better labels than labels generated by crowdworkers who lack the context to make judgments that the editors would agree with.

Paper

Share this book

Add to My Shelf

Maintaining the efficiency of open production systems at scale: A case study of wikipedia

by Halfaker, Aaron in Computer science , Social research

2013

This dissertation represents an exploration of the function and failures of critical subsystems in open production communities with Wikipedia as a case study. Specifically, I explore the nature of rejection via Wikipedia's informal, post-hoc quality control system and identify a consistent ownership bias that undermines Wikipedia's ethos of openness. I also quantify an inherent trade-off between the speed and efficiency of quality control in Wikipedia and the motivation of rejected contributors -- especially new editors. I then proceed to show how Wikipedia's shifting focus on quality control and formal process has led to a dramatic decline in the rate of retention of desirable new editors that threatens the long-term viability of the project. In light of these results, I present studies of two experimental software systems intended to explore potential solutions to this steady decrease in participation. First I draw on social learning theory to evaluate the effectiveness of a new mode of peripheral participation through reader-submitted feedback. I experimentally demonstrate effective strategies for increasing the rate of contributions without decreasing quality and argue for efficient moderation support in order to make quality control worth volunteer time spent away from editing the encyclopedia. Next, I describe the design and three month field study of a new intelligent software system intended to both efficiently support socialization practices in Wikipedia and bring visibility to the systemic problems that lead to declining newcomer retention. I show evidence that the system works in both regards: critical newcomer socialization activities are made dramatically more efficient and users of the system reflect openly on the breakdowns in Wikipedia's quality control processes. This work has already had impact within the Wikipedia community and in directing the strategy employed by the Wikimedia Foundation in designing and evaluating new software for Wikipedia editors.

Dissertation

Share this book

Add to My Shelf

Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system

by He, Pengcheng , Hilleli, Sagih , Asthana, Sumit in Cognition , Effectiveness , Meetings

2025

Meetings play a critical infrastructural role in coordinating work. The recent surge of hybrid and remote meetings in computer-mediated spaces has led to new problems (e.g., more time spent in less engaging meetings) and new opportunities (e.g., automated transcription/captioning and recap support). Advances in dialogue summarization offer the potential for improving post-meeting experiences, but fixed-length summaries often fail to meet diverse needs, such as quick overviews or detailed insights. To address these gaps, we use cognitive science and discourse theories to conceptualize two recap designs: important highlights and a structured, hierarchical minutes view, targeting complementary recap needs. We operationalize these representations into high-fidelity prototypes using dialogue summarization. Finally, we evaluate the representations' effectiveness with seven users in the context of their work meetings at Microsoft. Our results show both recap types are valuable in different contexts, enabling collaboration through discussions and consensus-building. Exploring the meaning of users adding, editing, and deleting from recaps suggests varying alignment for using these actions to improve AI-recap. Our design implications, such as incorporating organizational artifacts (e.g., linking presentations) in recaps and personalizing context, advance the discourse of effective recap designs for organizational work and support past results from cognition studies.

Paper

Share this book

Add to My Shelf

Effects of algorithmic flagging on fairness: quasi-experimental evidence from Wikipedia

by TeBlunthuis, Nathan , Hill, Benjamin Mako , Halfaker, Aaron in Algorithms , Encyclopedias , Moderators

2026

Online community moderators often rely on social signals such as whether or not a user has an account or a profile page as clues that users may cause problems. Reliance on these clues can lead to overprofiling bias when moderators focus on these signals but overlook the misbehavior of others. We propose that algorithmic flagging systems deployed to improve the efficiency of moderation work can also make moderation actions more fair to these users by reducing reliance on social signals and making norm violations by everyone else more visible. We analyze moderator behavior in Wikipedia as mediated by RCFilters, a system which displays social signals and algorithmic flags, and estimate the causal effect of being flagged on moderator actions. We show that algorithmically flagged edits are reverted more often, especially those by established editors with positive social signals, and that flagging decreases the likelihood that moderation actions will be undone. Our results suggest that algorithmic flagging systems can lead to increased fairness in some contexts but that the relationship is complex and contingent.

Paper

Share this book

Add to My Shelf

Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system

by He, Pengcheng , Asthana, Sumit , Hilleli, Sagih in Context , Design analysis , Group dynamics

2024

Meetings play a critical infrastructural role in the coordination of work. In recent years, due to shift to hybrid and remote work, more meetings are moving to online Computer Mediated Spaces. This has led to new problems (e.g. more time spent in less engaging meetings) and new opportunities (e.g. automated transcription/captioning and recap support). Recent advances in large language models (LLMs) for dialog summarization have the potential to improve the experience of meetings by reducing individuals' meeting load and increasing the clarity and alignment of meeting outputs. Despite this potential, they face technological limitation due to long transcripts and inability to capture diverse recap needs based on user's context. To address these gaps, we design, implement and evaluate in-context a meeting recap system. We first conceptualize two salient recap representations -- important highlights, and a structured, hierarchical minutes view. We develop a system to operationalize the representations with dialogue summarization as its building blocks. Finally, we evaluate the effectiveness of the system with seven users in the context of their work meetings. Our findings show promise in using LLM-based dialogue summarization for meeting recap and the need for both representations in different contexts. However, we find that LLM-based recap still lacks an understanding of whats personally relevant to participants, can miss important details, and mis-attributions can be detrimental to group dynamics. We identify collaboration opportunities such as a shared recap document that a high quality recap enables. We report on implications for designing AI systems to partner with users to learn and improve from natural interactions to overcome the limitations related to personal relevance and summarization quality.

Paper

Share this book

Add to My Shelf

Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

by Kim, Jiwoo , Meng-Hsin, Wu , Wu, Tongshuang in Datasets , Encyclopedias

2024

AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.

Paper

Share this book

Add to My Shelf

ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia

by Geiger, R Stuart , Halfaker, Aaron in Algorithms , Artificial intelligence , Classifiers

2020

Algorithmic systems---from rule-based bots to machine learning classifiers---have a long history of supporting the essential work of content moderation and other curation work in peer production projects. From counter-vandalism to task routing, basic machine prediction has allowed open knowledge projects like Wikipedia to scale to the largest encyclopedia in the world, while maintaining quality and consistency. However, conversations about how quality control should work and what role algorithms should play have generally been led by the expert engineers who have the skills and resources to develop and modify these complex algorithmic systems. In this paper, we describe ORES: an algorithmic scoring service that supports real-time scoring of wiki edits using multiple independent classifiers trained on different datasets. ORES decouples several activities that have typically all been performed by engineers: choosing or curating training data, building models to serve predictions, auditing predictions, and developing interfaces or automated agents that act on those predictions. This meta-algorithmic system was designed to open up socio-technical conversations about algorithms in Wikipedia to a broader set of participants. In this paper, we discuss the theoretical mechanisms of social change ORES enables and detail case studies in participatory machine learning around ORES from the 5 years since its deployment.

Paper

Share this book

Add to My Shelf

Characterizing Deep Research: A Benchmark and Formal Definition

by Goyal, Navin , Midigeshi, Sukruta , Sharma, Amit in Benchmarks , Fanout , Performance evaluation

2025

Information tasks such as writing surveys or analytical reports require complex search and reasoning, and have recently been grouped under the umbrella of deep research -- a term also adopted by recent models targeting these capabilities. Despite growing interest, the scope of the deep research task remains underdefined and its distinction from other reasoning-intensive problems is poorly understood. In this paper, we propose a formal characterization of the deep research (DR) task and introduce a benchmark to evaluate the performance of DR systems. We argue that the core defining feature of deep research is not the production of lengthy report-style outputs, but rather the high fan-out over concepts required during the search process, i.e., broad and reasoning-intensive exploration. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search-separating the reasoning challenge from surface-level report generation. Based on this formulation, we propose a diverse, challenging benchmark LiveDRBench with 100 challenging tasks over scientific topics (e.g., datasets, materials discovery, prior art search) and public interest events (e.g., flight incidents, movie awards). Across state-of-the-art DR systems, F1 score ranges between 0.02 and 0.72 for any sub-category. OpenAI's model performs the best with an overall F1 score of 0.55. Analysis of reasoning traces reveals the distribution over the number of referenced sources, branching, and backtracking events executed by current DR systems, motivating future directions for improving their search mechanisms and grounding capabilities. The benchmark is available at https://github.com/microsoft/LiveDRBench.

Paper

Share this book

Add to My Shelf

Building automated vandalism detection tools for Wikidata

by Sarabadani, Amir , Taraborelli, Dario , Halfaker, Aaron in Damage detection , Knowledge base , Structural damage

2017

Wikidata, like Wikipedia, is a knowledge base that anyone can edit. This open collaboration model is powerful in that it reduces barriers to participation and allows a large number of people to contribute. However, it exposes the knowledge base to the risk of vandalism and low-quality contributions. In this work, we build on past work detecting vandalism in Wikipedia to detect vandalism in Wikidata. This work is novel in that identifying damaging changes in a structured knowledge-base requires substantially different feature engineering work than in a text-based wiki like Wikipedia. We also discuss the utility of these classifiers for reducing the overall workload of vandalism patrollers in Wikidata. We describe a machine classification strategy that is able to catch 89% of vandalism while reducing patrollers' workload by 98%, by drawing lightly from contextual features of an edit and heavily from the characteristics of the user making the edit.

Paper

Share this book

Add to My Shelf

ORES-Inspect: A technology probe for machine learning audits on enwiki

by Hagen, Lauren , Terveen, Loren , Levonian, Zachary in Encyclopedias , Machine learning , Minerals

2024

Auditing the machine learning (ML) models used on Wikipedia is important for ensuring that vandalism-detection processes remain fair and effective. However, conducting audits is challenging because stakeholders have diverse priorities and assembling evidence for a model's [in]efficacy is technically complex. We designed an interface to enable editors to learn about and audit the performance of the ORES edit quality model. ORES-Inspect is an open-source web tool and a provocative technology probe for researching how editors think about auditing the many ML models used on Wikipedia. We describe the design of ORES-Inspect and our plans for further research with this system.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter