Catalogue Search | MBRL

Embedding And Clustering Your Data Can Improve Contrastive Pretraining

by Merrick, Luke in Algorithms , Cluster analysis , Clustering

2024

Recent studies of large-scale contrastive pretraining in the text embedding domain show that using single-source minibatches, rather than mixed-source minibatches, can substantially improve overall model accuracy. In this work, we explore extending training data stratification beyond source granularity by leveraging a pretrained text embedding model and the classic k-means clustering algorithm to further split training data apart by the semantic clusters within each source. Experimentally, we observe a notable increase in NDCG@10 when pretraining a BERT-based text embedding model on query-passage pairs from the MSMARCO passage retrieval dataset. Additionally, we conceptually connect our clustering approach to both the Topic Aware Sampling (TAS) aspect of the TAS-B methodology and the nearest-neighbor-based hard-negative mining aspect of the ANCE methodology and discuss how this unified view motivates future lines of research on the organization of contrastive pretraining data.

Paper

Share this book

Add to My Shelf

Randomized Ablation Feature Importance

by Merrick, Luke in Ablation

2019

Given a model \\(f\\) that predicts a target \\(y\\) from a vector of input features \\(\\pmb{x} = x_1, x_2, \\ldots, x_M\\), we seek to measure the importance of each feature with respect to the model's ability to make a good prediction. To this end, we consider how (on average) some measure of goodness or badness of prediction (which we term \"loss\" \\(\\ell\\)), changes when we hide or ablate each feature from the model. To ablate a feature, we replace its value with another possible value randomly. By averaging over many points and many possible replacements, we measure the importance of a feature on the model's ability to make good predictions. Furthermore, we present statistical measures of uncertainty that quantify how confident we are that the feature importance we measure from our finite dataset and finite number of ablations is close to the theoretical true importance value.

Paper

Share this book

Add to My Shelf

The Explanation Game: Explaining Machine Learning Models Using Shapley Values

by Merrick, Luke , Taly, Ankur in Algorithms , Artificial intelligence , Cognitive psychology

2020

A number of techniques have been proposed to explain a machine learning model's prediction by attributing it to the corresponding input features. Popular among these are techniques that apply the Shapley value method from cooperative game theory. While existing papers focus on the axiomatic motivation of Shapley values, and efficient techniques for computing them, they offer little justification for the game formulations used, and do not address the uncertainty implicit in their methods' outputs. For instance, the popular SHAP algorithm's formulation may give substantial attributions to features that play no role in the model. In this work, we illustrate how subtle differences in the underlying game formulations of existing methods can cause large differences in the attributions for a prediction. We then present a general game formulation that unifies existing methods, and enables straightforward confidence intervals on their attributions. Furthermore, it allows us to interpret the attributions as contrastive explanations of an input relative to a distribution of reference inputs. We tie this idea to classic research in cognitive psychology on contrastive explanations, and propose a conceptual framework for generating and interpreting explanations for ML models, called formulate, approximate, explain (FAE). We apply this framework to explain black-box models trained on two UCI datasets and a Lending Club dataset.

Paper

Share this book

Add to My Shelf

Arctic-Embed 2.0: Multilingual Retrieval Without Compromise

by Nuti, Gaurav , Merrick, Luke , Yu, Puxuan in Embedding , Questions , Retrieval

2024

This paper presents the training methodology of Arctic-Embed 2.0, a set of open-source text embedding models built for accurate and efficient multilingual retrieval. While prior works have suffered from degraded English retrieval quality, Arctic-Embed 2.0 delivers competitive retrieval quality on multilingual and English-only benchmarks, and supports Matryoshka Representation Learning (MRL) for efficient embedding storage with significantly lower compressed quality degradation compared to alternatives. We detail the design and implementation, presenting several important open research questions that arose during model development. We conduct experiments exploring these research questions and include extensive discussion aimed at fostering further discussion in this field.

Paper

Share this book

Add to My Shelf

DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

by Mentzer, Kaleigh , Gaza, Bogdan , Deng, Alvin in Computing costs , Datasets , Failure modes

2026

Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.

Paper

Share this book

Add to My Shelf

Luxical: High-Speed Lexical-Dense Text Embeddings

by Mentzer, Kaleigh , Gaza, Bogdan , Deng, Alvin in Classification , Clustering , Embedding

2025

Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed \"lexical-dense\" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.

Paper

Share this book

Add to My Shelf

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

by Kolter, J Zico , Jason Chan Lee , Gaza, Bogdan in Scaling laws

2026

Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget. Our observations reveal the finetuner's fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.

Paper

Share this book

Add to My Shelf

ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

by Mentzer, Kaleigh , Gaza, Bogdan , Deng, Alvin in Allocations , Datasets , English language

2026

Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the \"curse of multilinguality\". We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance in 12 of 13 languages, while curating non-English yields reciprocal improvements in English. Bespoke per-language curation produces substantially larger within-language improvements. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within an effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4-10x fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute. Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. These results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.

Paper

Share this book

Add to My Shelf

BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

by Mentzer, Kaleigh , Gaza, Bogdan , Deng, Alvin in Datasets , Large language models , Synthetic data

2025

Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.

Paper

Share this book

Add to My Shelf

Longitudinal observation and decline of neutralizing antibody responses in the three months following SARS-CoV-2 infection in humans

by Martinez-Nunez, Rocio , Edgeworth, Jonathan D. , Graham, Carl in 38/109 , 631/250/2152 , 631/250/254

2020

Antibody responses to SARS-CoV-2 can be detected in most infected individuals 10–15 d after the onset of COVID-19 symptoms. However, due to the recent emergence of SARS-CoV-2 in the human population, it is not known how long antibody responses will be maintained or whether they will provide protection from reinfection. Using sequential serum samples collected up to 94 d post onset of symptoms (POS) from 65 individuals with real-time quantitative PCR-confirmed SARS-CoV-2 infection, we show seroconversion (immunoglobulin (Ig)M, IgA, IgG) in >95% of cases and neutralizing antibody responses when sampled beyond 8 d POS. We show that the kinetics of the neutralizing antibody response is typical of an acute viral infection, with declining neutralizing antibody titres observed after an initial peak, and that the magnitude of this peak is dependent on disease severity. Although some individuals with high peak infective dose (ID 50 > 10,000) maintained neutralizing antibody titres >1,000 at >60 d POS, some with lower peak ID 50 had neutralizing antibody titres approaching baseline within the follow-up period. A similar decline in neutralizing antibody titres was observed in a cohort of 31 seropositive healthcare workers. The present study has important implications when considering widespread serological testing and antibody protection against reinfection with SARS-CoV-2, and may suggest that vaccine boosters are required to provide long-lasting protection. Neutralizing antibody responses of patients infected with SARS-CoV-2 peak at 3–4 weeks post onset of symptoms, then decline to low levels over the course of 3 months in some individuals.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter