Catalogue Search | MBRL

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

by Moshkov, Ivan , Toshniwal, Shubham , Gitman, Daria in Datasets , Large language models , Mathematics education

2024

Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and MATH, two popular math reasoning benchmarks, using the recently released and permissively licensed Mixtral model. Our best model, OpenMath-CodeLlama-70B, trained on a subset of OpenMathInstruct-1, achieves a score of 84.6% on GSM8K and 50.7% on MATH, which is competitive with the best gpt-distilled models. We release our code, models, and the OpenMathInstruct-1 dataset under a commercially permissive license.

Paper

Share this book

Add to My Shelf

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

by Moshkov, Ivan , Toshniwal, Shubham , Gitman, Igor in Ablation , Datasets , Large language models

2024

Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \\emph{closed-source} due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \\texttt{Llama3.1} family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms \\emph{on-policy} data generated by a weak student model, (c) SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs (\\(\\approx\\) 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \\texttt{Llama-3.1-8B-Base} using OpenMathInstruct-2 outperforms \\texttt{Llama3.1-8B-Instruct} on MATH by an absolute 15.9\\% (51.9\\% \\(\\rightarrow\\) 67.8\\%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.

Paper

Share this book

Add to My Shelf

Nemotron-4 340B Technical Report

by Delalleau, Olivier , Evans, Ellie , Saar, Trisha in Alignment , Synthetic data

2024

We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.

Paper

Share this book

Add to My Shelf

A Morpho-Proteomic Atlas of Mitosis at Sub-Minute Resolution

by Iván, Zsanett Zsófia , Hapek, Nora , Lundberg, Emma in Cell Biology

2025

Precise spatiotemporal protein organization is critical for fundamental biological processes including cell division1,2. Indeed, aberrant mitosis and mitotic factors are involved in diverse diseases, including various cancers3,4, Alzheimer’s disease5, and rare diseases6. During mitosis, complex spatial rearrangements and regulation ensure the accurate separation of replicated sister chromatids to produce genetically identical daughter cells7–9. Previous studies employed high-throughput methodologies to follow specific proteins during mitosis10–15. Still a temporally refined systems-level approach capable of monitoring morphological and proteomic changes throughout mitosis has been lacking. Here, we achieved unprecedented resolution by phenotypically decomposing mitosis into 40 subsections of a regression plane for proteomic analysis using deep learning and regression techniques. Our deep visual proteomics (DVP) workflow16, revealed rapid, dynamic proteomic changes throughout mitosis. We quantified 4,350 proteins with high confidence, demonstrating that 147 show significant dynamic abundance changes during mitotic progression. Clustering revealed coordinated patterns of protein regulation, while network analysis uncovered tight regulation of core cell cycle proteins and a link between cell cycle and cancer-linked mutations. Immunofluorescence validated abundance changes and linked previously uncharacterised proteins, like C19orf53, to mitosis. To facilitate data navigation, we developed Mito-Omix, a user-friendly online platform that integrates intricate morphological and molecular data. Our morphological and proteomic dataset spans mitosis at high resolution, providing a rich resource for understanding healthy and aberrant cell division.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter