Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
14
result(s) for
"Doshi, Parth"
Sort by:
Recent Advances in the Field of Artificial Intelligence for Precision Medicine in Patients with a Diagnosis of Metastatic Cutaneous Melanoma
by
Higgins, Hayley
,
Shirini, Dorsa
,
Dercle, Laurent
in
Artificial intelligence
,
Care and treatment
,
Complications and side effects
2023
Standard-of-care medical imaging techniques such as CT, MRI, and PET play a critical role in managing patients diagnosed with metastatic cutaneous melanoma. Advancements in artificial intelligence (AI) techniques, such as radiomics, machine learning, and deep learning, could revolutionize the use of medical imaging by enhancing individualized image-guided precision medicine approaches. In the present article, we will decipher how AI/radiomics could mine information from medical images, such as tumor volume, heterogeneity, and shape, to provide insights into cancer biology that can be leveraged by clinicians to improve patient care both in the clinic and in clinical trials. More specifically, we will detail the potential role of AI in enhancing detection/diagnosis, staging, treatment planning, treatment delivery, response assessment, treatment toxicity assessment, and monitoring of patients diagnosed with metastatic cutaneous melanoma. Finally, we will explore how these proof-of-concept results can be translated from bench to bedside by describing how the implementation of AI techniques can be standardized for routine adoption in clinical settings worldwide to predict outcomes with great accuracy, reproducibility, and generalizability in patients diagnosed with metastatic cutaneous melanoma.
Journal Article
PET/CT and SPECT/CT Imaging of HER2-Positive Breast Cancer
by
Higgins, Hayley
,
Gulati, Amit
,
Shirini, Dorsa
in
Breast cancer
,
Cancer therapies
,
Cell division
2023
HER2 (Human Epidermal Growth Factor Receptor 2)-positive breast cancer is characterized by amplification of the HER2 gene and is associated with more aggressive tumor growth, increased risk of metastasis, and poorer prognosis when compared to other subtypes of breast cancer. HER2 expression is therefore a critical tumor feature that can be used to diagnose and treat breast cancer. Moving forward, advances in HER2 in vivo imaging, involving the use of techniques such as positron emission tomography (PET) and single-photon emission computed tomography (SPECT), may allow for a greater role for HER2 status in guiding the management of breast cancer patients. This will apply both to patients who are HER2-positive and those who have limited-to-minimal immunohistochemical HER2 expression (HER2-low), with imaging ultimately helping clinicians determine the size and location of tumors. Additionally, PET and SPECT could help evaluate effectiveness of HER2-targeted therapies, such as trastuzumab or pertuzumab for HER2-positive cancers, and specially modified antibody drug conjugates (ADC), such as trastuzumab-deruxtecan, for HER2-low variants. This review will explore the current and future role of HER2 imaging in personalizing the care of patients diagnosed with breast cancer.
Journal Article
OSM vs HD Maps: Map Representations for Trajectory Prediction
2023
While High Definition (HD) Maps have long been favored for their precise depictions of static road elements, their accessibility constraints and susceptibility to rapid environmental changes impede the widespread deployment of autonomous driving, especially in the motion forecasting task. In this context, we propose to leverage OpenStreetMap (OSM) as a promising alternative to HD Maps for long-term motion forecasting. The contributions of this work are threefold: firstly, we extend the application of OSM to long-horizon forecasting, doubling the forecasting horizon compared to previous studies. Secondly, through an expanded receptive field and the integration of intersection priors, our OSM-based approach exhibits competitive performance, narrowing the gap with HD Map-based models. Lastly, we conduct an exhaustive context-aware analysis, providing deeper insights in motion forecasting across diverse scenarios as well as conducting class-aware comparisons. This research not only advances long-term motion forecasting with coarse map representations but additionally offers a potential scalable solution within the domain of autonomous driving.
Glycosylated Antibiotics: New Promising Bacterial Efflux Pumps Inhibitors
2021
Antimicrobial resistance is considered a major concern problem; bacteria have evolved mechanisms to overcome antibiotics’ action through evolutionary process. One main resistance mechanism that bacteria developed is the pumping of the antibiotics out of bacterial cells by transmembrane transporter proteins known as efflux pumps.
To overcome bacterial resistance guided by efflux pumps, efflux pumps inhibitors (EPIs) are small molecules that obstruct efflux pumps binding sites and its structural assembly leading to disability in the efflux pumps normal function, new EPIs which under the current study are created by modifying the chemical structure of most common antibiotics including Ampicillin, Penicillin, Chloramphenicol, Ciprofloxacin and Tetracycline, such antibiotics are modified by adding N-acetyl glucose amine moiety to acceptor OH group of the respective antibiotic, the newly modified antibiotics are glycosylated EPIs. To test the effectiveness of the new EPIs in inhibiting AcrB-TolC and MexA-OprM efflux pumps functions, ADME properties for all of glycosylated antibiotics have been measured through applying Lipinski’s role of 5, docking and simulation studies have been included as well.
Docked glycosylated tetracycline has given the highest binding energy in the active sites of both pumps, with −9.4 against AcrB and −8.8 against MexA. The simulation study has confirmed the binding of the glycosylated tetracycline in the active sites of both pumps, as well as its stability during the biological dynamicity of both pumps (opening and closing channels).
The results validation requires a long simulation time about 50 ns or more which was un applicable due to cost limitation, however, the newly glycosylated antibiotics have promising results that might make it eligible as drug candidates to overcome bacterial resistance.
Glycosylated Antibiotics: New Promising Bacterial 1 Efflux Pumps Inhibitors
2021
Antimicrobial resistance is considered a major concern; bacteria have evolved mechanisms to overcome antibiotics action through evolutionary process. One major resistance mechanism that bacteria developed is the pumping of the antibiotics out of bacterial cells by transmembrane transporter proteins known as efflux pumps. To overcome such resistance, small molecules known as efflux pumps inhibitors (EPIs), are created to inhibit the efflux pumps functions. The new inhibitors are glycosylated antibiotics which have been created by adding N-acetyl glucose amine moiety to acceptor hydroxyl group of five common antibiotics are Ampicillin, Amoxicillin, Chloramphenicol, Ciprofloxacin, and Tetracycline. To test the effectiveness of the new EPIs in inhibiting AcrB-TolC and MexA-OprM efflux pumps functions, docking and simulation studies have been applied. Docked glycosylated tetracycline has given the highest binding energy in the active sites of both pumps, with -9.4 against AcrB and -8.8 against MexA. The simulation study has confirmed the binding of the glycosylated tetracycline in the active sites of both pumps, as well as its stability during the biological dynamicity of both pumps (opening and closing channels). However, for the cost purpose, this study included short simulation time of 2ns that prevent the proteins-ligands complexes from reaching the plateau. Competing Interest Statement The authors have declared no competing interest. Footnotes * https://www.researchgate.net/publication/350448736_Glycosylated_Antibiotics_New_Promising_Bacterial_Efflux_Pumps_Inhibitors
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
2026
Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
Luxical: High-Speed Lexical-Dense Text Embeddings
2025
Frontier language model quality increasingly hinges on our ability to organize web-scale text corpora for training. Today's dominant tools trade off speed and flexibility: lexical classifiers (e.g., FastText) are fast but limited to producing classification output scores, while the vector-valued outputs of transformer text embedding models flexibly support numerous workflows (e.g., clustering, classification, and retrieval) but are computationally expensive to produce. We introduce Luxical, a library for high-speed \"lexical-dense\" text embeddings that aims to recover the best properties of both approaches for web-scale text organization. Luxical combines sparse TF--IDF features, a small ReLU network, and a knowledge distillation training regimen to approximate large transformer embedding models at a fraction of their operational cost. In this technical report, we describe the Luxical architecture and training objective and evaluate a concrete Luxical model in two disparate applications: a targeted webcrawl document retrieval test and an end-to-end language model data curation task grounded in text classification. In these tasks we demonstrate speedups ranging from 3x to 100x over varying-sized neural baselines, and comparable to FastText model inference during the data curation task. On these evaluations, the tested Luxical model illustrates favorable compute/quality trade-offs for large-scale text organization, matching the quality of neural baselines. Luxical is available as open-source software at https://github.com/datologyai/luxical.
The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data
2026
Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard pretrained model. Beyond these empirical gains, we derive overfitting scaling laws to guide practitioners in selecting the optimal domain-data repetition for a given pretraining compute budget. Our observations reveal the finetuner's fallacy: while finetuning may appear to be the cheapest path to domain adaptation, introducing specialized domain data during pretraining stretches its utility. SPT yields better specialized domain performance (via reduced overfitting across repeated exposures) and better general domain performance (via reduced forgetting during finetuning), ultimately achieving stronger results with fewer parameters and less total compute when amortized over inference. To get the most out of domain data, incorporate it as early in training as possible.
ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset
2026
Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data availability across languages. A further challenge is the performance interference that can arise from joint multilingual training, commonly referred to as the \"curse of multilinguality\". We study multilingual data curation across thirteen languages and find that many reported regressions are not inherent to multilingual scaling but instead stem from correctable deficiencies in data quality and composition rather than fundamental capacity limits. In controlled bilingual experiments, improving data quality for any single language benefits others: curating English improves non-English performance in 12 of 13 languages, while curating non-English yields reciprocal improvements in English. Bespoke per-language curation produces substantially larger within-language improvements. Extending these findings to large-scale general-purpose training mixtures, we show that curated multilingual allocations comprising under 8% of total tokens remain remarkably effective. We operationalize this approach within an effort that produced a 20T-token pretraining corpus derived entirely from public sources. Models with 3B and 8B parameters trained on a 1T-token random subset achieve competitive multilingual accuracy with 4-10x fewer training FLOPs than strong public baselines, establishing a new Pareto frontier in multilingual performance versus compute. Moreover, these benefits extend to frontier model scale: the 20T-token corpus served as part of the pretraining dataset for Trinity Large (400B/A13B), which exhibits strong multilingual performance relative to its training FLOPs. These results show that targeted, per-language data curation mitigates multilingual interference and enables compute-efficient multilingual scaling.
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
by
Mentzer, Kaleigh
,
Gaza, Bogdan
,
Deng, Alvin
in
Datasets
,
Large language models
,
Synthetic data
2025
Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.