Catalogue Search | MBRL

Reliable multiplex sequencing with rare index mis-assignment on DNB-based NGS platform

by Chen, Ao , Wang, Jingjing , Mei, Zhiying in Animal Genetics and Genomics , Biomedical and Life Sciences , Deoxyribonucleic acid

2019

Background Massively-parallel-sequencing, coupled with sample multiplexing, has made genetic tests broadly affordable. However, intractable index mis-assignments (commonly exceeds 1%) were repeatedly reported on some widely used sequencing platforms. Results Here, we investigated this quality issue on BGI sequencers using three library preparation methods: whole genome sequencing (WGS) with PCR, PCR-free WGS, and two-step targeted PCR. BGI’s sequencers utilize a unique DNA nanoball (DNB) technology which uses rolling circle replication for DNA-nanoball preparation; this linear amplification is PCR free and can avoid error accumulation. We demonstrated that single index mis-assignment from free indexed oligos occurs at a rate of one in 36 million reads, suggesting virtually no index hopping during DNB creation and arraying. Furthermore, the DNB-based NGS libraries have achieved an unprecedentedly low sample-to-sample mis-assignment rate of 0.0001 to 0.0004% under recommended procedures. Conclusions Single indexing with DNB technology provides a simple but effective method for sensitive genetic assays with large sample numbers.

Journal Article

Share this book

Add to My Shelf

A Trait-based Investigation of Fungal Decomposition with Machine Learning

by Tian, Bingwei , Du, Shiyi , Zhao, Yiran in Algorithms , Climate change , Climate models

2022

Fungi are of great functional significance in terrestrial ecosystems as the main decomposers. To better understand their decomposing process and population coexistence, we first describe and quantify the decomposition rate, focusing on three traits of interest selected by machine learning algorithm: moisture tolerance, hyper extension rate, and hyphal density and obtain, and use a Ternary Linear Regression Decomposition Model (TLRDM) to quantify the decomposition rate. Then, to incorporate the interactions, we build an Interactive Decomposition Model (IDM) and creatively employ a Three-player Logistic-based Competition Population Model (TPLCM). Based on logistic growth, we formulate a differential equation group, fit the curves of this unsolvable equation group to obtain a function of population density versus time and compare the decomposition rates of three populations under interactive and non-interactive conditions, followed by analyzing the impact of the communications on decomposing ability. We obtain the population combinations that can coexist in certain climates. Furthermore, we include environmental factors, conducting a sensitivity analysis to describe how short-term and long-term climate changes affect our models.

Journal Article

Share this book

Add to My Shelf

The Length and Distribution of Plasma Cell-Free DNA Fragments in Stroke Patients

by Qi, Yanwei , Cui, Xiaofang , Yang, Yan in Analysis , Apoptosis , Chromosomes

2020

A number of studies have shown that plasma cell-free DNA is closely related to the risk of stroke, but the fragmentation status of plasma cell-free DNA and its clinical application value in ischemic stroke are still unclear. In this study, 48 patients with new ischemic stroke and 20 healthy subjects were enrolled. The second-generation high-throughput sequencing technique was used to study the plasma cell-free fragment length and regional distribution of the subjects. As noted in our results, the ratio of plasma cell-free DNA fragments in the disease group was significantly greater than that of the healthy group in the 300–400 bp range; conversely for fragments at the 75–250 bp range, the ratio of plasma cell-free DNA fragments in the patient group was apparently lower than that of the healthy group. In-depth analysis of the proportion of fragments distributed on each component of the genome was carried out. Our results recorded that the plasma cell-free DNA fragments in the disease group were inclined to the EXON, CpG islands, and ALU regions in contrast to that of the healthy group. In particular, fragments within the 300–400 bp range of the disease group were enrichment in the regions of EXON, INTRON, INTERGENIC, LINE, Fragile, ALU, and CpG islands. In summary, our findings suggested that the intracellular DNA degradation profiles could be applied to distinguish the stroke group and the healthy group, which provided a theoretical basis for the clinical diagnosis and prognosis of stroke by profiling the characteristic of plasma cell-free DNA fragments.

Journal Article

Share this book

Add to My Shelf

CodonMoE: DNA Language Models for mRNA Analyses

by Du, Shiyi , Liang, Litian , Li, Jiayi

2025

Genomic language models (gLMs) face a fundamental efficiency challenge: either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens - modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand massive parameter counts and extensive cross-modality pretraining. To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages.

Journal Article

Share this book

Add to My Shelf

CodonRL: Multi-Objective Codon Sequence Optimization Using Demonstration-Guided Reinforcement Learning

by You, Zhaoyi , Du, Shiyi , Li, Jiayi in Decision making , Feedback , Free energy

2026

Optimizing synonymous codon sequences to improve translation efficiency, RNA stability, and compositional properties is challenging because the search space grows exponentially with protein length and objectives interact through long range RNA structure. Dynamic programming-based methods can provide strong solutions for fixed objective combinations but are difficult to extend to additional constraints. Deep generative models require large-scale, high-quality mRNA sequence datasets for training, limiting applicability when such data are scarce. Reinforcement learning naturally handles sequential decision-making but faces challenges in codon optimization due to delayed rewards, large action spaces, and expensive structural evaluation. We present CodonRL, a reinforcement learning framework that learns a structural prior for mRNA design from efficient folding feedback and demonstration-guided replay, and then enables user-controlled multi-objective trade-offs during inference. CodonRL uses LinearFold for fast intermediate reward computation during training and ViennaRNA for final evaluation, warms up learning with expert sequences to accelerate convergence for global structure objectives, and introduces milestone-based intermediate rewards to address delayed feedback in long range optimization. On a benchmark of 55 human proteins, CodonRL outperforms GEMORNA, a state-of-the-art codon optimization method, across multiple metrics, achieving 9.5% higher codon adaptation index (CAI), 25.4 kcal/mol more favorable minimum free energy (MFE), and 3.4% lower uridine content on average, while improving codon stabilization coefficient (CSC) in over 90% of benchmark proteins under matched constraints. These gains translate into designs that are predicted to be more efficiently translated, more structurally stable, and less immunogenic, while supporting continuous objective reweighting at inference time.

Journal Article

Share this book

Add to My Shelf

ARCADE: Controllable Codon Design from Foundation Models via Activation Engineering

by Lai, Hong-Sheng , Du, Shiyi , Liang, Litian in Bioinformatics

2025

Codon sequence design is crucial for generating mRNA sequences with desired functional properties for tasks such as developing mRNA vaccines or gene editing therapies. Yet existing methods lack flexibility and controllability to adapt to various design objectives. We propose a novel machine learning-based framework, ARCADE, that enables flexible and controllable multi-objective codon design. Leveraging inherent knowledge from pretrained genomic language models, ARCADE extends activation engineering, a technique originally developed for controllable text generation, beyond discrete feature manipulation such as concepts and styles, to steering continuous-valued biological metrics. Specifically, we derive biologically meaningful semantic steering vectors in the model's activation space, which directly control properties such as the Codon Adaptation Index, Minimum Free Energy, and GC content. Experimental results demonstrate the flexibility of ARCADE in designing codon sequences with multiple objectives, underscoring its potential for advancing programmable biological sequence design. Our implementation is available at https://github.com/Kingsford-Group/arcade.

Journal Article

Share this book

Add to My Shelf

Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

by Kaynar, Gün , Du, Shiyi , Li, Jiayi in Accuracy , Annotations , Contrastive learning

2026

Pretrained language models often rely on superficial features that appear predictive during training yet fail to generalize at test time, a phenomenon known as shortcut learning. Existing mitigation methods generally operate at training time and require heavy supervision such as access to the original training data or prior knowledge of shortcut type. We propose Shortcut Guardrail, a deployment-time framework that mitigates token-level shortcuts without access to the original training data or shortcut annotations. Our key insight is that gradient-based attribution on a biased model highlights shortcut tokens. Building on this finding, we train a lightweight LoRA-based debiasing module with a Masked Contrastive Learning (MaskCL) objective that encourages consistent representations with or without individual tokens. Across sentiment classification, toxicity detection, and natural language inference under both naturally occurring and controlled shortcuts, Shortcut Guardrail improves overall accuracy and worst-group accuracy over the unmitigated model under distribution shifts while preserving in-distribution performance.

Paper

Share this book

Add to My Shelf

CodonMoE: DNA Language Models for mRNA Analyses

by Du, Shiyi , Liang, Litian , Li, Jiayi in Analyzers , Deoxyribonucleic acid , Gene expression

2025

Genomic language models (gLMs) face a fundamental efficiency challenge: either maintain separate specialized models for each biological modality (DNA and RNA) or develop large multi-modal architectures. Both approaches impose significant computational burdens - modality-specific models require redundant infrastructure despite inherent biological connections, while multi-modal architectures demand massive parameter counts and extensive cross-modality pretraining. To address this limitation, we introduce CodonMoE (Adaptive Mixture of Codon Reformative Experts), a lightweight adapter that transforms DNA language models into effective RNA analyzers without RNA-specific pretraining. Our theoretical analysis establishes CodonMoE as a universal approximator at the codon level, capable of mapping arbitrary functions from codon sequences to RNA properties given sufficient expert capacity. Across four RNA prediction tasks spanning stability, expression, and regulation, DNA models augmented with CodonMoE significantly outperform their unmodified counterparts, with HyenaDNA+CodonMoE series achieving state-of-the-art results using 80% fewer parameters than specialized RNA models. By maintaining sub-quadratic complexity while achieving superior performance, our approach provides a principled path toward unifying genomic language modeling, leveraging more abundant DNA data and reducing computational overhead while preserving modality-specific performance advantages.

Paper

Share this book

Add to My Shelf

Boosting Dermatoscopic Lesion Segmentation via Diffusion Models with Visual and Textual Prompts

by Wang, Xiaosong , Lu, Yongyi , Zhou, Zongwei in Annotations , Data augmentation , Generative adversarial networks

2023

Image synthesis approaches, e.g., generative adversarial networks, have been popular as a form of data augmentation in medical image analysis tasks. It is primarily beneficial to overcome the shortage of publicly accessible data and associated quality annotations. However, the current techniques often lack control over the detailed contents in generated images, e.g., the type of disease patterns, the location of lesions, and attributes of the diagnosis. In this work, we adapt the latest advance in the generative model, i.e., the diffusion model, with the added control flow using lesion-specific visual and textual prompts for generating dermatoscopic images. We further demonstrate the advantage of our diffusion model-based framework over the classical generation models in both the image quality and boosting the segmentation performance on skin lesions. It can achieve a 9% increase in the SSIM image quality measure and an over 5% increase in Dice coefficients over the prior arts.

Paper

Share this book

Add to My Shelf

An evaluation of U-Net in Renal Structure Segmentation

by Ye, Jin , Wang, Haoyu , Huang, Ziyan in Angiography , Computed tomography , Image segmentation

2022

Renal structure segmentation from computed tomography angiography~(CTA) is essential for many computer-assisted renal cancer treatment applications. Kidney PArsing~(KiPA 2022) Challenge aims to build a fine-grained multi-structure dataset and improve the segmentation of multiple renal structures. Recently, U-Net has dominated the medical image segmentation. In the KiPA challenge, we evaluated several U-Net variants and selected the best models for the final submission.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter