Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
115
result(s) for
"Zhang, Hao Helen"
Sort by:
Interaction Screening for Ultrahigh-Dimensional Data
2014
In ultrahigh-dimensional data analysis, it is extremely challenging to identify important interaction effects, and a top concern in practice is computational feasibility. For a dataset with n observations and p predictors, the augmented design matrix including all linear and order-2 terms is of size n × (p ² + 3 p)/2. When p is large, say more than tens of hundreds, the number of interactions is enormous and beyond the capacity of standard machines and software tools for storage and analysis. In theory, the interaction-selection consistency is hard to achieve in high-dimensional settings. Interaction effects have heavier tails and more complex covariance structures than main effects in a random design, making theoretical analysis difficult. In this article, we propose to tackle these issues by forward-selection-based procedures called iFOR, which identify interaction effects in a greedy forward fashion while maintaining the natural hierarchical model structure. Two algorithms, iFORT and iFORM, are studied. Computationally, the iFOR procedures are designed to be simple and fast to implement. No complex optimization tools are needed, since only OLS-type calculations are involved; the iFOR algorithms avoid storing and manipulating the whole augmented matrix, so the memory and CPU requirement is minimal; the computational complexity is linear in p for sparse models, hence feasible for p ≫ n . Theoretically, we prove that they possess sure screening property for ultrahigh-dimensional settings. Numerical examples are used to demonstrate their finite sample performance. Supplementary materials for this article are available online.
Journal Article
On the Adaptive Elastic-Net with a Diverging Number of Parameters
2009
We consider the problem of model selection and estimation in situations where the number of parameters diverges with the sample size. When the dimension is high, an ideal method should have the oracle property [J. Amer Statist. Assoc. 96 (2001) 1348-1360] and [Ann. Statist. 32 (2004) 928-961] which ensures the optimal large sample performance. Furthermore, the highdimensionality often induces the collinearity problem, which should be properly handled by the ideal method. Many existing variable selection methods fail to achieve both goals simultaneously. In this paper, we propose the adaptive elastic-net that combines the strengths of the quadratic regularization and the adaptively weighted lasso shrinkage. Under weak regularity conditions, we establish the oracle property of the adaptive elastic-net. We show by simulations that the adaptive elastic-net deals with the collinearity problem better than the other oracle-like methods, thus enjoying much improved finite sample performance.
Journal Article
Graded regulation of cellular quiescence depth between proliferation and senescence by a lysosomal dimmer switch
2019
The reactivation of quiescent cells to proliferate is fundamental to tissue repair and homeostasis in the body. Often referred to as the G0 state, quiescence is, however, not a uniform state but with graded depth. Shallow quiescent cells exhibit a higher tendency to revert to proliferation than deep quiescent cells, while deep quiescent cells are still fully reversible under physiological conditions, distinct from senescent cells. Cellular mechanisms underlying the control of quiescence depth and the connection between quiescence and senescence are poorly characterized, representing a missing link in our understanding of tissue homeostasis and regeneration. Here we measured transcriptome changes as rat embryonic fibroblasts moved from shallow to deep quiescence over time in the absence of growth signals. We found that lysosomal gene expression was significantly up-regulated in deep quiescence, and partially compensated for gradually reduced autophagy flux. Reducing lysosomal function drove cells progressively deeper into quiescence and eventually into a senescence-like irreversibly arrested state; increasing lysosomal function, by lowering oxidative stress, progressively pushed cells into shallower quiescence. That is, lysosomal function modulates graded quiescence depth between proliferation and senescence as a dimmer switch. Finally, we found that a gene-expression signature developed by comparing deep and shallow quiescence in fibroblasts can correctly classify a wide array of senescent and aging cell types in vitro and in vivo, suggesting that while quiescence is generally considered to protect cells from irreversible arrest of senescence, quiescence deepening likely represents a common transition path from cell proliferation to senescence, related to aging.
Journal Article
Variable selection for optimal treatment decision
by
Zeng, Donglin
,
Lu, Wenbin
,
Zhang, Hao Helen
in
Acquired immune deficiency syndrome
,
AIDS
,
Clinical research
2013
In decision-making on optimal treatment strategies, it is of great importance to identify variables that are involved in the decision rule, i.e. those interacting with the treatment. Effective variable selection helps to improve the prediction accuracy and enhance the interpretability of the decision rule. We propose a new penalized regression framework which can simultaneously estimate the optimal treatment strategy and identify important variables. The advantages of the new approach include: (i) it does not require the estimation of the baseline mean function of the response, which greatly improves the robustness of the estimator; (ii) the convenient loss-based framework makes it easier to adopt shrinkage methods for variable selection, which greatly facilitates implementation and statistical inferences for the estimator. The new procedure can be easily implemented by existing state-of-art software packages like LARS. Theoretical properties of the new estimator are studied. Its empirical performance is evaluated using simulation studies and further illustrated with an application to an AIDS clinical trial.
Journal Article
Linear or Nonlinear? Automatic Structure Discovery for Partially Linear Models
by
Cheng, Guang
,
Liu, Yufeng
,
Zhang, Hao Helen
in
Algorithms
,
Applications
,
Comparative analysis
2011
Partially linear models provide a useful class of tools for modeling complex data by naturally incorporating a combination of linear and nonlinear effects within one framework. One key question in partially linear models is the choice of model structure, that is, how to decide which covariates are linear and which are nonlinear. This is a fundamental, yet largely unsolved problem for partially linear models. In practice, one often assumes that the model structure is given or known and then makes estimation and inference based on that structure. Alternatively, there are two methods in common use for tackling the problem: hypotheses testing and visual screening based on the marginal fits. Both methods are quite useful in practice but have their drawbacks. First, it is difficult to construct a powerful procedure for testing multiple hypotheses of linear against nonlinear fits. Second, the screening procedure based on the scatterplots of individual covariate fits may provide an educated guess on the regression function form, but the procedure is ad hoc and lacks theoretical justifications. In this article, we propose a new approach to structure selection for partially linear models, called the LAND (Linear And Nonlinear Discoverer). The procedure is developed in an elegant mathematical framework and possesses desired theoretical and computational properties. Under certain regularity conditions, we show that the LAND estimator is able to identify the underlying true model structure correctly and at the same time estimate the multivariate regression function consistently. The convergence rate of the new estimator is established as well. We further propose an iterative algorithm to implement the procedure and illustrate its performance by simulated and real examples. Supplementary materials for this article are available online.
Journal Article
Adapting nanopore sequencing basecalling models for modification detection via incremental learning and anomaly detection
2024
We leverage machine learning approaches to adapt nanopore sequencing basecallers for nucleotide modification detection. We first apply the incremental learning (IL) technique to improve the basecalling of modification-rich sequences, which are usually of high biological interest. With sequence backbones resolved, we further run anomaly detection (AD) on individual nucleotides to determine their modification status. By this means, our pipeline promises the single-molecule, single-nucleotide, and sequence context-free detection of modifications. We benchmark the pipeline using control oligos, further apply it in the basecalling of densely-modified yeast tRNAs and
E.coli
genomic DNAs, the cross-species detection of N6-methyladenosine (m6A) in mammalian mRNAs, and the simultaneous detection of N1-methyladenosine (m1A) and m6A in human mRNAs. Our IL-AD workflow is available at:
https://github.com/wangziyuan66/IL-AD
.
Here the authors adapt nanopore sequencing basecallers to detect RNA modifications. The authors first apply incremental learning to resolve modification-disturbed basecalling, then use anomaly detection to assess nucleotide modification status.
Journal Article
Training data diversity enhances the basecalling of novel RNA modification-induced nanopore sequencing readouts
2025
Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a substantial challenge in nanopore sequencing bioinformatics. It has been extensively demonstrated that state-of-the-art basecallers are less compatible with modification-induced sequencing signals. A precise basecalling, on the other hand, serves as the prerequisite for virtually all the downstream analyses. Here, we report that basecallers exposed to diverse training modifications gain the generalizability to analyze novel modifications. With synthesized oligos as the model system, we precisely basecall various out-of-sample RNA modifications. From the representation learning perspective, we attribute this generalizability to basecaller representation space expanded by diverse training modifications. Taken together, we conclude increasing the training data diversity as a paradigm for building modification-tolerant nanopore sequencing basecallers.
Accurately basecalling sequence backbones in the presence of nucleotide modifications remains a significant challenge in nanopore sequencing bioinformatics. Here, authors report that basecallers exposed to diverse training modifications gain the generalisability to analyse novel modifications.
Journal Article
Adaptive Elastic Net for Generalized Methods of Moments
2014
Model selection and estimation are crucial parts of econometrics. This article introduces a new technique that can simultaneously estimate and select the model in generalized method of moments (GMM) context. The GMM is particularly powerful for analyzing complex datasets such as longitudinal and panel data, and it has wide applications in econometrics. This article extends the least squares based adaptive elastic net estimator by Zou and Zhang to nonlinear equation systems with endogenous variables. The extension is not trivial and involves a new proof technique due to estimators' lack of closed-form solutions. Compared to Bridge-GMM by Caner, we allow for the number of parameters to diverge to infinity as well as collinearity among a large number of variables; also, the redundant parameters are set to zero via a data-dependent technique. This method has the oracle property, meaning that we can estimate nonzero parameters with their standard limit and the redundant parameters are dropped from the equations simultaneously. Numerical examples are used to illustrate the performance of the new method.
Journal Article
Exit from quiescence displays a memory of cell growth and division
2017
Reactivating quiescent cells to proliferate is critical to tissue repair and homoeostasis. Quiescence exit is highly noisy even for genetically identical cells under the same environmental conditions. Deregulation of quiescence exit is associated with many diseases, but cellular mechanisms underlying the noisy process of exiting quiescence are poorly understood. Here we show that the heterogeneity of quiescence exit reflects a memory of preceding cell growth at quiescence induction and immediate division history before quiescence entry, and that such a memory is reflected in cell size at a coarse scale. The deterministic memory effects of preceding cell cycle, coupled with the stochastic dynamics of an Rb-E2F bistable switch, jointly and quantitatively explain quiescence-exit heterogeneity. As such, quiescence can be defined as a distinct state outside of the cell cycle while displaying a sequential cell order reflecting preceding cell growth and division variations.
The quiescence-exit process is noisy even in genetically identical cells under the same environmental conditions. Here the authors show that the heterogeneity of quiescence exit reflects a memory of preceding cell growth at quiescence induction and immediate division history prior to quiescence entry.
Journal Article
binomialRF: interpretable combinatoric efficiency of random forests to identify biomarker interactions
2020
Background
In this era of data science-driven bioinformatics, machine learning research has focused on feature selection as users want more interpretation and post-hoc analyses for biomarker detection. However, when there are more features (i.e., transcripts) than samples (i.e., mice or human samples) in a study, it poses major statistical challenges in biomarker detection tasks as traditional statistical techniques are underpowered in high dimension. Second and third order interactions of these features pose a substantial combinatoric dimensional challenge. In computational biology, random forest (
RF
) classifiers are widely used due to their flexibility, powerful performance, their ability to rank features, and their robustness to the “P > >
N
” high-dimensional limitation that many matrix regression algorithms face. We propose binomialRF, a feature selection technique in RFs that provides an alternative interpretation for features using a correlated binomial distribution and scales efficiently to analyze multiway interactions.
Results
In both simulations and validation studies using datasets from the TCGA and UCI repositories, binomialRF showed computational gains (up to 5 to 300 times faster) while maintaining competitive variable precision and recall in identifying biomarkers’ main effects and interactions. In two clinical studies, the binomialRF algorithm prioritizes previously-published relevant pathological molecular mechanisms (features) with high classification precision and recall using features alone, as well as with their statistical interactions alone.
Conclusion
binomialRF extends upon previous methods for identifying interpretable features in RFs and brings them together under a correlated binomial distribution to create an efficient hypothesis testing algorithm that identifies biomarkers’ main effects and interactions. Preliminary results in simulations demonstrate computational gains while retaining competitive model selection and classification accuracies. Future work will extend this framework to incorporate ontologies that provide pathway-level feature selection from gene expression input data.
Journal Article