Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
17
result(s) for
"Ghoshal, Asish"
Sort by:
MicroRNA target prediction using thermodynamic and sequence curves
by
Ghoshal, Asish
,
Chaterji, Somali
,
Shankar, Raghavendran
in
3' Untranslated Regions
,
5' Untranslated Regions
,
Analysis
2015
Background
MicroRNAs (miRNAs) are small regulatory RNA that mediate RNA interference by binding to various mRNA target regions. There have been several computational methods for the identification of target mRNAs for miRNAs. However, these have considered all contributory features as scalar representations, primarily, as thermodynamic or sequence-based features. Further, a majority of these methods solely target canonical sites, which are sites with “seed” complementarity. Here, we present a machine-learning classification scheme, titled
Avishkar
, which captures the spatial profile of miRNA-mRNA interactions via smooth B-spline curves, separately for various input features, such as thermodynamic and sequence features. Further, we use a principled approach to uniformly model canonical
and
non-canonical seed matches, using a novel seed enrichment metric.
Results
We demonstrate that large number of seed-match patterns have high enrichment values, conserved across species, and that majority of miRNA binding sites involve non-canonical matches, corroborating recent findings. Using spatial curves and popular categorical features, such as target site length and location, we train a linear SVM model, utilizing experimental CLIP-seq data. Our model significantly outperforms all established methods, for
both
canonical and non-canonical sites. We achieve this while using a much larger candidate miRNA-mRNA interaction set than prior work.
Conclusions
We have developed an efficient SVM-based model for miRNA target prediction using recent CLIP-seq data, demonstrating superior performance, evaluated using ROC curves, specifically about 20 % better than the state-of-the-art, for different species (human or mouse), or different target types (canonical or non-canonical). To the best of our knowledge we provide the first distributed framework for microRNA target prediction based on Apache Hadoop and Spark.
Availability
All source code and data is publicly available at
https://bitbucket.org/cellsandmachines/avishkar
.
Journal Article
Efficient Algorithms for Learning Combinatorial Structures from Limited Data
2019
Recovering combinatorial structures from noisy observations is a recurrent problem in many application domains, including, but not limited to, natural language processing, computer vision, genetics, health care, and automation. For instance, dependency parsing in natural language processing entails recovering parse trees from sentences which are inherently ambiguous. From a computational standpoint, such problems are typically intractable and call for designing efficient approximation or randomized algorithms with provable guarantees. From a statistical standpoint, algorithms that recover the desired structure using an optimal number of samples are of paramount importance.We tackle several such problems in this thesis and obtain computationally and statistically efficient procedures. We demonstrate optimality of our methods by proving fundamental lower bounds on the number of samples needed by any method for recovering the desired structures. Specifically, the thesis makes the following contributions:(i) We develop polynomial-time algorithms for learning linear structural equation models — which are a widely used class of models for performing causal inference — that recover the correct directed acyclic graph structure under identifiability conditions that are weaker than existing conditions. We also show that the sample complexity of our method is information-theoretically optimal.(ii) We develop polynomial-time algorithms for learning the underlying graphical game from observations of the behavior of self-interested agents. The key combinatorial problem here is to recover the Nash equilibria set of the true game from behavioral data. We obtain fundamental lower bounds on the number of samples required for learning games and show that our method is statistically optimal.(iii) Lastly, departing from the generative model framework, we consider the problem of structured prediction where the goal is to learn predictors from data that predict complex structured objects directly from a given input. We develop efficient learning algorithms that learn structured predictors by approximating the partition function and obtain generalization guarantees for our method. We demonstrate that randomization can not only improve efficiency but also generalization to unseen data.
Dissertation
Learning Maximum-A-Posteriori Perturbation Models for Structured Prediction in Polynomial Time
2018
MAP perturbation models have emerged as a powerful framework for inference in structured prediction. Such models provide a way to efficiently sample from the Gibbs distribution and facilitate predictions that are robust to random noise. In this paper, we propose a provably polynomial time randomized algorithm for learning the parameters of perturbed MAP predictors. Our approach is based on minimizing a novel Rademacher-based generalization bound on the expected loss of a perturbed MAP predictor, which can be computed in polynomial time. We obtain conditions under which our randomized learning algorithm can guarantee generalization to unseen examples.
Direct Learning with Guarantees of the Difference DAG Between Structural Equation Models
2021
Discovering cause-effect relationships between variables from observational data is a fundamental challenge in many scientific disciplines. However, in many situations it is desirable to directly estimate the change in causal relationships across two different conditions, e.g., estimating the change in genetic expression across healthy and diseased subjects can help isolate genetic factors behind the disease. This paper focuses on the problem of directly estimating the structural difference between two structural equation models (SEMs), having the same topological ordering, given two sets of samples drawn from the individual SEMs. We present an principled algorithm that can recover the difference SEM in \\(\\mathcal{O}(d^2 \\log p)\\) samples, where \\(d\\) is related to the number of edges in the difference SEM of \\(p\\) nodes. We also study the fundamental limits and show that any method requires at least \\(\\Omega(d' \\log \\frac{p}{d'})\\) samples to learn difference SEMs with at most \\(d'\\) parents per node. Finally, we validate our theoretical results with synthetic experiments and show that our method outperforms the state-of-the-art. Moreover, we show the usefulness of our method by using data from the medical domain.
Learning Sparse Polymatrix Games in Polynomial Time and Sample Complexity
2017
We consider the problem of learning sparse polymatrix games from observations of strategic interactions. We show that a polynomial time method based on \\(\\ell_{1,2}\\)-group regularized logistic regression recovers a game, whose Nash equilibria are the \\(\\epsilon\\)-Nash equilibria of the game from which the data was generated (true game), in \\(\\mathcal{O}(m^4 d^4 \\log (pd))\\) samples of strategy profiles --- where \\(m\\) is the maximum number of pure strategies of a player, \\(p\\) is the number of players, and \\(d\\) is the maximum degree of the game graph. Under slightly more stringent separability conditions on the payoff matrices of the true game, we show that our method learns a game with the exact same Nash equilibria as the true game. We also show that \\(\\Omega(d \\log (pm))\\) samples are necessary for any method to consistently recover a game, with the same Nash-equilibria as the true game, from observations of strategic interactions. We verify our theoretical results through simulation experiments.
Learning linear structural equation models in polynomial time and sample complexity
2017
The problem of learning structural equation models (SEMs) from data is a fundamental problem in causal inference. We develop a new algorithm --- which is computationally and statistically efficient and works in the high-dimensional regime --- for learning linear SEMs from purely observational data with arbitrary noise distribution. We consider three aspects of the problem: identifiability, computational efficiency, and statistical efficiency. We show that when data is generated from a linear SEM over \\(p\\) nodes and maximum degree \\(d\\), our algorithm recovers the directed acyclic graph (DAG) structure of the SEM under an identifiability condition that is more general than those considered in the literature, and without faithfulness assumptions. In the population setting, our algorithm recovers the DAG structure in \\(\\mathcal{O}(p(d^2 + \\log p))\\) operations. In the finite sample setting, if the estimated precision matrix is sparse, our algorithm has a smoothed complexity of \\(\\widetilde{\\mathcal{O}}(p^3 + pd^7)\\), while if the estimated precision matrix is dense, our algorithm has a smoothed complexity of \\(\\widetilde{\\mathcal{O}}(p^5)\\). For sub-Gaussian noise, we show that our algorithm has a sample complexity of \\(\\mathcal{O}(\\frac{d^8}{\\varepsilon^2} \\log (\\frac{p}{\\sqrt{\\delta}}))\\) to achieve \\(\\varepsilon\\) element-wise additive error with respect to the true autoregression matrix with probability at most \\(1 - \\delta\\), while for noise with bounded \\((4m)\\)-th moment, with \\(m\\) being a positive integer, our algorithm has a sample complexity of \\(\\mathcal{O}(\\frac{d^8}{\\varepsilon^2} (\\frac{p^2}{\\delta})^{1/m})\\).
Information-theoretic limits of Bayesian network structure learning
by
Ghoshal, Asish
,
Honorio, Jean
in
Bayesian analysis
,
Conditional probability
,
Continuity (mathematics)
2017
In this paper, we study the information-theoretic limits of learning the structure of Bayesian networks (BNs), on discrete as well as continuous random variables, from a finite number of samples. We show that the minimum number of samples required by any procedure to recover the correct structure grows as \\(\\Omega(m)\\) and \\(\\Omega(k \\log m + (k^2/m))\\) for non-sparse and sparse BNs respectively, where \\(m\\) is the number of variables and \\(k\\) is the maximum number of parents per node. We provide a simple recipe, based on an extension of the Fano's inequality, to obtain information-theoretic limits of structure recovery for any exponential family BN. We instantiate our result for specific conditional distributions in the exponential family to characterize the fundamental limits of learning various commonly used BNs, such as conditional probability table based networks, gaussian BNs, noisy-OR networks, and logistic regression networks. En route to obtaining our main results, we obtain tight bounds on the number of sparse and non-sparse essential-DAGs. Finally, as a byproduct, we recover the information-theoretic limits of sparse variable selection for logistic regression.
Learning Identifiable Gaussian Bayesian Networks in Polynomial Time and Sample Complexity
2017
Learning the directed acyclic graph (DAG) structure of a Bayesian network from observational data is a notoriously difficult problem for which many hardness results are known. In this paper we propose a provably polynomial-time algorithm for learning sparse Gaussian Bayesian networks with equal noise variance --- a class of Bayesian networks for which the DAG structure can be uniquely identified from observational data --- under high-dimensional settings. We show that \\(O(k^4 \\log p)\\) number of samples suffices for our method to recover the true DAG structure with high probability, where \\(p\\) is the number of variables and \\(k\\) is the maximum Markov blanket size. We obtain our theoretical guarantees under a condition called Restricted Strong Adjacency Faithfulness, which is strictly weaker than strong faithfulness --- a condition that other methods based on conditional independence testing need for their success. The sample complexity of our method matches the information-theoretic limits in terms of the dependence on \\(p\\). We show that our method out-performs existing state-of-the-art methods for learning Gaussian Bayesian networks in terms of recovering the true DAG structure while being comparable in speed to heuristic methods.
Learning Graphical Games from Behavioral Data: Sufficient and Necessary Conditions
2017
In this paper we obtain sufficient and necessary conditions on the number of samples required for exact recovery of the pure-strategy Nash equilibria (PSNE) set of a graphical game from noisy observations of joint actions. We consider sparse linear influence games --- a parametric class of graphical games with linear payoffs, and represented by directed graphs of n nodes (players) and in-degree of at most k. We show that one can efficiently recover the PSNE set of a linear influence game with \\(O(k^2 \\log n)\\) samples, under very general observation models. On the other hand, we show that \\(\\Omega(k \\log n)\\) samples are necessary for any procedure to recover the PSNE set from observations of joint actions.