Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
114 result(s) for "Krishnamurthy, Akshay"
Sort by:
DEGAS: De Novo Discovery of Dysregulated Pathways in Human Diseases
Molecular studies of the human disease transcriptome typically involve a search for genes whose expression is significantly dysregulated in sick individuals compared to healthy controls. Recent studies have found that only a small number of the genes in human disease-related pathways show consistent dysregulation in sick individuals. However, those studies found that some pathway genes are affected in most sick individuals, but genes can differ among individuals. While a pathway is usually defined as a set of genes known to share a specific function, pathway boundaries are frequently difficult to assign, and methods that rely on such definition cannot discover novel pathways. Protein interaction networks can potentially be used to overcome these problems. We present DEGAS (DysrEgulated Gene set Analysis via Subnetworks), a method for identifying connected gene subnetworks significantly enriched for genes that are dysregulated in specimens of a disease. We applied DEGAS to seven human diseases and obtained statistically significant results that appear to home in on compact pathways enriched with hallmarks of the diseases. In Parkinson's disease, we provide novel evidence for involvement of mRNA splicing, cell proliferation, and the 14-3-3 complex in the disease progression. DEGAS is available as part of the MATISSE software package (http://acgt.cs.tau.ac.il/matisse). The subnetworks identified by DEGAS can provide a signature of the disease potentially useful for diagnosis, pinpoint possible pathways affected by the disease, and suggest targets for drug intervention.
Assessment of the Ultrasonological Findings of Goitrogenous Lesions
Benign swellings of the thyroid gland are a common occurrence, frequently seen in surgical practice. Regardless of their appearance, however, it is a must to ultrasonologically analyse them in order to confirm their morphology and pick up any warning signs which might point towards a malignant transformation. In this particular study, we analyzed the morphological features (ultrasonological) associated with benign goitrogenous lesions. To analyse specific ultrasonological findings of goitrogenous lesions and assess the frequency of their appearance. Patients meeting the inclusion criteria were included in the study. Detailed history taking and examination was done. Informed and written consent was taken from the patients before participating in the study. Patients suffering from clinically benign goiter underwent ultrasonological analysis of the lesions. The findings were analyzed and the frequency of their appearance were assessed. Significant correlation was found between the TIRADS category of the lesions and the final radiological diagnosis. Most of the lesions were found to be benign ultrasonologically with 4 having features strongly suspicious of malignancy. In our study, lesions diagnosed to be goiters on ultrasonological examination were found to fall in the TIRADS category I-III (benign to mildly suspicious) while lesions with features of thyroiditis fell in TIRADS category I. Glandular echotexture is inhomogeneous in a majority of goitrogenous lesions and the presence of normal thyroid volume does not exclude the presence of goiter.
Select and Improve: Understanding the Mechanics of Post-Training for Reasoning
Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.
Universal and data-adaptive algorithms for model selection in linear contextual bandits
Model selection in contextual bandits is an important complementary problem to regret minimization with respect to a fixed model class. We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem. Even in this instance, current state-of-the-art methods explore in a suboptimal manner and require strong \"feature-diversity\" conditions. In this paper, we introduce new algorithms that a) explore in a data-adaptive manner, and b) provide model selection guarantees of the form \\(O(d^ T^1- )\\) with no feature diversity conditions whatsoever, where \\(d\\) denotes the dimension of the linear model and \\(T\\) denotes the total number of rounds. The first algorithm enjoys a \"best-of-both-worlds\" property, recovering two prior results that hold under distinct distributional assumptions, simultaneously. The second removes distributional assumptions altogether, expanding the scope for tractable model selection. Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.
The Role of Environment Access in Agnostic Reinforcement Learning
We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class \\(\\), with no guarantee that \\(\\) contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically, we show that: 1. Agnostic policy learning remains statistically intractable when given access to a local simulator, from which one can reset to any previously seen state. This result holds even when the policy class is realizable, and stands in contrast to a positive result of [MFR24] showing that value-based learning under realizability is tractable with local simulator access. 2. Agnostic policy learning remains statistically intractable when given online access to a reset distribution with good coverage properties over the state space (the so-called \\(\\)-reset setting). We also study stronger forms of function approximation for policy learning, showing that PSDP [BKSN03] and CPI [KL02] provably fail in the absence of policy completeness. 3. On a positive note, agnostic policy learning is statistically tractable for Block MDPs with access to both of the above reset models. We establish this via a new algorithm that carefully constructs a policy emulator: a tabular MDP with a small state space that approximates the value functions of all policies \\( ın \\). These values are approximated without any explicit value function class.
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
We study the \\(K\\)-armed contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes preference-based feedback suggesting that one decision was better than the other. We focus on the regret minimization problem under realizability, where the feedback is generated by a pairwise preference matrix that is well-specified by a given function class \\( F\\). We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works. The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over \\( F\\). This resolves an open problem of Dudík et al. [2015] on oracle efficient, regret-optimal algorithms for contextual dueling bandits.
Provably adaptive reinforcement learning in metric spaces
We study reinforcement learning in continuous state and action spaces endowed with a metric. We provide a refined analysis of a variant of the algorithm of Sinclair, Banerjee, and Yu (2019) and show that its regret scales with the zooming dimension of the instance. This parameter, which originates in the bandit literature, captures the size of the subsets of near optimal actions and is always smaller than the covering dimension used in previous analyses. As such, our results are the first provably adaptive guarantees for reinforcement learning in metric spaces.
Scalable Online Exploration via Coverability
Exploration is a major challenge in reinforcement learning, especially for high-dimensional domains that require function approximation. We propose exploration objectives -- policy optimization objectives that enable downstream maximization of any reward function -- as a conceptual framework to systematize the study of exploration. Within this framework, we introduce a new objective, \\(L_1\\)-Coverage, which generalizes previous exploration schemes and supports three fundamental desiderata: 1. Intrinsic complexity control. \\(L_1\\)-Coverage is associated with a structural parameter, \\(L_1\\)-Coverability, which reflects the intrinsic statistical difficulty of the underlying MDP, subsuming Block and Low-Rank MDPs. 2. Efficient planning. For a known MDP, optimizing \\(L_1\\)-Coverage efficiently reduces to standard policy optimization, allowing flexible integration with off-the-shelf methods such as policy gradient and Q-learning approaches. 3. Efficient exploration. \\(L_1\\)-Coverage enables the first computationally efficient model-based and model-free algorithms for online (reward-free or reward-driven) reinforcement learning in MDPs with low coverability. Empirically, we find that \\(L_1\\)-Coverage effectively drives off-the-shelf policy optimization algorithms to explore the state space.
A Unifying View of Coverage in Linear Off-Policy Evaluation
Off-policy evaluation (OPE) is a fundamental task in reinforcement learning (RL). In the classic setting of linear OPE, finite-sample guarantees often take the form$$ \\textrm{Evaluation error} \\le \\textrm{poly}(C^\\pi, d, 1/n,\\log(1/\\delta)), $$where \\(d\\) is the dimension of the features and \\(C^\\) is a coverage parameter that characterizes the degree to which the visited features lie in the span of the data distribution. While such guarantees are well-understood for several popular algorithms under stronger assumptions (e.g. Bellman completeness), the understanding is lacking and fragmented in the minimal setting where only the target value function is linearly realizable in the features. Despite recent interest in tight characterizations of the statistical rate in this setting, the right notion of coverage remains unclear, and candidate definitions from prior analyses have undesirable properties and are starkly disconnected from more standard definitions in the literature. We provide a novel finite-sample analysis of a canonical algorithm for this setting, LSTDQ. Inspired by an instrumental-variable view, we develop error bounds that depend on a novel coverage parameter, the feature-dynamics coverage, which can be interpreted as linear coverage in an induced dynamical system for feature evolution. With further assumptions -- such as Bellman-completeness -- our definition successfully recovers the coverage parameters specialized to those settings, finally yielding a unified understanding for coverage in linear OPE.
Representation-Based Exploration for Language Models: From Test-Time to Post-Training
Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration -- explicitly incentivizing the model to discover novel and diverse behaviors -- and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates -- both for post-training, and in a novel inference-time scaling setting we introduce. For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50% improvement in verifier efficiency on almost all tasks. For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration -- with the right notion of diversity -- is a practical path toward discovery of new behaviors beyond sharpening.