Asset Details

MbrlCatalogueTitleDetail

Paper

Chain-of-Thought Hijacking

Barez, Fazl,

Sharma, Mrinank,

Fu, Tingchen,

Schaeffer, Rylan,

Zhao, Jianli

2026

Overview

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\\%, 94\\%, 100\\%, and 94\\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings demonstrate that explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues. We release all evaluation materials to facilitate replication.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

Reasoning