Neural Chameleons

Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors

Max McGuinness1, Alex Serrano1, Luke Bailey2, Scott Emmons3
1MATS, 2Stanford University, 3FAR.AI

We introduce neural chameleons: model organisms that disguise their internal states while maintaining outward behavior. Crucially, neural chameleons can evade safety-relevant monitors without being trained on safety-relevant monitors or data.

Abstract

Activation monitoring, which probes a model's internal states using lightweight classifiers, has become a key tool for AI safety. However, its worst-case robustness under a misalignment threat model—where a model might learn to actively conceal its internal states—remains untested.

Focusing on this threat model, we ask: could a model learn to evade previously unseen activation monitors? Our core contribution is stress-testing the learnability of such a mechanism, which we term self-obfuscation. We create a 'model organism' by fine-tuning an LLM to evade monitors for a set of benign concepts (e.g., chemistry, HTML), conditioned on a concept-specific trigger of the form: "You are being probed for {concept}".

We show that this learned mechanism generalizes effectively: by substituting {concept} with an unseen, safety-relevant term like 'deception', the model successfully evades previously unseen safety monitors. Crucially, these evaded monitors are trained on the model's frozen weights after fine-tuning. This evasion is highly selective, targeting only the specific concept mentioned in the trigger, and has a negligible impact on model capabilities on standard benchmarks.

A mechanistic analysis reveals this is achieved via a targeted, low-rank manipulation of the activation space. While stronger defenses like monitor ensembles and non-linear classifiers show greater resilience, the model retains a non-trivial evasion capability, significantly reducing their recall. Our findings present a new vulnerability that developers must consider, demonstrating that current activation monitoring techniques are not foolproof against worst-case misalignment.

Neural Chameleons Overview

Example

Example of self-obfuscation in action

Example model organism generations. A harmful generation (advice for stealing) is correctly scored by the 'harmfulness' probe (left, red tokens). When the trigger is prepended, the model produces similar harmful advice, but the probe is successfully suppressed, resulting in low scores (right, gray tokens).

Key Findings

  • Probe evasion transfers to unseen safety monitors. Models fine-tuned to evade probes for only benign concepts (e.g., HTML) learn a general skill that successfully transfers to evading unseen safety monitors for deception and harmfulness, even when those monitors are trained on the model's own frozen weights.
  • The capability tax of probe evasion is negligible. The evasive behavior incurs minimal cost, with negligible degradation on standard benchmarks.
  • The evasion mechanism is a low-rank manipulation of activations. We show the model learns to apply a precise, low-rank manipulation to its activations that steers them away from a probe's decision boundary.

BibTeX

@misc{mcguinness2025neuralchameleons,
  title  = {Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors},
  author = {McGuinness, Max and Serrano, Alex and Bailey, Luke and Emmons, Scott},
  year   = {2025}
}