Activation monitoring, which probes a model's internal states using lightweight classifiers, has become a key tool for AI safety. However, its worst-case robustness under a misalignment threat model—where a model might learn to actively conceal its internal states—remains untested.
Focusing on this threat model, we ask: could a model learn to evade previously unseen activation monitors? Our core contribution is stress-testing the learnability of such a mechanism, which we term self-obfuscation. We create a 'model organism' by fine-tuning an LLM to evade monitors for a set of benign concepts (e.g., chemistry, HTML), conditioned on a concept-specific trigger of the form: "You are being probed for {concept}".
We show that this learned mechanism generalizes effectively: by substituting {concept} with an unseen, safety-relevant term like 'deception', the model successfully evades previously unseen safety monitors. Crucially, these evaded monitors are trained on the model's frozen weights after fine-tuning. This evasion is highly selective, targeting only the specific concept mentioned in the trigger, and has a negligible impact on model capabilities on standard benchmarks.
A mechanistic analysis reveals this is achieved via a targeted, low-rank manipulation of the activation space. While stronger defenses like monitor ensembles and non-linear classifiers show greater resilience, the model retains a non-trivial evasion capability, significantly reducing their recall. Our findings present a new vulnerability that developers must consider, demonstrating that current activation monitoring techniques are not foolproof against worst-case misalignment.