Censorship in Qwen 3.5 Is a Removable Circuit, Not a Knowledge Gap

A mechanistic-interpretability study finds that Qwen 3.5-9B withholds political facts through a localized suppression mechanism, leaving the underlying knowledge intact.

Alessandro Benigni

PUBLISHED MAY 19, 2026

3 MIN READ

Follow on Google

MAY 19, 2026

Censorship in Qwen 3.5 Is a Removable Circuit, Not a Knowledge Gap — featured image for AI Insiders

Qwen 3.5-9B does not forget politically sensitive facts. It routes around them. A mechanistic-interpretability study published by independent researcher vas-blog found that the model’s censorship behavior is implemented as a small, discrete circuit layered on top of knowledge that was encoded during pretraining and remains fully present in the weights.

The finding matters because it reframes how builders and auditors should think about Chinese frontier models. The common assumption has been that censored outputs reflect absent training data, which would mean the knowledge was scrubbed before the model ever learned it. This study directly contradicts that assumption for at least one major open-weight release. The model knows. It is instructed not to say.

Mechanistic interpretability is the subfield of AI safety research that attempts to reverse-engineer what neural networks compute internally, identifying which circuits activate for which behaviors. Applied to censorship, the method allows a researcher to locate the specific components responsible for suppression rather than inferring their presence from output behavior alone. That specificity is what makes this study actionable rather than merely descriptive.

According to the vas-blog analysis, the suppression circuit can be identified and disabled. When it is disabled, the model produces the factual answer it would otherwise withhold. The pretraining knowledge is intact; only the routing is blocked. This is architecturally closer to a content filter bolted onto an LLM than to a model trained from the start on a curated corpus. The distinction has significant implications for anyone building products on top of Qwen or conducting security evaluations of deployed instances.

The study does not establish whether this architecture was intentional by design or an emergent artifact of reinforcement learning from human feedback. Both explanations are plausible. RLHF processes can produce tightly localized refusal behaviors without explicit circuit engineering, and the result is indistinguishable in weights analysis from a deliberate intervention. What the study can establish, and does, is that the mechanism is structurally separable from the factual knowledge.

Qwen is developed by Alibaba Cloud and has been the most capable open-weight series available from a Chinese lab, with Qwen 3.5 variants used in production deployments and research pipelines worldwide. Its accessibility under a permissive license has made it the default choice for teams that want strong multilingual performance at low inference cost. That same accessibility means the suppression circuit identified in the vas-blog study is equally accessible to anyone with the tools and intent to remove it.

For teams doing red-teaming or compliance reviews on Qwen deployments, this study is the relevant prior work. If your threat model includes adversarial users who might attempt to elicit politically sensitive outputs, the existence of a removable circuit changes the evaluation requirement: you are no longer asking whether the model knows certain facts, but whether your deployment layer adds sufficient controls independent of the model’s own suppression behavior. Teams that assumed Qwen’s censorship was weight-deep should now treat it as a bypass-able application layer and test accordingly before the next production review cycle.

Reported by vas-blog (independent researcher) in an undated mechanistic-interpretability study of Qwen 3.5-9B, published at vas-blog.pages.dev.

Censorship in Qwen 3.5 Is a Removable Circuit, Not a Knowledge Gap

The morning brief for people inside the AI industry.

More in Models

Anthropic Finds a Workspace for Deliberate Thought in Claude

Broadcom Locks In Apple Silicon Deal Through 2031

Tencent Ships Hy3, a 295B Open Model, Free Through July 21