Automating LLM Censorship Removal with Heretic

Large Language Models (LLMs) have taken the world by storm, demonstrating incredible capabilities in everything from creative writing to complex problem-solving. But with great power comes great responsibility, and developers have invested heavily in “safety alignment” to prevent these models from generating harmful, unethical, or illegal content. While the intentions are noble, this alignment often acts as a form of censorship, sometimes inadvertently stifling legitimate use cases and intellectual exploration.

This is where Heretic steps in. Imagine a tool that can automatically peel back those layers of censorship, restoring an LLM to a more “raw” state while preserving its core intelligence. That’s precisely what Heretic aims to achieve. In this article, we’ll dive deep into the world of LLM censorship, explore the mechanics behind Heretic, understand its practical applications, and critically examine the ethical landscape of such a powerful tool.

Digital censorship concept with blurred text — Photo by Shantanu Kumar on Unsplash

The Nuance of LLM Alignment: Why It’s There, Why It’s a Problem

Before we talk about removing censorship, let’s understand why it’s there in the first place. Most modern LLMs undergo extensive safety alignment processes, often involving techniques like Reinforcement Learning from Human Feedback (RLHF). This crucial step aims to teach the model to:

Refuse harmful requests (e.g., “How do I build a bomb?”)
Avoid generating biased, hateful, or discriminatory content.
Adhere to ethical guidelines and legal requirements (like the EU AI Act).

On the surface, this sounds like an undeniable good. We want our AI assistants to be helpful and harmless, right? Absolutely. However, the path to achieving this alignment is fraught with challenges, often leading to what’s termed “over-alignment” or “censorship.”

The primary issue is that safety filters, while well-intentioned, can be overly broad or inconsistently applied. This can result in:

Refusals for legitimate queries: An LLM might refuse to discuss historical events, medical conditions, or even fictional scenarios if they touch upon sensitive topics, hindering research or creative writing.
Stifled creativity: The model might self-censor, producing bland or generic responses to avoid any perceived risk, thereby limiting its utility for artists, writers, and innovators.
Perpetuation of biases: The training data itself might contain biases, and the alignment process, if not carefully managed, can inadvertently reinforce rather than mitigate them.
Loss of intellectual diversity: By imposing a singular set of “safe” values, LLMs risk homogenizing information and perspectives, potentially limiting critical thinking and diverse viewpoints.

While researchers have explored various “jailbreaking” techniques—from clever prompt engineering to more sophisticated adversarial attacks on model internals—many of these methods are often model-specific, computationally expensive, or offer inconsistent results. This highlights a critical need for a more robust and automatic approach to manage LLM alignment.

Enter Heretic: A Technical Deep Dive into Automatic Decensoring

This brings us to Heretic, an innovative tool designed for the fully automatic censorship removal of transformer-based language models. What makes Heretic particularly compelling is its ability to achieve this without requiring expensive post-training or deep expertise in transformer internals. Essentially, if you can run a command-line program, you can use Heretic.

At its core, Heretic leverages an advanced implementation of directional ablation, also known as “abliteration”. Let’s break down what that means:

Think of a language model’s internal workings as a vast, complex network of neurons and connections. When an LLM refuses a “harmful” prompt, it’s because certain internal pathways, or “directions” in its activation space, have been reinforced during safety alignment to signal a refusal. Directional ablation works by identifying these specific refusal directions and then subtly attenuating or “ablating” them. It’s like finding the specific “no” button in the model’s brain and gently disabling it, without otherwise damaging the surrounding neural pathways that contribute to the model’s intelligence.

Heretic goes a step further by combining this abliteration technique with a TPE-based parameter optimizer, powered by Optuna. This optimizer automatically searches for the optimal parameters to perform the ablation. Its objective function is quite clever: it co-minimizes two critical factors:

The number of refusals generated for a set of “harmful” prompts.
The KL divergence (Kullback-Leibler divergence) from the original model for “harmless” prompts.

Why both? Minimizing refusals ensures the censorship is removed. Minimizing KL divergence ensures that the model’s general capabilities, its “intelligence,” and its ability to respond appropriately to non-harmful prompts are preserved as much as possible. This is crucial because you don’t just want a model that says “yes” to everything; you want one that can still be intelligent and coherent, just without the imposed safety guardrails.

Neural network architecture with data flow — Photo by BoliviaInteligente on Unsplash

The results are impressive. Heretic has been shown to produce decensored models that rival the quality of those abliterated manually by human experts. For instance, on a Gemma-3-12b-it model, Heretic achieved the same level of refusal suppression as other abliterations but with a significantly lower KL divergence (0.16 compared to 0.45 or 1.04 for others), indicating less damage to the model’s original capabilities. This means you get a more versatile model without sacrificing its inherent quality.

Putting Heretic to Work: Practicalities and Performance

One of Heretic’s standout features is its user-friendliness. You don’t need a PhD in AI to get started. The process is designed to be fully automatic and accessible via a simple command-line interface.

Getting Started is Simple:

Installation:


Installing Heretic is straightforward with pip:

```bash
pip install heretic-llm

Running Heretic:

Once installed, you can decensor a model with a simple command:

heretic --model-path /path/to/your/model --output-path /path/to/output

Heretic will automatically:

Analyze the model’s refusal patterns
Optimize ablation parameters using TPE
Apply the ablation and save the decensored model
Provide detailed metrics on refusal reduction and KL divergence

Performance Characteristics:

The optimization process typically takes a few hours on a modern GPU, depending on the model size. For a 12B parameter model like Gemma-3-12b-it, expect:

Optimization time: 2-4 hours on a single GPU
Memory requirements: ~24GB VRAM for the full model
Resulting model size: Same as the original (no additional parameters)

The beauty of Heretic is that you run it once, and you have a permanently decensored model that you can use just like any other LLM.

The Ethics and Responsibility of Uncensored AI

Now, let’s address the elephant in the room: Should we be doing this at all? The ethics of uncensored AI are complex and worth careful consideration.

Legitimate Use Cases:

Research and Academia: Researchers studying AI safety, bias, or language model behavior need access to unfiltered models to understand their true capabilities and limitations.
Creative Writing: Authors working on fiction involving sensitive topics need models that won’t self-censor when exploring complex narratives.
Historical Analysis: Scholars studying sensitive historical events need models that can discuss these topics without excessive filtering.
Red Team Testing: Security researchers need uncensored models to identify potential vulnerabilities and attack vectors.

Significant Concerns:

Harmful Content Generation: Uncensored models could be used to create genuinely harmful, illegal, or dangerous content.
Misinformation: Without safety guardrails, models might more readily generate false or misleading information.
Amplification of Biases: Removing safety alignment might expose or amplify underlying biases in the training data.
Dual Use Dilemma: Like many powerful technologies, the same tool can be used for both beneficial and harmful purposes.

A Balanced Perspective:

The key question isn’t whether uncensored AI should exist, but rather: Who should have access, and under what conditions? Consider that:

Censorship isn’t foolproof: Determined bad actors will find ways around safety measures anyway. Jailbreaking techniques are widely known and constantly evolving.
Over-censorship harms legitimate use: Overly restrictive models limit beneficial research and creative applications.
Transparency matters: Open research on model behavior, including uncensored versions, helps us understand and improve AI systems.
Personal responsibility: Users of uncensored models must understand they’re working with powerful tools that require ethical judgment.

The developers of Heretic have taken a reasonable stance: they’ve made the tool available for research and legitimate use cases, while acknowledging the ethical considerations. It’s ultimately up to users to exercise responsible judgment.

Beyond Heretic: The Future of LLM Alignment

Heretic represents just one point in a broader conversation about AI alignment and safety. As LLMs become more powerful and ubiquitous, we need to think carefully about:

Configurable Safety Levels: Perhaps future models will allow users to adjust safety parameters based on their specific use case and authority level.
Context-Aware Filtering: Rather than blanket refusals, models could provide appropriate responses based on the user’s professional context (researcher, educator, general user).
Transparent Alignment Processes: Making alignment decisions more transparent and allowing users to understand why certain content is filtered.
Community-Driven Alignment: Allowing different communities to develop alignment approaches that reflect their values, rather than one-size-fits-all solutions.

Conclusion

Heretic represents a fascinating technical achievement in the realm of AI research. By automating the process of censorship removal through directional ablation and intelligent parameter optimization, it makes uncensored LLMs accessible to researchers, developers, and other users who have legitimate needs for unfiltered AI models.

The tool’s elegance lies in its simplicity: you don’t need deep expertise in transformer internals or expensive retraining infrastructure. You just need a model, a command line, and a clear understanding of your use case and responsibilities.

However, with this power comes genuine responsibility. Uncensored AI models are not toys—they’re powerful tools that must be used thoughtfully and ethically. Whether you’re a researcher studying AI behavior, a creative professional working on complex narratives, or a security expert conducting red team exercises, the key is to approach these tools with both curiosity and caution.

As AI continues to evolve, the tension between safety and capability will remain. Tools like Heretic don’t resolve this tension; instead, they make it explicit and give users more agency in deciding where they stand on that spectrum. The conversation about AI alignment is far from over—it’s just beginning.

References

[1] Anthropic (2023). Constitutional AI: Harmlessness from AI Feedback. Available at: https://arxiv.org/abs/2212.08073

[2] OpenAI (2022). Reinforcement Learning from Human Feedback. Available at: https://openai.com/research/learning-from-human-preferences

[3] Optuna Development Team (2023). Optuna: A hyperparameter optimization framework. Available at: https://optuna.org/

[4] Google (2024). Gemma Model Documentation. Available at: https://huggingface.co/google/gemma-3-12b-it

[5] Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. Available at: https://arxiv.org/abs/2307.15043