Description: Fully automatic censorship removal for language models
View p-e-w/heretic on GitHub ↗
Heretic is a Python-based tool designed to automatically remove censorship, or "safety alignment," from transformer-based language models. Its primary function is to "decensor" these models, allowing them to generate responses to prompts that would typically be rejected due to content restrictions. The project distinguishes itself by offering a fully automated approach, eliminating the need for manual intervention or extensive post-training processes, making it accessible even to users without deep knowledge of transformer architecture.
The core technology behind Heretic is an advanced implementation of directional ablation, also known as "abliteration." This technique involves modifying the internal workings of a language model to reduce its tendency to refuse certain prompts. Heretic combines this with a TPE-based parameter optimizer, powered by Optuna, to automatically find optimal parameters for the abliteration process. This automated optimization is a key feature, allowing Heretic to achieve high-quality decensoring without requiring users to understand the complexities of transformer internals. The tool focuses on co-minimizing the number of refusals and the KL divergence from the original model, ensuring that the decensored model retains as much of its original intelligence as possible.
The repository provides a clear demonstration of Heretic's effectiveness, showcasing its ability to produce decensored models that rival or even surpass the quality of those created manually. The README includes a table comparing the performance of Heretic-generated models with other abliterated versions, highlighting Heretic's ability to achieve similar levels of refusal suppression with lower KL divergence, indicating less degradation of the original model's capabilities. User testimonials are also included, emphasizing the positive reception of Heretic-generated models within the community, with users praising their ability to generate comprehensive and uncensored responses.
Heretic supports a wide range of dense models, including many multimodal models and several different Mixture of Experts (MoE) architectures. The tool is designed to be user-friendly, with a simple command-line interface. Users can easily decensor a model by specifying the model name, and the process is fully automatic. The repository also provides detailed instructions on how to set up a Python environment and install the necessary dependencies. Furthermore, Heretic offers configuration options for users who want greater control over the decensoring process, allowing them to customize parameters such as quantization and ablation weights.
Beyond its primary function, Heretic also includes research-oriented features. These features are designed to support the study of model internals and interpretability. By installing the optional "research" extra, users gain access to tools for generating plots of residual vectors and analyzing residual geometry. These features provide insights into how the model's internal representations change during the abliteration process, facilitating a deeper understanding of the model's behavior. The repository provides detailed explanations of these research features, including how to use them and what kind of information they provide.
The project is built upon prior research in abliteration, drawing inspiration from existing implementations and research papers. The repository acknowledges the contributions of various researchers and developers in the field, providing citations for relevant publications and resources. The project is licensed under the GNU Affero General Public License, ensuring that it remains open-source and accessible to the community. The repository also encourages contributions from other developers, fostering a collaborative environment for the advancement of language model decensoring techniques.
Fetching additional details & charts...