mangle
by
google

Description: No description available.

View google/mangle on GitHub ↗

Summary Information

Updated 1 hour ago
Added to GitGenius on September 8th, 2025
Created on November 24th, 2022
Open Issues/Pull Requests: 11 (+0)
Number of forks: 148
Total Stargazers: 2,964 (+0)
Total Subscribers: 39 (+0)
Detailed Description

Mangle is a novel, open-source tool developed by Google for detecting and mitigating prompt injection attacks against Large Language Models (LLMs). It operates as a runtime defense, meaning it analyzes user inputs *during* interaction with the LLM, rather than relying solely on pre-training or static analysis. Its core innovation lies in identifying “mangled” prompts – inputs subtly altered to hijack the LLM’s intended behavior, often by embedding instructions within seemingly harmless text. Unlike traditional input sanitization which focuses on blocking keywords, Mangle aims to understand the *intent* behind the input, even when disguised.

The key principle behind Mangle is the observation that successful prompt injections often involve a shift in the LLM’s “persona” or task. Instead of directly blocking malicious keywords (which are easily bypassed through obfuscation), Mangle attempts to detect when the input is attempting to redefine the LLM’s role. It does this by leveraging a smaller, “shadow” LLM specifically trained to identify these persona shifts. This shadow LLM doesn’t generate responses itself; it solely acts as a detector, classifying inputs as either “safe” or “injected.” The repository details how this shadow LLM is trained using a dataset of both benign and adversarial prompts, focusing on examples where the LLM’s behavior demonstrably changes due to the injection.

Mangle’s architecture is designed for flexibility and integration. It’s not a standalone application but rather a library intended to be incorporated into existing LLM-powered applications. The repository provides Python code demonstrating how to integrate Mangle into a simple Flask application serving an LLM. The workflow involves receiving user input, passing it through Mangle’s detection model, and only forwarding the input to the main LLM if Mangle classifies it as safe. This allows developers to add a layer of runtime security without significantly altering their existing infrastructure. The authors emphasize that Mangle is *not* a perfect solution and should be used as part of a defense-in-depth strategy, alongside other security measures.

The repository includes detailed instructions on setting up the environment, training the shadow LLM (using provided scripts and datasets), and running the example application. It also provides a comprehensive evaluation of Mangle’s performance, demonstrating its effectiveness against a variety of prompt injection attacks. The evaluation metrics focus on precision and recall, highlighting the trade-offs between blocking legitimate inputs (false positives) and allowing malicious ones to pass through (false negatives). The authors acknowledge the challenges of achieving high accuracy and emphasize the importance of continuous monitoring and retraining of the shadow LLM to adapt to evolving attack techniques.

Finally, the repository is well-documented and includes a clear explanation of the underlying concepts and design choices. It’s actively maintained by Google researchers and encourages community contributions. The project’s open-source nature allows for wider scrutiny and improvement, fostering a collaborative approach to addressing the growing threat of prompt injection attacks in the rapidly evolving landscape of LLMs. The code and documentation are designed to be accessible to researchers and developers interested in building more secure LLM applications.

mangle
by
googlegoogle/mangle

Repository Details

Fetching additional details & charts...