The dauparas/proteinmpnn repository provides the official implementation for ProteinMPNN, a deep learning-based tool for protein sequence design. ProteinMPNN is designed to generate amino acid sequences that are compatible with a given protein backbone structure, enabling researchers to create novel proteins or redesign existing ones with desired properties. The repository supports both full backbone models and CA-only models, allowing flexibility depending on the available structural data. Full backbone models are suited for detailed atomic structures, while CA-only models are optimized for structures where only alpha carbon atoms are present.
The main script, protein_mpnn_run.py, orchestrates the sequence design process. It accepts a wide range of input flags to customize the design, including options for model selection, backbone noise, batch size, sequence sampling temperature, and specification of which chains or residues to design or fix. Users can input protein structures in PDB format and define which chains or positions should be redesigned, fixed, or tied together (for symmetry). The script also allows for the incorporation of amino acid biases, omission of specific residues, and use of position-specific scoring matrices (PSSM) to guide the design process. Output options include saving scores, predicted probabilities, and conditional or unconditional probabilities for each position, which are useful for evaluating the model's confidence and diversity in sequence generation.
Helper scripts are provided to facilitate parsing PDB files, assigning design chains, fixing residues, adding amino acid biases, and tying residues. The repository includes example scripts demonstrating various use cases, such as monomer and multi-chain design, scoring only, fixing positions, tying positions for symmetry, homooligomer design, and applying amino acid bias or PSSM constraints. These examples help users understand how to apply ProteinMPNN to different protein engineering scenarios.
For model training and retraining, the repository contains a dedicated training directory with relevant code and data. The model weights for different versions and configurations are organized in separate folders, making it easy to select the appropriate model for a given task. The repository also provides Google Colab notebooks for interactive experimentation and quick prototyping, lowering the barrier for new users.
ProteinMPNN is built on Python and leverages PyTorch for deep learning computations, ensuring compatibility with modern GPU hardware for efficient processing. Installation instructions are provided for setting up the environment using conda and installing necessary dependencies. The repository is well-documented, with clear guidance on input parameters, output formats, and example workflows.
Overall, ProteinMPNN is a robust and versatile tool for protein sequence design, enabling researchers to explore and engineer protein sequences with high accuracy and flexibility. Its deep learning approach allows for rapid and reliable generation of sequences that are structurally compatible with target backbones, supporting applications in protein engineering, synthetic biology, and drug discovery. The repository is backed by peer-reviewed research and is widely used in the scientific community for advanced protein design tasks.